Authors: Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li
Abstract: Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.
Authors: Edward Y. Chang, Ethan Y. Chang
Abstract: Large language models can change answers under harmless edits that matter in practice: RAG outputs flip when passages are reordered, fine-tuning erodes invariances learned at pretraining, debate or chain-of-thought prompts take path-dependent routes, and compiler fusion or reordering perturbs logits near decision boundaries. These failures violate intended invariances, break continuous integration, and force teams to trade safety for speed. The effects are small yet distributed across layers and positions, sensitive to context length and evaluation order, and costly to repair with retraining or formal verification. We present WILSON, a minimal post-hoc diagnostic suite that converts simple loop and reordering checks on internal representations into system signals. WILSON combines an inverse-free curvature map over positions and layers, computed with JVPs and Hutchinson probes, with activation-level commutators that flag reorder risk. Signals are cheap to compute, model-agnostic for standard Transformers, and exported as thresholds and CSV artifacts for orchestrators. This enables concrete actions: guard RAG against order effects, catch fine-tuning regressions, stabilize debate pathways and long multi-turn contexts, and gate fusions or reorders in deployment. In short, WILSON helps anticipate failures and approve safe optimizations so reliability and throughput can improve together without changing model architecture or training.
Authors: Premt Cara, Kamilia Zaripova, David Bani-Harouni, Nassir Navab, Azade Farshad
Abstract: Rare genetic disease diagnosis faces critical challenges: insufficient patient data, inaccessible full genome sequencing, and the immense number of possible causative genes. These limitations cause prolonged diagnostic journeys, inappropriate treatments, and critical delays, disproportionately affecting patients in resource-limited settings where diagnostic tools are scarce. We propose RareNet, a subgraph-based Graph Neural Network that requires only patient phenotypes to identify the most likely causal gene and retrieve focused patient subgraphs for targeted clinical investigation. RareNet can function as a standalone method or serve as a pre-processing or post-processing filter for other candidate gene prioritization methods, consistently enhancing their performance while potentially enabling explainable insights. Through comprehensive evaluation on two biomedical datasets, we demonstrate competitive and robust causal gene prediction and significant performance gains when integrated with other frameworks. By requiring only phenotypic data, which is readily available in any clinical setting, RareNet democratizes access to sophisticated genetic analysis, offering particular value for underserved populations lacking advanced genomic infrastructure.
Authors: Zipo Jibao, Yingyi Fu, Xinyang Chen, Guoting Chen
Abstract: Real-world time series are influenced by numerous factors and exhibit complex non-stationary characteristics. Non-stationarity can lead to distribution shifts, where the statistical properties of time series change over time, negatively impacting model performance. Several instance normalization techniques have been proposed to address distribution shifts in time series forecasting. However, existing methods fail to account for shifts within individual instances, leading to suboptimal performance. To tackle inner-instance distribution shifts, we propose two novel point-level methods: Learning Distribution (LD) and Learning Conditional Distribution (LCD). LD eliminates internal discrepancies by fitting the internal distribution of input and output with different parameters at different time steps, while LCD utilizes neural networks to predict scaling coefficients of the output. We evaluate the performance of the two methods with various backbone models across public benchmarks and demonstrate the effectiveness of the point-level paradigm through comparative experiments.
Authors: Yuni Lai, Xiaoyu Xue, Linghui Shen, Yulun Wu, Gaolei Li, Song Guo, Kai Zhou, Bin Xiao
Abstract: Language-empowered foundation models (LeFMs), such as CLIP and GraphCLIP, have transformed multimodal learning by aligning visual (or graph) features with textual representations, enabling powerful downstream capabilities like few-shot learning. However, the reliance on small, task-specific support datasets collected in open environments exposes these models to poisoning attacks, where adversaries manipulate the support samples to degrade performance. Existing defenses rely on empirical strategies, which lack formal guarantees and remain vulnerable to unseen and adaptive attacks. Certified robustness offers provable guarantees but has been largely unexplored for few-shot classifiers based on LeFMs. This study seeks to fill these critical gaps by proposing the first provably robust few-shot classifier that is tailored for LeFMs. We term our model Language-empowered Few-shot Certification (\textbf{LeFCert}). It integrates both textual and feature embeddings with an adaptive blending mechanism. To achieve provable robustness, we propose a twofold trimmed mean prototype and derive provable upper and lower bounds for classification scores, enabling certification under worst-case poisoning scenarios. To further enhance the performance, we extend LeFCert with two variants by considering a more realistic and tighter attack budget: LeFCert-L incorporates randomized smoothing to provide Lipschitz continuity and derive robustness under dual budget constraints, and LeFCert-C provides collective certification for scenarios where attackers distribute a shared poisoning budget across multiple samples. Experiments demonstrate that LeFCert achieves state-of-the-art performance, significantly improving both clean and certified accuracy compared to existing baselines. Despite its advanced robustness mechanisms, LeFCert is computationally efficient, making it practical for real-world applications.
Authors: Kiran Smelser, Kaviru Gunaratne, Jacob Miller, Stephen Kobourov
Abstract: Complex, high-dimensional data is ubiquitous across many scientific disciplines, including machine learning, biology, and the social sciences. One of the primary methods of visualizing these datasets is with two-dimensional scatter plots that visually capture some properties of the data. Because visually determining the accuracy of these plots is challenging, researchers often use quality metrics to measure the projection's accuracy and faithfulness to the original data. One of the most commonly employed metrics, normalized stress, is sensitive to uniform scaling (stretching, shrinking) of the projection, despite this act not meaningfully changing anything about the projection. Another quality metric, the Kullback--Leibler (KL) divergence used in the popular t-Distributed Stochastic Neighbor Embedding (t-SNE) technique, is also susceptible to this scale sensitivity. We investigate the effect of scaling on stress and KL divergence analytically and empirically by showing just how much the values change and how this affects dimension reduction technique evaluations. We introduce a simple technique to make both metrics scale-invariant and show that it accurately captures expected behavior on a small benchmark.
Authors: Zipo Jibao, Yingyi Fu, Xinyang Chen, Guoting Chen
Abstract: Recent research demonstrates that linear models achieve forecasting performance competitive with complex architectures, yet methodologies for enhancing linear models remain underexplored. Motivated by the hypothesis that distinct time series instances may follow heterogeneous linear mappings, we propose the Classification Auxiliary Trend-Seasonal Decoupling Linear Model CATS-Linear, employing Classification Auxiliary Channel-Independence (CACI). CACI dynamically routes instances to dedicated predictors via classification, enabling supervised channel design. We further analyze the theoretical expected risks of different channel settings. Additionally, we redesign the trend-seasonal decomposition architecture by adding a decoupling -- linear mapping -- recoupling framework for trend components and complex-domain linear projections for seasonal components. Extensive experiments validate that CATS-Linear with fixed hyperparameters achieves state-of-the-art accuracy comparable to hyperparameter-tuned baselines while delivering SOTA accuracy against fixed-hyperparameter counterparts.
Authors: Pengcheng Deng, Kening Liu, Mengxi Zhou, Mingxi Li, Rui Yang, Chuzhe Cao, Maojun Wang, Zeyu Zhang
Abstract: Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.
Authors: Jiacheng Liu, Peiliang Cai, Qinming Zhou, Yuqi Lin, Deyang Kong, Benhao Huang, Yupei Pan, Haowen Xu, Chang Zou, Junshu Tang, Shikang Zheng, Linfeng Zhang
Abstract: The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in future timesteps. However, previous feature caching assumes that features in adjacent timesteps are similar or continuous, which does not always hold in all settings. To investigate this, this paper begins with an analysis from the frequency domain, which reveal that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps. Concretely, low-frequency components, which decide the structure of images, exhibit higher similarity but poor continuity. In contrast, the high-frequency bands, which decode the details of images, show significant continuity but poor similarity. These interesting observations motivate us to propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity, while using a second-order Hermite interpolator to predict the volatile high-frequency ones based on its continuity. Besides, we further propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%. Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate its effectiveness in both generation and editing. Codes are available in the supplementary materials and will be released on GitHub.
Authors: Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.
Authors: Jiachen Jiang, Zhen Qin, Zhihui Zhu
Abstract: Channel equalization is fundamental for mitigating distortions such as frequency-selective fading and inter-symbol interference. Unlike standard supervised learning approaches that require costly retraining or fine-tuning for each new task, in-context learning (ICL) adapts to new channels at inference time with only a few examples. However, existing ICL-based equalizers are primarily developed for and evaluated on static channels within the context window. Indeed, to our knowledge, prior principled analyses and theoretical studies of ICL focus exclusively on the stationary setting, where the function remains fixed within the context. In this paper, we investigate the ability of ICL to address non-stationary problems through the lens of time-varying channel equalization. We employ a principled framework for designing efficient attention mechanisms with improved adaptivity in non-stationary tasks, leveraging algorithms from adaptive signal processing to guide better designs. For example, new attention variants can be derived from the Least Mean Square (LMS) adaptive algorithm, a Least Root Mean Square (LRMS) formulation for enhanced robustness, or multi-step gradient updates for improved long-term tracking. Experimental results demonstrate that ICL holds strong promise for non-stationary MIMO equalization, and that attention mechanisms inspired by classical adaptive algorithms can substantially enhance adaptability and performance in dynamic environments. Our findings may provide critical insights for developing next-generation wireless foundation models with stronger adaptability and robustness.
Authors: Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong
Abstract: Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, which encourages the model to learn representations that are invariant to the common underlying object across these views.
Authors: Ozgur Guldogan, Neeraj Sarna, Yuanyuan Li, Michael Berger
Abstract: While counterfactual fairness of point predictors is well studied, its extension to prediction sets--central to fair decision-making under uncertainty--remains underexplored. On the other hand, conformal prediction (CP) provides efficient, distribution-free, finite-sample valid prediction sets, yet does not ensure counterfactual fairness. We close this gap by developing Counterfactually Fair Conformal Prediction (CF-CP) that produces counterfactually fair prediction sets. Through symmetrization of conformity scores across protected-attribute interventions, we prove that CF-CP results in counterfactually fair prediction sets while maintaining the marginal coverage property. Furthermore, we empirically demonstrate that on both synthetic and real datasets, across regression and classification tasks, CF-CP achieves the desired counterfactual fairness and meets the target coverage rate with minimal increase in prediction set size. CF-CP offers a simple, training-free route to counterfactually fair uncertainty quantification.
Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo
Abstract: A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.
Authors: Justin Lin, Julia Fukuyama
Abstract: In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.
Authors: Yikai Zhao, Jiekai Ma
Abstract: Modern time series forecasting increasingly relies on complex ensemble models generated by AutoML systems like AutoGluon, delivering superior accuracy but with significant costs to transparency and interpretability. This paper introduces a comprehensive, dual-approach framework that addresses both the explainability and forecastability challenges in complex time series ensembles. First, we develop a surrogate-based explanation methodology that bridges the accuracy-interpretability gap by training a LightGBM model to faithfully mimic AutoGluon's time series forecasts, enabling stable SHAP-based feature attributions. We rigorously validated this approach through feature injection experiments, demonstrating remarkably high faithfulness between extracted SHAP values and known ground truth effects. Second, we integrated spectral predictability analysis to quantify each series' inherent forecastability. By comparing each time series' spectral predictability to its pure noise benchmarks, we established an objective mechanism to gauge confidence in forecasts and their explanations. Our empirical evaluation on the M5 dataset found that higher spectral predictability strongly correlates not only with improved forecast accuracy but also with higher fidelity between the surrogate and the original forecasting model. These forecastability metrics serve as effective filtering mechanisms and confidence scores, enabling users to calibrate their trust in both the forecasts and their explanations. We further demonstrated that per-item normalization is essential for generating meaningful SHAP explanations across heterogeneous time series with varying scales. The resulting framework delivers interpretable, instance-level explanations for state-of-the-art ensemble forecasts, while equipping users with forecastability metrics that serve as reliability indicators for both predictions and their explanations.
Authors: Gang Liu, Jie Chen, Yihan Zhu, Michael Sun, Tengfei Luo, Nitesh V Chawla, Meng Jiang
Abstract: In-context learning allows large models to adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design. Existing databases such as ChEMBL contain molecular properties spanning millions of biological assays, yet labeled data for each property remain scarce. To address this limitation, we introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts using a small set of molecule-score examples instead of text descriptions. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5$\times$ fewer nodes. We curate a dataset containing millions of context tasks from multiple sources covering both drugs and materials, and pretrain a 0.7-billion-parameter model on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000$\times$ larger and achieves an average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches. These results position DemoDiff as a molecular foundation model for in-context molecular design. Our code is available at https://github.com/liugangcode/DemoDiff.
Authors: Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang
Abstract: Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.
Authors: Christopher Yeh, Nicolas Christianson, Adam Wierman, Yisong Yue
Abstract: While deep learning models often achieve high predictive accuracy, their predictions typically do not come with any provable guarantees on risk or reliability, which are critical for deployment in high-stakes applications. The framework of conformal risk control (CRC) provides a distribution-free, finite-sample method for controlling the expected value of any bounded monotone loss function and can be conveniently applied post-hoc to any pre-trained deep learning model. However, many real-world applications are sensitive to tail risks, as opposed to just expected loss. In this work, we develop a method for controlling the general class of Optimized Certainty-Equivalent (OCE) risks, a broad class of risk measures which includes as special cases the expected loss (generalizing the original CRC method) and common tail risks like the conditional value-at-risk (CVaR). Furthermore, standard post-hoc CRC can degrade average-case performance due to its lack of feedback to the model. To address this, we introduce "conformal risk training," an end-to-end approach that differentiates through conformal OCE risk control during model training or fine-tuning. Our method achieves provable risk guarantees while demonstrating significantly improved average-case performance over post-hoc approaches on applications to controlling classifiers' false negative rate and controlling financial risk in battery storage operation.
Authors: Tinnakit Udsa, Can Udomcharoenchaikit, Patomporn Payoungkhamdee, Sarana Nutanong, Norrathep Rattanavipanon
Abstract: Federated learning (FL) enables collaborative training without raw data sharing, but still risks training data memorization. Existing FL memorization detection techniques focus on one sample at a time, underestimating more subtle risks of cross-sample memorization. In contrast, recent work on centralized learning (CL) has introduced fine-grained methods to assess memorization across all samples in training data, but these assume centralized access to data and cannot be applied directly to FL. We bridge this gap by proposing a framework that quantifies both intra- and inter-client memorization in FL using fine-grained cross-sample memorization measurement across all clients. Based on this framework, we conduct two studies: (1) measuring subtle memorization across clients and (2) examining key factors that influence memorization, including decoding strategies, prefix length, and FL algorithms. Our findings reveal that FL models do memorize client data, particularly intra-client data, more than inter-client data, with memorization influenced by training and inferencing factors.
Authors: Mujin Kwun, Depen Morwani, Chloe Huangyuan Su, Stephanie Gil, Nikhil Anand, Sham Kakade
Abstract: Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.
Authors: Ayush Khot, Miruna Oprescu, Maresa Schr\"oder, Ai Kagawa, Xihaier Luo
Abstract: Causal inference in spatial domains faces two intertwined challenges: (1) unmeasured spatial factors, such as weather, air pollution, or mobility, that confound treatment and outcome, and (2) interference from nearby treatments that violate standard no-interference assumptions. While existing methods typically address one by assuming away the other, we show they are deeply connected: interference reveals structure in the latent confounder. Leveraging this insight, we propose the Spatial Deconfounder, a two-stage method that reconstructs a substitute confounder from local treatment vectors using a conditional variational autoencoder (CVAE) with a spatial prior, then estimates causal effects via a flexible outcome model. We show that this approach enables nonparametric identification of both direct and spillover effects under weak assumptions--without requiring multiple treatment types or a known model of the latent field. Empirically, we extend SpaCE, a benchmark suite for spatial confounding, to include treatment interference, and show that the Spatial Deconfounder consistently improves effect estimation across real-world datasets in environmental health and social science. By turning interference into a multi-cause signal, our framework bridges spatial and deconfounding literatures to advance robust causal inference in structured data.
Authors: David Fenwick, Navid NaderiAlizadeh, Vahid Tarokh, Nicholas Felice, Darin Clark, Jayasai Rajagopal, Anuj Kapadia, Benjamin Wildman-Tobriner, Ehsan Samei, Ehsan Abadi
Abstract: Protocol optimization is critical in Computed Tomography (CT) to achieve high diagnostic image quality while minimizing radiation dose. However, due to the complex interdependencies among CT acquisition and reconstruction parameters, traditional optimization methods rely on exhaustive testing of combinations of these parameters, which is often impractical. This study introduces a novel methodology that combines virtual imaging tools with reinforcement learning to optimize CT protocols more efficiently. Human models with liver lesions were imaged using a validated CT simulator and reconstructed with a novel CT reconstruction toolkit. The optimization parameter space included tube voltage, tube current, reconstruction kernel, slice thickness, and pixel size. The optimization process was performed using a Proximal Policy Optimization (PPO) agent, which was trained to maximize an image quality objective, specifically the detectability index (d') of liver lesions in the reconstructed images. Optimization performance was compared against an exhaustive search performed on a supercomputer. The proposed reinforcement learning approach achieved the global maximum d' across test cases while requiring 79.7% fewer steps than the exhaustive search, demonstrating both accuracy and computational efficiency. The proposed framework is flexible and can accommodate various image quality objectives. The findings highlight the potential of integrating virtual imaging tools with reinforcement learning for CT protocol management.
Authors: Francisco Pascoa, Ian Lalonde, Alexandre Girard
Abstract: Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham's Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.
Authors: Shikun Liu, Haoyu Wang, Mufei Li, Pan Li
Abstract: Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM's internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.
Authors: Vaibhav Jain, Gerrit Grossmann
Abstract: Reinforcement Learning (RL) agents often struggle in sparse-reward environments where traditional exploration strategies fail to discover effective action sequences. Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining that could guide RL exploration, but existing approaches create rigid dependencies where RL policies must follow LLM suggestions or incorporate them directly into reward functions. We propose a framework that provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore this guidance. Our method leverages LLMs' world knowledge and reasoning abilities while maintaining flexibility through soft constraints. We evaluate our approach on three BabyAI environments of increasing complexity and show that the benefits of LLM guidance scale with task difficulty. In the most challenging environment, we achieve 71% relative improvement in final success rates over baseline. The approach provides substantial sample efficiency gains, with agents reaching performance thresholds up to 9 times faster, and requires no modifications to existing RL algorithms. Our results demonstrate an effective method for leveraging LLM planning capabilities to accelerate RL training in challenging environments.
Authors: Xinwen Hu, Yunqing Huang, Nianyu Yi, Peimeng Yin
Abstract: Neural network-based function approximation plays a pivotal role in the advancement of scientific computing and machine learning. Yet, training such models faces several challenges: (i) each target function often requires training a new model from scratch; (ii) performance is highly sensitive to architectural and hyperparameter choices; and (iii) models frequently generalize poorly beyond the training domain. To overcome these challenges, we propose a reusable initialization framework based on basis function pretraining. In this approach, basis neural networks are first trained to approximate families of polynomials on a reference domain. Their learned parameters are then used to initialize networks for more complex target functions. To enhance adaptability across arbitrary domains, we further introduce a domain mapping mechanism that transforms inputs into the reference domain, thereby preserving structural correspondence with the pretrained models. Extensive numerical experiments in one- and two-dimensional settings demonstrate substantial improvements in training efficiency, generalization, and model transferability, highlighting the promise of initialization-based strategies for scalable and modular neural function approximation. The full code is made publicly available on Gitee.
Authors: I. Arda Vurankaya, Mustafa O. Karabag, Wesley A. Suttle, Jesse Milzman, David Fridovich-Keil, Ufuk Topcu
Abstract: We consider a multi-armed bandit setting in which each arm has a public and a private reward distribution. An observer expects an agent to follow Thompson Sampling according to the public rewards, however, the deceptive agent aims to quickly identify the best private arm without being noticed. The observer can observe the public rewards and the pulled arms, but not the private rewards. The agent, on the other hand, observes both the public and private rewards. We formalize detectability as a stepwise Kullback-Leibler (KL) divergence constraint between the actual pull probabilities used by the agent and the anticipated pull probabilities by the observer. We model successful pulling of public suboptimal arms as a % Bernoulli process where the success probability decreases with each successful pull, and show these pulls can happen at most at a $\Theta(\sqrt{T}) $ rate under the KL constraint. We then formulate a maximin problem based on public and private means, whose solution characterizes the optimal error exponent for best private arm identification. We finally propose an algorithm inspired by top-two algorithms. This algorithm naturally adapts its exploration according to the hardness of pulling arms based on the public suboptimality gaps. We provide numerical examples illustrating the $\Theta(\sqrt{T}) $ rate and the behavior of the proposed algorithm.
Authors: Junyi Wu, Guang Lin
Abstract: We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov--Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.
Authors: Eshika Saxena, Alberto Alfarano, Fran\c{c}ois Charton, Emily Wenger, Kristin Lauter
Abstract: AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform "classical" attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners' ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.
Authors: S M Rafiuddin
Abstract: Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems
Authors: Michal Podstawski
Abstract: Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from textual graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.
Authors: Fudong Lin, Xu Yuan
Abstract: The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intra-class distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model's overall performance. As a result, our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets, validated via extensive experiments. Our code is available at https://github.com/fudong03/BNS_IPDPP.
Authors: Motahare Mounesan, Sourya Saha, Houchao Gan, Md. Nurul Absur, Saptarshi Debroy
Abstract: Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.
Authors: Thai-Hoang Pham, Jiayuan Chen, Seungyeon Lee, Yuanlong Wang, Sayoko Moroi, Xueru Zhang, Ping Zhang
Abstract: As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.
Authors: Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti
Abstract: Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].
URLs: https://github.com/DLFundamentals/understanding_ssl_v2, https://dlfundamentals.github.io/cl-nscl-representation-alignment/
Authors: T. Ed Li, Junyu Ren
Abstract: Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.
Authors: Ammar I Marvi, Nancy G Kanwisher, Meenakshi Khosla
Abstract: The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams -- identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system's underlying axes of neural tuning.
Authors: Murray Cutforth, Yiming Yang, Tiffany Fan, Serge Guillas, Eric Darve
Abstract: Many science and engineering problems rely on expensive computational simulations, where a multi-fidelity approach can accelerate the exploration of a parameter space. We study efficient allocation of a simulation budget using a Gaussian Process (GP) model in the binary simulation output case. This paper introduces Bernoulli Parameter Mutual Information (BPMI), a batch active learning algorithm for multi-fidelity GP classifiers. BPMI circumvents the intractability of calculating mutual information in the probability space by employing a first-order Taylor expansion of the link function. We evaluate BPMI against several baselines on two synthetic test cases and a complex, real-world application involving the simulation of a laser-ignited rocket combustor. In all experiments, BPMI demonstrates superior performance, achieving higher predictive accuracy for a fixed computational budget.
Authors: Haolin Liu, Chen-Yu Wei, Julian Zimmert
Abstract: We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.
Authors: Junxi Yin, Haisen Luo, Zhenyu Li, Yihua Liu, Dan Liu, Zequn Li, Xiaohang Xu
Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.
Authors: Di Zhang
Abstract: The stochastic multi-armed bandit (MAB) problem is one of the most fundamental models in sequential decision-making, with the core challenge being the trade-off between exploration and exploitation. Although algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling, along with their regret theories, are well-established, existing analyses primarily operate from a time-domain and cumulative regret perspective, struggling to characterize the dynamic nature of the learning process. This paper proposes a novel frequency-domain analysis framework, reformulating the bandit process as a signal processing problem. Within this framework, the reward estimate of each arm is viewed as a spectral component, with its uncertainty corresponding to the component's frequency, and the bandit algorithm is interpreted as an adaptive filter. We construct a formal Frequency-Domain Bandit Model and prove the main theorem: the confidence bound term in the UCB algorithm is equivalent in the frequency domain to a time-varying gain applied to uncertain spectral components, a gain inversely proportional to the square root of the visit count. Based on this, we further derive finite-time dynamic bounds concerning the exploration rate decay. This theory not only provides a novel and intuitive physical interpretation for classical algorithms but also lays a rigorous theoretical foundation for designing next-generation algorithms with adaptive parameter adjustment.
Authors: Maoxin Ji, Tong Wang, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen
Abstract: Addressing the problem of Age of Information (AoI) deterioration caused by packet collisions and vehicle speed-related channel uncertainties in Semi-Persistent Scheduling (SPS) for the Internet of Vehicles (IoV), this letter proposes an optimization approach based on Large Language Models (LLM) and Deep Deterministic Policy Gradient (DDPG). First, an AoI calculation model influenced by vehicle speed, vehicle density, and Resource Reservation Interval (RRI) is established, followed by the design of a dual-path optimization scheme. The DDPG is guided by the state space and reward function, while the LLM leverages contextual learning to generate optimal parameter configurations. Experimental results demonstrate that LLM can significantly reduce AoI after accumulating a small number of exemplars without requiring model training, whereas the DDPG method achieves more stable performance after training.
Authors: Yuting Yang, Gang Mei, Zhengjing Ma, Nengxiong Xu, Jianbing Peng
Abstract: Small Earth data are geoscience observations with limited short-term monitoring variability, providing sparse but meaningful measurements, typically exhibiting spatiotemporal correlations. Spatiotemporal forecasting on such data is crucial for understanding geoscientific processes despite their small scale. However, conventional deep learning models for spatiotemporal forecasting requires task-specific training for different scenarios. Foundation models do not need task-specific training, but they often exhibit forecasting bias toward the global mean of the pretraining distribution. Here we propose a simple and robust approach for spatiotemporally correlated small Earth data forecasting. The essential idea is to characterize and quantify spatiotemporal patterns of small Earth data and then utilize tabular foundation models for accurate forecasting across different scenarios. Comparative results across three typical scenarios demonstrate that our forecasting approach achieves superior accuracy compared to the graph deep learning model (T-GCN) and tabular foundation model (TabPFN) in the majority of instances, exhibiting stronger robustness.
Authors: Jonah Botvinick-Greenhouse, Wael H. Ali, Mouhacine Benosman, Saviz Mowlavi
Abstract: We introduce adaptive-basis physics-informed neural networks (AB-PINNs), a novel approach to domain decomposition for training PINNs in which existing subdomains dynamically adapt to the intrinsic features of the unknown solution. Drawing inspiration from classical mesh refinement techniques, we also modify the domain decomposition on-the-fly throughout training by introducing new subdomains in regions of high residual loss, thereby providing additional expressive power where the solution of the differential equation is challenging to represent. Our flexible approach to domain decomposition is well-suited for multiscale problems, as different subdomains can learn to capture different scales of the underlying solution. Moreover, the ability to introduce new subdomains during training helps prevent convergence to unwanted local minima and can reduce the need for extensive hyperparameter tuning compared to static domain decomposition approaches. Throughout, we present comprehensive numerical results which demonstrate the effectiveness of AB-PINNs at solving a variety of complex multiscale partial differential equations.
Authors: Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng
Abstract: Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.
Authors: Zhen Yang, Yansong Ma, Lei Chen
Abstract: Traditional Evidence Deep Learning (EDL) methods rely on static hyperparameter for uncertainty calibration, limiting their adaptability in dynamic data distributions, which results in poor calibration and generalization in high-risk decision-making tasks. To address this limitation, we propose the Meta-Policy Controller (MPC), a dynamic meta-learning framework that adjusts the KL divergence coefficient and Dirichlet prior strengths for optimal uncertainty modeling. Specifically, MPC employs a bi-level optimization approach: in the inner loop, model parameters are updated through a dynamically configured loss function that adapts to the current training state; in the outer loop, a policy network optimizes the KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing prediction accuracy and uncertainty quality. Unlike previous methods with fixed priors, our learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics. Extensive experimental results show that MPC significantly enhances the reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.
Authors: Haroon Gharwi, Kai Shu
Abstract: Real-world time series data exhibit non-stationary behavior, regime shifts, and temporally varying noise (heteroscedastic) that degrade the robustness of standard regression models. We introduce the Variability-Aware Recursive Neural Network (VARNN), a novel residual-aware architecture for supervised time-series regression that learns an explicit error memory from recent prediction residuals and uses it to recalibrate subsequent predictions. VARNN augments a feed-forward predictor with a learned error-memory state that is updated from residuals over a short context steps as a signal of variability and drift, and then conditions the final prediction at the current time step. Across diverse dataset domains, appliance energy, healthcare, and environmental monitoring, experimental results demonstrate VARNN achieves superior performance and attains lower test MSE with minimal computational overhead over static, dynamic, and recurrent baselines. Our findings show that the VARNN model offers robust predictions under a drift and volatility environment, highlighting its potential as a promising framework for time-series learning.
Authors: Zhihan Zhang, Xunkai Li, Yilong Zuo, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
Abstract: Text-attributed graphs (TAGs) have emerged as a powerful representation that combines structural connections with fine-grained semantics, supporting a wide range of data-centric applications. However, the performance of graph neural networks (GNNs) on TAGs is highly sensitive to input quality. Our empirical study shows that both traditional GNNs and LLM-enhanced GNNs suffer significant degradation across nine representative scenarios of sparsity, noise, and imbalance, highlighting graph quality as a critical bottleneck. Existing approaches mainly focus on improving model architectures, while neglecting systematic optimization of TAG data itself, leading to limited effectiveness in practice. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework that treats graph quality control as a first-class, data-centric problem. LAGA integrates four collaborative agents-detection, planning, action, and evaluation-into an automated closed loop. At its core, the action agent employs a dual-encoder and tri-objective design to capture complementary information across modalities and perform holistic graph quality enhancement. Experiments across nine scenarios show that LAGA improves graph quality and achieves state-of-the-art performance across various tasks and backbones, validating data-centric quality optimization as key to reliable TAGs and robust graph learning.
Authors: Xiaofeng Cao, Mingwei Xu, Xin Yu, Jiangchao Yao, Wei Ye, Shengjun Huang, Minling Zhang, Ivor W. Tsang, Yew Soon Ong, James T. Kwok, Heng Tao Shen
Abstract: Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI); however, the costs associated with data annotation and model training remain significant. A fundamental objective of AI research is to achieve robust generalization with limited-resource data. This survey employs agnostic active sampling theory within the Probably Approximately Correct (PAC) framework to analyze the generalization error and label complexity associated with learning from low-resource data in both model-agnostic supervised and unsupervised settings. Based on this analysis, we investigate a suite of optimization strategies tailored for low-resource data learning, including gradient-informed optimization, meta-iteration optimization, geometry-aware optimization, and LLMs-powered optimization. Furthermore, we provide a comprehensive overview of multiple learning paradigms that can benefit from low-resource data, including domain transfer, reinforcement feedback, and hierarchical structure modeling. Finally, we conclude our analysis and investigation by summarizing the key findings and highlighting their implications for learning with low-resource data.
Authors: Junyu Xuan, Wenlong Chen, Yingzhen Li
Abstract: Bayesian Optimisation (BO) is a powerful tool for optimising expensive blackbox functions but its effectiveness diminishes in highdimensional spaces due to sparse data and poor surrogate model scalability While Variational Autoencoder (VAE) based approaches address this by learning low-dimensional latent representations the reconstructionbased objective function often brings the functional distribution mismatch between the latent space and original space leading to suboptimal optimisation performance In this paper we first analyse the reason why reconstructiononly loss may lead to distribution mismatch and then propose HiBBO a novel BO framework that introduces the space consistency into the latent space construction in VAE using HiPPO - a method for longterm sequence modelling - to reduce the functional distribution mismatch between the latent space and original space Experiments on highdimensional benchmark tasks demonstrate that HiBBO outperforms existing VAEBO methods in convergence speed and solution quality Our work bridges the gap between high-dimensional sequence representation learning and efficient Bayesian Optimisation enabling broader applications in neural architecture search materials science and beyond.
Authors: Suraj Kumar Sahoo, Narayanan C Krishnan
Abstract: Learned Optimizers (LOs), a type of Meta-learning, have gained traction due to their ability to be parameterized and trained for efficient optimization. Traditional gradient-based methods incorporate explicit regularization techniques such as Sharpness-Aware Minimization (SAM), Gradient-norm Aware Minimization (GAM), and Gap-guided Sharpness-Aware Minimization (GSAM) to enhance generalization and convergence. In this work, we explore a fundamental question: \textbf{Can regularizers be learned?} We empirically demonstrate that LOs can be trained to learn and internalize the effects of traditional regularization techniques without explicitly applying them to the objective function. We validate this through extensive experiments on standard benchmarks (including MNIST, FMNIST, CIFAR and Neural Networks such as MLP, MLP-Relu and CNN), comparing LOs trained with and without access to explicit regularizers. Regularized LOs consistently outperform their unregularized counterparts in terms of test accuracy and generalization. Furthermore, we show that LOs retain and transfer these regularization effects to new optimization tasks by inherently seeking minima similar to those targeted by these regularizers. Our results suggest that LOs can inherently learn regularization properties, \textit{challenging the conventional necessity of explicit optimizee loss regularization.
Authors: Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
Abstract: Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.
Authors: Jionghao Lou, Jian Zhang, Zhongmei Li, Lanlan Chen, Enbo Feng
Abstract: The training of deep learning models in seizure prediction requires large amounts of Electroencephalogram (EEG) data. However, acquiring sufficient labeled EEG data is difficult due to annotation costs and privacy constraints. Federated Learning (FL) enables privacy-preserving collaborative training by sharing model updates instead of raw data. However, due to the inherent inter-patient variability in real-world scenarios, existing FL-based seizure prediction methods struggle to achieve robust performance under heterogeneous client settings. To address this challenge, we propose FedL2T, a personalized federated learning framework that leverages a novel two-teacher knowledge distillation strategy to generate superior personalized models for each client. Specifically, each client simultaneously learns from a globally aggregated model and a dynamically assigned peer model, promoting more direct and enriched knowledge exchange. To ensure reliable knowledge transfer, FedL2T employs an adaptive multi-level distillation strategy that aligns both prediction outputs and intermediate feature representations based on task confidence. In addition, a proximal regularization term is introduced to constrain personalized model updates, thereby enhancing training stability. Extensive experiments on two EEG datasets demonstrate that FedL2T consistently outperforms state-of-the-art FL methods, particularly under low-label conditions. Moreover, FedL2T exhibits rapid and stable convergence toward optimal performance, thereby reducing the number of communication rounds and associated overhead. These results underscore the potential of FedL2T as a reliable and personalized solution for seizure prediction in privacy-sensitive healthcare scenarios.
Authors: Kamel Alrashedy, Vriksha Srihari, Zulfiqar Zaidi, Ridam Srivastava, Pradyumna Tambwekar, Matthew Gombolay
Abstract: While researchers have made significant progress in enabling large language models (LLMs) to perform multi-step planning, LLMs struggle to ensure that those plans align with high-level user intent and satisfy symbolic constraints, especially in complex, multi-step domains. Existing reasoning approaches such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and verifier-augmented methods, expand the search space but often yield infeasible actions or hallucinated steps. To overcome these limitations, we propose Constraints-of-Thought (Const-o-T), a framework that provides a structured prior that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths. Each reasoning step is represented as an (intent, constraint) pair, which serves both to compress the search space and enforce validity. Unlike prior methods that merely generate reasoning traces or validate outputs post hoc, Const-o-T uses (intent, constraint)pairs to actively focus the search toward feasible and meaningful plans. We integrate Const-o-T into MCTS using a structured representation of intent-constraint pairs constraints prune infeasible branches and guide exploration toward semantically valid actions, improving planning efficiency and verifiable decision-making. We demonstrate across three domains Risk game, CAD code generation, and arithmetic reasoning that our approach outperforms baselines, yielding higher accuracy and stronger structural alignment. Our contribution is to demonstrate that Const-of-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning with LLMs.
Authors: Xiaolong Tu, Dawei Chen, Kyungtae Han, Onur Altintas, Haoxin Wang
Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) has emerged as a powerful tool for designing efficient deep neural networks (DNNs) tailored to edge devices. However, existing methods remain largely impractical for real-world deployment due to their high time cost, extensive manual profiling, and poor scalability across diverse hardware platforms with complex, device-specific energy behavior. In this paper, we present PlatformX, a fully automated and transferable HW-NAS framework designed to overcome these limitations. PlatformX integrates four key components: (i) an energy-driven search space that expands conventional NAS design by incorporating energy-critical configurations, enabling exploration of high-efficiency architectures; (ii) a transferable kernel-level energy predictor across devices and incrementally refined with minimal on-device samples; (iii) a Pareto-based multi-objective search algorithm that balances energy and accuracy to identify optimal trade-offs; and (iv) a high-resolution runtime energy profiling system that automates on-device power measurement using external monitors without human intervention. We evaluate PlatformX across multiple mobile platforms, showing that it significantly reduces search overhead while preserving accuracy and energy fidelity. It identifies models with up to 0.94 accuracy or as little as 0.16 mJ per inference, both outperforming MobileNet-V2 in accuracy and efficiency. Code and tutorials are available at github.com/amai-gsu/PlatformX.
Authors: Ziyi Wang, Nan Jiang, Guang Lin, Qifan Song
Abstract: Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.
Authors: Changsheng Wang, Yihua Zhang, Dennis Wei, Jinghan Jia, Pin-Yu Chen, Sijia Liu
Abstract: Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.
Authors: Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li
Abstract: Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing 'no-op' attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token's contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.
Authors: Ian Harshbarger, Calvin Chidambaram
Abstract: Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.
Authors: Zekai Chen, Xunkai Li, Sirui Zhang, Henan Sun, Jia Li, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
Abstract: De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the \textbf{Pseudo De Novo}, \textbf{Limited Docking Modeling}, and \textbf{Inflexible Ligand Type}. To address these issues, we propose MagicDock, a forward-looking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well-designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the docking process, leveraging learnable 3D point-cloud representations to precisely capture binding details, thereby ensuring that the generated ligands preserve docking validity through direct and interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide rigorous theoretical guarantees for each component of MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 27.1\% and 11.7\% over SOTA baselines specialized for protein or molecule ligand design, respectively.
Authors: Cl\'ement Morand (STL), Anne-Laure Ligozat (ENSIIE, LISN, STL), Aur\'elie N\'ev\'eol (STL, LISN)
Abstract: Recent Machine Learning (ML) approaches have shown increased performance on benchmarks but at the cost of escalating computational demands. Hardware, algorithmic and carbon optimizations have been proposed to curb energy consumption and environmental impacts. Can these strategies lead to sustainable ML model training? Here, we estimate the environmental impacts associated with training notable AI systems over the last decade, including Large Language Models, with a focus on the life cycle of graphics cards. Our analysis reveals two critical trends: First, the impacts of graphics cards production have increased steadily over this period; Second, energy consumption and environmental impacts associated with training ML models have increased exponentially, even when considering reduction strategies such as location shifting to places with less carbon intensive electricity mixes. Optimization strategies do not mitigate the impacts induced by model training, evidencing rebound effect. We show that the impacts of hardware must be considered over the entire life cycle rather than the sole use phase in order to avoid impact shifting. Our study demonstrates that increasing efficiency alone cannot ensure sustainability in ML. Mitigating the environmental impact of AI also requires reducing AI activities and questioning the scale and frequency of resource-intensive training.
Authors: Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tram\`er
Abstract: How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.
Authors: Jerome Bolte (TSE-R), Quoc-Tung Le (UGA, LJK), Edouard Pauwels (TSE-R)
Abstract: Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.
Authors: Junchao Fan, Xiaolin Chang
Abstract: Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9\% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.
Authors: Jose Tupayachi, Mustafa C. Camur, Kevin Heaslip, Xueping Li
Abstract: Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as electric vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces TW-GCN, a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States (U.S.). We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest EV infrastructure company in the U.S. to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying lag horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, population, and local demand variability shape model performance. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning, supporting both sustainable mobility transitions and resilient grid management.
Authors: Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello
Abstract: Concerning machine learning, segmentation models can identify state changes within time series, facilitating the detection of transitions between normal and anomalous conditions. Specific techniques such as Change Point Detection (CPD), particularly algorithms like ChangeFinder, have been successfully applied to segment time series and improve anomaly detection by reducing temporal uncertainty, especially in multivariate environments. In this work, we explored how the integration of segmentation techniques, combined with a heterogeneous ensemble, can enhance anomaly detection in an industrial production context. The results show that applying segmentation as a pre-processing step before selecting heterogeneous ensemble algorithms provided a significant advantage in our case study, improving the AUC-ROC metric from 0.8599 (achieved with a PCA and LSTM ensemble) to 0.9760 (achieved with Random Forest and XGBoost). This improvement is imputable to the ability of segmentation to reduce temporal ambiguity and facilitate the learning process of supervised algorithms. In our future work, we intend to assess the benefit of introducing weighted features derived from the study of change points, combined with segmentation and the use of heterogeneous ensembles, to further optimize model performance in early anomaly detection.
Authors: Atul Shree, Harshith Jupuru
Abstract: CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.
Authors: Kleanthis Avramidis, Tiantian Feng, Woojae Jeong, Jihwan Lee, Wenhui Cui, Richard M Leahy, Shrikanth Narayanan
Abstract: Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.
Authors: Yimu Zhang, Yuanshi Liu, Cong Fang
Abstract: In the training of large language models, momentum is widely used and often demonstrated to achieve significant acceleration. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, an adaptive training strategy that leverages partial momentum to implement a memory-efficient optimizer. To this end, AdaPM utilizes a non-uniform momentum design: for most blocks, full momentum is not necessary to preserve the performance of the optimization. In the momentum design of AdaPM, to mitigate the bias and performance loss caused by partial momentum, we enhance the partial momentum by a bias correction technique. Empirically, we verify that our approach reduces memory by over $90\%$ in momentum while maintaining both efficiency and performance for pretraining various language models ranging from 60M to 1.5B, as well as for supervised fine-tuning and RLHF. AdaPM can further reduce memory by up to $95\%$ in optimizer states by combining the memory-efficient technique on the second-order statistic, saving over $30\%$ GPU hours for pretraining GPT-2 1.5B.
Authors: Soroush Mahdi, Maryam Amirmazlaghani, Saeed Saravani, Zahra Dehghanian
Abstract: In this paper, we propose a new approach called MemLoss to improve the adversarial training of machine learning models. MemLoss leverages previously generated adversarial examples, referred to as 'Memory Adversarial Examples,' to enhance model robustness and accuracy without compromising performance on clean data. By using these examples across training epochs, MemLoss provides a balanced improvement in both natural accuracy and adversarial robustness. Experimental results on multiple datasets, including CIFAR-10, demonstrate that our method achieves better accuracy compared to existing adversarial training methods while maintaining strong robustness against attacks.
Authors: Zhi Yang, Changwu Huang, Ke Tang, Xin Yao
Abstract: While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.
Authors: Orin Levy, Liad Erez, Alon Cohen, Yishay Mansour
Abstract: We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.
Authors: Petrus Mikkola, Luigi Acerbi, Arto Klami
Abstract: We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner's score via score-matching. This allows estimating the target by `de-tempering' the estimated winner density's score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.
Authors: Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, Zhijie Sun
Abstract: Large language models (LLMs) often face a trade-off in post-training: improvements on specialized domains frequently come at the expense of general capabilities. Existing solutions attempt to mitigate this tension via regularization, selective parameter updates, or data-centric replay, but each imposes significant costs in computation, data access, or adaptability. Recent work has shown that training signals can be compressed to subsets of logits without severe accuracy loss, suggesting a path toward efficient adaptation. However, naive truncation destabilizes optimization and exacerbates forgetting. We introduce Logits Replay + MoClip, a two-stage framework that compresses supervision in the logit space and stabilizes optimization at the update level. In Stage 0, we record dynamic Top-K token subsets that cover a probability threshold, always including the gold label. In Stage 1, we replay these compact subsets to compute exact renormalized losses, avoiding full softmax computation and implicitly regularizing. To ensure stability, we design MoClip, an optimizer that caps gradient-momentum rotation and applies an arctan2-based rescaling of updates. Empirically, our method improves domain performance on Communication Technology (CT) and NL2SQL tasks while mitigating forgetting on general benchmarks (MMLU, BBH, GPQA, MATH), and reduces training cost by over 40%. Together, these contributions offer a scalable, architecture-agnostic path for domain adaptation of LLMs without sacrificing generalization.
Authors: Jing Li, Zhijie Sun, Zhicheng Zhou, Suming Qiu, Junjie Huang, Haijia Sun, Linyuan Qiu
Abstract: Current knowledge-enhanced large language models (LLMs) rely on static, pre-constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic-KGR, a novel framework enabling co-evolution between LLMs and knowledge graphs (KGs) through multi-round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.
Authors: Tianyi Chen, Mingcheng Zhu, Zhiyao Luo, Tingting Zhu
Abstract: Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families--including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams--and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.
Authors: Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen
Abstract: As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model's essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.
Authors: Benedikt Franke, Florian Heinrich, Markus Lange, Arne Raul
Abstract: This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.
Authors: Ling Zhan, Junjie Huang, Xiaoyao Yu, Wenyu Chen, Tao Jia
Abstract: Functional brain network (FBN) modeling often relies on local pairwise interactions, whose limitation in capturing high-order dependencies is theoretically analyzed in this paper. Meanwhile, the computational burden and heuristic nature of current hypergraph modeling approaches hinder end-to-end learning of FBN structures directly from data distributions. To address this, we propose to extract high-order FBN structures under global constraints, and implement this as a Global Constraints oriented Multi-resolution (GCM) FBN structure learning framework. It incorporates 4 types of global constraint (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning FBN structures for 4 distinct levels (sample/subject/group/project) of modeling resolution. Experimental results demonstrate that GCM achieves up to a 30.6% improvement in relative accuracy and a 96.3% reduction in computational time across 5 datasets and 2 task settings, compared to 9 baselines and 10 state-of-the-art methods. Extensive experiments validate the contributions of individual components and highlight the interpretability of GCM. This work offers a novel perspective on FBN structure learning and provides a foundation for interdisciplinary applications in cognitive neuroscience. Code is publicly available on https://github.com/lzhan94swu/GCM.
Authors: Peichen Xie, Xian Zhang, Shuo Chen
Abstract: Non-determinism and non-reproducibility present significant challenges in deep learning, leading to inconsistent results across runs and platforms. These issues stem from two origins: random number generation and floating-point computation. While randomness can be controlled through deterministic configurations, floating-point inconsistencies remain largely unresolved. To address this, we introduce RepDL, an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments. RepDL achieves this by enforcing correct rounding and order invariance in floating-point computation. The source code is available at https://github.com/microsoft/RepDL .
Authors: Ze Peng, Jian Zhang, Jintao Guo, Lei Qi, Yang Gao, Yinghuan Shi
Abstract: Continual learning seeks the human-like ability to accumulate new skills in machine intelligence. Its central challenge is catastrophic forgetting, whose underlying cause has not been fully understood for deep networks. In this paper, we demystify catastrophic forgetting by revealing that the new-task training is implicitly an adversarial attack against the old-task knowledge. Specifically, the new-task gradients automatically and accurately align with the sharp directions of the old-task loss landscape, rapidly increasing the old-task loss. This adversarial alignment is intriguingly counter-intuitive because the sharp directions are too sparsely distributed to align with by chance. To understand it, we theoretically show that it arises from training's low-rank bias, which, through forward and backward propagation, confines the two directions into the same low-dimensional subspace, facilitating alignment. Gradient projection (GP) methods, a representative family of forgetting-mitigating methods, reduce adversarial alignment caused by forward propagation, but cannot address the alignment due to backward propagation. We propose backGP to address it, which reduces forgetting by 10.8% and improves accuracy by 12.7% on average over GP methods.
Authors: Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang
Abstract: Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
Authors: Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang
Abstract: Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.
Authors: Klaus Weinbauer, Tieu-Long Phan, Peter F. Stadler, Thomas G\"artner, Sagar Malhotra
Abstract: Machine learning models that predict the feasibility of chemical reactions have become central to automated synthesis planning. Despite their predictive success, these models often lack transparency and interpretability. We introduce a novel formulation of prime implicant explanations--also known as minimally sufficient reasons--tailored to this domain, and propose an algorithm for computing such explanations in small-scale reaction prediction tasks. Preliminary experiments demonstrate that our notion of prime implicant explanations conservatively captures the ground truth explanations. That is, such explanations often contain redundant bonds and atoms but consistently capture the molecular attributes that are essential for predicting reaction feasibility.
Authors: Jiangwei Chen, Kieu Thao Nguyen Pham, Rachael Hwee Ling Sim, Arun Verma, Zhaoxuan Wu, Chuan-Sheng Foo, Bryan Kian Hsiang Low
Abstract: In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.
Authors: Peteris Daugulis, Vija Vagale, Emiliano Mancini, Filippo Castiglione
Abstract: The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.
Authors: Ilyas Varshavskiy, Bonu Boboeva, Shuhrat Khalilbekov, Azizjon Azimi, Sergey Shulgin, Akhlitdin Nizamitdinov, Haitz Saez de Ocariz Borde
Abstract: Machine Learning models in finance are highly susceptible to model drift, where predictive performance declines as data distributions shift. This issue is especially acute in developing economies such as those in Central Asia and the Caucasus - including Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan - where frequent and unpredictable macroeconomics shocks destabilize financial data. To the best of our knowledge, this is among the first studies to examine drift mitigation methods on financial datasets from these regions. We investigate the use of synthetic outliers, a largely unexplored approach, to improve model stability against unforeseen shocks. To evaluate effectiveness, we introduce a two-level framework that measures both the extent of performance degradation and the severity of shocks. Our experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model
Authors: Yuanming Zhang, Yan Lin, Arijit Khan, Huaiyu Wan
Abstract: A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we--for the first time--have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.
Authors: Felix Brandt, Andreas Heuermann, Philip Hannebohm, Bernhard Bachmann
Abstract: This paper presents a residual-informed machine learning approach for replacing algebraic loops in equation-based Modelica models with neural network surrogates. A feedforward neural network is trained using the residual (error) of the algebraic loop directly in its loss function, eliminating the need for a supervised dataset. This training strategy also resolves the issue of ambiguous solutions, allowing the surrogate to converge to a consistent solution rather than averaging multiple valid ones. Applied to the large-scale IEEE 14-Bus system, our method achieves a 60% reduction in simulation time compared to conventional simulations, while maintaining the same level of accuracy through error control mechanisms.
Authors: Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi
Abstract: We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.
Authors: Tuan Nguyen, Long Tran-Thanh
Abstract: Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.
Authors: Till Aczel, Lucas Theis, Wattenhofer Roger
Abstract: Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
Authors: Vu Duc Anh Nguyen, Ziyue Li
Abstract: The operational efficiency of railway networks, a cornerstone of modern economies, is persistently undermined by the cascading effects of train delays. Accurately forecasting this delay propagation is a critical challenge for real-time traffic management. While recent research has leveraged Graph Neural Networks (GNNs) to model the network structure of railways, a significant gap remains in developing frameworks that provide multi-step autoregressive forecasts at a network-wide scale, while simultaneously offering the live, interpretable explanations needed for decision support. This paper addresses this gap by developing and evaluating a novel XGeoAI framework for live, explainable, multi-step train delay forecasting. The core of this work is a two-stage, autoregressive Graph Attention Network (GAT) model, trained on a real-world dataset covering over 40% of the Dutch railway network. The model represents the system as a spatio-temporal graph of operational events (arrivals and departures) and is enriched with granular features, including platform and station congestion. To test its viability for live deployment, the model is rigorously evaluated using a sequential, k-step-ahead forecasting protocol that simulates real-world conditions where prediction errors can compound. The results demonstrate that while the proposed GATv2 model is challenged on pure error metrics (MAE) by a simpler Persistence baseline, it achieves consistently higher precision in classifying delay events -- a crucial advantage for a reliable decision support tool.
Authors: Natalie Abreu, Nikhil Vyas, Sham Kakade, Depen Morwani
Abstract: Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
Authors: Rahel Rickenbach, Jelena Trisovic, Alexandre Didier, Jerome Sieber, Melanie N. Zeilinger
Abstract: Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.
Authors: Ankush Pratap Singh, Houwei Cao, Yong Liu
Abstract: Curriculum learning (CL) structures training from simple to complex samples, facilitating progressive learning. However, existing CL approaches for emotion recognition often rely on heuristic, data-driven, or model-based definitions of sample difficulty, neglecting the difficulty for human perception, a critical factor in subjective tasks like emotion recognition. We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework that leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, under the assumption that clips challenging for humans are similarly hard for machine learning models. Empirical results suggest that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates, thereby enhancing both training efficiency and model robustness.
Authors: Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Shuguang Ma, Fei Yu, Yanghua Xiao
Abstract: Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model's policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data efficiency.Code is available on Github.
Authors: Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso
Abstract: Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.
Authors: Yuhao Pan, Xiucheng Wang, Nan Cheng, Wenchao Xu
Abstract: Radio frequency fingerprint identification (RFFI) is a critical technique for wireless network security, leveraging intrinsic hardware-level imperfections introduced during device manufacturing to enable precise transmitter identification. While deep neural networks have shown remarkable capability in extracting discriminative features, their real-world deployment is hindered by receiver-induced variability. In practice, RF fingerprint signals comprise transmitter-specific features as well as channel distortions and receiver-induced biases. Although channel equalization can mitigate channel noise, receiver-induced feature shifts remain largely unaddressed, causing the RFFI models to overfit to receiver-specific patterns. This limitation is particularly problematic when training and evaluation share the same receiver, as replacing the receiver in deployment can cause substantial performance degradation. To tackle this challenge, we propose an RFFI framework robust to cross-receiver variability, integrating adversarial training and style transfer to explicitly disentangle transmitter and receiver features. By enforcing domain-invariant representation learning, our method isolates genuine hardware signatures from receiver artifacts, ensuring robustness against receiver changes. Extensive experiments on multi-receiver datasets demonstrate that our approach consistently outperforms state-of-the-art baselines, achieving up to a 10% improvement in average accuracy across diverse receiver settings.
Authors: Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier
Abstract: Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.
Authors: Yankun Han
Abstract: Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.
Authors: Gur Keinan, Rotem Torkan, Omer Ben-Porat
Abstract: We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} -- a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
Authors: Hyunin Lee, Yong Zhang, Hoang Vu Nguyen, Xiaoyi Liu, Namyong Park, Christopher Jung, Rong Jin, Yang Wang, Zhigang Wang, Somayeh Sojoudi, Xue Feng
Abstract: Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.
Authors: Faried Abu Zaid, Tim Katzke, Emmanuel M\"uller, Daniel Neider
Abstract: Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.
Authors: Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping
Abstract: AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-$n$ attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.
Authors: Saeid Mashhadi, Amirhossein Saghezchi, Vesal Ghassemzadeh Kashani
Abstract: This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.
Authors: Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, Benedikt Wirth
Abstract: Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.
Authors: Erik Larsson, Joel Oskarsson, Tomas Landelius, Fredrik Lindsten
Abstract: Machine learning for weather prediction increasingly relies on ensemble methods to provide probabilistic forecasts. Diffusion-based models have shown strong performance in Limited-Area Modeling (LAM) but remain computationally expensive at sampling time. Building on the success of global weather forecasting models trained based on Continuous Ranked Probability Score (CRPS), we introduce CRPS-LAM, a probabilistic LAM forecasting model trained with a CRPS-based objective. By sampling and injecting a single latent noise vector into the model, CRPS-LAM generates ensemble members in a single forward pass, achieving sampling speeds up to 39 times faster than a diffusion-based model. We evaluate the model on the MEPS regional dataset, where CRPS-LAM matches the low errors of diffusion models. By retaining also fine-scale forecast details, the method stands out as an effective approach for probabilistic regional weather forecasting
Authors: Hrad Ghoukasian, Bonwoo Lee, Shahab Asoodeh
Abstract: We study the problem of sampling from a distribution under local differential privacy (LDP). Given a private distribution $P \in \mathcal{P}$, the goal is to generate a single sample from a distribution that remains close to $P$ in $f$-divergence while satisfying the constraints of LDP. This task captures the fundamental challenge of producing realistic-looking data under strong privacy guarantees. While prior work by Park et al. (NeurIPS'24) focuses on global minimax-optimality across a class of distributions, we take a local perspective. Specifically, we examine the minimax risk in a neighborhood around a fixed distribution $P_0$, and characterize its exact value, which depends on both $P_0$ and the privacy level. Our main result shows that the local minimax risk is determined by the global minimax risk when the distribution class $\mathcal{P}$ is restricted to a neighborhood around $P_0$. To establish this, we (1) extend previous work from pure LDP to the more general functional LDP framework, and (2) prove that the globally optimal functional LDP sampler yields the optimal local sampler when constrained to distributions near $P_0$. Building on this, we also derive a simple closed-form expression for the locally minimax-optimal samplers which does not depend on the choice of $f$-divergence. We further argue that this local framework naturally models private sampling with public data, where the public data distribution is represented by $P_0$. In this setting, we empirically compare our locally optimal sampler to existing global methods, and demonstrate that it consistently outperforms global minimax samplers.
Authors: Shangzhe Li, Dongruo Zhou, Weitong Zhang
Abstract: We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $\epsilon$ and the policy variance $\sigma^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.
Authors: Iftekhar Ahmed, Tanzil Ebad Chowdhury, Biggo Bushon Routh, Nafisa Tasmiya, Shadman Sakib, Adil Ahmed Chowdhury
Abstract: Kidneys are the filter of the human body. About 10% of the global population is thought to be affected by Chronic Kidney Disease (CKD), which causes kidney function to decline. To protect in danger patients from additional kidney damage, effective risk evaluation of CKD and appropriate CKD monitoring are crucial. Due to quick and precise detection capabilities, Machine Learning models can help practitioners accomplish this goal efficiently; therefore, an enormous number of diagnosis systems and processes in the healthcare sector nowadays are relying on machine learning due to its disease prediction capability. In this study, we designed and suggested disease predictive computer-aided designs for the diagnosis of CKD. The dataset for CKD is attained from the repository of machine learning of UCL, with a few missing values; those are filled in using "mean-mode" and "Random sampling method" strategies. After successfully achieving the missing data, eight ML techniques (Random Forest, SVM, Naive Bayes, Logistic Regression, KNN, XGBoost, Decision Tree, and AdaBoost) were used to establish models, and the performance evaluation comparisons among the result accuracies are measured by the techniques to find the machine learning models with the highest accuracy. Among them, Random Forest as well as Logistic Regression showed an outstanding 99% accuracy, followed by the Ada Boost, XGBoost, Naive Bayes, Decision Tree, and SVM, whereas the KNN classifier model stands last with an accuracy of 73%.
Authors: Shiyuan Luo, Runlong Yu, Shengyu Chen, Yingda Fan, Yiqun Xie, Yanhua Li, Xiaowei Jia
Abstract: Understanding environmental ecosystems is vital for the sustainable management of our planet. However,existing physics-based and data-driven models often fail to generalize to varying spatial regions and scales due to the inherent data heterogeneity presented in real environmental ecosystems. This generalization issue is further exacerbated by the limited observation samples available for model training. To address these issues, we propose Geo-STARS, a geo-aware spatio-temporal modeling framework for predicting stream water temperature across different watersheds and spatial scales. The major innovation of Geo-STARS is the introduction of geo-aware embedding, which leverages geographic information to explicitly capture shared principles and patterns across spatial regions and scales. We further integrate the geo-aware embedding into a gated spatio-temporal graph neural network. This design enables the model to learn complex spatial and temporal patterns guided by geographic and hydrological context, even with sparse or no observational data. We evaluate Geo-STARS's efficacy in predicting stream water temperature, which is a master factor for water quality. Using real-world datasets spanning 37 years across multiple watersheds along the eastern coast of the United States, Geo-STARS demonstrates its superior generalization performance across both regions and scales, outperforming state-of-the-art baselines. These results highlight the promise of Geo-STARS for scalable, data-efficient environmental monitoring and decision-making.
Authors: Ilia Revin, Leon Strelkov, Vadim A. Potemkin, Ivan Kireev, Andrey Savchenko
Abstract: There are many critical challenges in optimizing neural network models, including distributed computing, compression techniques, and efficient training, regardless of their application to specific tasks. Solving such problems is crucial because the need for scalable and resource-efficient models is increasing. To address these challenges, we have developed a new automated machine learning (AutoML) framework, Parameter Efficient Training with Robust Automation (PETRA). It applies evolutionary optimization to model architecture and training strategy. PETRA includes pruning, quantization, and loss regularization. Experimental studies on real-world data with financial event sequences, as well as image and time-series -- benchmarks, demonstrate PETRA's ability to improve neural model performance and scalability -- namely, a significant decrease in model size (up to 75%) and latency (up to 33%), and an increase in throughput (by 13%) without noticeable degradation in the target metric.
Authors: Disharee Bhowmick, Ranjith Ramanathan, Sathyanarayanan N. Aakur
Abstract: Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90\% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.
Authors: Nathan Quiblier, Roy Friedman, Matthew Ricci
Abstract: Dynamical systems in the life sciences are often composed of complex mixtures of overlapping behavioral regimes. Cellular subpopulations may shift from cycling to equilibrium dynamics or branch towards different developmental fates. The transitions between these regimes can appear noisy and irregular, posing a serious challenge to traditional, flow-based modeling techniques which assume locally smooth dynamics. To address this challenge, we propose MODE (Mixture Of Dynamical Experts), a graphical modeling framework whose neural gating mechanism decomposes complex dynamics into sparse, interpretable components, enabling both the unsupervised discovery of behavioral regimes and accurate long-term forecasting across regime transitions. Crucially, because agents in our framework can jump to different governing laws, MODE is especially tailored to the aforementioned noisy transitions. We evaluate our method on a battery of synthetic and real datasets from computational biology. First, we systematically benchmark MODE on an unsupervised classification task using synthetic dynamical snapshot data, including in noisy, few-sample settings. Next, we show how MODE succeeds on challenging forecasting tasks which simulate key cycling and branching processes in cell biology. Finally, we deploy our method on human, single-cell RNA sequencing data and show that it can not only distinguish proliferation from differentiation dynamics but also predict when cells will commit to their ultimate fate, a key outstanding challenge in computational biology.
Authors: Sangyun Lee, Brandon Amos, Giulia Fanti
Abstract: Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.
Authors: Gursimran Singh, Aviral Chharia, Rahul Upadhyay, Vinay Kumar, Luca Longo
Abstract: Electroencephalography (EEG)-based Brain-Computer Interfaces (BCIs) have emerged as a transformative technology with applications spanning robotics, virtual reality, medicine, and rehabilitation. However, existing BCI frameworks face several limitations, including a lack of stage-wise flexibility essential for experimental research, steep learning curves for researchers without programming expertise, elevated costs due to reliance on proprietary software, and a lack of all-inclusive features leading to the use of multiple external tools affecting research outcomes. To address these challenges, we present PyNoetic, a modular BCI framework designed to cater to the diverse needs of BCI research. PyNoetic is one of the very few frameworks in Python that encompasses the entire BCI design pipeline, from stimulus presentation and data acquisition to channel selection, filtering, feature extraction, artifact removal, and finally simulation and visualization. Notably, PyNoetic introduces an intuitive and end-to-end GUI coupled with a unique pick-and-place configurable flowchart for no-code BCI design, making it accessible to researchers with minimal programming experience. For advanced users, it facilitates the seamless integration of custom functionalities and novel algorithms with minimal coding, ensuring adaptability at each design stage. PyNoetic also includes a rich array of analytical tools such as machine learning models, brain-connectivity indices, systematic testing functionalities via simulation, and evaluation methods of novel paradigms. PyNoetic's strengths lie in its versatility for both offline and real-time BCI development, which streamlines the design process, allowing researchers to focus on more intricate aspects of BCI development and thus accelerate their research endeavors. Project Website: https://neurodiag.github.io/PyNoetic
Authors: Yaxin Shi, Abhishek Gupta, Ying Wu, Melvin Wong, Ivor Tsang, Thiago Rios, Stefan Menzel, Bernhard Sendhoff, Yaqing Hou, Yew-Soon Ong
Abstract: Generative AI (GenAI) has achieved remarkable success across a range of domains, but its capabilities remain constrained to statistical models of finite training sets and learning based on local gradient signals. This often results in artifacts that are more derivative than genuinely generative. In contrast, Evolutionary Computation (EC) offers a search-driven pathway to greater diversity and creativity, expanding generative capabilities by exploring uncharted solution spaces beyond the limits of available data. This work establishes a fundamental connection between EC and GenAI, redefining EC as Natural Generative AI (NatGenAI) -- a generative paradigm governed by exploratory search under natural selection. We demonstrate that classical EC with parent-centric operators mirrors conventional GenAI, while disruptive operators enable structured evolutionary leaps, often within just a few generations, to generate out-of-distribution artifacts. Moreover, the methods of evolutionary multitasking provide an unparalleled means of integrating disruptive EC (with cross-domain recombination of evolved features) and moderated selection mechanisms (allowing novel solutions to survive), thereby fostering sustained innovation. By reframing EC as NatGenAI, we emphasize structured disruption and selection pressure moderation as essential drivers of creativity. This perspective extends the generative paradigm beyond conventional boundaries and positions EC as crucial to advancing exploratory design, innovation, scientific discovery, and open-ended generation in the GenAI era.
Authors: Takehiro Ishikawa
Abstract: Recent advancements in QML and SNNs have generated considerable excitement, promising exponential speedups and brain-like energy efficiency to revolutionize AI. However, this paper argues that they are unlikely to displace DNNs in the near term. QML struggles with adapting backpropagation due to unitary constraints, measurement-induced state collapse, barren plateaus, and high measurement overheads, exacerbated by the limitations of current noisy intermediate-scale quantum hardware, overfitting risks due to underdeveloped regularization techniques, and a fundamental misalignment with machine learning's generalization. SNNs face restricted representational bandwidth, struggling with long-range dependencies and semantic encoding in language tasks due to their discrete, spike-based processing. Furthermore, the goal of faithfully emulating the brain might impose inherent inefficiencies like cognitive biases, limited working memory, and slow learning speeds. Even their touted energy-efficient advantages are overstated; optimized DNNs with quantization can outperform SNNs in energy costs under realistic conditions. Finally, SNN training incurs high computational overhead from temporal unfolding. In contrast, DNNs leverage efficient backpropagation, robust regularization, and innovations in LRMs that shift scaling to inference-time compute, enabling self-improvement via RL and search algorithms like MCTS while mitigating data scarcity. This superiority is evidenced by recent models such as xAI's Grok-4 Heavy, which advances SOTA performance, and gpt-oss-120b, which surpasses or approaches the performance of leading industry models despite its modest 120-billion-parameter size deployable on a single 80GB GPU. Furthermore, specialized ASICs amplify these efficiency gains. Ultimately, QML and SNNs may serve niche hybrid roles, but DNNs remain the dominant, practical paradigm for AI advancement.
Authors: Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra
Abstract: Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.
Authors: Devleena Das, Rajeev Patwari, Ashish Sirasao
Abstract: Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.
Authors: Aneesh Jonelagadda (Kaliber AI), Christina Hahn (Kaliber AI), Haoze Zheng (Kaliber AI), Salvatore Penachio (Kaliber AI)
Abstract: Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated "core summary" efficiently derived from a fixed-length subset of the memory graph to capture the user's personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.
Authors: Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu
Abstract: The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.
Authors: Raffaele Mura, Giorgio Piras, Kamil\.e Luko\v{s}i\=ut\.e, Maura Pintor, Amin Karbasi, Battista Biggio
Abstract: Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
Authors: Nouar Aldahoul, Yasir Zaki
Abstract: The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language-switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple-choice questions. In this paper, we present a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI-driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin-based deployment for real-world web applications.
Authors: Imranur Rahman, Jill Marley, William Enck, Laurie Williams
Abstract: Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emph{Pinning} dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emph{floating} can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emph{pinning} dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emph{pinning} is the tightest version constraint, \emph{pinning} is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emph{pinning} as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emph{floating-minor}, with \emph{pinning} being the next most common. We also find that \emph{floating-major} is the least likely to result in outdated and \emph{floating-minor} is the least likely to result in vulnerable dependencies.
Authors: Imranur Rahman, Md Rayhanur Rahman
Abstract: Code completion can help developers improve efficiency and ease the development lifecycle. Although code completion is available in modern integrated development environments (IDEs), research lacks in determining what makes a good context for code completion based on the information available to the IDEs for the large language models (LLMs) to perform better. In this paper, we describe an effective context collection strategy to assist the LLMs in performing better at code completion tasks. The key idea of our strategy is to preprocess the repository into smaller code chunks and later use syntactic and semantic similarity-based code chunk retrieval with relative positioning. We found that code chunking and relative positioning of the chunks in the final context improve the performance of code completion tasks.
Authors: Saumya B
Abstract: Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.
Authors: Tennison Liu, Silas Ruhrberg Est\'evez, David L. Bentley, Mihaela van der Schaar
Abstract: Large-scale scientific datasets -- spanning health biobanks, cell atlases, Earth reanalyses, and more -- create opportunities for exploratory discovery unconstrained by specific research questions. We term this process hypothesis hunting: the cumulative search for insight through sustained exploration across vast and complex hypothesis spaces. To support it, we introduce AScience, a framework modeling discovery as the interaction of agents, networks, and evaluation norms, and implement it as ASCollab, a distributed system of LLM-based research agents with heterogeneous behaviors. These agents self-organize into evolving networks, continually producing and peer-reviewing findings under shared standards of evaluation. Experiments show that such social dynamics enable the accumulation of expert-rated results along the diversity-quality-novelty frontier, including rediscoveries of established biomarkers, extensions of known pathways, and proposals of new therapeutic targets. While wet-lab validation remains indispensable, our experiments on cancer cohorts demonstrate that socially structured, agentic networks can sustain exploratory hypothesis hunting at scale.
Authors: Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla
Abstract: Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.
Authors: Hanieh Shojaei Miandashti, Claus Brenner
Abstract: In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out-of-distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty-based methods on the SemanticKITTI dataset, achieving an 18\% improvement in AUROC, 22\% increase in AUPRC, and 36\% reduction in FPR95 (from 76\% to 40\%), compared to the predictive entropy approach used in prior works.
Authors: Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola
Abstract: In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.
Authors: Vinayak Sharma, Ashish Padhy, Vijay Jagdish Karanjkar, Sourav Behera, Lord Sen, Shyamapada Mukherjee, Aviral Shrivastava
Abstract: Kolmogorov-Arnold Networks or KANs have shown the ability to outperform classical Deep Neural Networks, while using far fewer trainable parameters for regression problems on scientific domains. Even more powerful has been their interpretability due to their structure being composed of univariate B-Spline functions. This enables us to derive closed-form equations from trained KANs for a wide range of problems. This paper introduces a quantum-inspired variant of the KAN based on Quantum Data Re-uploading~(DR) models. The Quantum-Inspired Re-uploading KAN or QuIRK model replaces B-Splines with single-qubit DR models as the univariate function approximator, allowing them to match or outperform traditional KANs while using even fewer parameters. This is especially apparent in the case of periodic functions. Additionally, since the model utilizes only single-qubit circuits, it remains classically tractable to simulate with straightforward GPU acceleration. Finally, we also demonstrate that QuIRK retains the interpretability advantages and the ability to produce closed-form solutions.
Authors: Linfeng Wang, Susana Campino, Taane G. Clark, Jody E. Phelan
Abstract: Positive selection drives the emergence of adaptive mutations in Mycobacterium tuberculosis, shaping drug resistance, transmissibility, and virulence. Phylogenetic trees capture evolutionary relationships among isolates and provide a natural framework for detecting such adaptive signals. We present a phylogeny-guided graph attention network (GAT) approach, introducing a method for converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis. Using 500 M. tuberculosis isolates from four major lineages and 249 single-nucleotide variants (84 resistance-associated and 165 neutral) across 61 drug-resistance genes, we constructed graphs where nodes represented isolates and edges reflected phylogenetic distances. Edges between isolates separated by more than seven internal nodes were pruned to emphasise local evolutionary structure. Node features encoded SNP presence or absence, and the GAT architecture included two attention layers, a residual connection, global attention pooling, and a multilayer perceptron classifier. The model achieved an accuracy of 0.88 on a held-out test set and, when applied to 146 WHO-classified "uncertain" variants, identified 41 candidates with convergent emergence across multiple lineages, consistent with adaptive evolution. This work demonstrates the feasibility of transforming phylogenies into GNN-compatible structures and highlights attention-based models as effective tools for detecting positive selection, aiding genomic surveillance and variant prioritisation.
Authors: Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram Adve, Sasa Misailovic
Abstract: Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. The paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. On ten attention-based benchmarks, Neptune, starting from simple attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have average speedup of $1.35\times$ over the next best alternative, demonstrating its effectiveness for deep learning workloads.
Authors: Nicolas Ewen, Jairo Diaz-Rodriguez, Kelly Ramsay
Abstract: Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model's ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and $L_1$ penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.
Authors: Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta
Abstract: Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.
Authors: Graham Tierney, Srikar Katta, Christopher Bail, Sunshine Hillygus, Alexander Volfovsky
Abstract: Many social science questions ask how linguistic properties causally affect an audience's attitudes and behaviors. Because text properties are often interlinked (e.g., angry reviews use profane language), we must control for possible latent confounding to isolate causal effects. Recent literature proposes adapting large language models (LLMs) to learn latent representations of text that successfully predict both treatment and the outcome. However, because the treatment is a component of the text, these deep learning methods risk learning representations that actually encode the treatment itself, inducing overlap bias. Rather than depending on post-hoc adjustments, we introduce a new experimental design that handles latent confounding, avoids the overlap issue, and unbiasedly estimates treatment effects. We apply this design in an experiment evaluating the persuasiveness of expressing humility in political communication. Methodologically, we demonstrate that LLM-based methods perform worse than even simple bag-of-words models using our real text and outcomes from our experiment. Substantively, we isolate the causal effect of expressing humility on the perceived persuasiveness of political statements, offering new insights on communication effects for social media platforms, policy makers, and social scientists.
Authors: Vasuda Trehan, Kevin H. Knuth, M. J. Way
Abstract: The evolution of space technology in recent years, fueled by advancements in computing such as Artificial Intelligence (AI) and machine learning (ML), has profoundly transformed our capacity to explore the cosmos. Missions like the James Webb Space Telescope (JWST) have made information about distant objects more easily accessible, resulting in extensive amounts of valuable data. As part of this work-in-progress study, we are working to create an atmospheric absorption spectrum prediction model for exoplanets. The eventual model will be based on both collected observational spectra and synthetic spectral data generated by the ROCKE-3D general circulation model (GCM) developed by the climate modeling program at NASA's Goddard Institute for Space Studies (GISS). In this initial study, spline curves are used to describe the bin heights of simulated atmospheric absorption spectra as a function of one of the values of the planetary parameters. Bayesian Adaptive Exploration is then employed to identify areas of the planetary parameter space for which more data are needed to improve the model. The resulting system will be used as a forward model so that planetary parameters can be inferred given a planet's atmospheric absorption spectrum. This work is expected to contribute to a better understanding of exoplanetary properties and general exoplanet climates and habitability.
Authors: Proggya Chakraborty, Aaquib Asrar, Jayasree Sengupta, Sipra Das Bit
Abstract: 5G networks enable diverse services such as eMBB, URLLC, and mMTC through network slicing, necessitating intelligent admission control and resource allocation to meet stringent QoS requirements while maximizing Network Service Provider (NSP) profits. However, existing Deep Reinforcement Learning (DRL) frameworks focus primarily on profit optimization without explicitly accounting for service delay, potentially leading to QoS violations for latency-sensitive slices. Moreover, commonly used epsilon-greedy exploration of DRL often results in unstable convergence and suboptimal policy learning. To address these gaps, we propose DePSAC -- a Delay and Profit-aware Slice Admission Control scheme. Our DRL-based approach incorporates a delay-aware reward function, where penalties due to service delay incentivize the prioritization of latency-critical slices such as URLLC. Additionally, we employ Boltzmann exploration to achieve smoother and faster convergence. We implement and evaluate DePSAC on a simulated 5G core network substrate with realistic Network Slice Request (NSLR) arrival patterns. Experimental results demonstrate that our method outperforms the DSARA baseline in terms of overall profit, reduced URLLC slice delays, improved acceptance rates, and improved resource consumption. These findings validate the effectiveness of the proposed DePSAC in achieving better QoS-profit trade-offs for practical 5G network slicing scenarios.
Authors: Gregory Yeghiyan, Jurius Azar, Devson Butani, Chan-Jin Chung
Abstract: This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.
Authors: Rohan Choudhury, Shanchuan Lin, Jianyi Wang, Hao Chen, Qi Zhao, Feng Cheng, Lu Jiang, Kris Kitani, Laszlo A. Jeni
Abstract: Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/
Authors: Rohit Dwivedula, Divyanshu Saxena, Aditya Akella, Swarat Chaudhuri, Daehyeok Kim
Abstract: Policy design for various systems controllers has conventionally been a manual process, with domain experts carefully tailoring heuristics for the specific instance in which the policy will be deployed. In this paper, we re-imagine policy design via a novel automated search technique fueled by recent advances in generative models, specifically Large Language Model (LLM)-driven code generation. We outline the design and implementation of PolicySmith, a framework that applies LLMs to synthesize instance-optimal heuristics. We apply PolicySmith to two long-standing systems policies - web caching and congestion control, highlighting the opportunities unraveled by this LLM-driven heuristic search. For caching, PolicySmith discovers heuristics that outperform established baselines on standard open-source traces. For congestion control, we show that PolicySmith can generate safe policies that integrate directly into the Linux kernel.
Authors: Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharor, Vitor Guizilini, Yue Wang
Abstract: From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.
Authors: Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader
Abstract: The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.
Authors: Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You
Abstract: Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
Authors: Morris Trestman, Stefan Gugler, Felix A. Faber, O. A. von Lilienfeld
Abstract: Smart training set selections procedures enable the reduction of data needs and improves predictive robustness in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy-system (Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Compared to FPS and uniform sampling, our numerical results indicate superior data efficiency and robustness when using GGFPS. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to two fold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinksi-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.
Authors: Hideaki Kim, Tomoharu Iwata
Abstract: The representer theorem is a cornerstone of kernel methods, which aim to estimate latent functions in reproducing kernel Hilbert spaces (RKHSs) in a nonparametric manner. Its significance lies in converting inherently infinite-dimensional optimization problems into finite-dimensional ones over dual coefficients, thereby enabling practical and computationally tractable algorithms. In this paper, we address the problem of estimating the latent triggering kernels--functions that encode the interaction structure between events--for linear multivariate Hawkes processes based on observed event sequences within an RKHS framework. We show that, under the principle of penalized least squares minimization, a novel form of representer theorem emerges: a family of transformed kernels can be defined via a system of simultaneous integral equations, and the optimal estimator of each triggering kernel is expressed as a linear combination of these transformed kernels evaluated at the data points. Remarkably, the dual coefficients are all analytically fixed to unity, obviating the need to solve a costly optimization problem to obtain the dual coefficients. This leads to a highly efficient estimator capable of handling large-scale data more effectively than conventional nonparametric approaches. Empirical evaluations on synthetic datasets reveal that the proposed method attains competitive predictive accuracy while substantially improving computational efficiency over existing state-of-the-art kernel method-based estimators.
Authors: Daiki Yoshikawa, Takashi Matsubara
Abstract: Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
Authors: Yunrui Guan, Krishnakumar Balasubramanian, Shiqian Ma
Abstract: We study generative modeling on convex domains using flow matching and mirror maps, and identify two fundamental challenges. First, standard log-barrier mirror maps induce heavy-tailed dual distributions, leading to ill-posed dynamics. Second, coupling with Gaussian priors performs poorly when matching heavy-tailed targets. To address these issues, we propose Mirror Flow Matching based on a \emph{regularized mirror map} that controls dual tail behavior and guarantees finite moments, together with coupling to a Student-$t$ prior that aligns with heavy-tailed targets and stabilizes training. We provide theoretical guarantees, including spatial Lipschitzness and temporal regularity of the velocity field, Wasserstein convergence rates for flow matching with Student-$t$ priors and primal-space guarantees for constrained generation, under $\varepsilon$-accurate learned velocity fields. Empirically, our method outperforms baselines in synthetic convex-domain simulations and achieves competitive sample quality on real-world constrained generative tasks.
Authors: Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad
Abstract: Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.
Authors: Siyuan Chen, Minghao Guo, Caoliwen Wang, Anka He Chen, Yikun Zhang, Jingjing Chai, Yin Yang, Wojciech Matusik, Peter Yichen Chen
Abstract: Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a uniffed module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid conffguration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end ffnetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufffcient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our 2-step model achieves the same structural accuracy as state-of-the-art 200-step diffusion baselines, delivering approximately 10 times faster wall-clock speed while guaranteeing physical validity.
Authors: Nisha Pillai, Aditi Virupakshaiah, Harrison W. Smith, Amanda J. Ashworth, Prasanna Gowda, Phillip R. Owens, Adam R. Rivers, Bindu Nanduri, Mahalingam Ramkumar
Abstract: Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
Authors: Yuki Nii, Futa Waseda, Ching-Chun Chang, Isao Echizen
Abstract: AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement -- for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.
Authors: Hamed Mahdavi (Pennsylvania State University), Pouria Mahdavinia (Pennsylvania State University), Samira Malek (Pennsylvania State University), Pegah Mohammadipour (Pennsylvania State University), Alireza Hashemi (City University of New York), Majid Daliri (New York University), Alireza Farhadi (Amirkabir University of Technology), Amir Khasahmadi (Autodesk), Niloofar Mireshghallah (Carnegie Mellon University), Vasant Honavar (Pennsylvania State University)
Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
Authors: Weitong Kong, Zichao Zeng, Di Wen, Jiale Wei, Kunyu Peng, June Moh Goo, Jan Boehm, Rainer Stiefelhagen
Abstract: Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.
Authors: Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, Biwei Huang
Abstract: We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
Authors: Minghao Han, Kiwan Wong, Adrian Wing-Keung Law, Xunyuan Yin
Abstract: In this work, we propose a meta-learning-based Koopman modeling and predictive control approach for nonlinear systems with parametric uncertainties. An adaptive deep meta-learning-based modeling approach, called Meta Adaptive Koopman Operator (MAKO), is proposed. Without knowledge of the parametric uncertainty, the proposed MAKO approach can learn a meta-model from a multi-modal dataset and efficiently adapt to new systems with previously unseen parameter settings by using online data. Based on the learned meta Koopman model, a predictive control scheme is developed, and the stability of the closed-loop system is ensured even in the presence of previously unseen parameter settings. Through extensive simulations, our proposed approach demonstrates superior performance in both modeling accuracy and control efficacy as compared to competitive baselines.
Authors: Manojit Chakraborty, Madhusudan Ghosh, Rishabh Gupta
Abstract: In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
Authors: Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq
Abstract: Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.
Authors: Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Abstract: We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
Authors: Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu
Abstract: Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.
Authors: Gurprit Singh, Wenzel Jakob
Abstract: Generative artificial intelligence (AI) has made unprecedented advances in vision language models over the past two years. During the generative process, new samples (images) are generated from an unknown high-dimensional distribution. Markov Chain Monte Carlo (MCMC) methods are particularly effective in drawing samples from such complex, high-dimensional distributions. This makes MCMC methods an integral component for models like EBMs, ensuring accurate sample generation. Gradient-based optimization is at the core of modern generative models. The update step during the optimization forms a Markov chain where the new update depends only on the current state. This allows exploration of the parameter space in a memoryless manner, thus combining the benefits of gradient-based optimization and MCMC sampling. MCMC methods have shown an equally important role in physically based rendering where complex light paths are otherwise quite challenging to sample from simple importance sampling techniques. A lot of research is dedicated towards bringing physical realism to samples (images) generated from diffusion-based generative models in a data-driven manner, however, a unified framework connecting these techniques is still missing. In this course, we take the first steps toward understanding each of these components and exploring how MCMC could potentially serve as a bridge, linking these closely related areas of research. Our course aims to provide necessary theoretical and practical tools to guide students, researchers and practitioners towards the common goal of generative physically based rendering. All Jupyter notebooks with demonstrations associated to this tutorial can be found on the project webpage: https://sinbag.github.io/mcmc/
Authors: Xinhu Li, Ayush Jain, Zhaojing Yang, Yigit Korkmaz, Erdem B{\i}y{\i}k
Abstract: Learning from demonstrations enables experts to teach robots complex tasks using interfaces such as kinesthetic teaching, joystick control, and sim-to-real transfer. However, these interfaces often constrain the expert's ability to demonstrate optimal behavior due to indirect control, setup restrictions, and hardware safety. For example, a joystick can move a robotic arm only in a 2D plane, even though the robot operates in a higher-dimensional space. As a result, the demonstrations collected by constrained experts lead to suboptimal performance of the learned policies. This raises a key question: Can a robot learn a better policy than the one demonstrated by a constrained expert? We address this by allowing the agent to go beyond direct imitation of expert actions and explore shorter and more efficient trajectories. We use the demonstrations to infer a state-only reward signal that measures task progress, and self-label reward for unknown states using temporal interpolation. Our approach outperforms common imitation learning in both sample efficiency and task completion time. On a real WidowX robotic arm, it completes the task in 12 seconds, 10x faster than behavioral cloning, as shown in real-robot videos on https://sites.google.com/view/constrainedexpert .
Authors: Irash Perera (Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka), Uthayasanker Thayasivam (Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka)
Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.
Authors: Hao Zeng, Jianguo Huang, Bingyi Jing, Hongxin Wei, Bo An
Abstract: Large reasoning models (LRMs) have achieved remarkable progress in complex problem-solving tasks. Despite this success, LRMs typically suffer from high computational costs during deployment, highlighting a need for efficient inference. A popular direction of efficiency improvement is to switch the LRM between thinking and nonthinking modes dynamically. However, such approaches often introduce additional reasoning errors and lack statistical guarantees for the performance loss, which are critical for high-stakes applications. In this work, we propose Probably Approximately Correct (PAC) reasoning that controls the performance loss under the user-specified performance loss tolerance. In particular, we construct an upper confidence bound on the performance loss, formulated as a monotone function of the uncertainty score, and subsequently determine a threshold for switching to the nonthinking model. Theoretically, using the threshold to switch between the thinking and nonthinking modes ensures bounded performance loss in a distribution-free manner. Our comprehensive experiments on reasoning benchmarks show that the proposed method can save computational budgets and control the user-specified performance loss.
Authors: Aziz Bacha, Thomas George
Abstract: Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
Authors: Mira Raheem, Michael Papazoglou, Bernd Kr\"amer, Neamat El-Tazi, Amal Elgammal
Abstract: Connected health is a multidisciplinary approach focused on health management, prioritizing pa-tient needs in the creation of tools, services, and treatments. This paradigm ensures proactive and efficient care by facilitating the timely exchange of accurate patient information among all stake-holders in the care continuum. The rise of digital technologies and process innovations promises to enhance connected health by integrating various healthcare data sources. This integration aims to personalize care, predict health outcomes, and streamline patient management, though challeng-es remain, particularly in data architecture, application interoperability, and security. Data analytics can provide critical insights for informed decision-making and health co-creation, but solutions must prioritize end-users, including patients and healthcare professionals. This perspective was explored through an agile System Development Lifecycle in an EU-funded project aimed at developing an integrated AI-generated solution for managing cancer patients undergoing immunotherapy. This paper contributes with a collaborative digital framework integrating stakeholders across the care continuum, leveraging federated big data analytics and artificial intelligence for improved decision-making while ensuring privacy. Analytical capabilities, such as treatment recommendations and adverse event predictions, were validated using real-life data, achieving 70%-90% accuracy in a pilot study with the medical partners, demonstrating the framework's effectiveness.
Authors: Mihriban Ceylan, David J. Pr\"omel
Abstract: The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional $L^p$-setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.
Authors: Giacomo Dimarco, Federica Ferrarese, Lorenzo Pareschi
Abstract: In this work, we propose a data augmentation strategy aimed at improving the training phase of neural networks and, consequently, the accuracy of their predictions. Our approach relies on generating synthetic data through a suitable compartmental model combined with the incorporation of uncertainty. The available data are then used to calibrate the model, which is further integrated with deep learning techniques to produce additional synthetic data for training. The results show that neural networks trained on these augmented datasets exhibit significantly improved predictive performance. We focus in particular on two different neural network architectures: Physics-Informed Neural Networks (PINNs) and Nonlinear Autoregressive (NAR) models. The NAR approach proves especially effective for short-term forecasting, providing accurate quantitative estimates by directly learning the dynamics from data and avoiding the additional computational cost of embedding physical constraints into the training. In contrast, PINNs yield less accurate quantitative predictions but capture the qualitative long-term behavior of the system, making them more suitable for exploring broader dynamical trends. Numerical simulations of the second phase of the COVID-19 pandemic in the Lombardy region (Italy) validate the effectiveness of the proposed approach.
Authors: Simon Idoko, Arun Kumar Singh
Abstract: Centralized trajectory optimization in the joint space of multiple robots allows access to a larger feasible space that can result in smoother trajectories, especially while planning in tight spaces. Unfortunately, it is often computationally intractable beyond a very small swarm size. In this paper, we propose Flow-Opt, a learning-based approach towards improving the computational tractability of centralized multi-robot trajectory optimization. Specifically, we reduce the problem to first learning a generative model to sample different candidate trajectories and then using a learned Safety-Filter(SF) to ensure fast inference-time constraint satisfaction. We propose a flow-matching model with a diffusion transformer (DiT) augmented with permutation invariant robot position and map encoders as the generative model. We develop a custom solver for our SF and equip it with a neural network that predicts context-specific initialization. The initialization network is trained in a self-supervised manner, taking advantage of the differentiability of the SF solver. We advance the state-of-the-art in the following respects. First, we show that we can generate trajectories of tens of robots in cluttered environments in a few tens of milliseconds. This is several times faster than existing centralized optimization approaches. Moreover, our approach also generates smoother trajectories orders of magnitude faster than competing baselines based on diffusion models. Second, each component of our approach can be batched, allowing us to solve a few tens of problem instances in a fraction of a second. We believe this is a first such result; no existing approach provides such capabilities. Finally, our approach can generate a diverse set of trajectories between a given set of start and goal locations, which can capture different collision-avoidance behaviors.
Authors: Yifan Zhu, Lijia Yu, Xiao-Shan Gao
Abstract: In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying dataset ownership or safeguarding private data from unauthorized use. However, these developments have the potential to cause misunderstandings and conflicts, as data poisoning has traditionally been regarded as a security threat to machine learning systems. To address this issue, it is imperative for harmless poisoning generators to claim ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. In this paper, we propose the deployment of watermarking schemes as a solution to this challenge. We introduce two provable and practical watermarking approaches for data poisoning: {\em post-poisoning watermarking} and {\em poisoning-concurrent watermarking}. Our analyses demonstrate that when the watermarking length is $\Theta(\sqrt{d}/\epsilon_w)$ for post-poisoning watermarking, and falls within the range of $\Theta(1/\epsilon_w^2)$ to $O(\sqrt{d}/\epsilon_p)$ for poisoning-concurrent watermarking, the watermarked poisoning dataset provably ensures both watermarking detectability and poisoning utility, certifying the practicality of watermarking under data poisoning attacks. We validate our theoretical findings through experiments on several attacks, models, and datasets.
Authors: Tao Feng, Lizhen Qu, Niket Tandon, Gholamreza Haffari
Abstract: Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.
Authors: Jindong Hong, Wencheng Zhang, Shiqin Qiao, Jianhai Chen, Jianing Qiu, Chuanyang Zheng, Qian Xu, Yun Ji, Qianyue Wen, Weiwei Sun, Hao Li, Huizhen Li, Huichao Wang, Kai Wu, Meng Li, Yijun He, Lingjie Luo, Jiankai Sun
Abstract: Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6\% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.
Authors: Javier Albert-Smet, Zoraida Frias, Luis Mendo, Sergio Melones, Eduardo Yraola
Abstract: Characterizing application-layer user throughput in next-generation networks is increasingly challenging as the higher capacity of the 5G Radio Access Network (RAN) shifts connectivity bottlenecks towards deeper parts of the network. Traditional methods, such as drive tests and operator equipment counters, are costly, limited, or fail to capture end-to-end (E2E) Quality of Service (QoS) and its variability. In this work, we leverage large-scale crowdsourced measurements-including E2E, radio, contextual and network deployment features collected by the user equipment (UE)-to propose an uncertainty-aware and explainable approach for downlink user throughput estimation. We first validate prior 4G methods, improving R^2 by 8.7%, and then extend them to 5G NSA and 5G SA, providing the first benchmarks for 5G crowdsourced datasets. To address the variability of throughput, we apply NGBoost, a model that outputs both point estimates and calibrated confidence intervals, representing its first use in the field of computer communications. Finally, we use the proposed model to analyze the evolution from 4G to 5G SA, and show that throughput bottlenecks move from the RAN to transport and service layers, as seen by E2E metrics gaining importance over radio-related features.
Authors: Marco Siino, Giuseppe Bonomo, Rosario Sorbello, Ilenia Tinnirello
Abstract: The present study investigates the impact of the Rational Discrete Wavelet Transform (RDWT), used as a plug-in preprocessing step for motor imagery electroencephalographic (EEG) decoding prior to applying deep learning classifiers. A systematic paired evaluation (with/without RDWT) is conducted on four state-of-the-art deep learning architectures: EEGNet, ShallowConvNet, MBEEG\_SENet, and EEGTCNet. This evaluation was carried out across three benchmark datasets: High Gamma, BCI-IV-2a, and BCI-IV-2b. The performance of the RDWT is reported with subject-wise averages using accuracy and Cohen's kappa, complemented by subject-level analyses to identify when RDWT is beneficial. On BCI-IV-2a, RDWT yields clear average gains for EEGTCNet (+4.44 percentage points, pp; kappa +0.059) and MBEEG\_SENet (+2.23 pp; +0.030), with smaller improvements for EEGNet (+2.08 pp; +0.027) and ShallowConvNet (+0.58 pp; +0.008). On BCI-IV-2b, the enhancements observed are modest yet consistent for EEGNet (+0.21 pp; +0.044) and EEGTCNet (+0.28 pp; +0.077). On HGD, average effects are modest to positive, with the most significant gain observed for MBEEG\_SENet (+1.65 pp; +0.022), followed by EEGNet (+0.76 pp; +0.010) and EEGTCNet (+0.54 pp; +0.008). Inspection of the subject material reveals significant enhancements in challenging recordings (e.g., non-stationary sessions), indicating that RDWT can mitigate localized noise and enhance rhythm-specific information. In conclusion, RDWT is shown to be a low-overhead, architecture-aware preprocessing technique that can yield tangible gains in accuracy and agreement for deep model families and challenging subjects.
Authors: Zofia Bracha, Pawe{\l} Sakowski, Jakub Micha\'nk\'ow
Abstract: This paper explores the application of deep Q-learning to hedging at-the-money options on the S\&P~500 index. We develop an agent based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, trained to simulate hedging decisions without making explicit model assumptions on price dynamics. The agent was trained on historical intraday prices of S\&P~500 call options across years 2004--2024, using a single time series of six predictor variables: option price, underlying asset price, moneyness, time to maturity, realized volatility, and current hedge position. A walk-forward procedure was applied for training, which led to nearly 17~years of out-of-sample evaluation. The performance of the deep reinforcement learning (DRL) agent is benchmarked against the Black--Scholes delta-hedging strategy over the same period. We assess both approaches using metrics such as annualized return, volatility, information ratio, and Sharpe ratio. To test the models' adaptability, we performed simulations across varying market conditions and added constraints such as transaction costs and risk-awareness penalties. Our results show that the DRL agent can outperform traditional hedging methods, particularly in volatile or high-cost environments, highlighting its robustness and flexibility in practical trading contexts. While the agent consistently outperforms delta-hedging, its performance deteriorates when the risk-awareness parameter is higher. We also observed that the longer the time interval used for volatility estimation, the more stable the results.
Authors: X. Jiang, J. Qiu, K. Gustavsson, B. Mehlig, L. Zhao
Abstract: Artificial gliders are designed to disperse as they settle through a fluid, requiring precise navigation to reach target locations. We show that a compact glider settling in a viscous fluid can navigate by dynamically adjusting its centre-of-mass. Using fully resolved direct numerical simulations (DNS) and reinforcement learning, we find two optimal navigation strategies that allow the glider to reach its target location accurately. These strategies depend sensitively on how the glider interacts with the surrounding fluid. The nature of this interaction changes as the particle Reynolds number Re$_p$ changes. Our results explain how the optimal strategy depends on Re$_p$. At large Re$_p$, the glider learns to tumble rapidly by moving its centre-of-mass as its orientation changes. This generates a large horizontal inertial lift force, which allows the glider to travel far. At small Re$_p$, by contrast, high viscosity hinders tumbling. In this case, the glider learns to adjust its centre-of-mass so that it settles with a steady, inclined orientation that results in a horizontal viscous force. The horizontal range is much smaller than for large Re$_p$, because this viscous force is much smaller than the inertial lift force at large Re$_p$. *These authors contributed equally.
Authors: Alina Elena Baia, Alessio Xompero, Andrea Cavallaro
Abstract: While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
Authors: Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
Abstract: Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
Authors: Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang
Abstract: Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over $5000$ angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.
Authors: Amina Ferrad, Johann Huber, Fran\c{c}ois H\'el\'enon, Julien Gleyze, Mahdi Khoramshahi, St\'ephane Doncieux
Abstract: Robotics research has made significant strides in learning, yet mastering basic skills like object placement remains a fundamental challenge. A key bottleneck is the acquisition of large-scale, high-quality data, which is often a manual and laborious process. Inspired by Graspit!, a foundational work that used simulation to automatically generate dexterous grasp poses, we introduce Placeit!, an evolutionary-computation framework for generating valid placement positions for rigid objects. Placeit! is highly versatile, supporting tasks from placing objects on tables to stacking and inserting them. Our experiments show that by leveraging quality-diversity optimization, Placeit! significantly outperforms state-of-the-art methods across all scenarios for generating diverse valid poses. A pick&place pipeline built on our framework achieved a 90% success rate over 120 real-world deployments. This work positions Placeit! as a powerful tool for open-environment pick-and-place tasks and as a valuable engine for generating the data needed to train simulation-based foundation models in robotics.
Authors: Zirun Zhou, Zhengyang Xiao, Haochuan Xu, Jing Sun, Di Wang, Jingfeng Zhang
Abstract: Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA's actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at https://goba-attack.github.io/.
Authors: Pablo G. Arce, Roi Naveiro, David R\'ios Insua
Abstract: The vulnerability of machine learning models to adversarial attacks remains a critical security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. However, these deterministic approaches do not account for uncertainty in the adversary's attack. While stochastic defenses placing a probability distribution on the adversary exist, they often lack statistical rigor and fail to make explicit their underlying assumptions. To resolve these issues, we introduce a formal Bayesian framework that models adversarial uncertainty through a stochastic channel, articulating all probabilistic assumptions. This yields two robustification strategies: a proactive defense enacted during training, aligned with adversarial training, and a reactive defense enacted during operations, aligned with adversarial purification. Several previous defenses can be recovered as limiting cases of our model. We empirically validate our methodology, showcasing the benefits of explicitly modeling adversarial uncertainty.
Authors: Mira Raheem, Amal Elgammal, Michael Papazoglou, Bernd Kr\"amer, Neamat El-Tazi
Abstract: Artificial intelligence (AI) has the potential to transform healthcare by supporting more accurate diagnoses and personalized treatments. However, its adoption in practice remains constrained by fragmented data sources, strict privacy rules, and the technical complexity of building reliable clinical systems. To address these challenges, we introduce a model driven engineering (MDE) framework designed specifically for healthcare AI. The framework relies on formal metamodels, domain-specific languages (DSLs), and automated transformations to move from high level specifications to running software. At its core is the Medical Interoperability Language (MILA), a graphical DSL that enables clinicians and data scientists to define queries and machine learning pipelines using shared ontologies. When combined with a federated learning architecture, MILA allows institutions to collaborate without exchanging raw patient data, ensuring semantic consistency across sites while preserving privacy. We evaluate this approach in a multi center cancer immunotherapy study. The generated pipelines delivered strong predictive performance, with support vector machines achieving up to 98.5 percent and 98.3 percent accuracy in key tasks, while substantially reducing manual coding effort. These findings suggest that MDE principles metamodeling, semantic integration, and automated code generation can provide a practical path toward interoperable, reproducible, and trustworthy digital health platforms.
Authors: Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda
Abstract: Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.
Authors: Siu-Kui Au, Zi-Jun Cao
Abstract: Engineering risk is concerned with the likelihood of failure and the scenarios when it occurs. The sensitivity of failure probability to change in system parameters is relevant to risk-informed decision making. Computing sensitivity is at least one level more difficult than the probability itself, which is already challenged by a large number of input random variables, rare events and implicit nonlinear `black-box' response. Finite difference with Monte Carlo probability estimates is spurious, requiring the number of samples to grow with the reciprocal of step size to suppress estimation variance. Many existing works gain efficiency by exploiting a specific class of input variables, sensitivity parameters, or response in its exact or surrogate form. For general systems, this work presents a theory and associated Monte Carlo strategy for computing sensitivity using response values and gradients with respect to sensitivity parameters. It is shown that the sensitivity at a given response threshold can be expressed via the expectation of response gradient conditional on the threshold. Determining the expectation requires conditioning on the threshold that is a zero-probability event, but it can be resolved by the concept of kernel smoothing. The proposed method offers sensitivity estimates for all response thresholds generated in a single Monte Carlo run. It is investigated in a number of examples featuring sensitivity parameters of different nature. As response gradient becomes increasingly available, it is hoped that this work can provide the basis for embedding sensitivity calculations with reliability in the same Monte Carlo run.
Authors: Joachim Diederich
Abstract: We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.
Authors: Valentin Biller, Lucas Zimmer, Can Erdur, Sandeep Nagar, Daniel R\"uckert, Niklas Bubeck, Jonas Weidner
Abstract: Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git
Authors: Sahab Zandi, Kamesh Korangi, Juan C. Moreno-Paredes, Mar\'ia \'Oskarsd\'ottir, Christophe Mues, Cristi\'an Bravo
Abstract: Small and Medium-sized Enterprises (SMEs) are known to play a vital role in economic growth, employment, and innovation. However, they tend to face significant challenges in accessing credit due to limited financial histories, collateral constraints, and exposure to macroeconomic shocks. These challenges make an accurate credit risk assessment by lenders crucial, particularly since SMEs frequently operate within interconnected firm networks through which default risk can propagate. This paper presents and tests a novel approach for modelling the risk of SME credit, using a unique large data set of SME loans provided by a prominent financial institution. Specifically, our approach employs Graph Neural Networks to predict SME default using multilayer network data derived from common ownership and financial transactions between firms. We show that combining this information with traditional structured data not only improves application scoring performance, but also explicitly models contagion risk between companies. Further analysis shows how the directionality and intensity of these connections influence financial risk contagion, offering a deeper understanding of the underlying processes. Our findings highlight the predictive power of network data, as well as the role of supply chain networks in exposing SMEs to correlated default risk.
Authors: Beige Jerry Jin, Leila Wehbe
Abstract: Current non-invasive neuroimaging techniques trade off between spatial resolution and temporal resolution. While magnetoencephalography (MEG) can capture rapid neural dynamics and functional magnetic resonance imaging (fMRI) can spatially localize brain activity, a unified picture that preserves both high resolutions remains an unsolved challenge with existing source localization or MEG-fMRI fusion methods, especially for single-trial naturalistic data. We collected whole-head MEG when subjects listened passively to more than seven hours of narrative stories, using the same stimuli in an open fMRI dataset (LeBel et al., 2023). We developed a transformer-based encoding model that combines the MEG and fMRI from these two naturalistic speech comprehension experiments to estimate latent cortical source responses with high spatiotemporal resolution. Our model is trained to predict MEG and fMRI from multiple subjects simultaneously, with a latent layer that represents our estimates of reconstructed cortical sources. Our model predicts MEG better than the common standard of single-modality encoding models, and it also yields source estimates with higher spatial and temporal fidelity than classic minimum-norm solutions in simulation experiments. We validated the estimated latent sources by showing its strong generalizability across unseen subjects and modalities. Estimated activity in our source space predict electrocorticography (ECoG) better than an ECoG-trained encoding model in an entirely new dataset. By integrating the power of large naturalistic experiments, MEG, fMRI, and encoding models, we propose a practical route towards millisecond-and-millimeter brain mapping.
Authors: Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve G\"urel
Abstract: We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.
Authors: Nizar El Ghazal, Antoine Caubri\`ere, Valentin Vielzeuf
Abstract: This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
Authors: David-Alexandre Duclos, William Guimont-Martin, Gabriel Jeanson, Arthur Larochelle-Tremblay, Th\'eo Defosse, Fr\'ed\'eric Moore, Philippe Nolet, Fran\c{c}ois Pomerleau, Philippe Gigu\`ere
Abstract: Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.
Authors: Ralf R\"omer, Adrian Kobras, Luca Worbis, Angela P. Schoellig
Abstract: Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy's embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at https://tum-lsy.github.io/fiper_website.
Authors: Manuel Vargas Guzm\'an, Jakub Szymanik, Maciej Malicki
Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.
Authors: Jisu Han, Wonjun Hwang
Abstract: Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
Authors: Ruben Pascual, Mikel Sesma-Sara, Aranzazu Jurio, Daniel Paternain, Mikel Galar
Abstract: The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.
Authors: Conor Hassan, Nasrulloh Loka, Cen-You Li, Daolang Huang, Paul E. Chang, Yang Yang, Francesco Silvestrin, Samuel Kaski, Luigi Acerbi
Abstract: Transformer-based models for amortized probabilistic inference, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many real-world applications, from signal interpolation to multi-column tabular predictions, require coherent joint distributions that capture dependencies between predictions. While purely autoregressive architectures efficiently generate such distributions, they sacrifice the flexible set-conditioning that makes these models powerful for meta-learning. Conversely, the standard approach to obtain joint distributions from set-based models requires expensive re-encoding of the entire augmented conditioning set at each autoregressive step. We introduce a causal autoregressive buffer that preserves the advantages of both paradigms. Our approach decouples context encoding from updating the conditioning set. The model processes the context once and caches it. A dynamic buffer then captures target dependencies: as targets are incorporated, they enter the buffer and attend to both the cached context and previously buffered targets. This enables efficient batched autoregressive generation and one-pass joint log-likelihood evaluation. A unified training strategy allows seamless integration of set-based and autoregressive modes at minimal additional cost. Across synthetic functions, EEG signals, cognitive models, and tabular data, our method matches predictive accuracy of strong baselines while delivering up to 20 times faster joint sampling. Our approach combines the efficiency of autoregressive generative models with the representational power of set-based conditioning, making joint prediction practical for transformer-based probabilistic models.
Authors: Rogier P. Krijnen, Akshay Joshi, Siddhant Kumar, Mathias Peirlinck
Abstract: Fully capturing this behavior in traditional homogenized tissue testing requires the excitation of multiple deformation modes, i.e. combined triaxial shear tests and biaxial stretch tests. Inherently, such multimodal experimental protocols necessitate multiple tissue samples and extensive sample manipulations. Intrinsic inter-sample variability and manipulation-induced tissue damage might have an adverse effect on the inversely identified tissue behavior. In this work, we aim to overcome this gap by focusing our attention to the use of heterogeneous deformation profiles in a parameter estimation problem. More specifically, we adapt EUCLID, an unsupervised method for the automated discovery of constitutive models, towards the purpose of parameter identification for highly nonlinear, orthotropic constitutive models using a Bayesian inference approach and three-dimensional continuum elements. We showcase its strength to quantitatively infer, with varying noise levels, the material model parameters of synthetic myocardial tissue slabs from a single heterogeneous biaxial stretch test. This method shows good agreement with the ground-truth simulations and with corresponding credibility intervals. Our work highlights the potential for characterizing highly nonlinear and orthotropic material models from a single biaxial stretch test with uncertainty quantification.
Authors: Parhom Esmaeili, Virginia Fernandez, Pedro Borges, Eli Gibson, Sebastien Ourselin, M. Jorge Cardoso
Abstract: Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.
Authors: Albert Belenguer-Llorens, Carlos Sevilla-Salcedo, Janaina Mourao-Miranda, Vanessa G\'omez-Verdejo
Abstract: Real-world clinical problems are often characterized by multimodal data, usually associated with incomplete views and limited sample sizes in their cohorts, posing significant limitations for machine learning algorithms. In this work, we propose a Bayesian approach designed to efficiently handle these challenges while providing interpretable solutions. Our approach integrates (1) a generative formulation to capture cross-view relationships with a semi-supervised strategy, and (2) a discriminative task-oriented formulation to identify relevant information for specific downstream objectives. This dual generative-discriminative formulation offers both general understanding and task-specific insights; thus, it provides an automatic imputation of the missing views while enabling robust inference across different data sources. The potential of this approach becomes evident when applied to the multimodal clinical data, where our algorithm is able to capture and disentangle the complex interactions among biological, psychological, and sociodemographic modalities.
Authors: So Won Jeong, Percy S. Zhai, Veronika Ro\v{c}ov\'a
Abstract: We propose a generative multivariate posterior sampler via flow matching. It offers a simple training objective, and does not require access to likelihood evaluation. The method learns a dynamic, block-triangular velocity field in the joint space of data and parameters, which results in a deterministic transport map from a source distribution to the desired posterior. The inverse map, named vector rank, is accessible by reversibly integrating the velocity over time. It is advantageous to leverage the dynamic design: proper constraints on the velocity yield a monotone map, which leads to a conditional Brenier map, enabling a fast and simultaneous generation of Bayesian credible sets whose contours correspond to level sets of Monge-Kantorovich data depth. Our approach is computationally lighter compared to GAN-based and diffusion-based counterparts, and is capable of capturing complex posterior structures. Finally, frequentist theoretical guarantee on the consistency of the recovered posterior distribution, and of the corresponding Bayesian credible sets, is provided.
Authors: Yuqian Huo, David Quiroga, Anastasios Kyrillidis, Tirthak Patel
Abstract: Variational quantum algorithms (VQAs) have the potential to demonstrate quantum utility on near-term quantum computers. However, these algorithms often get executed on the highest-fidelity qubits and computers to achieve the best performance, causing low system throughput. Recent efforts have shown that VQAs can be run on low-fidelity qubits initially and high-fidelity qubits later on to still achieve good performance. We take this effort forward and show that carefully varying the qubit fidelity map of the VQA over its execution using our technique, Nest, does not just (1) improve performance (i.e., help achieve close to optimal results), but also (2) lead to faster convergence. We also use Nest to co-locate multiple VQAs concurrently on the same computer, thus (3) increasing the system throughput, and therefore, balancing and optimizing three conflicting metrics simultaneously.
Authors: Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang
Abstract: Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestant performance, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results will be made publicly available on our website.
Authors: Sondos Mahmoud Bsharat, Zhiqiang Shen
Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.
Authors: Omar Besbes, Will Ma, Omar Mouchtaki
Abstract: In this work, we study how the relevance/quality and quantity of past data influence performance by analyzing a contextual Newsvendor problem, in which a decision-maker trades off between underage and overage costs under uncertain demand. We consider a setting in which past demands observed under ``close by'' contexts come from close by distributions and analyze the performance of data-driven algorithms through a notion of context-dependent worst-case expected regret. We analyze the broad class of Weighted Empirical Risk Minimization (WERM) policies which weigh past data according to their similarity in the contextual space. This class includes classical policies such as ERM, k-Nearest Neighbors and kernel-based policies. Our main methodological contribution is to characterize exactly the worst-case regret of any WERM policy on any given configuration of contexts. To the best of our knowledge, this provides the first understanding of tight performance guarantees in any contextual decision-making problem, with past literature focusing on upper bounds via concentration inequalities. We instead take an optimization approach, and isolate a structure in the Newsvendor loss function that allows to reduce the infinite-dimensional optimization problem over worst-case distributions to a simple line search. This in turn allows us to unveil fundamental insights that were obfuscated by previous general-purpose bounds. We characterize actual guaranteed performance as a function of the contexts, as well as granular insights on the learning curve of algorithms.
Authors: Debolina Halder Lina, Arlei Silva
Abstract: Graph Neural Networks (GNNs) have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs often assumes that either sensitive attributes are fully observed or they are missing completely at random. We show that an adversarial missingness process can inadvertently disguise a fair model through the imputation, leading the model to overestimate the fairness of its predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for sensitive attributes. The key principle behind BFtS is that imputations should approximate the worst-case scenario for fairness -- i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against a GNN classifier, and the classifier minimizes the maximum bias. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness x accuracy trade-off than existing alternatives under an adversarial missingness process.
Authors: Shiyuan Luo, Juntong Ni, Shengyu Chen, Runlong Yu, Yiqun Xie, Licheng Liu, Zhenong Jin, Huaxiu Yao, Xiaowei Jia
Abstract: Modeling environmental ecosystems is critical for the sustainability of our planet, but is extremely challenging due to the complex underlying processes driven by interactions amongst a large number of physical variables. As many variables are difficult to measure at large scales, existing works often utilize a combination of observable features and locally available measurements or modeled values as input to build models for a specific study region and time period. This raises a fundamental question in advancing the modeling of environmental ecosystems: how to build a general framework for modeling the complex relationships among diverse environmental variables over space and time? In this paper, we introduce a framework, FREE, that enables the use of varying features and available information to train a universal model. The core idea is to map available environmental data into a text space and then convert the traditional predictive modeling task in environmental science to a semantic recognition problem. Our evaluation on two societally important real-world applications, stream water temperature prediction and crop yield prediction, demonstrates the superiority of FREE over multiple baselines, even in data-sparse scenarios.
Authors: Bo Peng, Yadan Luo, Yonggang Zhang, Yixuan Li, Zhen Fang
Abstract: Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.
Authors: Zhengbao He, Tao Li, Xinwen Cheng, Zhehao Huang, Xiaolin Huang
Abstract: Machine unlearning (MU) aims to eliminate information that has been learned from specific training data, namely forgetting data, from a pre-trained model. Currently, the mainstream of existing MU methods involves modifying the forgetting data with incorrect labels and subsequently fine-tuning the model. While learning such incorrect information can indeed remove knowledge, the process is quite unnatural as the unlearning process undesirably reinforces the incorrect information and leads to over-forgetting. Towards more \textit{natural} machine unlearning, we inject correct information from the remaining data to the forgetting samples when changing their labels. Through pairing these adjusted samples with their labels, the model will tend to use the injected correct information and naturally suppress the information meant to be forgotten. Albeit straightforward, such a first step towards natural machine unlearning can significantly outperform current state-of-the-art approaches. In particular, our method substantially reduces the over-forgetting and leads to strong robustness to hyperparameters, making it a promising candidate for practical machine unlearning.
Authors: Zhaochun Ren, Zhou Yang, Chenglong Ye, Yufeng Wang, Haizhou Sun, Chao Chen, Xiaofei Zhu, Yunbing Wu, Xiangwen Liao
Abstract: In-context learning (ICL) achieves remarkable performance in various domains such as knowledge acquisition, commonsense reasoning, and semantic understanding. However, its performance significantly deteriorates for emotion detection tasks, especially fine-grained emotion recognition. The underlying reasons for this remain unclear. In this paper, we identify the reasons behind ICL's poor performance from the perspective of prototype theory and propose a method to address this issue. Specifically, we conduct extensive pilot experiments and find that ICL conforms to the prototype theory on fine-grained emotion recognition. Based on this theory, we uncover the following deficiencies in ICL: (1) It relies on prototypes (example-label pairs) that are semantically similar but emotionally inaccurate to predict emotions. (2) It is prone to interference from irrelevant categories, affecting the accuracy and robustness of the predictions. To address these issues, we propose an Emotion Context Learning method (E-ICL) on fine-grained emotion recognition. E-ICL relies on more emotionally accurate prototypes to predict categories by referring to emotionally similar examples with dynamic labels. Simultaneously, E-ICL employs an exclusionary emotion prediction strategy to avoid interference from irrelevant categories, thereby increasing its accuracy and robustness. Note that the entire process is accomplished with the assistance of a plug-and-play emotion auxiliary model, without additional training. Experiments on the fine-grained emotion datasets EDOS, Empathetic-Dialogues, EmpatheticIntent, and GoEmotions show that E-ICL achieves superior emotion prediction performance. Furthermore, even when the emotion auxiliary model used is lower than 10% of the LLMs, E-ICL can still boost the performance of LLMs by over 4% on multiple datasets.
Authors: Siyuan Li, Rongchang Zuo, Bofei Liu, Yaoyu He, Peng Liu, Yingnan Zhao
Abstract: Unmanned Combat Aerial Vehicle (UCAV) Within-Visual-Range (WVR) engagement, referring to a fight between two or more UCAVs at close quarters, plays a decisive role on the aerial battlefields. With the development of artificial intelligence, WVR engagement progressively advances towards intelligent and autonomous modes. However, autonomous WVR engagement policy learning is hindered by challenges such as weak exploration capabilities, low learning efficiency, and unrealistic simulated environments. To overcome these challenges, we propose a novel imitative reinforcement learning framework, which efficiently leverages expert data while enabling autonomous exploration. The proposed framework not only enhances learning efficiency through expert imitation, but also ensures adaptability to dynamic environments via autonomous exploration with reinforcement learning. Therefore, the proposed framework can learn a successful policy of `pursuit-lock-launch' for UCAVs. To support data-driven learning, we establish an environment based on the Harfang3D sandbox. The extensive experiment results indicate that the proposed framework excels in this multistage task, and significantly outperforms state-of-the-art reinforcement learning and imitation learning methods. Thanks to the ability of imitating experts and autonomous exploration, our framework can quickly learn the critical knowledge in complex aerial combat tasks, achieving up to a 100% success rate and demonstrating excellent robustness.
Authors: Ayush Jain, Norio Kosaka, Xinhu Li, Kyung-Min Kim, Erdem B{\i}y{\i}k, Joseph J. Lim
Abstract: In reinforcement learning, off-policy actor-critic methods like DDPG and TD3 use deterministic policy gradients: the Q-function is learned from environment data, while the actor maximizes it via gradient ascent. We observe that in complex tasks such as dexterous manipulation and restricted locomotion with mobility constraints, the Q-function exhibits many local optima, making gradient ascent prone to getting stuck. To address this, we introduce SAVO, an actor architecture that (i) generates multiple action proposals and selects the one with the highest Q-value, and (ii) approximates the Q-function repeatedly by truncating poor local optima to guide gradient ascent more effectively. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.
Authors: Kaiyan Zhao, Tsuguchika Tabaru, Kenichi Kobayashi, Takumi Honda, Masafumi Yamazaki, Yoshimasa Tsuruoka
Abstract: Although recent quantized Large Language Models (LLMs), such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estimation during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.
Authors: Kamaljyoti Nath, Varun Kumar, Daniel J. Smith, George Em Karniadakis
Abstract: Improving diesel engine efficiency, reducing emissions, and enabling robust health monitoring have been critical research topics in engine modelling. While recent advancements in the use of neural networks for system monitoring have shown promising results, such methods often focus on component-level analysis, lack generalizability, and physical interpretability. In this study, we propose a novel hybrid framework that combines physics-informed neural networks (PINNs) with deep operator networks (DeepONet) to enable accurate and computationally efficient parameter identification in mean-value diesel engine models. Our method leverages physics-based system knowledge in combination with data-driven training of neural networks to enhance model applicability. Incorporating offline-trained DeepONets to predict actuator dynamics significantly lowers the online computation cost when compared to the existing PINN framework. To address the re-training burden typical of PINNs under varying input conditions, we propose two transfer learning (TL) strategies: (i) a multi-stage TL scheme offering better runtime efficiency than full online training of the PINN model and (ii) a few-shot TL scheme that freezes a shared multi-head network body and computes physics-based derivatives required for model training outside the training loop. The second strategy offers a computationally inexpensive and physics-based approach for predicting engine dynamics and parameter identification, offering computational efficiency over the existing PINN framework. Compared to existing health monitoring methods, our framework combines the interpretability of physics-based models with the flexibility of deep learning, offering substantial gains in generalization, accuracy, and deployment efficiency for diesel engine diagnostics.
Authors: Dipam Goswami, Simone Magistri, Kai Wang, Bart{\l}omiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer
Abstract: Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have explored training-free methods using first- and second-order statistics to aggregate local client data distributions at the server and achieve high performance without any training. In this work, we propose a training-free method based on an unbiased estimator of class covariance matrices which only uses first-order statistics in the form of class means communicated by clients to the server. We show how these estimated class covariances can be used to initialize the global classifier, thus exploiting the covariances without actually sharing them. We also show that using only within-class covariances results in a better classifier initialization. Our approach improves performance in the range of 4-26% with exactly the same communication cost when compared to methods sharing only class means and achieves performance competitive or superior to methods sharing second-order statistics with dramatically less communication overhead. The proposed method is much more communication-efficient than federated prompt-tuning methods and still outperforms them. Finally, using our method to initialize classifiers and then performing federated fine-tuning or linear probing again yields better performance. Code is available at https://github.com/dipamgoswami/FedCOF.
Authors: Yuchen Lin, Yong Zhang, Sihan Feng, Hong Zhao
Abstract: Advancements in artificial intelligence call for a deeper understanding of the fundamental mechanisms underlying deep learning. In this work, we propose a theoretical framework to analyze learning dynamics through the lens of dynamical systems theory. We redefine the notions of linearity and nonlinearity in neural networks by introducing two fundamental transformation units at the neuron level: order-preserving transformations and non-order-preserving transformations. Different transformation modes lead to distinct collective behaviors in weight vector organization, different modes of information extraction, and the emergence of qualitatively different learning phases. Transitions between these phases may occur during training, accounting for key phenomena such as grokking. To further characterize generalization and structural stability, we introduce the concept of attraction basins in both sample and weight spaces. The distribution of neurons with different transformation modes across layers, along with the structural characteristics of the two types of attraction basins, forms a set of core metrics for analyzing the performance of learning models. Hyperparameters such as depth, width, learning rate, and batch size act as control variables for fine-tuning these metrics. Our framework not only sheds light on the intrinsic advantages of deep learning, but also provides a novel perspective for optimizing network architectures and training strategies.
Authors: Makoto Otsuka, Kento Kodama, Keisuke Morita, Masayuki Ohzeki
Abstract: This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.
Authors: Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel
Abstract: End-to-end representation learning has become a powerful tool for estimating causal quantities from high-dimensional observational data, but its efficiency remained unclear. Here, we face a central tension: End-to-end representation learning methods often work well in practice but lack asymptotic optimality in the form of the quasi-oracle efficiency. In contrast, two-stage Neyman-orthogonal learners provide such a theoretical optimality property but do not explicitly benefit from the strengths of representation learning. In this work, we step back and ask two research questions: (1) When do representations strengthen existing Neyman-orthogonal learners? and (2) Can a balancing constraint - commonly proposed technique in the representation learning literature - provide improvements to Neyman-orthogonality? We address these two questions through our theoretical and empirical analysis, where we introduce a unifying framework that connects representation learning with Neyman-orthogonal learners (namely, OR-learners). In particular, we show that, under the low-dimensional manifold hypothesis, the OR-learners can strictly improve the estimation error of the standard Neyman-orthogonal learners. At the same time, we find that the balancing constraint requires an additional inductive bias and cannot generally compensate for the lack of Neyman-orthogonality of the end-to-end approaches. Building on these insights, we offer guidelines for how users can effectively combine representation learning with the classical Neyman-orthogonal learners to achieve both practical performance and theoretical guarantees.
Authors: Filip Ekstr\"om Kelvinius, Zheng Zhao, Fredrik Lindsten
Abstract: A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on "decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.
Authors: Thong Pham, Takashi Nicholas Maeda, Shohei Shimizu
Abstract: Causal additive models provide a tractable yet expressive framework for causal discovery in the presence of hidden variables. However, when unobserved backdoor or causal paths exist between two variables, their causal relationship is often unidentifiable under existing theories. We establish sufficient conditions under which causal directions can be identified in many such cases. In particular, we derive conditions that enable identification of the parent-child relationship in a bow, an adjacent pair of observed variables sharing a hidden common parent. This represents a notoriously difficult case in causal discovery, and, to our knowledge, no prior work has established such identifiability in any causal model without imposing assumptions on the hidden variables. Our conditions rely on new characterizations of regression sets and a hybrid approach that combines independence among regression residuals with conditional independencies among observed variables. We further provide a sound and complete algorithm that incorporates these insights, and empirical evaluations demonstrate competitive performance with state-of-the-art methods.
Authors: Harsh Poonia, Felix Divo, Kristian Kersting, Devendra Singh Dhami
Abstract: Causality in time series can be challenging to determine, especially in the presence of non-linear dependencies. Granger causality helps analyze potential relationships between variables, thereby offering a method to determine whether one time series can predict-Granger cause-future values of another. Although successful, Granger causal methods still struggle with capturing long-range relations between variables. To this end, we leverage the recently successful Extended Long Short-Term Memory (xLSTM) architecture and propose Granger causal xLSTMs (GC-xLSTM). It first enforces sparsity between the time series components by using a novel dynamic loss penalty on the initial projection. Specifically, we adaptively improve the model and identify sparsity candidates. Our joint optimization procedure then ensures that the Granger causal relations are recovered robustly. Our experimental evaluation on six diverse datasets demonstrates the overall efficacy of GC-xLSTM.
Authors: Yijun Pan, Taiwei Shi, Jieyu Zhao, Jiaqi W. Ma
Abstract: Large language models (LLMs) are highly sensitive to even small amounts of unsafe training data, making effective detection and filtering essential for trustworthy model development. Current state-of-the-art (SOTA) detection approaches primarily rely on moderation classifiers, which require significant computation overhead for training and are limited to predefined taxonomies. In this work, we explore data attribution approaches that measure the similarity between individual training samples and a small set of unsafe target examples, based on data representations such as hidden states or gradients. We identify a key limitation in existing methods: unsafe target texts contain both critical tokens that make them unsafe and neutral tokens (e.g., stop words or benign facts) that are necessary to form fluent language, and the latter of which makes the overall representations ``noisy'' for the purpose of detecting unsafe training data. To address this challenge, we propose Denoised Representation Attribution (DRA), a novel representation-based data attribution approach that denoises training and target representations for unsafe data detection. Across tasks of filtering jailbreaks and detecting gender bias, the proposed approach leads to significant improvement for data attribution methods, outperforming SOTA methods that are mostly based on moderation classifiers.
Authors: Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, Tuo Zhao
Abstract: Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but this leads to substantial approximation errors, resulting in less effective optimization (e.g., in terms of per-token efficiency). In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS. Our code is available at https://github.com/lliu606/COSMOS.
Authors: Zhengdao Chen
Abstract: Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we study the hypothesis that such creativity arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional subspace, we show theoretically how regularized two-layer ReLU neural networks tend to learn approximately a smoothed version of the empirical score function, and further probe the interplay between score smoothing and the denoising dynamics with analytical solutions and numerical experiments. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate the training data along their subspace while avoiding full memorization. Moreover, we present experimental evidence that learning score functions with neural networks indeed induces a score smoothing effect, including in simple nonlinear settings and without explicit regularization.
Authors: Archie Licudi, Anshul Thakur, Soheila Molaei, Danielle Belgrave, David Clifton
Abstract: Asynchronous federated learning (FL) with heterogeneous clients faces two key issues: curvature-induced loss barriers encountered by standard linear parameter interpolation techniques (e.g. FedAvg) and interference from stale updates misaligned with the server's current optimisation state. To alleviate these issues, we introduce a geometric framework that casts aggregation as curve learning in a Riemannian model space and decouples trajectory selection from update conflict resolution. Within this, we propose AsyncBezier, which replaces linear aggregation with low-degree polynomial (Bezier) trajectories to bypass loss barriers, and OrthoDC, which projects delayed updates via inner product-based orthogonality to reduce interference. We establish framework-level convergence guarantees covering each variant given simple assumptions on their components. On three datasets spanning general-purpose and healthcare domains, including LEAF Shakespeare and FEMNIST, our approach consistently improves accuracy and client fairness over strong asynchronous baselines; finally, we show that these gains are preserved even when other methods are allocated a higher local compute budget.
Authors: Xinling Yu, Ziyue Liu, Hai Li, Yixing Li, Xin Ai, Zhiyu Zeng, Ian Young, Zheng Zhang
Abstract: Thermal analysis is crucial in 3D-IC design due to increased power density and complex heat dissipation paths. Although operator learning frameworks such as DeepOHeat~\cite{liu2023deepoheat} have demonstrated promising preliminary results in accelerating thermal simulation, they face critical limitations in prediction capability for multi-scale thermal patterns, training efficiency, and trustworthiness of results during design optimization. This paper presents DeepOHeat-v1, an enhanced physics-informed operator learning framework that addresses these challenges through three key innovations. First, we integrate Kolmogorov-Arnold Networks with learnable activation functions as trunk networks, enabling an adaptive representation of multi-scale thermal patterns. This approach achieves a 1.25x and 6.29x reduction in error in two representative test cases. Second, we introduce a separable training method that decomposes the basis function along the coordinate axes, achieving 62x training speedup and 31x GPU memory reduction in our baseline case, and enabling thermal analysis at resolutions previously infeasible due to GPU memory constraints. Third, we propose a confidence score to evaluate the trustworthiness of the predicted results, and further develop a hybrid optimization workflow that combines operator learning with finite difference (FD) using Generalized Minimal Residual (GMRES) method for incremental solution refinement, enabling efficient and trustworthy thermal optimization. Experimental results demonstrate that DeepOHeat-v1 achieves accuracy comparable to optimization using high-fidelity finite difference solvers, while speeding up the entire optimization process by $70.6\times$ in our test cases, effectively minimizing the peak temperature through optimal placement of heat-generating components. Open source code is available at https://github.com/xlyu0127/DeepOHeat-v1.
Authors: Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao
Abstract: Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at \href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}.
URLs: https://yifei-he.github.io/mergebench/, https://yifei-he.github.io/mergebench/
Authors: Simone Azeglio, Arianna Di Bernardo
Abstract: Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold's contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.
Authors: Alex Heyman, Joel Zylberberg
Abstract: Large language models have recently made great strides in reasoning task performance through chain-of-thought (CoT) strategies trained via reinforcement learning; however, these "reasoning large language models" (RLLMs) remain imperfect reasoners, and understanding the frequencies and causes of their failure modes is important for both users and developers. We test o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and Grok 3 Mini Beta on graph coloring as a variable-complexity constraint-satisfaction logic problem, and find evidence from both error rate comparisons and CoT/explanation text analysis that RLLMs are prone to hallucinate graph edges not specified in the prompt. This phenomenon persists across multiple problem complexity levels and semantic frames, and it appears to account for a significant fraction of the incorrect answers from every tested model, and the vast majority of them for some models. We also validate the generalizability of this input-conflicting hallucination phenomenon with smaller-scale experiments on a type of stable matching problem. Our results indicate that RLLMs may possess broader issues with misrepresentation of problem specifics, and we offer suggestions for design choices to mitigate this weakness.
Authors: Xinyi Shang, Peng Sun, Fengyuan Liu, Tao Lin
Abstract: This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.
Authors: Benjamin Herdeanu, Juan Nathaniel, Carla Roesch, Jatan Buch, Gregor Ramien, Johannes Haux, Pierre Gentine
Abstract: Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of both linearly and nonlinearly coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on https://kausable.github.io/CausalDynamics.
Authors: Hany Abdulsamad, Sahel Iqbal, Simo S\"arkk\"a
Abstract: Optimal decision-making under partial observability requires agents to balance reducing uncertainty (exploration) against pursuing immediate objectives (exploitation). In this paper, we introduce a novel policy optimization framework for continuous partially observable Markov decision processes (POMDPs) that explicitly addresses this challenge. Our method casts policy learning as probabilistic inference in a non-Markovian Feynman--Kac model that inherently captures the value of information gathering by anticipating future observations, without requiring suboptimal approximations or handcrafted heuristics. To optimize policies under this model, we develop a nested sequential Monte Carlo (SMC) algorithm that efficiently estimates a history-dependent policy gradient under samples from the optimal trajectory distribution induced by the POMDP. We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks, where existing methods struggle to act under uncertainty.
Authors: Arash Afkanpour, Omkar Dige, Fatemeh Tavakoli, Negin Baghbanzadeh, Farnaz Kohankhaki, Elham Dolatabadi
Abstract: Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model's performance by evaluating only a subset of capabilities via active learning. It reaches within 0.01 RMSE of exhaustive evaluation by evaluating less than half of capabilities. Compared to static datasets, ACE provides more balanced coverage and uncovers fine-grained differences that aggregate metrics fail to capture. Our results demonstrate that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.
Authors: Justin Deschenaux, Lan Tran, Caglar Gulcehre
Abstract: Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry no information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the Partition Generative Model (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least $5\times$ improvements in sampling latency and throughput, while producing samples with superior Generative Perplexity, compared to Masked Diffusion Language Models. On ImageNet, PGMs achieve a $7.5\times$ higher throughput than MaskGIT, with only a slight increase in FID (5.54 vs. 5.35). With twice as many sampling steps, the FID reduces to 4.56 while while being $3.9\times$ faster than MaskGIT. Finally, PGMs integrate seamlessly with MGM distillation, providing further inference speedups.
Authors: Maresa Schr\"oder, Justin Hartenstein, Stefan Feuerriegel
Abstract: The average treatment effect (ATE) is widely used to evaluate the effectiveness of drugs and other medical interventions. In safety-critical applications like medicine, reliable inferences about the ATE typically require valid uncertainty quantification, such as through confidence intervals (CIs). However, estimating treatment effects in these settings often involves sensitive data that must be kept private. In this work, we present PrivATE, a novel machine learning framework for computing CIs for the ATE under differential privacy. Specifically, we focus on deriving valid privacy-preserving CIs for the ATE from observational data. Our PrivATE framework consists of three steps: (i) estimating the differentially private ATE through output perturbation; (ii) estimating the differentially private variance in a doubly robust manner; and (iii) constructing the CIs while accounting for the uncertainty from both the estimation and privatization steps. Our PrivATE framework is model agnostic, doubly robust, and ensures valid CIs. We demonstrate the effectiveness of our framework using synthetic and real-world medical datasets. To the best of our knowledge, we are the first to derive a general, doubly robust framework for valid CIs of the ATE under ($\varepsilon,\delta$)-differential privacy.
Authors: Sabina J. Sloman, Michele Caprio, Samuel Kaski
Abstract: Uncertainty-aware machine learners, such as Bayesian neural networks, output a quantification of uncertainty instead of a point prediction. In this work, we provide uncertainty-aware learners with a principled framework to characterize, and identify ways to eliminate, errors that arise from reducible (epistemic) uncertainty. We introduce a principled definition of epistemic error, and provide a decompositional epistemic error bound which operates in the very general setting of imperfect multitask learning under distribution shift. In this setting, the training (source) data may arise from multiple tasks, the test (target) data may differ systematically from the source data tasks, and/or the learner may not arrive at an accurate characterization of the source data. Our bound separately attributes epistemic errors to each of multiple aspects of the learning procedure and environment. As corollaries of the general result, we provide epistemic error bounds specialized to the settings of Bayesian transfer learning and distribution shift within $\epsilon$-neighborhoods. We additionally leverage the terms in our bound to provide a novel definition of negative transfer.
Authors: Sahil Rajesh Dhayalkar
Abstract: We present a formal and constructive simulation framework for nondeterministic finite automata (NFAs) using time-shared, depth-unrolled feedforward networks (TS-FFNs), i.e., acyclic unrolled computations with shared parameters that are functionally equivalent to unrolled recurrent or state-space models. Unlike prior approaches that rely on explicit recurrent architectures or post hoc extraction methods, our formulation symbolically encodes automaton states as binary vectors, transitions as sparse matrix transformations, and nondeterministic branching-including $\varepsilon$-closures-as compositions of shared thresholded updates. We prove that every regular language can be recognized exactly by such a shared-parameter unrolled feedforward network, with parameter count independent of input length. Our construction yields a constructive equivalence between NFAs and neural networks and demonstrates \emph{empirical learnability}: these networks can be trained via gradient descent on supervised acceptance data to recover the target automaton behavior. This learnability, formalized in Proposition 5.1, is the crux of this work. Extensive experiments validate the theoretical results, achieving perfect or near-perfect agreement on acceptance, state propagation, and closure dynamics. This work clarifies the correspondence between automata theory and modern neural architectures, showing that unrolled feedforward networks can perform precise, interpretable, and trainable symbolic computation.
Authors: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher
Abstract: Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).
URLs: https://github.com/LIONS-EPFL/LEAF, https://huggingface.co/LEAF-CLIP
Authors: Stephen Smith, Cormac Purcell, Zdenka Kuncic
Abstract: Earth observations from low Earth orbit satellites provide vital information for decision makers to better manage time-sensitive events such as natural disasters. For the data to be most effective for first responders, low latency is required between data capture and its arrival to decision makers. A major bottleneck is in the bandwidth-limited downlinking of the data from satellites to ground stations. One approach to overcome this challenge is to process at least some of the data on-board and prioritise pertinent data to be downlinked. In this work we propose a Physics Aware Neuromorphic Network (PANN) to detect changes caused by natural disasters from a sequence of multi-spectral satellite images and produce a change map, enabling relevant data to be prioritised for downlinking. The PANN used in this study is motivated by physical neural networks comprised of nano-electronic circuit elements known as "memristors" (nonlinear resistors with memory). The weights in the network are dynamic and update in response to varying input signals according to memristor equations of state and electrical circuit conservation laws. The PANN thus generates physics-constrained dynamical output features which are used to detect changes in a natural disaster detection task by applying a distance-based metric. Importantly, this makes the whole model training-free, allowing it to be implemented with minimal computing resources. The PANN was benchmarked against a state-of-the-art AI model and achieved comparable or better results in each natural disaster category. It thus presents a promising solution to the challenge of resource-constrained on-board processing.
Authors: Shijian Xu, Marcello Massimo Negri, Volker Roth
Abstract: Deep learning has achieved remarkable success across many domains, but it has also created a growing demand for interpretability in model predictions. Although many explainable machine learning methods have been proposed, post-hoc explanations lack guaranteed fidelity and are sensitive to hyperparameter choices, highlighting the appeal of inherently interpretable models. For example, linear regression provides clear feature effects through its coefficients. However, such models are often outperformed by more complex neural networks (NNs) that usually lack inherent interpretability. To address this dilemma, we introduce NIMO, a framework that combines inherent interpretability with the expressive power of neural networks. Building on the simple linear regression, NIMO is able to provide flexible and intelligible feature effects. Relevantly, we develop an optimization method based on parameter elimination, that allows for optimizing the NN parameters and linear coefficients effectively and efficiently. By relying on adaptive ridge regression we can easily incorporate sparsity as well. We show empirically that our model can provide faithful and intelligible feature effects while maintaining good predictive performance.
Authors: Aneesh Komanduri, Karuna Bhaila, Xintao Wu
Abstract: Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.
Authors: Bo Zhao, Robin Walters, Rose Yu
Abstract: Modern deep learning models are highly overparameterized, resulting in large sets of parameter configurations that yield the same outputs. A significant portion of this redundancy is explained by symmetries in the parameter space--transformations that leave the network function unchanged. These symmetries shape the loss landscape and constrain learning dynamics, offering a new lens for understanding optimization, generalization, and model complexity that complements existing theory of deep learning. This survey provides an overview of parameter space symmetry. We summarize existing literature, uncover connections between symmetry and learning theory, and identify gaps and opportunities in this emerging field.
Authors: Keyue Jiang, Jiahao Cui, Xiaowen Dong, Laura Toni
Abstract: Graph generation has emerged as a critical task in fields ranging from drug discovery to circuit design. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between reference and data distributions. However, these methods typically model the evolution of individual nodes and edges independently and use linear interpolations to build the path. This disentangled interpolation breaks the interconnected patterns of graphs, making the constructed probability path irregular and non-smooth, which causes poor training dynamics and faulty sampling convergence. To address the limitation, this paper first presents a theoretically grounded framework for probability path construction in graph generative models. Specifically, we model the joint evolution of the nodes and edges by representing graphs as connected systems parameterized by Markov random fields (MRF). We then leverage the optimal transport displacement between MRF objects to design a smooth probability path that ensures the co-evolution of graph components. Based on this, we introduce BWFlow, a flow-matching framework for graph generation that utilizes the derived optimal probability path to benefit the training and sampling algorithm design. Experimental evaluations in plain graph generation and molecule generation validate the effectiveness of BWFlow with competitive performance, better training convergence, and efficient sampling.
Authors: Zhongzheng Qiao, Chenghao Liu, Yiming Zhang, Ming Jin, Quang Pham, Qingsong Wen, P. N. Suganthan, Xudong Jiang, Savitha Ramasamy
Abstract: Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal performance. Given the diverse temporal patterns across sampling scales and the inherent multi-scale forecasting capabilities of TSFMs, we adopt a causal perspective to analyze finetuning process, through which we highlight the critical importance of explicitly modeling multiple scales and reveal the shortcomings of naive approaches. Focusing on encoder-based TSFMs, we propose Multiscale finetuning (MSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process. Experimental results on three different backbones (Moirai, Moment and Units) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods. Codes are available at https://github.com/zqiao11/MSFT.
Authors: Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel
Abstract: A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.
Authors: Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang
Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9\% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.
Authors: Chang Qi, Matteo Forasassi, Thomas Lukasiewicz, Tommaso Salvatori
Abstract: Predictive coding networks are neural models that perform inference through an iterative energy minimization process, whose operations are local in space and time. While effective in shallow architectures, they suffer significant performance degradation beyond five to seven layers. In this work, we show that this degradation is caused by exponentially imbalanced errors between layers during weight updates, and by predictions from the previous layers not being effective in guiding updates in deeper layers. Furthermore, when training models with skip connections, the energy propagated by the residuals reaches higher layers faster than that propagated by the main pathway, affecting test accuracy. We address the first issue by introducing a novel precision-weighted optimization of latent variables that balances error distributions during the relaxation phase, the second issue by proposing a novel weight update mechanism that reduces error accumulation in deeper layers, and the third one by using auxiliary neurons that slow down the propagation of the energy in the residual connections. Empirically, our methods achieve performance comparable to backpropagation on deep models such as ResNets, opening new possibilities for predictive coding in complex tasks.
Authors: Ismail Labiad, Mathurin Videau, Matthieu Kowalski, Marc Schoenauer, Alessandro Leite, Julia Kempe, Olivier Teytaud
Abstract: Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, exposing gradients during training can leak sensitive information about the underlying data, raising privacy and security concerns such as susceptibility to data poisoning attacks. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide non-vacuous generalization bounds and strong theoretical guarantees for differential privacy, robustness to data poisoning attacks, and extraction attacks. In experiments with LLMs, we demonstrate empirically that black-box optimization methods-despite the scalability and computational challenges inherent to black-box approaches-are able to learn, showing how a few iterations of BBoxER improve performance, generalize well on a benchmark of reasoning datasets, and are robust to membership inference attacks. This positions BBoxER as an attractive add-on on top of gradient-based optimization, offering suitability for deployment in restricted or privacy-sensitive environments while also providing non-vacuous generalization guarantees.
Authors: Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar, Amir Barati Farimani
Abstract: We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.
Authors: Shengjian Chen
Abstract: We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the extrema of the model function. We further prove that the number of extrema in a neural network is positively correlated with the number of its parameters. We then propose a new algorithm that is significantly different from back-propagation algorithm, which mainly obtains the values of parameters by solving a system of linear equations. Some difficult situations, such as gradient vanishing and overfitting, can be simply explained and dealt with in this framework.
Authors: Shan Shen, Shenglu Hua, Jiajun Zou, Jiawei Liu, Jianwang Zhai, Chuan Shi, Wenjian Yu
Abstract: Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the $R^2$ improvement of $33.64\% \sim 44.20\%$ for edge regression and F1-score gain of $0.9\times \sim 2.1\times$ for node classification. Our code is available at https://github.com/ShenShan123/CircuitGCL.
Authors: Michael Kearns, Aaron Roth, Emily Ryu
Abstract: We study a distributed learning problem in which learning agents are embedded in a directed acyclic graph (DAG). There is a fixed and arbitrary distribution over feature/label pairs, and each agent or vertex in the graph is able to directly observe only a subset of the features -- potentially a different subset for every agent. The agents learn sequentially in some order consistent with a topological sort of the DAG, committing to a model mapping observations to predictions of the real-valued label. Each agent observes the predictions of their parents in the DAG, and trains their model using both the features of the instance that they directly observe, and the predictions of their parents as additional features. We ask when this process is sufficient to achieve \emph{information aggregation}, in the sense that some agent in the DAG is able to learn a model whose error is competitive with the best model that could have been learned (in some hypothesis class) with direct access to \emph{all} features, despite the fact that no single agent in the network has such access. We give upper and lower bounds for this problem for both linear and general hypothesis classes. Our results identify the \emph{depth} of the DAG as the key parameter: information aggregation can occur over sufficiently long paths in the DAG, assuming that all of the relevant features are well represented along the path, and there are distributions over which information aggregation cannot occur even in the linear case, and even in arbitrarily large DAGs that do not have sufficient depth (such as a hub-and-spokes topology in which the spoke vertices collectively see all the features). We complement our theoretical results with a comprehensive set of experiments.
Authors: Florian E. Dorner, Yatong Chen, Andr\'e F. Cruz, Fanny Yang
Abstract: Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.
Authors: Yufei He, Ruoyu Li, Alex Chen, Yue Liu, Yulin Chen, Yuan Sui, Cheng Chen, Yi Zhu, Luca Luo, Frank Yang, Bryan Hooi
Abstract: Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.
Authors: Tolga Dimlioglu, Anna Choromanska
Abstract: We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while maintaining communication efficiency. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.
Authors: Yuan-Cheng Yu, Yen-Chieh Ouyang, Chun-An Lin
Abstract: Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a triple-branch design comprising Patching, Selecting, and Global modules, to encode the input time-series into patch-wise representations, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art (SOTA) methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints of TriP-LLM are publicly available on https://github.com/YYZStart/TriP-LLM.git
Authors: Sahil Sethi, David Chen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones
Abstract: Prototype-based neural networks offer interpretable predictions by comparing inputs to learned, representative signal patterns anchored in training data. While such models have shown promise in the classification of physiological data, it remains unclear whether their prototypes capture an underlying structure that aligns with broader clinical phenotypes. We use a prototype-based deep learning model trained for multi-label ECG classification using the PTB-XL dataset. Then without modification we performed inference on the MIMIC-IV clinical database. We assess whether individual prototypes, trained solely for classification, are associated with hospital discharge diagnoses in the form of phecodes in this external population. Individual prototypes demonstrate significantly stronger and more specific associations with clinical outcomes compared to the classifier's class predictions, NLP-extracted concepts, or broader prototype classes across all phecode categories. Prototype classes with mixed significance patterns exhibit significantly greater intra-class distances (p $<$ 0.0001), indicating the model learned to differentiate clinically meaningful variations within diagnostic categories. The prototypes achieve strong predictive performance across diverse conditions, with AUCs ranging from 0.89 for atrial fibrillation to 0.91 for heart failure, while also showing substantial signal for non-cardiac conditions such as sepsis and renal disease. These findings suggest that prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.
Authors: Shiyan Liu, Bohan Tan, Zhiguang Cao, Yan Jin
Abstract: Graph routing problems play a vital role in web-related networks, where finding optimal paths across graphs is essential for efficient data transmission and content delivery. Classic routing formulations such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) represent fundamental graph optimization challenges. Recent data-driven optimization methods have made significant progress, yet they often face limitations in training efficiency and generalization to large-scale instances. In this paper, we propose a novel Vision-augmented Asymmetric Group Preference Optimization (VAGPO) approach. By leveraging ResNet-based visual encoding and Transformer-based sequential modeling, VAGPO captures both spatial structure and temporal dependencies. Furthermore, we introduce an asymmetric group preference optimization strategy that significantly accelerates convergence compared to commonly used policy gradient methods. Experimental results on generated TSP and CVRP instances, as well as real-world datasets, demonstrate that the proposed VAGPO approach achieves highly competitive solution quality. Additionally, VAGPO exhibits strong generalization to larger instances (up to 1000 nodes) without re-training, highlighting its effectiveness in both learning efficiency and scalability.
Authors: Adam Darmanin, Vince Vella
Abstract: Algorithmic trading requires short-term tactical decisions consistent with long-term financial objectives. Reinforcement Learning (RL) has been applied to such problems, but adoption is limited by myopic behaviour and opaque policies. Large Language Models (LLMs) offer complementary strategic reasoning and multi-modal signal interpretation when guided by well-structured prompts. This paper proposes a hybrid framework in which LLMs generate high-level trading strategies to guide RL agents. We evaluate (i) the economic rationale of LLM-generated strategies through expert review, and (ii) the performance of LLM-guided agents against unguided RL baselines using Sharpe Ratio (SR) and Maximum Drawdown (MDD). Empirical results indicate that LLM guidance improves both return and risk metrics relative to standard RL.
Authors: Yuichi Kondo, Hideaki Iiduka
Abstract: We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.
Authors: Bingguang Hao, Zengzhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang, Ji Zhang
Abstract: The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT) fail to instill robust reasoning, and traditional Reinforcement Learning (RL) struggles with inefficient exploration. We propose \textbf{EGPO}, a new RL framework built upon Group Relative Policy Optimization (GRPO), designed to address this challenge directly. The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient computation. This encourages the generation of diverse reasoning strategies. To maintain optimization direction, the entropy bonus is carefully constrained by a clipping mechanism. Complemented by a strict, binary reward signal, EGPO effectively guides the model towards discovering structured and accurate tool invocation patterns. On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B-parameter model trained with EGPO sets a new state-of-the-art among models of comparable size, surpassing a range of strong competitors, including GPT-4o and Gemini-2.5.
Authors: Lixuan He, Jie Feng, Yong Li
Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.
Authors: Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou
Abstract: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on mathematical reasoning problems and practical tool-use tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.
URLs: https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord
Authors: Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D'Inverno, Fabrizio Silvestri, Emanuele Rodol\`a
Abstract: Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
Authors: Jelke Wibbeke, Nico Sch\"onfisch, Sebastian Rohjans, Andreas Rauh
Abstract: In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.
Authors: Destiny Okpekpe, Antonio Orvieto
Abstract: Despite the advantageous subquadratic complexity of modern recurrent deep learning models -- such as state-space models (SSMs) -- recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba's performance and optimization stability.
Authors: Mazyar Taghavi, Rahman Farnoosh
Abstract: Protecting endangered wildlife from illegal poaching presents a critical challenge, particularly in vast and partially observable environments where real-time response is essential. This paper introduces a novel Expectation-Maximization (EM) based latent variable modeling approach in the context of Multi-Agent Reinforcement Learning (MARL) for Unmanned Aerial Vehicle (UAV) coordination in wildlife protection. By modeling hidden environmental factors and inter-agent dynamics through latent variables, our method enhances exploration and coordination under uncertainty.We implement and evaluate our EM-MARL framework using a custom simulation involving 10 UAVs tasked with patrolling protected habitats of the endangered Iranian leopard. Extensive experimental results demonstrate superior performance in detection accuracy, adaptability, and policy convergence when compared to standard algorithms such as Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG). Our findings underscore the potential of combining EM inference with MARL to improve decentralized decisionmaking in complex, high-stakes conservation scenarios. The full implementation, simulation environment, and training scripts are publicly available on GitHub.
Authors: Allan Salihovic, Payam Abdisarabshali, Michael Langberg, Seyyedali Hosseinalipour
Abstract: We provide our perspective on X-Learning (XL), a novel distributed learning architecture that generalizes and extends the concept of decentralization. Our goal is to present a vision for XL, introducing its unexplored design considerations and degrees of freedom. To this end, we shed light on the intuitive yet non-trivial connections between XL, graph theory, and Markov chains. We also present a series of open research directions to stimulate further research.
Authors: Rafael Bischof, Michal Piovar\v{c}i, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel
Abstract: We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a "delta" PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.
Authors: Lucas Correia, Jan-Christoph Goos, Thomas B\"ack, Anna V. Kononova
Abstract: Truly unsupervised approaches for time series anomaly detection are rare in the literature. Those that exist suffer from a poorly set threshold, which hampers detection performance, while others, despite claiming to be unsupervised, need to be calibrated using a labelled data subset, which is often not available in the real world. This work integrates active learning with an existing unsupervised anomaly detection method by selectively querying the labels of multivariate time series, which are then used to refine the threshold selection process. To achieve this, we introduce a novel query strategy called the dissimilarity-based query strategy (DQS). DQS aims to maximise the diversity of queried samples by evaluating the similarity between anomaly scores using dynamic time warping. We assess the detection performance of DQS in comparison to other query strategies and explore the impact of mislabelling, a topic that is underexplored in the literature. Our findings indicate that DQS performs best in small-budget scenarios, though the others appear to be more robust when faced with mislabelling. Therefore, in the real world, the choice of query strategy depends on the expertise of the oracle and the number of samples they are willing to label. Regardless, all query strategies outperform the unsupervised threshold even in the presence of mislabelling. Thus, whenever it is feasible to query an oracle, employing an active learning-based threshold is recommended.
Authors: Lei Wang, Jieming Bian, Letian Zhang, Jie Xu
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks, but fine-tuning them for domain-specific applications often requires substantial domain-specific data that may be distributed across multiple organizations. Federated Learning (FL) offers a privacy-preserving solution, but faces challenges with computational constraints when applied to LLMs. Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient fine-tuning approach, though a single LoRA module often struggles with heterogeneous data across diverse domains. This paper addresses two critical challenges in federated LoRA fine-tuning: 1. determining the optimal number and allocation of LoRA experts across heterogeneous clients, and 2. enabling clients to selectively utilize these experts based on their specific data characteristics. We propose FedLEASE (Federated adaptive LoRA Expert Allocation and SElection), a novel framework that adaptively clusters clients based on representation similarity to allocate and train domain-specific LoRA experts. It also introduces an adaptive top-$M$ Mixture-of-Experts mechanism that allows each client to select the optimal number of utilized experts. Our extensive experiments on diverse benchmark datasets demonstrate that FedLEASE significantly outperforms existing federated fine-tuning approaches in heterogeneous client settings while maintaining communication efficiency.
Authors: M. Giselle Fern\'andez-Godino, Meir H. Shachar, Kevin Korner, Jonathan L. Belof, Mukul Kumar, Jonathan Lind, William J. Schill
Abstract: The ability to predict how shock waves traverse porous and architected materials is a key challenge in planetary defense and in the pursuit of inertial fusion energy. Yet capturing pore collapse, anomalous Hugoniot responses, and localized heating - phenomena that strongly influence asteroid deflection or fusion ignition - has remained a major challenge despite recent advances in single-field and reduced representations. We introduce a multi-field spatio-temporal model (MSTM) that unifies seven coupled fields - pressure, density, temperature, energy, material distribution, and two velocity components - into a single autoregressive surrogate. Trained on high-fidelity hydrocode data, MSTM captures nonlinear shock-driven dynamics across porous and architected configurations, achieving mean errors of 1.4% and 3.2% respectively, all while delivering over three orders of magnitude in speedup. MSTM reduces mean-squared error and structural dissimilarity by 94% relative torelative to single-field spatio-temporal models. This advance transforms problems once considered intractable into tractable design studies, establishing a practical framework for optimizing meso-structured materials in planetary impact mitigation and inertial fusion energy.
Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang
Abstract: The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
Authors: Shuaike Shen, Jiaqing Xie, Zhuo Yang, Antong Zhang, Shuzhou Sun, Ben Gao, Tianfan Fu, Biqing Qi, Yuqiang Li
Abstract: Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design. Code are available at \href{https://github.com/Eurekashen/MolSpectLLM}{https://github.com/Eurekashen/MolSpectLLM}.
URLs: https://github.com/Eurekashen/MolSpectLLM, https://github.com/Eurekashen/MolSpectLLM
Authors: Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gael Gendron, Libo Zhang, Jiamou Liu, Michael J. Witbrock
Abstract: Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.
Authors: Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, Michalis Vazirgiannis
Abstract: Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.
Authors: Mohammadreza Bakhtyari, Bogdan Mazoure, Renato Cordeiro de Amorim, Guillaume Rabusseau, Vladimir Makarenkov
Abstract: We introduce ClustRecNet - a novel deep learning (DL)-based recommendation framework for determining the most suitable clustering algorithms for a given dataset, addressing the long-standing challenge of clustering algorithm selection in unsupervised learning. To enable supervised learning in this context, we construct a comprehensive data repository comprising 34,000 synthetic datasets with diverse structural properties. Each of them was processed using 10 popular clustering algorithms. The resulting clusterings were assessed via the Adjusted Rand Index (ARI) to establish ground truth labels, used for training and evaluation of our DL model. The proposed network architecture integrates convolutional, residual, and attention mechanisms to capture both local and global structural patterns from the input data. This design supports end-to-end training to learn compact representations of datasets and enables direct recommendation of the most suitable clustering algorithm, reducing reliance on handcrafted meta-features and traditional Cluster Validity Indices (CVIs). Comprehensive experiments across synthetic and real-world benchmarks demonstrate that our DL model consistently outperforms conventional CVIs (e.g. Silhouette, Calinski-Harabasz, Davies-Bouldin, and Dunn) as well as state-of-the-art AutoML clustering recommendation approaches (e.g. ML2DAC, AutoCluster, and AutoML4Clust). Notably, the proposed model achieves a 0.497 ARI improvement over the Calinski-Harabasz index on synthetic data and a 15.3% ARI gain over the best-performing AutoML approach on real-world data.
Authors: Ayush Patnaik, Jackson Fogelquist, Adam B Zufall, Stephen K Robinson, Xinfan Lin
Abstract: Lithium plating during fast charging is a critical degradation mechanism that accelerates capacity fade and can trigger catastrophic safety failures. Recent work has identified a distinctive dQ/dV peak above 4.0 V as a reliable signature of plating onset; however, conventional methods for computing dQ/dV rely on finite differencing with filtering, which amplifies sensor noise and introduces bias in peak location. In this paper, we propose a Gaussian Process (GP) framework for lithium plating detection by directly modeling the charge-voltage relationship Q(V) as a stochastic process with calibrated uncertainty. Leveraging the property that derivatives of GPs remain GPs, we infer dQ/dV analytically and probabilistically from the posterior, enabling robust detection without ad hoc smoothing. The framework provides three key benefits: (i) noise-aware inference with hyperparameters learned from data, (ii) closed-form derivatives with credible intervals for uncertainty quantification, and (iii) scalability to online variants suitable for embedded BMS. Experimental validation on Li-ion coin cells across a range of C-rates (0.2C-1C) and temperatures (0-40\deg C) demonstrates that the GP-based method reliably detects plating peaks under low-temperature, high-rate charging, while correctly reporting no peaks in baseline cases. The concurrence of GP-identified differential peaks, reduced charge throughput, and capacity fade measured via reference performance tests confirms the method's accuracy and robustness, establishing a practical pathway for real-time lithium plating detection.
Authors: Diptyaroop Maji, Kang Yang, Prashant Shenoy, Ramesh K Sitaraman, Mani Srivastava
Abstract: Computational decarbonization aims to reduce carbon emissions in computing and societal systems such as data centers, transportation, and built environments. This requires accurate, fine-grained carbon intensity forecasts, yet existing tools have several key limitations: (i) they require grid-specific electricity mix data, restricting use where such information is unavailable; (ii) they depend on separate grid-specific models that make it challenging to provide global coverage; and (iii) they provide forecasts without uncertainty estimates, limiting reliability for downstream carbon-aware applications. In this paper, we present CarbonX, an open-source tool that leverages Time Series Foundation Models (TSFMs) for a range of decarbonization tasks. CarbonX utilizes the versatility of TSFMs to provide strong performance across multiple tasks, such as carbon intensity forecasting and imputation, and across diverse grids. Using only historical carbon intensity data and a single general model, our tool achieves a zero-shot forecasting Mean Absolute Percentage Error (MAPE) of 15.82% across 214 grids worldwide. Across 13 benchmark grids, CarbonX performance is comparable with the current state-of-the-art, with an average MAPE of 9.59% and tail forecasting MAPE of 16.54%, while also providing prediction intervals with 95% coverage. CarbonX can provide forecasts for up to 21 days with minimal accuracy degradation. Further, when fully fine-tuned, CarbonX outperforms the statistical baselines by 1.2--3.9X on the imputation task. Overall, these results demonstrate that CarbonX can be used easily on any grid with limited data and still deliver strong performance, making it a practical tool for global-scale decarbonization.
Authors: Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai
Abstract: The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G$^2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G$^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
Authors: Jacob Feitelberg, Dwaipayan Saha, Kyuseong Choi, Zaid Ahmad, Anish Agarwal, Raaz Dwivedi
Abstract: Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100\times$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to $11$ established imputation methods.
Authors: Dong-Dong Wu, Jiacheng Cui, Wei Wang, Zhiqiang Shen, Masashi Sugiyama
Abstract: Conditional diffusion models have achieved remarkable success in various generative tasks recently, but their training typically relies on large-scale datasets that inevitably contain imprecise information in conditional inputs. Such supervision, often stemming from noisy, ambiguous, or incomplete labels, will cause condition mismatch and degrade generation quality. To address this challenge, we propose DMIS, a unified framework for training robust Diffusion Models from Imprecise Supervision, which is the first systematic study within diffusion models. Our framework is derived from likelihood maximization and decomposes the objective into generative and classification components: the generative component models imprecise-label distributions, while the classification component leverages a diffusion classifier to infer class-posterior probabilities, with its efficiency further improved by an optimized timestep sampling strategy. Extensive experiments on diverse forms of imprecise supervision, covering tasks of image generation, weakly supervised learning, and noisy dataset condensation demonstrate that DMIS consistently produces high-quality and class-discriminative samples.
Authors: Yiheng Tao, Yihe Zhang, Matthew T. Dearing, Xin Wang, Yuping Fan, Zhiling Lan
Abstract: Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.
Authors: Dongqi Zheng, Wenjin Fu
Abstract: We introduce Constraint-Aware Federated Learning with Lagrangian Dual Optimization (CAFL-L), a principled extension of FedAvg that explicitly incorporates device-level resource constraints including energy, communication, memory, and thermal budgets. CAFL-L employs Lagrangian dual optimization to dynamically adapt training hyperparameters -- freezing depth, local steps, batch size, and communication compression -- while preserving training stability through token-budget preservation via gradient accumulation. Experiments on a character-level language model demonstrate that CAFL-L achieves superior constraint satisfaction compared to standard FedAvg (reducing memory usage by 20% and communication by 95%) while maintaining competitive validation performance, making it practical for deployment on resource-constrained edge devices.
Authors: Iryna Stanishevska
Abstract: Thunderstorm-driven outages are difficult to predict because most storms do not cause damage, convective processes occur rapidly and chaotically, and the available public data are both noisy and incomplete. We develop a 24-48 h early-warning model for summer, thunderstorm-related outages in Michigan using only open sources (EAGLE-I for ground truth; METAR for weather). We use the publicly released EAGLE-I outage dataset (2014-2022), maintained by Oak Ridge National Laboratory for the U.S. Department of Energy. The pipeline preserves convective micro-signals from a sparse station network via parameter-specific kriging with hourly variograms and targeted overdrafting to retain extremes, and builds causal spatio-temporal features (lags/rolling statistics; k-NN/IDW spatial aggregates) capturing precursors of severe convection (moisture advection, wind shifts, and pressure drops). The two-stage model design, combining a logistic gate and an LSTM regressor, limits routine periods and reduces noise exposure. The study uses event-centric metrics (cluster-based hits/misses/false alarms) and peak-conditional MASE (cMASE) in +/-Delta-hour windows around state-level peaks (>= 50,000), with uncertainty quantified by hourly moving-block bootstrap. On the test sample, Two-Stage detects more reference peaks across all windows (e.g., at +/-48 h it records 3/4 vs. 2/4; F1 66.7% vs. 57.1%) with one extra false alarm. Near peaks, it shows modest amplitude gains (2-3% lower cMASE at +/-0-12 h; bootstrap medians +9-13% at +/-6-12 h) but small losses at +/-36-48 h (~3-4%). Overall, errors are comparable to the one-step LSTM baseline. SHAP analysis confirms moisture-advection and wind/gust precursors, underscoring the value of the feature engineering. Despite open-data noise, the feature-driven pipeline yields actionable, event-focused early warnings for thunderstorm outages.
Authors: Hao Wu, Yuan Gao, Xingjian Shi, Shuaipeng Li, Fan Xu, Fan Zhang, Zhihong Zhu, Weiyan Wang, Xiao Luo, Kun Wang, Xian Wu, Xiaomeng Huang
Abstract: To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an "imagination-based" environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search-based planning algorithm that leverages non-differentiable domain metrics as reward signals to explore high-return future sequences. These identified high-reward candidates then serve as pseudo-labels to continuously optimize the agent's policy through iterative self-training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.
Authors: Haiquan Qiu, Quanming Yao
Abstract: The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.
URLs: https://github.com/ucker/why-low-precision-training-fails.
Authors: Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, Xiangnan He
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.
Authors: Jinhui Bai, Andreas Christmann, Lei Shi
Abstract: In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.
Authors: Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee
Abstract: Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
Authors: Alina Ermilova, Dmitrii Kornilov, Sofia Samoilova, Ekaterina Laptenkova, Anastasia Kolesnikova, Ekaterina Podplutova, Senotrusova Sofya, Maksim G. Sharaev
Abstract: Identifying disease interconnections through manual analysis of large-scale clinical data is labor-intensive, subjective, and prone to expert disagreement. While machine learning (ML) shows promise, three critical challenges remain: (1) selecting optimal methods from the vast ML landscape, (2) determining whether real-world clinical data (e.g., electronic health records, EHRs) or structured disease descriptions yield more reliable insights, (3) the lack of "ground truth," as some disease interconnections remain unexplored in medicine. Large language models (LLMs) demonstrate broad utility, yet they often lack specialized medical knowledge. To address these gaps, we conduct a systematic evaluation of seven approaches for uncovering disease relationships based on two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the full set of ICD-10 codes, both with and without textual descriptions. Our framework integrates the following: (i) a statistical co-occurrence analysis and a masked language modeling (MLM) approach using real clinical data; (ii) domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral, DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained interconnection matrices shows that the LLM-based approach produces interconnections with the lowest diversity of ICD code connections to different diseases compared to other methods, including text-based and domain-based approaches. This suggests an important implication: LLMs have limited potential for discovering new interconnections. In the absence of ground truth databases for medical interconnections between ICD codes, our results constitute a valuable medical disease ontology that can serve as a foundational resource for future clinical research and artificial intelligence applications in healthcare.
Authors: Kristi Topollai, Anna Choromanska
Abstract: The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an \textit{adaptive memory} mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.
Authors: Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
Abstract: Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.
Authors: Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas
Abstract: Transformers' quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.
Authors: Masih Aminbeidokhti, Heitor Rapela Medeiros, Srikanth Muralidharan, Eric Granger, Marco Pedersoli
Abstract: Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
Authors: Tingfeng Hong, Pingye Ren, Xinlong Xiao, Chao Wang, Chenyi Lei, Wenwu Ou, Han Li
Abstract: Balancing multiple objectives is critical for user satisfaction in modern recommender and search systems, yet current Multi-Task Fusion (MTF) methods rely on static, manually-tuned weights that fail to capture individual user intent. While Reinforcement Learning (RL) offers a path to personalization, traditional approaches often falter due to training instability and the sparse rewards inherent in these large-scale systems. To address these limitations, we propose Group-relative Reinforcement learning with Adaptive Dirichlet Exploration (GRADE), a novel and robust framework for personalized multi-task fusion. GRADE leverages a critic-free, Group Relative Policy Optimization (GRPO) paradigm, enabling stable and efficient policy learning by evaluating the relative performance of candidate weight groups. Its core innovations include employing the Dirichlet distribution for principled and structured exploration of the weight space, and a composite reward function that combines sparse user feedback with dense model priors and rule-based constraints to guide the search effectively. Deployed in the in-app marketplace of an application with over hundreds of millions daily active users, GRADE significantly outperforms established baselines, achieving substantial gains in rigorous large-scale A/B tests: +0.595\% in CTR, +1.193\% in CVR, +1.788\% in OPM, and +1.568\% in total order volume. Following its strong performance, GRADE has been fully deployed in the marketplace search scenario of Kuaishou, serving hundreds of millions of users.
Authors: Murtaza Rangwala, Farag Azzedin, Richard O. Sinnott, Rajkumar Buyya
Abstract: Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a $(1+O(\epsilon))$ factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.
Authors: Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev
Abstract: Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.
Authors: Binbin Huang, Luo Luo, Yanghua Xiao, Deqing Yang, Baojian Zhou
Abstract: This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by $\min\{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)\}$ to obtain an $\epsilon$-approximation of the PPR vector, where $m$ denotes the number of edges in the graph and $R$ is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only $\tilde{\mathcal{O}}(1/\sqrt{\alpha})$ such linear systems, where $\alpha$ is the damping factor. When $1/\epsilon^2\ll m$, this implies the existence of an algorithm that computes an $\ epsilon $-approximation of the PPR vector with an overall time complexity of $\tilde{\mathcal{O}}\left(R^2 / (\sqrt{\alpha}\epsilon^2)\right)$, independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.
Authors: Larissa Reichart, Cem Ata Baykara, Ali Burak \"Unal, Harlin Lee, Mete Akg\"un
Abstract: Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.
Authors: Jacob Trauger, Tyson Trauger, Ambuj Tewari
Abstract: In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.
Authors: Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang
Abstract: Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.
Authors: Joe Suk, Yaqi Duan
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations.
Authors: Zilin Kang, Chonghua Liao, Tingqiang Xu, Huazhe Xu
Abstract: We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.
Authors: Yifan Zhang, Michael Sandborn, Stefan Larson, Yu Huang, Kevin Leach
Abstract: Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.
Authors: Luigi Celona, Simone Bianco, Marco Donzella, Paolo Napoletano
Abstract: State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context (MS-COCO) dataset, which contains human-annotated captions with an average length of approximately ten tokens. Although effective for general scene understanding, these short captions often fail to capture complex scenes and convey detailed information. Moreover, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects, thus overlooking finer details. In this paper, we present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models. Our proposed method requires no additional model training: given an image, it leverages pre-trained models from the literature to generate the initial captions, and then ranks them using a newly introduced image-text-based metric, which we name BLIPScore. Subsequently, the top two captions are fused using a Large Language Model (LLM) to produce the final, more detailed description. Experimental results on the MS-COCO and Flickr30k test sets demonstrate the effectiveness of our approach in terms of caption-image alignment and hallucination reduction according to the ALOHa, CAPTURE, and Polos metrics. A subjective study lends additional support to these results, suggesting that the captions produced by our model are generally perceived as more consistent with human judgment. By combining the strengths of diverse SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich and informative nature of human-generated descriptions. This advance enables the generation of more suitable captions for the training of both vision-language and captioning models.
Authors: Qinhao Zhou, Yuwen Tan, Boqing Gong, Xiang Xiang
Abstract: Class-incremental learning (CIL) aims to enable models to continuously learn new classes while overcoming catastrophic forgetting. The introduction of pre-trained models has brought new tuning paradigms to CIL. In this paper, we revisit different parameter-efficient tuning (PET) methods within the context of continual learning. We observe that adapter tuning demonstrates superiority over prompt-based methods, even without parameter expansion in each learning session. Motivated by this, we propose incrementally tuning the shared adapter without imposing parameter update constraints, enhancing the learning capacity of the backbone. Additionally, we employ feature sampling from stored prototypes to retrain a unified classifier, further improving its performance. We estimate the semantic shift of old prototypes without access to past samples and update stored prototypes session by session. Our proposed method eliminates model expansion and avoids retaining any image samples. It surpasses previous pre-trained model-based CIL methods and demonstrates remarkable continual learning capabilities. Experimental results on five CIL benchmarks validate the effectiveness of our approach, achieving state-of-the-art (SOTA) performance.
Authors: Elke R. Gizewski, Markus Holzleitner, Lukas Mayer-Suess, Sergiy Pereverzyev Jr., Sergei V. Pereverzyev
Abstract: Most of the recent results in polynomial functional regression have been focused on an in-depth exploration of single-parameter regularization schemes. In contrast, in this study we go beyond that framework by introducing an algorithm for multiple parameter regularization and presenting a theoretically grounded method for dealing with the associated parameters. This method facilitates the aggregation of models with varying regularization parameters. The efficacy of the proposed approach is assessed through evaluations on both synthetic and some real-world medical data, revealing promising results.
Authors: Joris Depoortere, Johan Driesen, Johan Suykens, Hussain Syed Kazmi
Abstract: Deep learning models have gained increasing prominence in recent years in the field of solar pho-tovoltaic (PV) forecasting. One drawback of these models is that they require a lot of high-quality data to perform well. This is often infeasible in practice, due to poor measurement infrastructure in legacy systems and the rapid build-up of new solar systems across the world. This paper proposes SolNet: a novel, general-purpose, multivariate solar power forecaster, which addresses these challenges by using a two-step forecasting pipeline which incorporates transfer learning from abundant synthetic data generated from PVGIS, before fine-tuning on observational data. Using actual production data from hundreds of sites in the Netherlands, Australia and Belgium, we show that SolNet improves forecasting performance over data-scarce settings as well as baseline models. We find transfer learning benefits to be the strongest when only limited observational data is available. At the same time we provide several guidelines and considerations for transfer learning practitioners, as our results show that weather data, seasonal patterns, amount of synthetic data and possible mis-specification in source location, can have a major impact on the results. The SolNet models created in this way are applicable for any land-based solar photovoltaic system across the planet where simulated and observed data can be combined to obtain improved forecasting capabilities.
Authors: Lang Zeng, Weijing Tang, Zhao Ren, Ying Ding
Abstract: The stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases, which is confirmed by simulation studies. Additionally, we offer practical guidance on using SGD, supported by theoretical analysis and numerical evidence. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable.
Authors: Darshan Thaker, Abhishek Goyal, Ren\'e Vidal
Abstract: Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we provide the first rigorous analysis of this approximation error for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. Motivated by our theoretical insights, we propose a simple modification to existing diffusion-based restoration methods. Our approach introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We develop an adaptive curriculum for this frequency schedule based on the underlying data distribution. Our method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.
Authors: Xianfeng Tan, Yuhan Li, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Ran Lin, Bingbing Ni
Abstract: Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
Authors: Yuangang Li, Jiaqi Li, Zhuo Xiao, Tiankai Yang, Yi Nian, Xiyang Hu, Yue Zhao
Abstract: Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.
Authors: Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna
Abstract: Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.
Authors: Fabian Paischer, Liu Yang, Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky Chen, Zhang Gabriel Li, Xiaoli Gao, Wei Shao, Xue Feng, Nima Noorshams, Sem Park, Bo Long, Hamid Eghbalzadeh
Abstract: In sequential recommendation, models recommend items based on user's interaction history. To this end, current models usually incorporate information such as item descriptions and user intent or preferences. User preferences are usually not explicitly given in open-source datasets, and thus need to be approximated, for example via large language models (LLMs). Current approaches leverage approximated user preferences only during training and rely solely on the past interaction history for recommendations, limiting their ability to dynamically adapt to changing preferences, potentially reinforcing echo chambers. To address this issue, we propose a new paradigm, namely preference discerning, which explicitly conditions a generative recommendation model on user preferences in natural language within its context. To evaluate preference discerning, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. Upon evaluating current state-of-the-art methods on our benchmark, we discover that their ability to dynamically adapt to evolving user preferences is limited. To address this, we propose a new method named Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce $\textbf{D}$iscern$\textbf{er}$), which achieves state-of-the-art performance in our benchmark. Our results show that Mender effectively adapts its recommendation guided by human preferences, even if not observed during training, paving the way toward more flexible recommendation models.
Authors: Bhanu Tokas, Rahul Nair, Hannah Kerner
Abstract: Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute A (e.g., gender) and a task T (e.g., cooking). However, these metrics fail to measure biases when A is balanced with T. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.
Authors: Jing An, Jianfeng Lu
Abstract: The two-timescale gradient descent-ascent (GDA) is a canonical gradient algorithm designed to find Nash equilibria in min-max games. We analyze the two-timescale GDA by investigating the effects of learning rate ratios on convergence behavior in both finite-dimensional and mean-field settings. In particular, for finite-dimensional quadratic min-max games, we obtain long-time convergence in near quasi-static regimes through the hypocoercivity method. For mean-field GDA dynamics, we investigate convergence under a finite-scale ratio using a mixed synchronous-reflection coupling technique.
Authors: Zhimin Zhao
Abstract: Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in iterative, context-rich workflows characteristic of SE activities. To address this limitation, we introduce \emph{SWE-Arena}, an interactive platform designed to evaluate FMs in SE tasks. SWE-Arena provides a transparent, open-source leaderboard, supports multi-round conversational workflows, and enables end-to-end model comparisons. The platform introduces novel metrics, including \emph{model consistency score} that measures the consistency of model outputs through self-play matches, and \emph{conversation efficiency index} that evaluates model performance while accounting for the number of interaction rounds required to reach conclusions. Moreover, SWE-Arena incorporates a new feature called \emph{RepoChat}, which automatically injects repository-related context (e.g., issues, commits, pull requests) into the conversation, further aligning evaluations with real-world development processes. This paper outlines the design and capabilities of SWE-Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.
Authors: Filip Ekstr\"om Kelvinius, Oskar B. Andersson, Abhijith S. Parackal, Dong Qian, Rickard Armiento, Fredrik Lindsten
Abstract: Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fr\'echet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation. As a proof-of-concept study, we use WyckoffDiff to find new materials below the convex hull of thermodynamical stability.
Authors: Tatsuro Inaba, Go Kamoda, Kentaro Inui, Masaru Isonuma, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi
Abstract: This study explores how bilingual language models develop complex internal representations. We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes. Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the mid layers. We also found that this bilingual tendency is stronger in larger models. Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model. Our results provide insights into how language models acquire bilingual capabilities.
Authors: Piotr Bia{\l}as, Vaibhav Chahar, Piotr Korcyl, Tomasz Stebel, Mateusz Winiarski, Dawid Zapolski
Abstract: Autoregressive Neural Networks (ANN) have been recently proposed as a mechanism to improve the efficiency of Monte Carlo algorithms for several spin systems. The idea relies on the fact that the total probability of a configuration can be factorized into conditional probabilities of each spin, which in turn can be approximated by a neural network. Once trained, the ANNs can be used to sample configurations from the approximated probability distribution and to explicitly evaluate this probability for a given configuration. It has also been observed that such conditional probabilities give access to information-theoretic observables such as mutual information or entanglement entropy. In this paper, we describe the hierarchical autoregressive network (HAN) algorithm in three spatial dimensions and study its performance using the example of the Ising model. We compare HAN with three other autoregressive architectures and the classical Wolff cluster algorithm. Finally, we provide estimates of thermodynamic observables for the three-dimensional Ising model, such as entropy and free energy, in a range of temperatures across the phase transition.
Authors: Asifullah Khan, Laiba Asmatullah, Anza Malik, Shahzaib Khan, Hamna Asif
Abstract: Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of "positive" and "negative" samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.
Authors: Weronika {\L}ajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Llu\'is M\`arquez
Abstract: Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Using our framework, we analyze state-of-the-art soft and hard compression methods and show that some fail to preserve key details from the original prompt, limiting performance on complex tasks. By identifying these limitations, we are able to improve one soft prompting method by controlling compression granularity, achieving up to +23% in downstream performance, +8 BERTScore points in grounding, and 2.7x more entities preserved in compression. Ultimately, we find that the best effectiveness/compression rate trade-off is achieved with soft prompting combined with sequence-level training.The code is available at https://github.com/amazon-science/information-preservation-in-prompt-compression.
URLs: https://github.com/amazon-science/information-preservation-in-prompt-compression.
Authors: Qiyu Wei, Haowei Wang, Zirui Cao, Songhao Wang, Richard Allmendinger, Mauricio A \'Alvarez
Abstract: Bayesian optimization (BO) is an effective technique for black-box optimization. However, its applicability is typically limited to moderate-budget problems due to the cubic complexity of fitting the Gaussian process (GP) surrogate model. In large-budget scenarios, directly employing the standard GP model faces significant challenges in computational time and resource requirements. In this paper, we propose a novel approach, gradient-based sample selection Bayesian Optimization (GSSBO), to enhance the computational efficiency of BO. The GP model is constructed on a selected set of samples instead of the whole dataset. These samples are selected by leveraging gradient information to remove redundancy while preserving diversity and representativeness. We provide a theoretical analysis of the gradient-based sample selection strategy and obtain explicit sublinear regret bounds for our proposed framework. Extensive experiments on synthetic and real-world tasks demonstrate that our approach significantly reduces the computational cost of GP fitting in BO while maintaining optimization performance comparable to baseline methods.
Authors: Yumin Choi, Jinheon Baek, Sung Ju Hwang
Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
Authors: Xiangqi Wang, Yue Huang, Yanbo Wang, Xiaonan Luo, Kehan Guo, Yujun Zhou, Xiangliang Zhang
Abstract: LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work 'well enough' across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.
Authors: Zhongyu Ouyang, Qianlong Wen, Chunhui Zhang, Yanfang Ye, Soroush Vosoughi
Abstract: Sequential recommendation systems aspire to profile users by interpreting their interaction histories, echoing how humans make decisions by weighing experience, relative preference strength, and situational relevance. Yet, existing large language model (LLM)-based recommenders often fall short of mimicking the flexible, context-aware decision strategies humans exhibit, neglecting the structured, dynamic, and context-aware mechanisms fundamental to human behaviors. To bridge this gap, we propose RecPO, a preference optimization framework that models structured feedback and contextual delay to emulate human-like prioritization in sequential recommendation. RecPO exploits adaptive reward margins based on inferred preference hierarchies and temporal signals, enabling the model to favor immediately relevant items and to distinguish between varying degrees of preference and aversion. Extensive experiments across five real-world datasets demonstrate that RecPO not only yields performance gains over state-of-the-art baselines, but also mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
Authors: David Tschan, Zhi Wang, Dominik Strebel, Jan Carmeliet, Yongling Zhao
Abstract: Urban heat islands (UHIs) are often accentuated during heat waves (HWs) and pose a public health risk. Mitigating UHIs requires urban planners to first estimate how urban heat is influenced by different land use types (LUTs) and drivers across scales - from synoptic-scale climatic background processes to small-scale urban- and scale-bridging features. This study proposes to classify these drivers into driving (D), urban (U), and local (L) features, respectively. To increase interpretability and enhance computation efficiency, a LUT-distinguishing machine learning approach is proposed as a fast emulator for Weather Research and Forecasting model (WRF) coupled to the Noah land surface model (LSM) to predict ground- (TSK) and 2-meter air temperature (T2). Using random forest regression (RFR) with extreme gradient boosting (XGB) trained on WRF output over Zurich, Switzerland, during heatwave (HW) periods in 2017 and 2019, this study proposes LUT-based (LB) models that categorize features by scales and practical controllability, allowing optional categorical weighting. This approach enables category-specific feature ranking and sensitivity estimation of T2 and TSK to most important small-scale drivers - most notably surface emissivity, albedo, and leaf area index (LAI). Models employing the LB framework are statistically significantly more accurate than models that do not, with higher performance when more HW data is included in training. With RFR-XGB robustly performing optimal with unit weights, the method substantially increase interpretability. Despite the needs to reduce uncertainties and test the method on other cities, the proposed approach offers urban planners a direct framework for feasibility-centered UHI mitigation assessment.
Authors: Marcel Hudiani
Abstract: We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function $F$ is globally convex or non-convex whose gradient is $\gamma$-H\"{o}lder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB: $\min_{s\leq t} \|\nabla F(w_s)\|^2 = o(t^{p-1})$ for non-convex objectives and $F(w_{\tau \wedge t}) - F_* = o(t^{2\gamma/(1+\gamma) \cdot \max(p-1,-2p+1)-\eps})$ for $\beta \in (0, 1)$, $\tau := \inf \{ t > 0 : F(w_t) = F_*\}$, and $\min_{s \leq t} F(w_s) - F_* = o(t^{p-1})$ for convex objectives $F$ whose minimum is $F_*$. In addition, we proved that SHB with constant momentum parameter $\beta \in (0, 1)$ attains a convergence rate of $F(w_t) - F_* = O(t^{\max(p-1,-2p+1)} \log^2 \frac{t}{\delta})$ with probability at least $1-\delta$ when $F$ is convex and $\gamma = 1$ and step size $\alpha_t = \Theta(t^{-p})$ with $p \in (\frac{1}{2}, 1)$.
Authors: Abdelrahman S. Abdelrahman, Shuvro Chowdhury, Flaviano Morone, Kerem Y. Camsari
Abstract: We introduce a generalized \textit{Probabilistic Approximate Optimization Algorithm (PAOA)}, a classical variational Monte Carlo framework that extends and formalizes prior work by Weitz \textit{et al.}~\cite{Combes_2023}, enabling parameterized and fast sampling on present-day Ising machines and probabilistic computers. PAOA operates by iteratively modifying the couplings of a network of binary stochastic units, guided by cost evaluations from independent samples. We establish a direct correspondence between derivative-free updates and the gradient of the full Markov flow over the exponentially large state space, showing that PAOA admits a principled variational formulation. Simulated annealing emerges as a limiting case under constrained parameterizations, and we implement this regime on an FPGA-based probabilistic computer with on-chip annealing to solve large 3D spin-glass problems. Benchmarking PAOA against QAOA on the canonical 26-spin Sherrington-Kirkpatrick model with matched parameters reveals superior performance for PAOA. We show that PAOA naturally extends simulated annealing by optimizing multiple temperature profiles, leading to improved performance over SA on heavy-tailed problems such as SK-L\'evy.
Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.
Authors: Sybelle Goedicke-Fritz (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Michelle Bous (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Annika Engel (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany), Matthias Flotho (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, Helmholtz Institute for Pharmaceutical Research Saarland), Pascal Hirsch (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany), Hannah Wittig (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Dino Milanovic (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany), Dominik Mohr (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Mathias Kaspar (Digital Medicine, University Hospital of Augsburg, Augsburg, Germany), Sogand Nemat (Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany), Dorothea Kerner (Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany), Arno B\"ucker (Department of Radiology, and Interventional Radiology, University Hospital of Saarland, Homburg, Germany), Andreas Keller (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, Helmholtz Institute for Pharmaceutical Research Saarland, Pharma Science Hub), Sascha Meyer (Clinical Centre Karlsruhe, Franz-Lust Clinic for Paediatrics, Karlsruhe, Germany), Michael Zemlin (Department of General Pediatrics and Neonatology, Saarland University, Campus Homburg, Homburg/Saar, Germany), Philipp Flotho (Chair for Clinical Bioinformatics, Saarland Informatics Campus, Saarland University, Saarbr\"ucken, Germany, Helmholtz Institute for Pharmaceutical Research Saarland)
Abstract: Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.
Authors: Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu
Abstract: Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.
Authors: Keqiang Guo, Yuheng Zhong, Xin Tong, Jiangbin Lyu, Rui Zhang
Abstract: Accurately predicting beam-level reference signal received power (RSRP) is essential for beam management in dense multi-user wireless networks, yet challenging due to high measurement overhead and fast channel variations. This paper proposes Neural Beam Field (NBF), a hybrid neural-physical framework for efficient and interpretable spatial beam RSRP prediction. Central to our approach is the introduction of the Multi-path Conditional Power Profile (MCPP), a learnable physical intermediary representing the site-specific propagation environment. This approach decouples the environment from specific antenna/beam configurations, which helps the model learn site-specific multipath features and enhances its generalization capability. We adopt a decoupled ``blackbox-whitebox" design: a Transformer-based deep neural network (DNN) learns the MCPP from sparse user measurements and positions, while a physics-inspired module analytically infers beam RSRP statistics. To improve convergence and adaptivity, we further introduce a Pretrain-and-Calibrate (PaC) strategy that leverages ray-tracing priors for physics-grounded pretraining and then RSRP data for on-site calibration. Extensive simulation results demonstrate that NBF significantly outperforms conventional table-based channel knowledge maps (CKMs) and pure blackbox DNNs in prediction accuracy, training efficiency, and generalization, while maintaining a compact model size. The proposed framework offers a scalable and physically grounded solution for intelligent beam management in next-generation dense wireless networks.
Authors: Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Zhen Dai, Yueyi Luo
Abstract: Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.
Authors: Ze Liu, Xianquan Wang, Shuochen Liu, Jie Ma, Huibo Xu, Yupeng Han, Kai Zhang, Jun Zhou
Abstract: Implicit feedback is central to modern recommender systems but is inherently noisy, often impairing model training and degrading user experience. At scale, such noise can mislead learning processes, reducing both recommendation accuracy and platform value. Existing denoising strategies typically overlook the entity-specific nature of noise while introducing high computational costs and complex hyperparameter tuning. To address these challenges, we propose \textbf{EARD} (\textbf{E}ntity-\textbf{A}ware \textbf{R}eliability-\textbf{D}riven Denoising), a lightweight framework that shifts the focus from interaction-level signals to entity-level reliability. Motivated by the empirical observation that training loss correlates with noise, EARD quantifies user and item reliability via their average training losses as a proxy for reputation, and integrates these entity-level factors with interaction-level confidence. The framework is \textbf{model-agnostic}, \textbf{computationally efficient}, and requires \textbf{only two intuitive hyperparameters}. Extensive experiments across multiple datasets and backbone models demonstrate that EARD yields substantial improvements over state-of-the-art baselines (e.g., up to 27.01\% gain in NDCG@50), while incurring negligible additional computational cost. Comprehensive ablation studies and mechanism analyses further confirm EARD's robustness to hyperparameter choices and its practical scalability. These results highlight the importance of entity-aware reliability modeling for denoising implicit feedback and pave the way for more robust recommendation research.
Authors: Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li
Abstract: Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.
Authors: Yushi Lin, Peng Yang
Abstract: Trade-based manipulation (TBM) undermines the fairness and stability of financial markets drastically. Spoofing, one of the most covert and deceptive TBM strategies, exhibits complex anomaly patterns across multilevel prices, while often being simplified as a single-level manipulation. These patterns are usually concealed within the rich, hierarchical information of the Limit Order Book (LOB), which is challenging to leverage due to high dimensionality and noise. To address this, we propose a representation learning framework combining a cascaded LOB representation architecture with supervised contrastive learning. Extensive experiments demonstrate that our framework consistently improves detection performance across diverse models, with Transformer-based architectures achieving state-of-the-art results. In addition, we conduct systematic analyses and ablation studies to investigate multilevel manipulation and the contributions of key components for detection, offering broader insights into representation learning and anomaly detection for complex time series data.
Authors: Eugene Kwek, Wenpeng Yin
Abstract: Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.
Authors: Mehran Behjati, Rosdiadee Nordin, Nor Fadzilah Abdullah
Abstract: This paper presents a reinforcement learning (RL) based approach for path planning of cellular connected unmanned aerial vehicles (UAVs) operating beyond visual line of sight (BVLoS). The objective is to minimize travel distance while maximizing the quality of cellular link connectivity by considering real world aerial coverage constraints and employing an empirical aerial channel model. The proposed solution employs RL techniques to train an agent, using the quality of communication links between the UAV and base stations (BSs) as the reward function. Simulation results demonstrate the effectiveness of the proposed method in training the agent and generating feasible UAV path plans. The proposed approach addresses the challenges due to limitations in UAV cellular communications, highlighting the need for investigations and considerations in this area. The RL algorithm efficiently identifies optimal paths, ensuring maximum connectivity with ground BSs to ensure safe and reliable BVLoS flight operation. Moreover, the solution can be deployed as an offline path planning module that can be integrated into future ground control systems (GCS) for UAV operations, enhancing their capabilities and safety. The method holds potential for complex long range UAV applications, advancing the technology in the field of cellular connected UAV path planning.
Authors: Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli
Abstract: We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.
Authors: Berker Demirel, Marco Fumero, Theofanis Karaletsos, Francesco Locatello
Abstract: Simulating in silico cellular responses to interventions is a promising direction to accelerate high-content image-based assays, critical for advancing drug discovery and gene editing. To support this, we introduce MorphGen, a state-of-the-art diffusion-based generative model for fluorescent microscopy that enables controllable generation across multiple cell types and perturbations. To capture biologically meaningful patterns consistent with known cellular morphologies, MorphGen is trained with an alignment loss to match its representations to the phenotypic embeddings of OpenPhenom, a state-of-the-art biological foundation model. Unlike prior approaches that compress multichannel stains into RGB images -- thus sacrificing organelle-specific detail -- MorphGen generates the complete set of fluorescent channels jointly, preserving per-organelle structures and enabling a fine-grained morphological analysis that is essential for biological interpretation. We demonstrate biological consistency with real images via CellProfiler features, and MorphGen attains an FID score over 35% lower than the prior state-of-the-art MorphoDiff, which only generates RGB images for a single cell type. Code is available at https://github.com/czi-ai/MorphGen.
Authors: Vaskar Chakma, Wooyeol Choi
Abstract: Current Cyber-Physical Systems (CPS) integrated with Digital Twin (DT) technology face critical limitations in achieving real-time performance for mission-critical industrial applications. Existing 5G-enabled systems suffer from latencies exceeding 10ms, which are inadequate for applications requiring sub-millisecond response times, such as autonomous industrial control and predictive maintenance. This research aims to develop and validate a 6G-enabled Digital Twin framework that achieves ultra-low latency communication and real-time synchronization between physical industrial assets and their digital counterparts, specifically targeting bearing fault detection as a critical industrial use case. The proposed framework integrates terahertz communications (0.1-1 THz), intelligent reflecting surfaces, and edge artificial intelligence within a five-layer architecture. Experimental validation was conducted using the Case Western Reserve University (CWRU) bearing dataset, implementing comprehensive feature extraction (15 time and frequency domain features) and Random Forest classification algorithms. The system performance was evaluated against traditional WiFi-6 and 5G networks across multiple metrics, including classification accuracy, end-to-end latency, and scalability. It achieved 97.7% fault classification accuracy with 0.8ms end-to-end latency, representing a 15.6x improvement over WiFi-6 (12.5ms) and 5.25x improvement over 5G (4.2ms) networks. The system demonstrated superior scalability with sub-linear processing time growth and maintained consistent performance across four bearing fault categories (normal, inner race, outer race, and ball faults) with macro-averaged F1-scores exceeding 97%.
Authors: Baptiste Ferrere (EDF R\&D PRISME, IMT, SINCLAIR AI Lab), Nicolas Bousquet (EDF R\&D PRISME, SINCLAIR AI Lab, LPSM), Fabrice Gamboa (IMT), Jean-Michel Loubes (IMT), Joseph Mur\'e (EDF R\&D PRISME)
Abstract: Explaining the behavior of predictive models with random inputs can be achieved through sub-models decomposition, where such sub-models have easier interpretable features. Arising from the uncertainty quantification community, recent results have demonstrated the existence and uniqueness of a generalized Hoeffding decomposition for such predictive models when the stochastic input variables are correlated, based on concepts of oblique projection onto L 2 subspaces. This article focuses on the case where the input variables have Bernoulli distributions and provides a complete description of this decomposition. We show that in this case the underlying L 2 subspaces are one-dimensional and that the functional decomposition is explicit. This leads to a complete interpretability framework and theoretically allows reverse engineering. Explicit indicators of the influence of inputs on the output prediction (exemplified by Sobol' indices and Shapley effects) can be explicitly derived. Illustrated by numerical experiments, this type of analysis proves useful for addressing decision-support problems, based on binary decision diagrams, Boolean networks or binary neural networks. The article outlines perspectives for exploring high-dimensional settings and, beyond the case of binary inputs, extending these findings to models with finite countable inputs.
Authors: MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, Afra Feyza Aky\"urek
Abstract: Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
Authors: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang
Abstract: Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks.
Authors: Abdelhakim Benechehab, Gabriel Singer, Corentin L\'eger, Youssef Attia El Hili, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Bal\'azs K\'egl
Abstract: Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .