Authors: Felix Michalak, Steven Abreu
Abstract: We demonstrate complete functional segregation in hybrid SSM-Transformer architectures: retrieval depends exclusively on self-attention layers. Across RecurrentGemma-2B/9B and Jamba-Mini-1.6, attention ablation causes catastrophic retrieval failure (0% accuracy), while SSM layers show no compensatory mechanisms even with improved prompting. Conversely, sparsifying attention to just 15% of heads maintains near-perfect retrieval while preserving 84% MMLU performance, suggesting self-attention specializes primarily for retrieval tasks. We identify precise mechanistic requirements for retrieval: needle tokens must be exposed during generation and sufficient context must be available during prefill or generation. This strict functional specialization challenges assumptions about redundancy in hybrid architectures and suggests these models operate as specialized modules rather than integrated systems, with immediate implications for architecture optimization and interpretability.
Authors: Iman Rahmani, Saman Yazdannik, Morteza Tayefi, Jafar Roshanian
Abstract: The performance of deep reinforcement learning agents is fundamentally constrained by their neural network architecture, a choice traditionally made through expensive hyperparameter searches and then fixed throughout training. This work investigates whether online, adaptive architecture optimization can escape this constraint and outperform static designs. We introduce NAS-DQN, an agent that integrates a learned neural architecture search controller directly into the DRL training loop, enabling dynamic network reconfiguration based on cumulative performance feedback. We evaluate NAS-DQN against three fixed-architecture baselines and a random search control on a continuous control task, conducting experiments over multiple random seeds. Our results demonstrate that NAS-DQN achieves superior final performance, sample efficiency, and policy stability while incurring negligible computational overhead. Critically, the learned search strategy substantially outperforms both undirected random architecture exploration and poorly-chosen fixed designs, indicating that intelligent, performance-guided search is the key mechanism driving success. These findings establish that architecture adaptation is not merely beneficial but necessary for optimal sample efficiency in online deep reinforcement learning, and suggest that the design of RL agents need not be a static offline choice but can instead be seamlessly integrated as a dynamic component of the learning process itself.
Authors: Junfeng Gong, Zhiyi Wei, Junying Chen, Cheng Liu, Huawei Li
Abstract: Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments. To bridge this gap, we propose ReGraphT, a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. ReGraphT organizes CUDA optimization trajectories into a structured reasoning graph, modeling the combined CUDA optimizations as state transitions, and leverages Monte Carlo Graph Search (MCGS) for efficient exploration. We also present a CUDA-specific benchmark with difficulty tiers defined by reasoning complexity to evaluate models more comprehensively. Experiments show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval. When paired with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without the associated privacy risks or excessive computing overhead.
Authors: Mostafa Ameli, Van Anh Le, Sulthana Shams, Alexander Skabardonis
Abstract: The traffic assignment problem is essential for traffic flow analysis, traditionally solved using mathematical programs under the Equilibrium principle. These methods become computationally prohibitive for large-scale networks due to non-linear growth in complexity with the number of OD pairs. This study introduces a novel data-driven approach using deep neural networks, specifically leveraging the Transformer architecture, to predict equilibrium path flows directly. By focusing on path-level traffic distribution, the proposed model captures intricate correlations between OD pairs, offering a more detailed and flexible analysis compared to traditional link-level approaches. The Transformer-based model drastically reduces computation time, while adapting to changes in demand and network structure without the need for recalculation. Numerical experiments are conducted on the Manhattan-like synthetic network, the Sioux Falls network, and the Eastern-Massachusetts network. The results demonstrate that the proposed model is orders of magnitude faster than conventional optimization. It efficiently estimates path-level traffic flows in multi-class networks, reducing computational costs and improving prediction accuracy by capturing detailed trip and flow information. The model also adapts flexibly to varying demand and network conditions, supporting traffic management and enabling rapid `what-if' analyses for enhanced transportation planning and policy-making.
Authors: Shiqi Dai, Wei Dai, Jiaee Cheong, Paul Pu Liang
Abstract: Medical artificial intelligence systems have achieved remarkable diagnostic capabilities, yet they consistently exhibit performance disparities across demographic groups, causing real-world harm to underrepresented populations. While recent multimodal reasoning foundation models have advanced clinical diagnosis through integrated analysis of diverse medical data, reasoning trainings via reinforcement learning inherit and often amplify biases present in training datasets dominated by majority populations. We introduce Fairness-aware Group Relative Policy Optimization (FairGRPO), a hierarchical reinforcement learning approach that promotes equitable learning across heterogeneous clinical populations. FairGRPO employs adaptive importance weighting of advantages based on representation, task difficulty, and data source. To address the common issue of missing demographic labels in the clinical domain, we further employ unsupervised clustering, which automatically discovers latent demographic groups when labels are unavailable. Through comprehensive experiments across 7 clinical diagnostic datasets spanning 5 clinical modalities across X-ray, CT scan, dermoscropy, mammography and ultrasound, we demonstrate that FairGRPO reduces predictive parity by 27.2% against all vanilla and bias mitigated RL baselines, while improving F1 score by 12.49%. Furthermore, training dynamics analysis reveals that FairGRPO progressively improves fairness throughout optimization, while baseline RL methods exhibit deteriorating fairness as training progresses. Based on FairGRPO, we release FairMedGemma-4B, a fairness-aware clinical VLLM that achieves state-of-the-art performance while demonstrating significantly reduced disparities across demographic groups.
Authors: Filipe Ferreira de Oliveira, Matheus Becali Rocha, Renato A. Krohling
Abstract: In this paper, we propose an approach to support the diagnosis of urinary tract diseases, with a focus on bladder cancer, using SHAP (SHapley Additive exPlanations)-based feature selection to enhance the transparency and effectiveness of predictive models. Six binary classification scenarios were developed to distinguish bladder cancer from other urological and oncological conditions. The algorithms XGBoost, LightGBM, and CatBoost were employed, with hyperparameter optimization performed using Optuna and class balancing with the SMOTE technique. The selection of predictive variables was guided by importance values through SHAP-based feature selection while maintaining or even improving performance metrics such as balanced accuracy, precision, and specificity. The use of explainability techniques (SHAP) for feature selection proved to be an effective approach. The proposed methodology may contribute to the development of more transparent, reliable, and efficient clinical decision support systems, optimizing screening and early diagnosis of urinary tract diseases.
Authors: Trajan Murphy, Akshunna S. Dogra, Hanfeng Gu, Caleb Meredith, Mark Kon, Julio Enrique Castrillion-Candas
Abstract: ''Noisy'' datasets (regimes with low signal to noise ratios, small sample sizes, faulty data collection, etc) remain a key research frontier for classification methods with both theoretical and practical implications. We introduce FINDER, a rigorous framework for analyzing generic classification problems, with tailored algorithms for noisy datasets. FINDER incorporates fundamental stochastic analysis ideas into the feature learning and inference stages to optimally account for the randomness inherent to all empirical datasets. We construct ''stochastic features'' by first viewing empirical datasets as realizations from an underlying random field (without assumptions on its exact distribution) and then mapping them to appropriate Hilbert spaces. The Kosambi-Karhunen-Lo\'eve expansion (KLE) breaks these stochastic features into computable irreducible components, which allow classification over noisy datasets via an eigen-decomposition: data from different classes resides in distinct regions, identified by analyzing the spectrum of the associated operators. We validate FINDER on several challenging, data-deficient scientific domains, producing state of the art breakthroughs in: (i) Alzheimer's Disease stage classification, (ii) Remote sensing detection of deforestation. We end with a discussion on when FINDER is expected to outperform existing methods, its failure modes, and other limitations.
Authors: Egor Shulgin, Sultan AlRashed, Francesco Orabona, Peter Richt\'arik
Abstract: The Muon optimizer has rapidly emerged as a powerful, geometry-aware alternative to AdamW, demonstrating strong performance in large-scale training of neural networks. However, a critical theory-practice disconnect exists: Muon's efficiency relies on fast, approximate orthogonalization, yet all prior theoretical work analyzes an idealized, computationally intractable version assuming exact SVD-based updates. This work moves beyond the ideal by providing the first analysis of the inexact orthogonalized update at Muon's core. We develop our analysis within the general framework of Linear Minimization Oracle (LMO)-based optimization, introducing a realistic additive error model to capture the inexactness of practical approximation schemes. Our analysis yields explicit bounds that quantify performance degradation as a function of the LMO inexactness/error. We reveal a fundamental coupling between this inexactness and the optimal step size and momentum: lower oracle precision requires a smaller step size but larger momentum parameter. These findings elevate the approximation procedure (e.g., the number of Newton-Schulz steps) from an implementation detail to a critical parameter that must be co-tuned with the learning schedule. NanoGPT experiments directly confirm the predicted coupling, with optimal learning rates clearly shifting as approximation precision changes.
Authors: Xiang Li, Buxin Su, Chendi Wang, Qi Long, Weijie J. Su
Abstract: Differentially private (DP) decentralized Federated Learning (FL) allows local users to collaborate without sharing their data with a central server. However, accurately quantifying the privacy budget of private FL algorithms is challenging due to the co-existence of complex algorithmic components such as decentralized communication and local updates. This paper addresses privacy accounting for two decentralized FL algorithms within the $f$-differential privacy ($f$-DP) framework. We develop two new $f$-DP-based accounting methods tailored to decentralized settings: Pairwise Network $f$-DP (PN-$f$-DP), which quantifies privacy leakage between user pairs under random-walk communication, and Secret-based $f$-Local DP (Sec-$f$-LDP), which supports structured noise injection via shared secrets. By combining tools from $f$-DP theory and Markov chain concentration, our accounting framework captures privacy amplification arising from sparse communication, local iterations, and correlated noise. Experiments on synthetic and real datasets demonstrate that our methods yield consistently tighter $(\epsilon,\delta)$ bounds and improved utility compared to R\'enyi DP-based approaches, illustrating the benefits of $f$-DP in decentralized privacy accounting.
Authors: Matan Tsipory, Ran Levinstein, Itay Evron, Mark Kong, Deanna Needell, Daniel Soudry
Abstract: We analyze task orderings in continual learning for linear regression, assuming joint realizability of training data. We focus on orderings that greedily maximize dissimilarity between consecutive tasks, a concept briefly explored in prior work but still surrounded by open questions. Using tools from the Kaczmarz method literature, we formalize such orderings and develop geometric and algebraic intuitions around them. Empirically, we demonstrate that greedy orderings converge faster than random ones in terms of the average loss across tasks, both for linear regression with random data and for linear probing on CIFAR-100 classification tasks. Analytically, in a high-rank regression setting, we prove a loss bound for greedy orderings analogous to that of random ones. However, under general rank, we establish a repetition-dependent separation. Specifically, while prior work showed that for random orderings, with or without replacement, the average loss after $k$ iterations is bounded by $\mathcal{O}(1/\sqrt{k})$, we prove that single-pass greedy orderings may fail catastrophically, whereas those allowing repetition converge at rate $\mathcal{O}(1/\sqrt[3]{k})$. Overall, we reveal nuances within and between greedy and random orderings.
Authors: Shaocong Ma, Heng Huang
Abstract: In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices. However, during deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact. This mismatch between training and deployment environments can significantly degrade performance. Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties, but typically rely on symmetric structures that fail to capture the directional nature of market impact. To address this issue, we develop a novel class of elliptic uncertainty sets. We establish both implicit and explicit closed-form solutions for the worst-case uncertainty under these sets, enabling efficient and tractable robust policy evaluation. Experiments on single-asset and multi-asset trading tasks demonstrate that our method achieves superior Sharpe ratio and remains robust under increasing trade volumes, offering a more faithful and scalable approach to RL in financial markets.
Authors: Shaocong Ma, Heng Huang
Abstract: Zeroth-order optimization (ZOO) is an important framework for stochastic optimization when gradients are unavailable or expensive to compute. A potential limitation of existing ZOO methods is the bias inherent in most gradient estimators unless the perturbation stepsize vanishes. In this paper, we overcome this biasedness issue by proposing a novel family of unbiased gradient estimators based solely on function evaluations. By reformulating directional derivatives as a telescoping series and sampling from carefully designed distributions, we construct estimators that eliminate bias while maintaining favorable variance. We analyze their theoretical properties, derive optimal scaling distributions and perturbation stepsizes of four specific constructions, and prove that SGD using the proposed estimators achieves optimal complexity for smooth non-convex objectives. Experiments on synthetic tasks and language model fine-tuning confirm the superior accuracy and convergence of our approach compared to standard methods.
Authors: Shaocong Ma, Heng Huang
Abstract: In this paper, we explore the two-point zeroth-order gradient estimator and identify the distribution of random perturbations that minimizes the estimator's asymptotic variance as the perturbation stepsize tends to zero. We formulate it as a constrained functional optimization problem over the space of perturbation distributions. Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length. While existing research has largely focused on fixed-length perturbations, the potential advantages of directional alignment have been overlooked. To address this gap, we delve into the theoretical and empirical properties of the directionally aligned perturbation (DAP) scheme, which adaptively offers higher accuracy along critical directions. Additionally, we provide a convergence analysis for stochastic gradient descent using $\delta$-unbiased random perturbations, extending existing complexity bounds to a wider range of perturbations. Through empirical evaluations on both synthetic problems and practical tasks, we demonstrate that DAPs outperform traditional methods under specific conditions.
Authors: Hanbin Hong, Ashish Kundu, Ali Payani, Binghui Wang, Yuan Hong
Abstract: Randomized smoothing has become essential for achieving certified adversarial robustness in machine learning models. However, current methods primarily use isotropic noise distributions that are uniform across all data dimensions, such as image pixels, limiting the effectiveness of robustness certification by ignoring the heterogeneity of inputs and data dimensions. To address this limitation, we propose UCAN: a novel technique that \underline{U}niversally \underline{C}ertifies adversarial robustness with \underline{A}nisotropic \underline{N}oise. UCAN is designed to enhance any existing randomized smoothing method, transforming it from symmetric (isotropic) to asymmetric (anisotropic) noise distributions, thereby offering a more tailored defense against adversarial attacks. Our theoretical framework is versatile, supporting a wide array of noise distributions for certified robustness in different $\ell_p$-norms and applicable to any arbitrary classifier by guaranteeing the classifier's prediction over perturbed inputs with provable robustness bounds through tailored noise injection. Additionally, we develop a novel framework equipped with three exemplary noise parameter generators (NPGs) to optimally fine-tune the anisotropic noise parameters for different data dimensions, allowing for pursuing different levels of robustness enhancements in practice.Empirical evaluations underscore the significant leap in UCAN's performance over existing state-of-the-art methods, demonstrating up to $182.6\%$ improvement in certified accuracy at large certified radii on MNIST, CIFAR10, and ImageNet datasets.\footnote{Code is anonymously available at \href{https://github.com/youbin2014/UCAN/}{https://github.com/youbin2014/UCAN/}}
URLs: https://github.com/youbin2014/UCAN/, https://github.com/youbin2014/UCAN/
Authors: Renzhao Liang, Sizhe Xu, Chenggang Xie, Jingru Chen, Feiyang Ren, Shu Yang, Takahiro Yabe
Abstract: Time series forecasting plays a pivotal role in critical domains such as energy management and financial markets. Although deep learning-based approaches (e.g., MLP, RNN, Transformer) have achieved remarkable progress, the prevailing "long-sequence information gain hypothesis" exhibits inherent limitations. Through systematic experimentation, this study reveals a counterintuitive phenomenon: appropriately truncating historical data can paradoxically enhance prediction accuracy, indicating that existing models learn substantial redundant features (e.g., noise or irrelevant fluctuations) during training, thereby compromising effective signal extraction. Building upon information bottleneck theory, we propose an innovative solution termed Adaptive Masking Loss with Representation Consistency (AMRC), which features two core components: 1) Dynamic masking loss, which adaptively identified highly discriminative temporal segments to guide gradient descent during model training; 2) Representation consistency constraint, which stabilized the mapping relationships among inputs, labels, and predictions. Experimental results demonstrate that AMRC effectively suppresses redundant feature learning while significantly improving model performance. This work not only challenges conventional assumptions in temporal modeling but also provides novel theoretical insights and methodological breakthroughs for developing efficient and robust forecasting models.
Authors: Zachary Horvitz, Raghav Singhal, Hao Zou, Carles Domingo-Enrich, Zhou Yu, Rajesh Ranganath, Kathleen McKeown
Abstract: Masked diffusion language models (MDLMs) are trained to in-fill positions in randomly masked sequences, in contrast to next-token prediction models. Discussions around MDLMs focus on two benefits: (1) any-order decoding and 2) multi-token decoding. However, we observe that for math and coding tasks, any-order algorithms often underperform or behave similarly to left-to-right sampling, and standard multi-token decoding significantly degrades performance. At inference time, MDLMs compute the conditional distribution of all masked positions. A natural question is: How can we justify this additional compute when left-to-right one-token-at-a-time decoding is on par with any-order decoding algorithms? First, we propose reasoning-as-infilling. By using MDLMs to infill a reasoning template, we can structure outputs and distinguish between reasoning and answer tokens. In turn, this enables measuring answer uncertainty during reasoning, and early exits when the model converges on an answer. Next, given an answer, reasoning-as-infilling enables sampling from the MDLM posterior over reasoning traces conditioned on the answer, providing a new source of high-quality data for post-training. On GSM8k, we observe that fine-tuning LLaDA-8B Base on its posterior reasoning traces provides a performance boost on par with fine-tuning on human-written reasoning traces. Additionally, given an answer, reasoning-as-infilling provides a method for scoring the correctness of the reasoning process at intermediate steps. Second, we propose multi-token entropy decoding (MED), a simple adaptive sampler that minimizes the error incurred by decoding positions in parallel based on the conditional entropies of those positions. MED preserves performance across benchmarks and leads to 2.7x fewer steps. Our work demonstrates that the training and compute used by MDLMs unlock many new inference and post-training methods.
Authors: Curtis Lee Shull, Merrick Green
Abstract: Radio Frequency Identification (RFID) tracking may be a viable solution for defense assets that must be stored in accordance with security guidelines. However, poor sensor specificity (vulnerabilities include long range detection, spoofing, and counterfeiting) can lead to erroneous detection and operational security events. We present a supervised learning simulation with realistic Received Signal Strength Indicator (RSSI) data and Decision Tree classification in a Computer Assisted Design (CAD)-modeled floor plan that encapsulates some of the challenges encountered in defense storage. In this work, we focused on classifying 12 lab zones (LabZoneA-L) to perform location inference. The raw dataset had approximately 980,000 reads. Class frequencies were imbalanced, and class weights were calculated to account for class imbalance in this multi-class setting. The model, trained on stratified subsamples to 5,000 balanced observations, yielded an overall accuracy of 34.2% and F1-scores greater than 0.40 for multiple zones (Zones F, G, H, etc.). However, rare classes (most notably LabZoneC) were often misclassified, even with the use of class weights. An adjacency-aware confusion matrix was calculated to allow better interpretation of physically adjacent zones. These results suggest that RSSI-based decision trees can be applied in realistic simulations to enable zone-level anomaly detection or misplacement monitoring for defense supply logistics. Reliable classification performance in low-coverage and low-signal zones could be improved with better antenna placement or additional sensors and sensor fusion with other modalities.
Authors: Jiazheng Li, Yawei Wang, David Yan, Yijun Tian, Zhichao Xu, Huan Song, Panpan Xu, Lin Lee Cheong
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, enabling language agents to excel at single-turn tasks. However, their application to complex, multi-step, and long-horizon tasks remains challenging. While reinforcement learning (RL) offers a promising avenue for addressing these challenges, mainstream approaches typically rely solely on sparse, outcome-based rewards, a limitation that becomes especially problematic for group-based RL algorithms lacking critic models, such as Group Relative Policy Optimization (GRPO). In such methods, uniformly rewarding or penalizing all actions within a trajectory can lead to training instability and suboptimal policies, because beneficial and detrimental actions are often entangled across multi-step interactions. To address this challenge, we propose SALT, a novel and lightweight framework that provides a finer-grained advantage assignment, derived solely from outcome rewards. We achieve this by constructing a graph from trajectories of the same prompt, which allows us to quantify the quality of each step and assign advantages accordingly. Crucially, SALT is designed as a plug-and-play module that seamlessly integrates with existing group-based RL algorithms, requiring no modifications to the rollout procedure and introducing negligible computational overhead. Extensive experiments on the WebShop, ALFWorld, and AppWorld benchmarks with various model sizes demonstrate that SALT consistently improves performance. We also conduct a thorough analysis to validate the design choices behind SALT and offer actionable insights.
Authors: Vahid Jalili
Abstract: Since its 2009 genesis block, the Bitcoin network has processed \num{>1.08} billion (B) transactions representing \num{>8.72}B BTC, offering rich potential for machine learning (ML); yet, its pseudonymity and obscured flow of funds inherent in its \utxo-based design, have rendered this data largely inaccessible for ML research. Addressing this gap, we present an ML-compatible graph modeling the Bitcoin's economic topology by reconstructing the flow of funds. This temporal, heterogeneous graph encompasses complete transaction history up to block \cutoffHeight, consisting of \num{>2.4}B nodes and \num{>39.72}B edges. Additionally, we provide custom sampling methods yielding node and edge feature vectors of sampled communities, tools to load and analyze the Bitcoin graph data within specialized graph databases, and ready-to-use database snapshots. This comprehensive dataset and toolkit empower the ML community to tackle Bitcoin's intricate ecosystem at scale, driving progress in applications such as anomaly detection, address classification, market analysis, and large-scale graph ML benchmarking. Dataset and code available at \href{https://github.com/B1AAB/EBA}{github.com/b1aab/eba}
Authors: Marin Bilo\v{s}, Anderson Schneider, Yuriy Nevmyvaka
Abstract: Temporal point processes are powerful generative models for event sequences that capture complex dependencies in time-series data. They are commonly specified using autoregressive models that learn the distribution of the next event from the previous events. This makes sampling inherently sequential, limiting efficiency. In this paper, we propose a novel algorithm based on rejection sampling that enables exact sampling of multiple future values from existing TPP models, in parallel, and without requiring any architectural changes or retraining. Besides theoretical guarantees, our method demonstrates empirical speedups on real-world datasets, bridging the gap between expressive modeling and efficient parallel generation for large-scale TPP applications.
Authors: Yuwei Cheng, Zifeng Zhao, Haifeng Xu
Abstract: Online advertising platforms use automated auctions to connect advertisers with potential customers, requiring effective bidding strategies to maximize profits. Accurate ad impact estimation requires considering three key factors: delayed and long-term effects, cumulative ad impacts such as reinforcement or fatigue, and customer heterogeneity. However, these effects are often not jointly addressed in previous studies. To capture these factors, we model ad bidding as a Contextual Markov Decision Process (CMDP) with delayed Poisson rewards. For efficient estimation, we propose a two-stage maximum likelihood estimator combined with data-splitting strategies, ensuring controlled estimation error based on the first-stage estimator's (in)accuracy. Building on this, we design a reinforcement learning algorithm to derive efficient personalized bidding strategies. This approach achieves a near-optimal regret bound of $\tilde{O}{(dH^2\sqrt{T})}$, where $d$ is the contextual dimension, $H$ is the number of rounds, and $T$ is the number of customers. Our theoretical findings are validated by simulation experiments.
Authors: Hongyi Liu, Jiaji Huang, Zhen Jia, Youngsuk Park, Yu-Xiang Wang
Abstract: Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the best draft model in hindsight for each query in terms of either the token acceptance probability or expected acceptance length. In particular, we show that we can accurately evaluate all draft models, instead of only the chosen model without incurring additional queries to the target model, which allows us to improve exponentially over the existing bandit-based approach as the number of draft models increases. Our approach is generically applicable with any speculative decoding methods (single draft, multi-drafts and draft-trees). Moreover, we design system-efficient versions of online learners and demonstrate that the overhead in computation and latency can be substantially reduced. We conduct extensive experiments on open-source LLMs and diverse datasets, demonstrating that our methods substantially outperform the state-of-the-art EAGLE3 and the BanditSpec baseline in a variety of domains where specialized domain-expert drafters are available, especially when long reasoning chains are required.
Authors: Yimeng Qiu, Feihuang Fang
Abstract: We study whether liquidity and volatility proxies of a core set of cryptoassets generate spillovers that forecast market-wide risk. Our empirical framework integrates three statistical layers: (A) interactions between core liquidity and returns, (B) principal-component relations linking liquidity and returns, and (C) volatility-factor projections that capture cross-sectional volatility crowding. The analysis is complemented by vector autoregression impulse responses and forecast error variance decompositions (see Granger 1969; Sims 1980), heterogeneous autoregressive models with exogenous regressors (HAR-X, Corsi 2009), and a leakage-safe machine learning protocol using temporal splits, early stopping, validation-only thresholding, and SHAP-based interpretation. Using daily data from 2021 to 2025 (1462 observations across 74 assets), we document statistically significant Granger-causal relationships across layers and moderate out-of-sample predictive accuracy. We report the most informative figures, including the pipeline overview, Layer A heatmap, Layer C robustness analysis, vector autoregression variance decompositions, and the test-set precision-recall curve. Full data and figure outputs are provided in the artifact repository.
Authors: Ram Dyuthi Sristi, Sowmya Manojna Narasimha, Jingya Huang, Alice Despatin, Simon Musall, Vikash Gilja, Gal Mishne
Abstract: Simultaneous recordings from thousands of neurons across multiple brain areas reveal rich mixtures of activity that are shared between regions and dynamics that are unique to each region. Existing alignment or multi-view methods neglect temporal structure, whereas dynamical latent variable models capture temporal dependencies but are usually restricted to a single area, assume linear read-outs, or conflate shared and private signals. We introduce the Coupled Transformer Autoencoder (CTAE) - a sequence model that addresses both (i) non-stationary, non-linear dynamics and (ii) separation of shared versus region-specific structure in a single framework. CTAE employs transformer encoders and decoders to capture long-range neural dynamics and explicitly partitions each region's latent space into orthogonal shared and private subspaces. We demonstrate the effectiveness of CTAE on two high-density electrophysiology datasets with simultaneous recordings from multiple regions, one from motor cortical areas and the other from sensory areas. CTAE extracts meaningful representations that better decode behavioral variables compared to existing approaches.
Authors: Bosong Huang, Ming Jin, Yuxuan Liang, Johan Barthelemy, Debo Cheng, Qingsong Wen, Chenghao Liu, Shirui Pan
Abstract: Explaining time series classification models is crucial, particularly in high-stakes applications such as healthcare and finance, where transparency and trust play a critical role. Although numerous time series classification methods have identified key subsequences, known as shapelets, as core features for achieving state-of-the-art performance and validating their pivotal role in classification outcomes, existing post-hoc time series explanation (PHTSE) methods primarily focus on timestep-level feature attribution. These explanation methods overlook the fundamental prior that classification outcomes are predominantly driven by key shapelets. To bridge this gap, we present ShapeX, an innovative framework that segments time series into meaningful shapelet-driven segments and employs Shapley values to assess their saliency. At the core of ShapeX lies the Shapelet Describe-and-Detect (SDD) framework, which effectively learns a diverse set of shapelets essential for classification. We further demonstrate that ShapeX produces explanations which reveal causal relationships instead of just correlations, owing to the atomicity properties of shapelets. Experimental results on both synthetic and real-world datasets demonstrate that ShapeX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal fidelity of time series explanations.
Authors: Chang Yang, Ziyi Wang, Wangfeng Tan, Zhiting Tan, Changrui Ji, Zhiming Zhou
Abstract: Social media platforms have become important sources for identifying suicide risk, but automated detection systems face multiple challenges including severe class imbalance, temporal complexity in posting patterns, and the dual nature of risk levels as both ordinal and categorical. This paper proposes a hierarchical dual-head neural network based on MentalRoBERTa for suicide risk classification into four levels: indicator, ideation, behavior, and attempt. The model employs two complementary prediction heads operating on a shared sequence representation: a CORAL (Consistent Rank Logits) head that preserves ordinal relationships between risk levels, and a standard classification head that enables flexible categorical distinctions. A 3-layer Transformer encoder with 8-head multi-head attention models temporal dependencies across post sequences, while explicit time interval embeddings capture posting behavior dynamics. The model is trained with a combined loss function (0.5 CORAL + 0.3 Cross-Entropy + 0.2 Focal Loss) that simultaneously addresses ordinal structure preservation, overconfidence reduction, and class imbalance. To improve computational efficiency, we freeze the first 6 layers (50%) of MentalRoBERTa and employ mixed-precision training. The model is evaluated using 5-fold stratified cross-validation with macro F1 score as the primary metric.
Authors: Amartya Roy, Souvik Chakraborty
Abstract: Causal discovery remains a central challenge in machine learning, yet existing methods face a fundamental gap: algorithms like GES and GraN-DAG achieve strong empirical performance but lack finite-sample guarantees, while theoretically principled approaches fail to scale. We close this gap by introducing a game-theoretic reinforcement learning framework for causal discovery, where a DDQN agent directly competes against a strong baseline (GES or GraN-DAG), always warm-starting from the opponent's solution. This design yields three provable guarantees: the learned graph is never worse than the opponent, warm-starting strictly accelerates convergence, and most importantly, with high probability the algorithm selects the true best candidate graph. To the best of our knowledge, our result makes a first-of-its-kind progress in explaining such finite-sample guarantees in causal discovery: on synthetic SEMs (30 nodes), the observed error probability decays with n, tightly matching theory. On real-world benchmarks including Sachs, Asia, Alarm, Child, Hepar2, Dream, and Andes, our method consistently improves upon GES and GraN-DAG while remaining theoretically safe. Remarkably, it scales to large graphs such as Hepar2 (70 nodes), Dream (100 nodes), and Andes (220 nodes). Together, these results establish a new class of RL-based causal discovery algorithms that are simultaneously provably consistent, sample-efficient, and practically scalable, marking a decisive step toward unifying empirical performance with rigorous finite-sample theory.
Authors: Ayatullah Faruk Mollah
Abstract: Studies on various facets of pattern classification is often imperative while working with multi-dimensional samples pertaining to diverse application scenarios. In this notion, weighted dimension-based distance measure has been one of the vital considerations in pattern analysis as it reflects the degree of similarity between samples. Though it is often presumed to be settled with the pervasive use of Euclidean distance, plethora of issues often surface. In this paper, we present (a) a detail analysis on the impact of distance measure norms and weights of dimensions along with visualization, (b) a novel weighting scheme for each dimension, (c) incorporation of this dimensional weighting schema into a KNN classifier, and (d) pattern classification on a variety of synthetic as well as realistic datasets with the developed model. It has performed well across diverse experiments in comparison to the traditional KNN under the same experimental setups. Specifically, for gene expression datasets, it yields significant and consistent gain in classification accuracy (around 10%) in all cross-validation experiments with different values of k. As such datasets contain limited number of samples of high dimensions, meaningful selection of nearest neighbours is desirable, and this requirement is reasonably met by regulating the shape and size of the region enclosing the k number of reference samples with the developed weighting schema and appropriate norm. It, therefore, stands as an important generalization of KNN classifier powered by weighted Minkowski distance with the present weighting schema.
Authors: Gabriel Y. Arteaga, Marius Aasan, Rwiddhi Chakraborty, Martine Hjelkrem-Tan, Thalles Silva, Michael Kampffmeyer, Ad\'in Ram\'irez Rivera
Abstract: Prototypical self-supervised learning methods consistently suffer from partial prototype collapse, where multiple prototypes converge to nearly identical representations. This undermines their central purpose -- providing diverse and informative targets to guide encoders toward rich representations -- and has led practitioners to over-parameterize prototype sets or add ad-hoc regularizers, which mitigate symptoms rather than address the root cause. We empirically trace the collapse to the joint optimization of encoders and prototypes, which encourages a type of shortcut learning: early in training prototypes drift toward redundant representations that minimize loss without necessarily enhancing representation diversity. To break the joint optimization, we introduce a fully decoupled training strategy that learns prototypes and encoders under separate objectives. Concretely, we model prototypes as a Gaussian mixture updated with an online EM-style procedure, independent of the encoder's loss. This simple yet principled decoupling eliminates prototype collapse without explicit regularization and yields consistently diverse prototypes and stronger downstream performance.
Authors: Arian Prabowo, Flora D. Salim
Abstract: Timeseries foundation models (TSFMs) have multiplied, yet lightweight supervised baselines and even classical models often match them. We argue this gap stems from the naive importation of NLP or CV pipelines. In language and vision, large web-scale corpora densely capture human concepts i.e. there are countless images and text of apples. In contrast, timeseries data is built to complement the image and text modalities. There are no timeseries dataset that contains the concept apple. As a result, the scrape-everything-online paradigm fails for TS. We posit that progress demands a shift from opportunistic aggregation to principled design: constructing datasets that systematically span the space of invariance that preserve temporal semantics. To this end, we suggest that the ontology of timeseries invariances should be built based on first principles. Only by ensuring representational completeness through invariance coverage can TSFMs achieve the aligned structure necessary for generalisation, reasoning, and truly emergent behaviour.
Authors: Tingting Dan, Xinwei Huang, Jiaqi Ding, Yinggang Zheng, Guorong Wu
Abstract: Emerging neuroimaging evidence shows that pathological tau proteins build up along specific brain networks, suggesting that large-scale network architecture plays a key role in the progression of Alzheimer's disease (AD). However, how structural connectivity (SC) and functional connectivity (FC) interact to influence tau propagation remains unclear. Leveraging an unprecedented volume of longitudinal neuroimaging data, we examine SC-FC interactions through a multi-layer graph diffusion model. Beyond showing that connectome architecture constrains tau spread, our model reveals a regionally asymmetric contribution of SC and FC. Specifically, FC predominantly drives tau spread in subcortical areas, the insula, frontal and temporal cortices, whereas SC plays a larger role in occipital, parietal, and limbic regions. The relative dominance of SC versus FC shifts over the course of disease, with FC generally prevailing in early AD and SC becoming primary in later stages. Spatial patterns of SC- and FC-dominant regions strongly align with the regional expression of AD-associated genes involved in inflammation, apoptosis, and lysosomal function, including CHUK (IKK-alpha), TMEM106B, MCL1, NOTCH1, and TH. In parallel, other non-modifiable risk factors (e.g., APOE genotype, sex) and biological mechanisms (e.g., amyloid deposition) selectively reshape tau propagation by shifting dominant routes between anatomical and functional pathways in a region-specific manner. Findings are validated in an independent AD cohort.
Authors: Xiaoming Wu, Teng Liu, Xin Wang, Ming Yang, Jiguo Yu
Abstract: Differential privacy is widely employed in decentralized learning to safeguard sensitive data by introducing noise into model updates. However, existing approaches that use fixed-variance noise often degrade model performance and reduce training efficiency. To address these limitations, we propose a novel approach called decentralized learning with adaptive differential privacy via variance-reduced stochastic gradient push (ADP-VRSGP). This method dynamically adjusts both the noise variance and the learning rate using a stepwise-decaying schedule, which accelerates training and enhances final model performance while providing node-level personalized privacy guarantees. To counteract the slowed convergence caused by large-variance noise in early iterations, we introduce a progressive gradient fusion strategy that leverages historical gradients. Furthermore, ADP-VRSGP incorporates decentralized push-sum and aggregation techniques, making it particularly suitable for time-varying communication topologies. Through rigorous theoretical analysis, we demonstrate that ADP-VRSGP achieves robust convergence with an appropriate learning rate, significantly improving training stability and speed. Experimental results validate that our method outperforms existing baselines across multiple scenarios, highlighting its efficacy in addressing the challenges of privacy-preserving decentralized learning.
Authors: Tongkai Lu, Shuai Ma, Chongyang Tao
Abstract: Traveling Salesman Problem (TSP) is a classic NP-hard problem that has garnered significant attention from both academia and industry. While neural-based methods have shown promise for solving TSPs, they still face challenges in scaling to larger instances, particularly in memory constraints associated with global heatmaps, edge weights, or access matrices, as well as in generating high-quality initial solutions and insufficient global guidance for efficiently navigating vast search spaces. To address these challenges, we propose a Hyper Tour Guided Neighborhood Search (HyperNS) method for large-scale TSP instances. Inspired by the ``clustering first, route second" strategy, our approach initially divides the TSP instance into clusters using a sparse heatmap graph and abstracts them as supernodes, followed by the generation of a hyper tour to guide both the initialization and optimization processes. This method reduces the search space by focusing on edges relevant to the hyper tour, leading to more efficient and effective optimization. Experimental results on both synthetic and real-world datasets demonstrate that our approach outperforms existing neural-based methods, particularly in handling larger-scale instances, offering a significant reduction in the gap to the optimal solution.
Authors: Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi, Dong Yu
Abstract: We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
Authors: Jane H. Lee, Baturay Saglam, Spyridon Pougkakiotis, Amin Karbasi, Dionysis Kalogerias
Abstract: Constrained optimization provides a common framework for dealing with conflicting objectives in reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects risky or even possibly catastrophic events at the tails of the reward distribution, and is often insufficient for high-stakes applications in which the risk involved in outliers is critical. In this work, we propose a framework for risk-aware constrained RL, which exhibits per-stage robustness properties jointly in reward values and time using optimized certainty equivalents (OCEs). Our framework ensures an exact equivalent to the original constrained problem within a parameterized strong Lagrangian duality framework under appropriate constraint qualifications, and yields a simple algorithmic recipe which can be wrapped around standard RL solvers, such as PPO. Lastly, we establish the convergence of the proposed algorithm under common assumptions, and verify the risk-aware properties of our approach through several numerical experiments.
Authors: Max Hopkins, Russell Impagliazzo, Christopher Ye
Abstract: Replicability, introduced by (Impagliazzo et al. STOC '22), is the notion that algorithms should remain stable under a resampling of their inputs (given access to shared randomness). While a strong and interesting notion of stability, the cost of replicability can be prohibitive: there is no replicable algorithm, for instance, for tasks as simple as threshold learning (Bun et al. STOC '23). Given such strong impossibility results we ask: under what approximate notions of replicability is learning possible? In this work, we propose three natural relaxations of replicability in the context of PAC learning: (1) Pointwise: the learner must be consistent on any fixed input, but not across all inputs simultaneously, (2) Approximate: the learner must output hypotheses that classify most of the distribution consistently, (3) Semi: the algorithm is fully replicable, but may additionally use shared unlabeled samples. In all three cases, for constant replicability parameters, we obtain sample-optimal agnostic PAC learners: (1) and (2) are achievable for ``free" using $\Theta(d/\alpha^2)$ samples, while (3) requires $\Theta(d^2/\alpha^2)$ labeled samples.
Authors: Shumin Li
Abstract: The development of accessible screening tools for early cancer detection in dogs represents a significant challenge in veterinary medicine. Routine laboratory data offer a promising, low-cost source for such tools, but their utility is hampered by the non-specificity of individual biomarkers and the severe class imbalance inherent in screening populations. This study assesses the feasibility of cancer risk classification using the Golden Retriever Lifetime Study (GRLS) cohort under real-world constraints, including the grouping of diverse cancer types and the inclusion of post-diagnosis samples. A comprehensive benchmark evaluation was conducted, systematically comparing 126 analytical pipelines that comprised various machine learning models, feature selection methods, and data balancing techniques. Data were partitioned at the patient level to prevent leakage. The optimal model, a Logistic Regression classifier with class weighting and recursive feature elimination, demonstrated moderate ranking ability (AUROC = 0.815; 95% CI: 0.793-0.836) but poor clinical classification performance (F1-score = 0.25, Positive Predictive Value = 0.15). While a high Negative Predictive Value (0.98) was achieved, insufficient recall (0.79) precludes its use as a reliable rule-out test. Interpretability analysis with SHapley Additive exPlanations (SHAP) revealed that predictions were driven by non-specific features like age and markers of inflammation and anemia. It is concluded that while a statistically detectable cancer signal exists in routine lab data, it is too weak and confounded for clinically reliable discrimination from normal aging or other inflammatory conditions. This work establishes a critical performance ceiling for this data modality in isolation and underscores that meaningful progress in computational veterinary oncology will require integration of multi-modal data sources.
Authors: Ke Xing, Yanjie Dong, Xiaoyi Fan, Runhao Zeng, Victor C. M. Leung, M. Jamal Deen, Xiping Hu
Abstract: Personalized federated learning (PFL) addresses a critical challenge of collaboratively training customized models for clients with heterogeneous and scarce local data. Conventional federated learning, which relies on a single consensus model, proves inadequate under such data heterogeneity. Its standard aggregation method of weighting client updates heuristically or by data volume, operates under an equal-contribution assumption, failing to account for the actual utility and reliability of each client's update. This often results in suboptimal personalization and aggregation bias. To overcome these limitations, we introduce Contribution-Oriented PFL (CO-PFL), a novel algorithm that dynamically estimates each client's contribution for global aggregation. CO-PFL performs a joint assessment by analyzing both gradient direction discrepancies and prediction deviations, leveraging information from gradient and data subspaces. This dual-subspace analysis provides a principled and discriminative aggregation weight for each client, emphasizing high-quality updates. Furthermore, to bolster personalization adaptability and optimization stability, CO-PFL cohesively integrates a parameter-wise personalization mechanism with mask-aware momentum optimization. Our approach effectively mitigates aggregation bias, strengthens global coordination, and enhances local performance by facilitating the construction of tailored submodels with stable updates. Extensive experiments on four benchmark datasets (CIFAR10, CIFAR10C, CINIC10, and Mini-ImageNet) confirm that CO-PFL consistently surpasses state-of-the-art methods in in personalization accuracy, robustness, scalability and convergence stability.
Authors: Iv\'an Ojeda-Ruiz, Young Ju-Lee, Malcolm Dickens, Leonardo Cambisaca
Abstract: Recent research has focused on mitigating algorithmic bias in clustering by incorporating fairness constraints into algorithmic design. Notions such as disparate impact, community cohesion, and cost per population have been implemented to enforce equitable outcomes. Among these, group fairness (balance) ensures that each protected group is proportionally represented within every cluster. However, incorporating balance as a metric of fairness into spectral clustering algorithms has led to computational times that can be improved. This study aims to enhance the efficiency of spectral clustering algorithms by reformulating the constrained optimization problem using a new formulation derived from the Lagrangian method and the Sherman-Morrison-Woodbury (SMW) identity, resulting in the Fair-SMW algorithm. Fair-SMW employs three alternatives to the Laplacian matrix with different spectral gaps to generate multiple variations of Fair-SMW, achieving clustering solutions with comparable balance to existing algorithms while offering improved runtime performance. We present the results of Fair-SMW, evaluated using the Stochastic Block Model (SBM) to measure both runtime efficiency and balance across real-world network datasets, including LastFM, FacebookNet, Deezer, and German. We achieve an improvement in computation time that is twice as fast as the state-of-the-art, and also flexible enough to achieve twice as much balance.
Authors: Hao Wang, Baojun Ma
Abstract: In real-world time series forecasting tasks, category information plays a pivotal role in capturing inherent data patterns. This paper introduces QKCV (Query-Key-Category-Value) attention, an extension of the traditional QKV framework that incorporates a static categorical embedding C to emphasize category-specific information. As a versatile plug-in module, QKCV enhances the forecasting accuracy of attention-based models (e.g., Vanilla Transformer, Informer, PatchTST, TFT) across diverse real-world datasets. Furthermore, QKCV demonstrates remarkable adaptability in fine-tuning univariate time series foundation model by solely updating the static embedding C while preserving pretrained weights, thereby reducing computational overhead and achieving superior fine-tuning performance.
Authors: Insu Jeon, Minui Hong, Junhyeog Yun, Gunhee Kim
Abstract: Federated Learning (FL) aims to train a global inference model from remotely distributed clients, gaining popularity due to its benefit of improving data privacy. However, traditional FL often faces challenges in practical applications, including model overfitting and divergent local models due to limited and non-IID data among clients. To address these issues, we introduce a novel Bayesian meta-learning approach called meta-variational dropout (MetaVD). MetaVD learns to predict client-dependent dropout rates via a shared hypernetwork, enabling effective model personalization of FL algorithms in limited non-IID data settings. We also emphasize the posterior adaptation view of meta-learning and the posterior aggregation view of Bayesian FL via the conditional dropout posterior. We conducted extensive experiments on various sparse and non-IID FL datasets. MetaVD demonstrated excellent classification accuracy and uncertainty calibration performance, especially for out-of-distribution (OOD) clients. MetaVD compresses the local model parameters needed for each client, mitigating model overfitting and reducing communication costs. Code is available at https://github.com/insujeon/MetaVD.
Authors: Yago del Valle Inclan Redondo, Enrique Arriaga-Varela, Dmitry Lyamzin, Pablo Cervantes, Tiago Ramalho
Abstract: We introduce SpLIIF to generate implicit neural representations and enable arbitrary downscaling of weather variables. We train a model from sparse weather stations and topography over Japan and evaluate in- and out-of-distribution accuracy predicting temperature and wind, comparing it to both an interpolation baseline and CorrDiff. We find the model to be up to 50% better than both CorrDiff and the baseline at downscaling temperature, and around 10-20% better for wind.
Authors: Woohyeon Byeon, Giseung Park, Jongseong Chae, Amir Leshem, Youngchul Sung
Abstract: In this paper, we propose a provably convergent and practical framework for multi-objective reinforcement learning with max-min criterion. From a game-theoretic perspective, we reformulate max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game and introduce an efficient algorithm based on mirror descent. Our approach simplifies the policy update while ensuring global last-iterate convergence. We provide a comprehensive theoretical analysis on our algorithm, including iteration complexity under both exact and approximate policy evaluations, as well as sample complexity bounds. To further enhance performance, we modify the proposed algorithm with adaptive regularization. Our experiments demonstrate the convergence behavior of the proposed algorithm in tabular settings, and our implementation for deep reinforcement learning significantly outperforms previous baselines in many MORL environments.
Authors: Teng Jiek See, Daokun Zhang, Mario Boley, David K. Chalmers
Abstract: Graph Neural Networks (GNNs) are the currently most effective methods for predicting molecular properties but there remains a need for more accurate models. GNN accuracy can be improved by increasing the model complexity but this also increases the computational cost and memory requirement during training and inference. In this study, we develop Layer-to-Layer Knowledge Mixing (LKM), a novel self-knowledge distillation method that increases the accuracy of state-of-the-art GNNs while adding negligible computational complexity during training and inference. By minimizing the mean absolute distance between pre-existing hidden embeddings of GNN layers, LKM efficiently aggregates multi-hop and multi-scale information, enabling improved representation of both local and global molecular features. We evaluated LKM using three diverse GNN architectures (DimeNet++, MXMNet, and PAMNet) using datasets of quantum chemical properties (QM9, MD17 and Chignolin). We found that the LKM method effectively reduces the mean absolute error of quantum chemical and biophysical property predictions by up to 9.8% (QM9), 45.3% (MD17 Energy), and 22.9% (Chignolin). This work demonstrates the potential of LKM to significantly improve the accuracy of GNNs for chemical property prediction without any substantial increase in training and inference cost.
Authors: Stephan Rabanser, Nicolas Papernot
Abstract: Selective classifiers improve model reliability by abstaining on inputs the model deems uncertain. However, few practical approaches achieve the gold-standard performance of a perfect-ordering oracle that accepts examples exactly in order of correctness. Our work formalizes this shortfall as the selective-classification gap and present the first finite-sample decomposition of this gap to five distinct sources of looseness: Bayes noise, approximation error, ranking error, statistical noise, and implementation- or shift-induced slack. Crucially, our analysis reveals that monotone post-hoc calibration -- often believed to strengthen selective classifiers -- has limited impact on closing this gap, since it rarely alters the model's underlying score ranking. Bridging the gap therefore requires scoring mechanisms that can effectively reorder predictions rather than merely rescale them. We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks, isolating each error component through controlled experiments. Our results confirm that (i) Bayes noise and limited model capacity can account for substantial gaps, (ii) only richer, feature-aware calibrators meaningfully improve score ordering, and (iii) data shift introduces a separate slack that demands distributionally robust training. Together, our decomposition yields a quantitative error budget as well as actionable design guidelines that practitioners can use to build selective classifiers which approximate ideal oracle behavior more closely.
Authors: Zhiqin Yang, Yonggang Zhang, Chenxin Li, Yiu-ming Cheung, Bo Han, Yixuan Yuan
Abstract: Federated Learning (FL) confronts a significant challenge known as data heterogeneity, which impairs model performance and convergence. Existing methods have made notable progress in addressing this issue. However, improving performance in certain heterogeneity scenarios remains an overlooked question: \textit{How robust are these methods to deploy under diverse heterogeneity scenarios?} To answer this, we conduct comprehensive evaluations across varied heterogeneity scenarios, showing that most existing methods exhibit limited robustness. Meanwhile, insights from these experiments highlight that sharing statistical information can mitigate heterogeneity by enabling clients to update with a global perspective. Motivated by this, we propose \textbf{FedGPS} (\textbf{Fed}erated \textbf{G}oal-\textbf{P}ath \textbf{S}ynergy), a novel framework that seamlessly integrates statistical distribution and gradient information from others. Specifically, FedGPS statically modifies each client's learning objective to implicitly model the global data distribution using surrogate information, while dynamically adjusting local update directions with gradient information from other clients at each round. Extensive experiments show that FedGPS outperforms state-of-the-art methods across diverse heterogeneity scenarios, validating its effectiveness and robustness. The code is available at: https://github.com/CUHK-AIM-Group/FedGPS.
Authors: Thomas Rupf, Marco Bagatella, Marin Vlastelica, Andreas Krause
Abstract: Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead. Code is available at https://github.com/ThomasRupf/opti-bfm.
Authors: Ziqian Zhong, Aditi Raghunathan, Nicholas Carlini
Abstract: The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.
Authors: Udit Saxena
Abstract: Topological features capture global geometric structure in imaging data, but practical adoption in deep learning requires both computational efficiency and differentiability. We present optimized GPU kernels for the Euler Characteristic Curve (ECC) computation achieving 16-2000\"O speedups over prior GPU implementations on synthetic grids, and introduce a differentiable PyTorch layer enabling end-to-end learning. Our CUDA kernels, optimized for Ampere GPUs use 128B-coalesced access and hierarchical shared-memory accumulation. Our PyTorch layer learns thresholds in a single direction via a Differentiable Euler Characteristic Transform-style sigmoid relaxation. We discuss downstream relevance, including applications highlighted by prior ECC work, and outline batching/multi-GPU extensions to broaden adoption.
Authors: Tristan Cinquin, Geoff Pleiss, Agustinus Kristiadi
Abstract: While chain-of-thought prompting with Best-of-N (BoN) selection has become popular for mathematical reasoning in large language models (LLMs), its linear structure fails to capture the branching and exploratory nature of complex problem-solving. In this work, we propose an adaptive algorithm to maximize process reward model (PRM) scores over the intractable action space, and investigate whether PRM-guided tree search can improve mathematical reasoning by exploring multiple partial solution paths. Across $23$ diverse mathematical problems using Qwen2.5-Math-7B-Instruct with its associated PRM as a case study, we find that: (1) PRM-guided tree search shows no statistically significant improvements over BoN despite higher costs, (2) Monte Carlo tree search and beam search outperform other PRM-guided tree search methods, (3) PRMs poorly approximate state values and their reliability degrades with reasoning depth, and (4) PRMs generalize poorly out of distribution. This underperformance stems from tree search's greater reliance on unreliable PRM scores, suggesting different reward modeling is necessary before tree search can effectively enhance mathematical reasoning in LLMs.
Authors: Qitai Tan, Yiyun Chen, Mo Li, Ruiwen Gu, Yilin Su, Xiao-Ping Zhang
Abstract: Recent advances in deep learning have driven rapid progress in time series forecasting, yet many state-of-the-art models continue to struggle with robust performance in real-world applications, even when they achieve strong results on standard benchmark datasets. This persistent gap can be attributed to the black-box nature of deep learning architectures and the inherent limitations of current evaluation frameworks, which frequently lack the capacity to provide clear, quantitative insights into the specific strengths and weaknesses of different models, thereby complicating the selection of appropriate models for particular forecasting scenarios. To address these issues, we propose a synthetic data-driven evaluation paradigm, SynTSBench, that systematically assesses fundamental modeling capabilities of time series forecasting models through programmable feature configuration. Our framework isolates confounding factors and establishes an interpretable evaluation system with three core analytical dimensions: (1) temporal feature decomposition and capability mapping, which enables systematic evaluation of model capacities to learn specific pattern types; (2) robustness analysis under data irregularities, which quantifies noise tolerance thresholds and anomaly recovery capabilities; and (3) theoretical optimum benchmarking, which establishes performance boundaries for each pattern type-enabling direct comparison between model predictions and mathematical optima. Our experiments show that current deep learning models do not universally approach optimal baselines across all types of temporal features.The code is available at https://github.com/TanQitai/SynTSBench
Authors: Guangyu Dai, Siliang Tang, Yueting Zhuang
Abstract: In recent years, Pretrained Large Models(PLMs) researchers proposed large-small model collaboration frameworks, leveraged easily trainable small models to assist large models, aim to(1) significantly reduce computational resource consumption while maintaining comparable accuracy, and (2) enhance large model performance in specialized domain tasks. However, this collaborative paradigm suffers from issues such as significant accuracy degradation, exacerbated catastrophic forgetting, and amplified hallucination problems induced by small model knowledge. To address these challenges, we propose a KAN-based Collaborative Model (KCM) as an improved approach to large-small model collaboration. The KAN utilized in KCM represents an alternative neural network architecture distinct from conventional MLPs. Compared to MLPs, KAN offers superior visualizability and interpretability while mitigating catastrophic forgetting. We deployed KCM in large-small model collaborative systems across three scenarios: language, vision, and vision-language cross-modal tasks. The experimental results demonstrate that, compared with pure large model approaches, the large-small model collaboration framework utilizing KCM as the collaborative model significantly reduces the number of large model inference calls while maintaining near-identical task accuracy, thereby substantially lowering computational resource consumption. Concurrently, the KAN-based small collaborative model markedly mitigates catastrophic forgetting, leading to significant accuracy improvements for long-tail data. The results reveal that KCM demonstrates superior performance across all metrics compared to MLP-based small collaborative models (MCM).
Authors: Penghao Wang, Yuhao Zhou, Mengxuan Wu, Ziheng Qin, Bangyuan Zhu, Shengbin Huang, Xuanlei Zhao, Panpan Zhang, Xiaojiang Peng, Yuzhang Shang, Jianfei Yang, Zheng Zhu, Tianlong Chen, Zhangyang Wang, Kai Wang
Abstract: As large language models (LLMs) advance, the ultimate vision for their role in science is emerging: we could build an AI collaborator to effectively assist human beings throughout the entire scientific research process. We refer to this envisioned system as ResearchGPT. Given that scientific research progresses through multiple interdependent phases, achieving this vision requires rigorous benchmarks that evaluate the end-to-end workflow rather than isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of scientific Q&A pairs in computer science, built from 14k CC-licensed papers. It is constructed through a scalable, paper-grounded pipeline that combines retrieval-augmented generation (RAG) with multi-stage quality control to ensure factual grounding. From this unified corpus, we derive two complementary subsets: CS-4k, a carefully curated benchmark for evaluating AI's ability to assist scientific research, and CS-50k, a large-scale training dataset. Extensive experiments demonstrate that CS-4k stratifies state-of-the-art LLMs into distinct capability tiers. Open models trained on CS-50k with supervised training and reinforcement learning demonstrate substantial improvements. Even 7B-scale models, when properly trained, outperform many larger proprietary systems, such as GPT-4.1, GPT-4o, and Gemini 2.5 Pro. This indicates that making AI models better research assistants relies more on domain-aligned training with high-quality data than on pretraining scale or general benchmark performance. We release CS-4k and CS-50k in the hope of fostering AI systems as reliable collaborators in CS research.
Authors: Yang Qiu, Yixiong Zou, Jun Wang, Wei Liu, Xiangyu Fu, Ruixuan Li
Abstract: Out-of-distribution generalization under distributional shifts remains a critical challenge for graph neural networks. Existing methods generally adopt the Invariant Risk Minimization (IRM) framework, requiring costly environment annotations or heuristically generated synthetic splits. To circumvent these limitations, in this work, we aim to develop an IRM-free method for capturing causal subgraphs. We first identify that causal subgraphs exhibit substantially smaller distributional variations than non-causal components across diverse environments, which we formalize as the Invariant Distribution Criterion and theoretically prove in this paper. Building on this criterion, we systematically uncover the quantitative relationship between distributional shift and representation norm for identifying the causal subgraph, and investigate its underlying mechanisms in depth. Finally, we propose an IRM-free method by introducing a norm-guided invariant distribution objective for causal subgraph discovery and prediction. Extensive experiments on two widely used benchmarks demonstrate that our method consistently outperforms state-of-the-art methods in graph generalization.
Authors: Saraf Anzum Shreya, MD. Abu Ismail Siddique, Sharaf Tasnim
Abstract: Brain tumors are a challenging problem in neuro-oncology, where early and precise diagnosis is important for successful treatment. Deep learning-based brain tumor classification methods often rely on heavy data augmentation which can limit generalization and trust in clinical applications. In this paper, we propose a double-backbone network integrating VGG16 and Xception with a Frequency-Gated Attention (FGA) Block to capture complementary local and global features. Unlike previous studies, our model achieves state-of-the-art performance without augmentation which demonstrates robustness to variably sized and distributed datasets. For further transparency, Grad-CAM is integrated to visualize the tumor regions based on which the model is giving prediction, bridging the gap between model prediction and clinical interpretability. The proposed framework achieves 99.24\% accuracy on the 7K-DS dataset for the 4-class setting, along with 98.68\% and 99.85\% in the 3-class and 2-class settings, respectively. On the independent 3K-DS dataset, the model generalizes with 95.77\% accuracy, outperforming baseline and state-of-the-art methods. To further support clinical usability, we developed a graphical user interface (GUI) that provides real-time classification and Grad-CAM-based tumor localization. These findings suggest that augmentation-free, interpretable, and deployable deep learning models such as DB-FGA-Net hold strong potential for reliable clinical translation in brain tumor diagnosis.
Authors: Yuhang Wang
Abstract: Multivariate time series forecasting requires simultaneously modeling temporal patterns and cross-variate dependencies. Channel-independent methods such as PatchTST excel at temporal modeling but ignore variable correlations, while pure variate-attention approaches such as iTransformer sacrifice temporal encoding. We proposeInvDec (Inverted Decoder), a hybrid architecture that achieves principled separation between temporal encoding and variate-level decoding. InvDec combines a patch-based temporal encoder with an inverted decoder operating on the variate dimension through variate-wise self-attention. We introduce delayed variate embeddings that enrich variable-specific representations only after temporal encoding, preserving temporal feature integrity. An adaptive residual fusion mechanism dynamically balances temporal and variate information across datasets of varying dimensions. Instantiating InvDec with PatchTST yields InvDec-PatchTST. Extensive experiments on seven benchmarks demonstrate significant gains on high-dimensional datasets: 20.9% MSE reduction on Electricity (321 variables), 4.3% improvement on Weather, and 2.7% gain on Traffic compared to PatchTST, while maintaining competitive performance on low-dimensional ETT datasets. Ablation studies validate each component, and analysis reveals that InvDec's advantage grows with dataset dimensionality, confirming that cross-variate modeling becomes critical as the number of variables increases.
Authors: Fengyuan Yu, Yuyuan Li, Xiaohua Feng, Junjie Fang, Tao Wang, Chaochao Chen
Abstract: With the growing demand for safeguarding sensitive user information in recommender systems, recommendation attribute unlearning is receiving increasing attention. Existing studies predominantly focus on single-attribute unlearning. However, privacy protection requirements in the real world often involve multiple sensitive attributes and are dynamic. Existing single-attribute unlearning methods cannot meet these real-world requirements due to i) CH1: the inability to handle multiple unlearning requests simultaneously, and ii) CH2: the lack of efficient adaptability to dynamic unlearning needs. To address these challenges, we propose LEGO, a lightweight and efficient multiple-attribute unlearning framework. Specifically, we divide the multiple-attribute unlearning process into two steps: i) Embedding Calibration removes information related to a specific attribute from user embedding, and ii) Flexible Combination combines these embeddings into a single embedding, protecting all sensitive attributes. We frame the unlearning process as a mutual information minimization problem, providing LEGO a theoretical guarantee of simultaneous unlearning, thereby addressing CH1. With the two-step framework, where Embedding Calibration can be performed in parallel and Flexible Combination is flexible and efficient, we address CH2. Extensive experiments on three real-world datasets across three representative recommendation models demonstrate the effectiveness and efficiency of our proposed framework. Our code and appendix are available at https://github.com/anonymifish/lego-rec-multiple-attribute-unlearning.
URLs: https://github.com/anonymifish/lego-rec-multiple-attribute-unlearning.
Authors: Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Fabrice Jimenez, Thomas Oberlin
Abstract: Deep vision models are now mature enough to be integrated in industrial and possibly critical applications such as autonomous navigation. Yet, data collection and labeling to train such models requires too much efforts and costs for a single company or product. This drawback is more significant in critical applications, where training data must include all possible conditions including rare scenarios. In this perspective, generating synthetic images is an appealing solution, since it allows a cheap yet reliable covering of all the conditions and environments, if the impact of the synthetic-to-real distribution shift is mitigated. In this article, we consider the case of runway detection that is a critical part in autonomous landing systems developed by aircraft manufacturers. We propose an image generation approach based on a commercial flight simulator that complements a few annotated real images. By controlling the image generation and the integration of real and synthetic data, we show that standard object detection models can achieve accurate prediction. We also evaluate their robustness with respect to adverse conditions, in our case nighttime images, that were not represented in the real data, and show the interest of using a customized domain adaptation strategy.
Authors: Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, Tuo Zhao
Abstract: Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.
Authors: Shuhei Aikawa, Aru Suzuki, Kei Yoshitake, Kanata Teshigawara, Akira Iwabuchi, Ken Kobayashi, Kazuhide Nakata
Abstract: This paper focuses on forecasting hierarchical time-series data, where each higher-level observation equals the sum of its corresponding lower-level time series. In such contexts, the forecast values should be coherent, meaning that the forecast value of each parent series exactly matches the sum of the forecast values of its child series. Existing hierarchical forecasting methods typically generate base forecasts independently for each series and then apply a reconciliation procedure to adjust them so that the resulting forecast values are coherent across the hierarchy. These methods generally derive an optimal reconciliation, using a covariance matrix of the forecast error. In practice, however, the true covariance matrix is unknown and has to be estimated from finite samples in advance. This gap between the true and estimated covariance matrix may degrade forecast performance. To address this issue, we propose a robust optimization framework for hierarchical reconciliation that accounts for uncertainty in the estimated covariance matrix. We first introduce an uncertainty set for the estimated covariance matrix and formulate a reconciliation problem that minimizes the worst-case expected squared error over this uncertainty set. We show that our problem can be cast as a semidefinite optimization problem. Numerical experiments demonstrate that the proposed robust reconciliation method achieved better forecast performance than existing hierarchical forecasting methods, which indicates the effectiveness of integrating uncertainty into the reconciliation process.
Authors: Baoqing Yue, Jinyuan Zhou, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu
Abstract: Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.
Authors: Tom Maus, Asma Atamna, Tobias Glasmachers
Abstract: Autonomous control of multi-stage industrial processes requires both local specialization and global coordination. Reinforcement learning (RL) offers a promising approach, but its industrial adoption remains limited due to challenges such as reward design, modularity, and action space management. Many academic benchmarks differ markedly from industrial control problems, limiting their transferability to real-world applications. This study introduces an enhanced industry-inspired benchmark environment that combines tasks from two existing benchmarks, SortingEnv and ContainerGym, into a sequential recycling scenario with sorting and pressing operations. We evaluate two control strategies: a modular architecture with specialized agents and a monolithic agent governing the full system, while also analyzing the impact of action masking. Our experiments show that without action masking, agents struggle to learn effective policies, with the modular architecture performing better. When action masking is applied, both architectures improve substantially, and the performance gap narrows considerably. These results highlight the decisive role of action space constraints and suggest that the advantages of specialization diminish as action complexity is reduced. The proposed benchmark thus provides a valuable testbed for exploring practical and robust multi-agent RL solutions in industrial automation, while contributing to the ongoing debate on centralization versus specialization.
Authors: Aditya Gopalan, Sayak Ray Chowdhury, Debangshu Banerjee
Abstract: Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.
Authors: Sishun Liu, Ke Deng, Xiuzhen Zhang, Yongli Ren, Yan Wang
Abstract: Marked Temporal Point Process (MTPP) has been well studied to model the event distribution in marked event streams, which can be used to predict the mark and arrival time of the next event. However, existing studies overlook that the distribution of event marks is highly imbalanced in many real-world applications, with some marks being frequent but others rare. The imbalance poses a significant challenge to the performance of the next event prediction, especially for events of rare marks. To address this issue, we propose a thresholding method, which learns thresholds to tune the mark probability normalized by the mark's prior probability to optimize mark prediction, rather than predicting the mark directly based on the mark probability as in existing studies. In conjunction with this method, we predict the mark first and then the time. In particular, we develop a novel neural MTPP model to support effective time sampling and estimation of mark probability without computationally expensive numerical improper integration. Extensive experiments on real-world datasets demonstrate the superior performance of our solution against various baselines for the next event mark and time prediction. The code is available at https://github.com/undes1red/IFNMTPP.
Authors: Xuran Li, Jingyi Wang
Abstract: Large language models (LLMs) are increasingly deployed in real-world systems, yet they can produce toxic or biased outputs that undermine safety and trust. Post-hoc model repair provides a practical remedy, but the high cost of parameter updates motivates selective use of repair data. Despite extensive prior work on data selection for model training, it remains unclear which sampling criteria are most effective and efficient when applied specifically to behavioral repair of large generative models. Our study presents a systematic analysis of sample prioritization strategies for LLM repair. We evaluate five representative selection methods, including random sampling, K-Center, gradient-norm-based selection(GraNd), stratified coverage (CCS), and a Semantic-Aware Prioritized Sampling (SAPS) approach we proposed. Repair effectiveness and trade-offs are assessed through toxicity reduction, perplexity on WikiText-2 and LAMBADA, and three composite metrics: the Repair Proximity Score (RPS), the Overall Performance Score (OPS), and the Repair Efficiency Score (RES). Experimental results show that SAPS achieves the best balance between detoxification, utility preservation, and efficiency, delivering comparable or superior repair outcomes with substantially less data. Random sampling remains effective for large or robust models, while high-overhead methods such as CCS and GraNd provide limited benefit. The optimal data proportion depends on model scale and repair method, indicating that sample selection should be regarded as a tunable component of repair pipelines. Overall, these findings establish selection-based repair as an efficient and scalable paradigm for maintaining LLM reliability.
Authors: Quannian Zhang, Michael R\"oder, Nikit Srivastava, N'Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo
Abstract: Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80\% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at https://github.com/dice-group/PruneCEL/tree/K-cap2025
Authors: Xuan Lin, Aocheng Ding, Tengfei Ma, Hua Liang, Zhe Quan
Abstract: Drug combinations offer therapeutic benefits but also carry the risk of adverse drug-drug interactions (DDIs), especially under complex molecular structures. Accurate DDI event prediction requires capturing fine-grained inter-drug relationships, which are critical for modeling metabolic mechanisms such as enzyme-mediated competition. However, existing approaches typically rely on isolated drug representations and fail to explicitly model atom-level cross-molecular interactions, limiting their effectiveness across diverse molecular complexities and DDI type distributions. To address these limitations, we propose MolBridge, a novel atom-level joint graph refinement framework for robust DDI event prediction. MolBridge constructs a joint graph that integrates atomic structures of drug pairs, enabling direct modeling of inter-drug associations. A central challenge in such joint graph settings is the potential loss of information caused by over-smoothing when modeling long-range atomic dependencies. To overcome this, we introduce a structure consistency module that iteratively refines node features while preserving the global structural context. This joint design allows MolBridge to effectively learn both local and global interaction outperforms state-of-the-art baselines, achieving superior performance across long-tail and inductive scenarios. patterns, yielding robust representations across both frequent and rare DDI types. Extensive experiments on two benchmark datasets show that MolBridge consistently. These results demonstrate the advantages of fine-grained graph refinement in improving the accuracy, robustness, and mechanistic interpretability of DDI event prediction.This work contributes to Web Mining and Content Analysis by developing graph-based methods for mining and analyzing drug-drug interaction networks.
Authors: Lawrence Clegg, John Cartlidge
Abstract: Intransitive player dominance, where player A beats B, B beats C, but C beats A, is common in competitive tennis. Yet, there are few known attempts to incorporate it within forecasting methods. We address this problem with a graph neural network approach that explicitly models these intransitive relationships through temporal directed graphs, with players as nodes and their historical match outcomes as directed edges. We find the bookmaker Pinnacle Sports poorly handles matches with high intransitive complexity and posit that our graph-based approach is uniquely positioned to capture relational dynamics in these scenarios. When selectively betting on higher intransitivity matchups with our model (65.7% accuracy, 0.215 Brier Score), we achieve significant positive returns of 3.26% ROI with Kelly staking over 1903 bets, suggesting a market inefficiency in handling intransitive matchups that our approach successfully exploits.
Authors: Tom\'a\v{s} Sou\v{c}ek, Sylvestre-Alvise Rebuffi, Pierre Fernandez, Nikola Jovanovi\'c, Hady Elsahar, Valeriu Lacatusu, Tuan Tran, Alexandre Mourachko
Abstract: Recent years have seen a surge in interest in digital content watermarking techniques, driven by the proliferation of generative models and increased legal pressure. With an ever-growing percentage of AI-generated content available online, watermarking plays an increasingly important role in ensuring content authenticity and attribution at scale. There have been many works assessing the robustness of watermarking to removal attacks, yet, watermark forging, the scenario when a watermark is stolen from genuine content and applied to malicious content, remains underexplored. In this work, we investigate watermark forging in the context of widely used post-hoc image watermarking. Our contributions are as follows. First, we introduce a preference model to assess whether an image is watermarked. The model is trained using a ranking loss on purely procedurally generated images without any need for real watermarks. Second, we demonstrate the model's capability to remove and forge watermarks by optimizing the input image through backpropagation. This technique requires only a single watermarked image and works without knowledge of the watermarking model, making our attack much simpler and more practical than attacks introduced in related work. Third, we evaluate our proposed method on a variety of post-hoc image watermarking models, demonstrating that our approach can effectively forge watermarks, questioning the security of current watermarking approaches. Our code and further resources are publicly available.
Authors: Rui Zhu, Song-Lin Lv, Zi-Kang Wang, Lan-Zhe Guo
Abstract: Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to combining fine-tuning of pre-trained vision-language models (VLMs) with SSL, forming the emerging paradigm of semi-supervised fine-tuning. However, existing methods often suffer from model bias and hyperparameter sensitivity, due to reliance on prediction consistency or pre-defined confidence thresholds. To address these limitations, we propose a simple yet effective plug-and-play methodology named $\underline{\textbf{Bi-Co}}$nsistency-$\underline{\textbf{G}}$uided Self-Training (Bi-CoG), which assigns high-quality and low-bias pseudo-labels, by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy. Both theoretical analysis and extensive experiments over 14 datasets demonstrate the effectiveness of Bi-CoG, which consistently and significantly improves the performance of existing methods.
Authors: Fangjian Zhang, Xiaoyong Zhuge, Wenlan Wang, Haixia Xiao, Yuying Zhu, Siyang Cheng
Abstract: Artificial intelligence has advanced quantitative remote sensing, yet its effectiveness is constrained by imbalanced label distribution. This imbalance leads conventionally trained models to favor common samples, which in turn degrades retrieval performance for rare ones. Rainfall retrieval exemplifies this issue, with performance particularly compromised for heavy rain. This study proposes Hurdle-Inversion Model Debiasing Learning (IMDL) framework. Following a divide-and-conquer strategy, imbalance in the rain distribution is decomposed into two components: zero inflation, defined by the predominance of non-rain samples; and long tail, defined by the disproportionate abundance of light-rain samples relative to heavy-rain samples. A hurdle model is adopted to handle the zero inflation, while IMDL is proposed to address the long tail by transforming the learning object into an unbiased ideal inverse model. Comprehensive evaluation via statistical metrics and case studies investigating rainy weather in eastern China confirms Hurdle-IMDL's superiority over conventional, cost-sensitive, generative, and multi-task learning methods. Its key advancements include effective mitigation of systematic underestimation and a marked improvement in the retrieval of heavy-to-extreme rain. IMDL offers a generalizable approach for addressing imbalance in distributions of environmental variables, enabling enhanced retrieval of rare yet high-impact events.
Authors: Abdulmomen Ghalkha, Zhuojun Tian, Chaouki Ben Issaid, Mehdi Bennis
Abstract: Conventional multimodal alignment methods assume mutual redundancy across all modalities, an assumption that fails in real-world distributed scenarios. We propose SheafAlign, a sheaf-theoretic framework for decentralized multimodal alignment that replaces single-space alignment with multiple comparison spaces. This approach models pairwise modality relations through sheaf structures and leverages decentralized contrastive learning-based objectives for training. SheafAlign overcomes the limitations of prior methods by not requiring mutual redundancy among all modalities, preserving both shared and unique information. Experiments on multimodal sensing datasets show superior zero-shot generalization, cross-modal alignment, and robustness to missing modalities, with 50\% lower communication cost than state-of-the-art baselines.
Authors: Jacopo Di Ventura, Jan Felix Kleuker, Aske Plaat, Thomas Moerland
Abstract: Zero-shot reinforcement learning (RL) has emerged as a setting for developing general agents in an unsupervised manner, capable of solving downstream tasks without additional training or planning at test-time. Unlike conventional RL, which optimizes policies for a fixed reward, zero-shot RL requires agents to encode representations rich enough to support immediate adaptation to any objective, drawing parallels to vision and language foundation models. Despite growing interest, the field lacks a common analytical lens. We present the first unified framework for zero-shot RL. Our formulation introduces a consistent notation and taxonomy that organizes existing approaches and allows direct comparison between them. Central to our framework is the classification of algorithms into two families: direct representations, which learn end-to-end mappings from rewards to policies, and compositional representations, which decompose the representation leveraging the substructure of the value function. Within this framework, we highlight shared principles and key differences across methods, and we derive an extended bound for successor-feature methods, offering a new perspective on their performance in the zero-shot regime. By consolidating existing work under a common lens, our framework provides a principled foundation for future research in zero-shot RL and outlines a clear path toward developing more general agents.
Authors: Alexandre Benoit, Catherine Aitken, Yu He
Abstract: Graph rewiring has emerged as a key technique to alleviate over-squashing in Graph Neural Networks (GNNs) and Graph Transformers by modifying the graph topology to improve information flow. While effective, rewiring inherently alters the graph's structure, raising the risk of distorting important topology-dependent signals. Yet, despite the growing use of rewiring, little is known about which structural properties must be preserved to ensure both performance gains and structural fidelity. In this work, we provide the first systematic analysis of how rewiring affects a range of graph structural metrics, and how these changes relate to downstream task performance. We study seven diverse rewiring strategies and correlate changes in local and global graph properties with node classification accuracy. Our results reveal a consistent pattern: successful rewiring methods tend to preserve local structure while allowing for flexibility in global connectivity. These findings offer new insights into the design of effective rewiring strategies, bridging the gap between graph theory and practical GNN optimization.
Authors: Simon Schindler, Christoph Binder, Lukas L\"urzer, Stefan Huber
Abstract: Machine Learning Operations (MLOps) practices are increas- ingly adopted in industrial settings, yet their integration with Opera- tional Technology (OT) systems presents significant challenges. This pa- per analyzes the fundamental obstacles in combining MLOps with OT en- vironments and proposes a systematic approach to embed MLOps prac- tices into established OT reference models. We evaluate the suitability of the Reference Architectural Model for Industry 4.0 (RAMI 4.0) and the International Society of Automation Standard 95 (ISA-95) for MLOps integration and present a detailed mapping of MLOps lifecycle compo- nents to RAMI 4.0 exemplified by a real-world use case. Our findings demonstrate that while standard MLOps practices cannot be directly transplanted to OT environments, structured adaptation using existing reference models can provide a pathway for successful integration.
Authors: Alexandru Oarga, Yilun Du
Abstract: Generalization is a key challenge in machine learning, specifically in reasoning tasks, where models are expected to solve problems more complex than those encountered during training. Existing approaches typically train reasoning models in an end-to-end fashion, directly mapping input instances to solutions. While this allows models to learn useful heuristics from data, it often results in limited generalization beyond the training distribution. In this work, we propose a novel approach to reasoning generalization by learning energy landscapes over the solution spaces of smaller, more tractable subproblems. At test time, we construct a global energy landscape for a given problem by combining the energy functions of multiple subproblems. This compositional approach enables the incorporation of additional constraints during inference, allowing the construction of energy landscapes for problems of increasing difficulty. To improve the sample quality from this newly constructed energy landscape, we introduce Parallel Energy Minimization (PEM). We evaluate our approach on a wide set of reasoning problems. Our method outperforms existing state-of-the-art methods, demonstrating its ability to generalize to larger and more complex problems. Project website can be found at: https://alexoarga.github.io/compositional_reasoning/
Authors: Yuta Kawamoto, Hideaki Iiduka
Abstract: Stochastic gradient descent (SGD) is the workhorse of large-scale learning, yet classical analyses rely on assumptions that can be either too strong (bounded variance) or too coarse (uniform noise). The expected smoothness (ES) condition has emerged as a flexible alternative that ties the second moment of stochastic gradients to the objective value and the full gradient. This paper presents a self-contained convergence analysis of SGD under ES. We (i) refine ES with interpretations and sampling-dependent constants; (ii) derive bounds of the expectation of squared full gradient norm; and (iii) prove $O(1/K)$ rates with explicit residual errors for various step-size schedules. All proofs are given in full detail in the appendix. Our treatment unifies and extends recent threads (Khaled and Richt\'arik, 2020; Umeda and Iiduka, 2025).
Authors: Timur Galimzyanov, Olga Kolomyttseva, Egor Bogomolov
Abstract: We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.
Authors: Mirza Raquib, Niloy Das, Farida Siddiqi Prity, Arafath Al Fahim, Saydul Akbar Murad, Mohammad Amzad Hossain, MD Jiabul Hoque, Mohammad Ali Moni
Abstract: Breast cancer is considered the most critical and frequently diagnosed cancer in women worldwide, leading to an increase in cancer-related mortality. Early and accurate detection is crucial as it can help mitigate possible threats while improving survival rates. In terms of prediction, conventional diagnostic methods are often limited by variability, cost, and, most importantly, risk of misdiagnosis. To address these challenges, machine learning (ML) has emerged as a powerful tool for computer-aided diagnosis, with feature selection playing a vital role in improving model performance and interpretability. This research study proposes an integrated framework that incorporates customized Particle Swarm Optimization (PSO) for feature selection. This framework has been evaluated on a comprehensive set of 29 different models, spanning classical classifiers, ensemble techniques, neural networks, probabilistic algorithms, and instance-based algorithms. To ensure interpretability and clinical relevance, the study uses cross-validation in conjunction with explainable AI methods. Experimental evaluation showed that the proposed approach achieved a superior score of 99.1\% across all performance metrics, including accuracy and precision, while effectively reducing dimensionality and providing transparent, model-agnostic explanations. The results highlight the potential of combining swarm intelligence with explainable ML for robust, trustworthy, and clinically meaningful breast cancer diagnosis.
Authors: Yang Han, Pengyu Wang, Kai Yu, Xin Chen, Lu Chen
Abstract: Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.
Authors: Aki Rehn, Linzh Zhao, Mikko A. Heikkil\"a, Antti Honkela
Abstract: Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$. We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.
Authors: Lukas Miklautz, Chengzhi Shi, Andrii Shkabrii, Theodoros Thirimachos Davarakis, Prudence Lam, Claudia Plant, Jennifer Dy, Stratis Ioannidis
Abstract: We introduce H-SPLID, a novel algorithm for learning salient feature representations through the explicit decomposition of salient and non-salient features into separate spaces. We show that H-SPLID promotes learning low-dimensional, task-relevant features. We prove that the expected prediction deviation under input perturbations is upper-bounded by the dimension of the salient subspace and the Hilbert-Schmidt Independence Criterion (HSIC) between inputs and representations. This establishes a link between robustness and latent representation compression in terms of the dimensionality and information preserved. Empirical evaluations on image classification tasks show that models trained with H-SPLID primarily rely on salient input components, as indicated by reduced sensitivity to perturbations affecting non-salient features, such as image backgrounds. Our code is available at https://github.com/neu-spiral/H-SPLID.
Authors: Mingxuan Liu, Yilin Ning, Haoyuan Wang, Chuan Hong, Matthew Engelhard, Danielle S. Bitterman, William G. La Cava, Nan Liu
Abstract: As machine learning models become increasingly integrated into healthcare, structural inequities and social biases embedded in clinical data can be perpetuated or even amplified by data-driven models. In survival analysis, censoring and time dynamics can further add complexity to fair model development. Additionally, algorithmic fairness approaches often overlook disparities in cross-group rankings, e.g., high-risk Black patients may be ranked below lower-risk White patients who do not experience the event of mortality. Such misranking can reinforce biological essentialism and undermine equitable care. We propose a Fairness-Aware Survival Modeling (FASM), designed to mitigate algorithmic bias regarding both intra-group and cross-group risk rankings over time. Using breast cancer prognosis as a representative case and applying FASM to SEER breast cancer data, we show that FASM substantially improves fairness while preserving discrimination performance comparable to fairness-unaware survival models. Time-stratified evaluations show that FASM maintains stable fairness over a 10-year horizon, with the greatest improvements observed during the mid-term of follow-up. Our approach enables the development of survival models that prioritize both accuracy and equity in clinical decision-making, advancing fairness as a core principle in clinical care.
Authors: Hyun Jong Yang, Hyunsoo Kim, Hyeonho Noh, Seungnyun Kim, Byonghyo Shim
Abstract: Large language models (LLMs) and large multimodal models (LMMs) have achieved unprecedented breakthrough, showcasing remarkable capabilities in natural language understanding, generation, and complex reasoning. This transformative potential has positioned them as key enablers for 6G autonomous communications among machines, vehicles, and humanoids. In this article, we provide an overview of task-oriented autonomous communications with LLMs/LMMs, focusing on multimodal sensing integration, adaptive reconfiguration, and prompt/fine-tuning strategies for wireless tasks. We demonstrate the framework through three case studies: LMM-based traffic control, LLM-based robot scheduling, and LMM-based environment-aware channel estimation. From experimental results, we show that the proposed LLM/LMM-aided autonomous systems significantly outperform conventional and discriminative deep learning (DL) model-based techniques, maintaining robustness under dynamic objectives, varying input parameters, and heterogeneous multimodal conditions where conventional static optimization degrades.
Authors: Fiza Hussain, Anson Bastos, Anjaly Parayil, Ayush Choure, Chetan Bansal, Rujia Wang, Saravan Rajmohan
Abstract: In this paper, we present DiRecGNN, an attention-enhanced entity recommendation framework for monitoring cloud services at Microsoft. We provide insights on the usefulness of this feature as perceived by the cloud service owners and lessons learned from deployment. Specifically, we introduce the problem of recommending the optimal subset of attributes (dimensions) that should be tracked by an automated watchdog (monitor) for cloud services. To begin, we construct the monitor heterogeneous graph at production-scale. The interaction dynamics of these entities are often characterized by limited structural and engagement information, resulting in inferior performance of state-of-the-art approaches. Moreover, traditional methods fail to capture the dependencies between entities spanning a long range due to their homophilic nature. Therefore, we propose an attention-enhanced entity ranking model inspired by transformer architectures. Our model utilizes a multi-head attention mechanism to focus on heterogeneous neighbors and their attributes, and further attends to paths sampled using random walks to capture long-range dependencies. We also employ multi-faceted loss functions to optimize for relevant recommendations while respecting the inherent sparsity of the data. Empirical evaluations demonstrate significant improvements over existing methods, with our model achieving a 43.1% increase in MRR. Furthermore, product teams who consumed these features perceive the feature as useful and rated it 4.5 out of 5.
Authors: Reuben Dorent, Polina Golland, William Wells III
Abstract: Mutual Information (MI) is a fundamental measure of statistical dependence widely used in representation learning. While direct optimization of MI via its definition as a Kullback-Leibler divergence (KLD) is often intractable, many recent methods have instead maximized alternative dependence measures, most notably, the Jensen-Shannon divergence (JSD) between joint and product of marginal distributions via discriminative losses. However, the connection between these surrogate objectives and MI remains poorly understood. In this work, we bridge this gap by deriving a new, tight, and tractable lower bound on KLD as a function of JSD in the general case. By specializing this bound to joint and marginal distributions, we demonstrate that maximizing the JSD-based information increases a guaranteed lower bound on mutual information. Furthermore, we revisit the practical implementation of JSD-based objectives and observe that minimizing the cross-entropy loss of a binary classifier trained to distinguish joint from marginal pairs recovers a known variational lower bound on the JSD. Extensive experiments demonstrate that our lower bound is tight when applied to MI estimation. We compared our lower bound to state-of-the-art neural estimators of variational lower bound across a range of established reference scenarios. Our lower bound estimator consistently provides a stable, low-variance estimate of a tight lower bound on MI. We also demonstrate its practical usefulness in the context of the Information Bottleneck framework. Taken together, our results provide new theoretical justifications and strong empirical evidence for using discriminative learning in MI-based representation learning.
Authors: Quan Li, Wenchao Yu, Suhang Wang, Minhua Lin, Lingwei Chen, Wei Cheng, Haifeng Chen
Abstract: Extreme events frequently occur in real-world time series and often carry significant practical implications. In domains such as climate and healthcare, these events, such as floods, heatwaves, or acute medical episodes, can lead to serious consequences. Accurate forecasting of such events is therefore of substantial importance. Most existing time series forecasting models are optimized for overall performance within the prediction window, but often struggle to accurately predict extreme events, such as high temperatures or heart rate spikes. The main challenges are data imbalance and the neglect of valuable information contained in intermediate events that precede extreme events. In this paper, we propose xTime, a novel framework for extreme event forecasting in time series. xTime leverages knowledge distillation to transfer information from models trained on lower-rarity events, thereby improving prediction performance on rarer ones. In addition, we introduce a mixture of experts (MoE) mechanism that dynamically selects and fuses outputs from expert models across different rarity levels, which further improves the forecasting performance for extreme events. Experiments on multiple datasets show that xTime achieves consistent improvements, with forecasting accuracy on extreme events improving from 3% to 78%.
Authors: Mariona Jaramillo-Civill, Luis Gonz\'alez-Gudi\~no, Tales Imbiriba, Pau Closas
Abstract: Global Navigation Satellite System (GNSS) signals are vulnerable to jamming, particularly in urban areas where multipath and shadowing distort received power. Previous data-driven approaches achieved reasonable localization but poorly reconstructed the received signal strength (RSS) field due to limited spatial context. We propose a hybrid Bayesian mixture-of-experts framework that fuses a physical path-loss (PL) model and a convolutional neural network (CNN) through log-linear pooling. The PL expert ensures physical consistency, while the CNN leverages building-height maps to capture urban propagation effects. Bayesian inference with Laplace approximation provides posterior uncertainty over both the jammer position and RSS field. Experiments on urban ray-tracing data show that localization accuracy improves and uncertainty decreases with more training points, while uncertainty concentrates near the jammer and along urban canyons where propagation is most sensitive.
Authors: Jinbin Bai, Yu Lei, Hecong Wu, Yuchen Zhu, Shufan Li, Yi Xin, Xiangtai Li, Molei Tao, Aditya Grover, Ming-Hsuan Yang
Abstract: This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a ``world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models.
Authors: Subham Kumar, Prakrithi Shivaprakash, Koustav Rudra, Lekhansh Shukla, Animesh Mukherjee
Abstract: Determining the appropriate locus of care for addiction patients is one of the most critical clinical decisions that affects patient treatment outcomes and effective use of resources. With a lack of sufficient specialized treatment resources, such as inpatient beds or staff, there is an unmet need to develop an automated framework for the same. Current decision-making approaches suffer from severe class imbalances in addiction datasets. To address this limitation, we propose a novel graph neural network (GRACE) framework that formalizes locus of care prediction as a structured learning problem. Further, we perform extensive feature engineering and propose a new approach of obtaining an unbiased meta-graph to train a GNN to overcome the class imbalance problem. Experimental results in real-world data show an improvement of 11-35% in terms of the F1 score of the minority class over competitive baselines. The codes and note embeddings are available at https://anonymous.4open.science/r/GRACE-F8E1/.
Authors: Georgios Mentzelopoulos, Ioannis Asmanis, Konrad P. Kording, Eva L. Dyer, Kostas Daniilidis, Flavia Vitale
Abstract: Brain-computer interfaces (BCIs) promise to enable vital functions, such as speech and prosthetic control, for individuals with neuromotor impairments. Central to their success are neural decoders, models that map neural activity to intended behavior. Current learning-based decoding approaches fall into two classes: simple, causal models that lack generalization, or complex, non-causal models that generalize and scale offline but struggle in real-time settings. Both face a common challenge, their reliance on power-hungry artificial neural network backbones, which makes integration into real-world, resource-limited systems difficult. Spiking neural networks (SNNs) offer a promising alternative. Because they operate causally these models are suitable for real-time use, and their low energy demands make them ideal for battery-constrained environments. To this end, we introduce Spikachu: a scalable, causal, and energy-efficient neural decoding framework based on SNNs. Our approach processes binned spikes directly by projecting them into a shared latent space, where spiking modules, adapted to the timing of the input, extract relevant features; these latent representations are then integrated and decoded to generate behavioral predictions. We evaluate our approach on 113 recording sessions from 6 non-human primates, totaling 43 hours of recordings. Our method outperforms causal baselines when trained on single sessions using between 2.26 and 418.81 times less energy. Furthermore, we demonstrate that scaling up training to multiple sessions and subjects improves performance and enables few-shot transfer to unseen sessions, subjects, and tasks. Overall, Spikachu introduces a scalable, online-compatible neural decoding framework based on SNNs, whose performance is competitive relative to state-of-the-art models while consuming orders of magnitude less energy.
Authors: Haozhe Shan, Sun Minni, Lea Duncker
Abstract: The ability to continually learn, retain and deploy skills to accomplish goals is a key feature of intelligent and efficient behavior. However, the neural mechanisms facilitating the continual learning and flexible (re-)composition of skills remain elusive. Here, we study continual learning and the compositional reuse of learned computations in recurrent neural network (RNN) models using a novel two-system approach: one system that infers what computation to perform, and one that implements how to perform it. We focus on a set of compositional cognitive tasks commonly studied in neuroscience. To construct the what system, we first show that a large family of tasks can be systematically described by a probabilistic generative model, where compositionality stems from a shared underlying vocabulary of discrete task epochs. The shared epoch structure makes these tasks inherently compositional. We first show that this compositionality can be systematically described by a probabilistic generative model. Furthermore, We develop an unsupervised online learning approach that can learn this model on a single-trial basis, building its vocabulary incrementally as it is exposed to new tasks, and inferring the latent epoch structure as a time-varying computational context within a trial. We implement the how system as an RNN whose low-rank components are composed according to the context inferred by the what system. Contextual inference facilitates the creation, learning, and reuse of low-rank RNN components as new tasks are introduced sequentially, enabling continual learning without catastrophic forgetting. Using an example task set, we demonstrate the efficacy and competitive performance of this two-system learning framework, its potential for forward and backward transfer, as well as fast compositional generalization to unseen tasks.
Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Kimia Ghobadi
Abstract: In this study we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models on JHFRAT assessment data and additional electronic health record (EHR) variables. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labelling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.
Authors: Daniel Sorensen, Bappaditya Dey, Minjin Hwang, Sandip Halder
Abstract: Semiconductor manufacturing is an extremely complex and precision-driven process, characterized by thousands of interdependent parameters collected across diverse tools and process steps. Multi-variate time-series analysis has emerged as a critical field for real-time monitoring and fault detection in such environments. However, anomaly prediction in semiconductor fabrication presents several critical challenges, including high dimensionality of sensor data and severe class imbalance due to the rarity of true faults. Furthermore, the complex interdependencies between variables complicate both anomaly prediction and root-cause-analysis. This paper proposes two novel approaches to advance the field from anomaly detection to anomaly prediction, an essential step toward enabling real-time process correction and proactive fault prevention. The proposed anomaly prediction framework contains two main stages: (a) training a forecasting model on a dataset assumed to contain no anomalies, and (b) performing forecast on unseen time series data. The forecast is compared with the forecast of the trained signal. Deviations beyond a predefined threshold are flagged as anomalies. The two approaches differ in the forecasting model employed. The first assumes independence between variables by utilizing the N-BEATS model for univariate time series forecasting. The second lifts this assumption by utilizing a Graph Neural Network (GNN) to capture inter-variable relationships. Both models demonstrate strong forecasting performance up to a horizon of 20 time points and maintain stable anomaly prediction up to 50 time points. The GNN consistently outperforms the N-BEATS model while requiring significantly fewer trainable parameters and lower computational cost. These results position the GNN as promising solution for online anomaly forecasting to be deployed in manufacturing environments.
Authors: Jasmine Bayrooti, Sattar Vakili, Amanda Prorok, Carl Henrik Ek
Abstract: Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of $\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)})$ over $K$ episodes of horizon $H$, where $\Gamma(\cdot)$ captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.
Authors: Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, Kun Zhang
Abstract: Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.
Authors: Tsai Hor Chan, Feng Wu, Yihang Chen, Guosheng Yin, Lequan Yu
Abstract: Developing effective multimodal fusion approaches has become increasingly essential in many real-world scenarios, such as health care and finance. The key challenge is how to preserve the feature expressiveness in each modality while learning cross-modal interactions. Previous approaches primarily focus on the cross-modal alignment, while over-emphasis on the alignment of marginal distributions of modalities may impose excess regularization and obstruct meaningful representations within each modality. The Dirichlet process (DP) mixture model is a powerful Bayesian non-parametric method that can amplify the most prominent features by its richer-gets-richer property, which allocates increasing weights to them. Inspired by this unique characteristic of DP, we propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment. Specifically, we assume that each modality follows a mixture of multivariate Gaussian distributions and further adopt DP to calculate the mixture weights for all the components. This paradigm allows DP to dynamically allocate the contributions of features and select the most prominent ones, leveraging its richer-gets-richer property, thus facilitating multimodal feature fusion. Extensive experiments on several multimodal datasets demonstrate the superior performance of our model over other competitors. Ablation analysis further validates the effectiveness of DP in aligning modality distributions and its robustness to changes in key hyperparameters. Code is anonymously available at https://github.com/HKU-MedAI/DPMM.git
Authors: Jan Sobotka, Luca Baroni, J\'an Antol\'ik
Abstract: Decoding visual stimuli from neural population activity is crucial for understanding the brain and for applications in brain-machine interfaces. However, such biological data is often scarce, particularly in primates or humans, where high-throughput recording techniques, such as two-photon imaging, remain challenging or impossible to apply. This, in turn, poses a challenge for deep learning decoding techniques. To overcome this, we introduce MEIcoder, a biologically informed decoding method that leverages neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training. MEIcoder achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in primary visual cortex (V1), especially excelling on small datasets with fewer recorded neurons. Using ablation studies, we demonstrate that MEIs are the main drivers of the performance, and in scaling experiments, we show that MEIcoder can reconstruct high-fidelity natural-looking images from as few as 1,000-2,500 neurons and less than 1,000 training data points. We also propose a unified benchmark with over 160,000 samples to foster future research. Our results demonstrate the feasibility of reliable decoding in early visual system and provide practical insights for neuroscience and neuroengineering applications.
Authors: Anna M\'esz\'aros, Patrik Reizinger, Ferenc Husz\'ar
Abstract: Chess is a canonical example of a task that requires rigorous reasoning and long-term planning. Modern decision Transformers - trained similarly to LLMs - are able to learn competent gameplay, but it is unclear to what extent they truly capture the rules of chess. To investigate this, we train a 270M parameter chess Transformer and test it on out-of-distribution scenarios, designed to reveal failures of systematic generalization. Our analysis shows that Transformers exhibit compositional generalization, as evidenced by strong rule extrapolation: they adhere to fundamental syntactic rules of the game by consistently choosing valid moves even in situations very different from the training data. Moreover, they also generate high-quality moves for OOD puzzles. In a more challenging test, we evaluate the models on variants including Chess960 (Fischer Random Chess) - a variant of chess where starting positions of pieces are randomized. We found that while the model exhibits basic strategy adaptation, they are inferior to symbolic AI algorithms that perform explicit search, but gap is smaller when playing against users on Lichess. Moreover, the training dynamics revealed that the model initially learns to move only its own pieces, suggesting an emergent compositional understanding of the game.
Authors: Liang Ye, Shengqin Chen, Jiazhu Dai
Abstract: The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
Authors: Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, Daniela Rus
Abstract: Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.
Authors: Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, Rajesh Ranganath
Abstract: It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.
Authors: Pierre Mergny, Lenka Zdeborov\'a
Abstract: We provide a rigorous random matrix theory analysis of spiked cross-covariance models where the signals across two high-dimensional data channels are partially aligned. These models are motivated by multi-modal learning and form the standard generative setting underlying Partial Least Squares (PLS), a widely used yet theoretically underdeveloped method. We show that the leading singular values of the sample cross-covariance matrix undergo a Baik-Ben Arous-Peche (BBP)-type phase transition, and we characterize the precise thresholds for the emergence of informative components. Our results yield the first sharp asymptotic description of the signal recovery capabilities of PLS in this setting, revealing a fundamental performance gap between PLS and the Bayes-optimal estimator. In particular, we identify the SNR and correlation regimes where PLS fails to recover any signal, despite detectability being possible in principle. These findings clarify the theoretical limits of PLS and provide guidance for the design of reliable multi-modal inference methods in high dimensions.
Authors: Aueaphum Aueawattthanaphisut, Thanyanee Srichaisak, Arissa Ieochai
Abstract: A sensor-fused wearable assistance prototype for upper-limb function (triceps brachii and extensor pollicis brevis) is presented. The device integrates surface electromyography (sEMG), an inertial measurement unit (IMU), and flex/force sensors on an M5StickC plus an ESP32-S3 compute hub. Signals are band-pass and notch filtered; features (RMS, MAV, zero-crossings, and 4-12 Hz tremor-band power) are computed in 250 ms windows and fed to an INT8 TensorFlow Lite Micro model. Control commands are bounded by a control-barrier-function safety envelope and delivered within game-based tasks with lightweight personalization. In a pilot technical feasibility evaluation with healthy volunteers (n = 12) performing three ADL-oriented tasks, tremor prominence decreased (Delta TI = -0.092, 95% CI [-0.102, -0.079]), range of motion increased (+12.65%, 95% CI [+8.43, +13.89]), repetitions rose (+2.99 min^-1, 95% CI [+2.61, +3.35]), and the EMG median-frequency slope became less negative (Delta = +0.100 Hz/min, 95% CI [+0.083, +0.127]). The sensing-to-assist loop ran at 100 Hz with 8.7 ms median on-device latency, 100% session completion, and 0 device-related adverse events. These results demonstrate technical feasibility of embedded, sensor-fused assistance for upper-limb function; formal patient studies under IRB oversight are planned.
Authors: Meghna Roy Chowdhury, Yi Ding, Shreyas Sen
Abstract: Electroencephalography (EEG) plays a crucial role in brain-computer interfaces (BCIs) and neurological diagnostics, but its real-world deployment faces challenges due to noise artifacts, missing data, and high annotation costs. We introduce SSL-SE-EEG, a framework that integrates Self-Supervised Learning (SSL) with Squeeze-and-Excitation Networks (SE-Nets) to enhance feature extraction, improve noise robustness, and reduce reliance on labeled data. Unlike conventional EEG processing techniques, SSL-SE-EEG} transforms EEG signals into structured 2D image representations, suitable for deep learning. Experimental validation on MindBigData, TUH-AB, SEED-IV and BCI-IV datasets demonstrates state-of-the-art accuracy (91% in MindBigData, 85% in TUH-AB), making it well-suited for real-time BCI applications. By enabling low-power, scalable EEG processing, SSL-SE-EEG presents a promising solution for biomedical signal analysis, neural engineering, and next-generation BCIs.
Authors: Ovishake Sen, Raghav Soni, Darpan Virmani, Akshar Parekh, Patrick Lehman, Sarthak Jena, Adithi Katikhaneni, Adam Khalifa, Baibhab Chatterjee
Abstract: Brain-computer interfaces (BCIs) offer a pathway to restore communication for individuals with severe motor or speech impairments. Imagined handwriting provides an intuitive paradigm for character-level neural decoding, bridging the gap between human intention and digital communication. While invasive approaches such as electrocorticography (ECoG) achieve high accuracy, their surgical risks limit widespread adoption. Non-invasive electroencephalography (EEG) offers safer and more scalable alternatives but suffers from low signal-to-noise ratio and spatial resolution, constraining its decoding precision. This work demonstrates that advanced machine learning combined with informative EEG feature extraction can overcome these barriers, enabling real-time, high-accuracy neural decoding on portable edge devices. A 32-channel EEG dataset was collected from fifteen participants performing imagined handwriting. Signals were preprocessed with bandpass filtering and artifact subspace reconstruction, followed by extraction of 85 time-, frequency-, and graphical-domain features. A hybrid architecture, EEdGeNet, integrates a Temporal Convolutional Network with a multilayer perceptron trained on the extracted features. When deployed on an NVIDIA Jetson TX2, the system achieved 89.83 percent accuracy with 914.18 ms per-character latency. Selecting only ten key features reduced latency by 4.5 times to 202.6 ms with less than 1 percent loss in accuracy. These results establish a pathway for accurate, low-latency, and fully portable non-invasive BCIs supporting real-time communication.
Authors: Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, Mosharaf Chowdhury
Abstract: Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8\% and reduces execution time by up to 40.4\% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.
Authors: Yuanhe Zhang, Ilja Kuzborskij, Jason D. Lee, Chenlei Leng, Fanghui Liu
Abstract: Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce logical closeness, a metric that quantifies how well a model's CoT trajectory (i.e., the LLM's final output) adheres to the DAG structure, providing evaluation beyond classical PASS@k metrics. Building on this, we introduce the DAG-MATH CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families-even when PASS@k is comparable-highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation. Our benchmark and code are available at: https://github.com/YuanheZ/DAG-MATH-Formatted-CoT.
Authors: Elizabeth Cucuzzella, Tria McNeely, Kimberly Wood, Ann B. Lee
Abstract: Accurate tropical cyclone (TC) short-term intensity forecasting with a 24-hour lead time is essential for disaster mitigation in the Atlantic TC basin. Since most TCs evolve far from land-based observing networks, satellite imagery is critical to monitoring these storms; however, these complex and high-resolution spatial structures can be challenging to qualitatively interpret in real time by forecasters. Here we propose a concise, interpretable, and descriptive approach to quantify fine TC structures with a multi-resolution analysis (MRA) by the discrete wavelet transform, enabling data analysts to identify physically meaningful structural features that strongly correlate with rapid intensity change. Furthermore, deep-learning techniques can build on this MRA for short-term intensity guidance.
Authors: Amila Indika, Igor Molybog
Abstract: Numerous knowledge workers utilize spreadsheets in business, accounting, and finance. However, a lack of systematic documentation methods for spreadsheets hinders automation, collaboration, and knowledge transfer, which risks the loss of crucial institutional knowledge. This paper introduces Spreadsheet Operations Documentation (SOD), an AI task that involves generating human-readable explanations from spreadsheet operations. Many previous studies have utilized Large Language Models (LLMs) for generating spreadsheet manipulation code; however, translating that code into natural language for SOD is a less-explored area. To address this, we present a benchmark of 111 spreadsheet manipulation code snippets, each paired with a corresponding natural language summary. We evaluate five LLMs, GPT-4o, GPT-4o-mini, LLaMA-3.3-70B, Mixtral-8x7B, and Gemma2-9B, using BLEU, GLEU, ROUGE-L, and METEOR metrics. Our findings suggest that LLMs can generate accurate spreadsheet documentation, making SOD a feasible prerequisite step toward enhancing reproducibility, maintainability, and collaborative workflows in spreadsheets, although there are challenges that need to be addressed.
Authors: Md Ashad Alam, Md Amanullah
Abstract: Diabetes mellitus is a chronic metabolic disorder that necessitates novel therapeutic innovations due to its gradual progression and the onset of various metabolic complications. Research indicates that Ficus religiosa is a conventional medicinal plant that generates bioactive phytochemicals with potential antidiabetic properties. The investigation employs ecosystem-based computational approaches utilizing artificial intelligence to investigate and evaluate compounds derived from Ficus religiosa that exhibit antidiabetic properties. A comprehensive computational procedure incorporated machine learning methodologies, molecular docking techniques, and ADMET prediction systems to assess phytochemical efficacy against the significant antidiabetic enzyme dipeptidyl peptidase-4 (DPP-4). DeepBindGCN and the AutoDock software facilitated the investigation of binding interactions via deep learning technology. Flavonoids and alkaloids have emerged as attractive phytochemicals due to their strong binding interactions and advantageous pharmacological effects, as indicated by the study. The introduction of AI accelerated screening procedures and enhanced accuracy rates, demonstrating its efficacy in researching plant-based antidiabetic agents. The scientific foundation now facilitates future experimental validation of natural product therapies tailored for diabetic management.
Authors: Md Selim Reza, Sabrin Afroz, Mostafizer Rahman, Md Ashad Alam
Abstract: Multi-omics data integration is crucial for understanding complex diseases, yet limited sample sizes, noise, and heterogeneity often reduce predictive power. To address these challenges, we introduce Omics-GAN, a Generative Adversarial Network (GAN)-based framework designed to generate high-quality synthetic multi-omics profiles while preserving biological relationships. We evaluated Omics-GAN on three omics types (mRNA, miRNA, and DNA methylation) using the ROSMAP cohort for Alzheimer's disease (AD) and TCGA datasets for colon and liver cancer. A support vector machine (SVM) classifier with repeated 5-fold cross-validation demonstrated that synthetic datasets consistently improved prediction accuracy compared to original omics profiles. The AUC of SVM for mRNA improved from 0.72 to 0.74 in AD, and from 0.68 to 0.72 in liver cancer. Synthetic miRNA enhanced classification in colon cancer from 0.59 to 0.69, while synthetic methylation data improved performance in liver cancer from 0.64 to 0.71. Boxplot analyses confirmed that synthetic data preserved statistical distributions while reducing noise and outliers. Feature selection identified significant genes overlapping with original datasets and revealed additional candidates validated by GO and KEGG enrichment analyses. Finally, molecular docking highlighted potential drug repurposing candidates, including Nilotinib for AD, Atovaquone for liver cancer, and Tecovirimat for colon cancer. Omics-GAN enhances disease prediction, preserves biological fidelity, and accelerates biomarker and drug discovery, offering a scalable strategy for precision medicine applications.
Authors: Benedetta Tessa, Alejandro Moreo, Stefano Cresci, Tiziano Fagni, Fabrizio Sebastiani
Abstract: Accurately estimating how users respond to moderation interventions is paramount for developing effective and user-centred moderation strategies. However, this requires a clear understanding of which user characteristics are associated with different behavioural responses, which is the goal of this work. We investigate the informativeness of 753 socio-behavioural, linguistic, relational, and psychological features, in predicting the behavioural changes of 16.8K users affected by a major moderation intervention on Reddit. To reach this goal, we frame the problem in terms of "quantification", a task well-suited to estimating shifts in aggregate user behaviour. We then apply a greedy feature selection strategy with the double goal of (i) identifying the features that are most predictive of changes in user activity, toxicity, and participation diversity, and (ii) estimating their importance. Our results allow identifying a small set of features that are consistently informative across all tasks, and determining that many others are either task-specific or of limited utility altogether. We also find that predictive performance varies according to the task, with changes in activity and toxicity being easier to estimate than changes in diversity. Overall, our results pave the way for the development of accurate systems that predict user reactions to moderation interventions. Furthermore, our findings highlight the complexity of post-moderation user behaviour, and indicate that effective moderation should be tailored not only to user traits but also to the specific objective of the intervention.
Authors: T\'elio Cropsal, Roc\'io Mercado
Abstract: High-throughput phenotypic screens generate vast microscopy image datasets that push the limits of generative models due to their large dimensionality. Despite the growing popularity of general-purpose models trained on natural images for microscopy data analysis, their suitability in this domain has not been quantitatively demonstrated. We present the first systematic evaluation of Stable Diffusion's variational autoencoder (SD-VAE) for reconstructing Cell Painting images, assessing performance across a large dataset with diverse molecular perturbations and cell types. We find that SD-VAE reconstructions preserve phenotypic signals with minimal loss, supporting its use in microscopy workflows. To benchmark reconstruction quality, we compare pixel-level, embedding-based, latent-space, and retrieval-based metrics for a biologically informed evaluation. We show that general-purpose feature extractors like InceptionV3 match or surpass publicly available bespoke models in retrieval tasks, simplifying future pipelines. Our findings offer practical guidelines for evaluating generative models on microscopy data and support the use of off-the-shelf models in phenotypic drug discovery.
Authors: Jan Zelinka, Oliver Kost, Marek Hr\'uz
Abstract: We present a data generation framework designed to simulate spoofing attacks and randomly place attack scenarios worldwide. We apply deep neural network-based models for spoofing detection, utilizing Long Short-Term Memory networks and Transformer-inspired architectures. These models are specifically designed for online detection and are trained using the generated dataset. Our results demonstrate that deep learning models can accurately distinguish spoofed signals from genuine ones, achieving high detection performance. The best results are achieved by Transformer-inspired architectures with early fusion of the inputs resulting in an error rate of 0.16%.
Authors: Jackson Hassell, Dan Zhang, Hannah Kim, Tom Mitchell, Estevam Hruschka
Abstract: We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.
Authors: Joseph Meyer, Divyansha Lachi, Reza Mohammadi, Roshan Reddy Upendra, Eva L. Dyer, Mark Li, Tom Palczewski
Abstract: Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
Authors: Le Ren, Xiangjian Zeng, Qingqiang Wu, Ruoxuan Liang
Abstract: Lyric translation is a challenging task that requires balancing multiple musical constraints. Existing methods often rely on hand-crafted rules and sentence-level modeling, which restrict their ability to internalize musical-linguistic patterns and to generalize effectively at the paragraph level, where cross-line coherence and global rhyme are crucial. In this work, we propose LyriCAR, a novel framework for controllable lyric translation that operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware curriculum designer and an adaptive curriculum strategy, ensuring efficient allocation of training resources, accelerating convergence, and improving overall translation quality by guiding the model with increasingly complex challenges. Extensive experiments on the EN-ZH lyric translation task show that LyriCAR achieves state-of-the-art results across both standard translation metrics and multi-dimensional reward scores, surpassing strong baselines. Notably, the adaptive curriculum strategy reduces training steps by nearly 40% while maintaining superior performance. Code, data and model can be accessed at https://github.com/rle27/LyriCAR.
Authors: Marc Amor\'os-Trepat, Luis Medrano-Navarro, Qiang Liu, Luca Guastoni, Nils Thuerey
Abstract: The reconstruction of unsteady flow fields from limited measurements is a challenging and crucial task for many engineering applications. Machine learning models are gaining popularity in solving this problem due to their ability to learn complex patterns from data and generalize across diverse conditions. Among these, diffusion models have emerged as particularly powerful in generative tasks, producing high-quality samples by iteratively refining noisy inputs. In contrast to other methods, these generative models are capable of reconstructing the smallest scales of the fluid spectrum. In this work, we introduce a novel sampling method for diffusion models that enables the reconstruction of high-fidelity samples by guiding the reverse process using the available sparse data. Moreover, we enhance the reconstructions with available physics knowledge using a conflict-free update method during training. To evaluate the effectiveness of our method, we conduct experiments on 2 and 3-dimensional turbulent flow data. Our method consistently outperforms other diffusion-based methods in predicting the fluid's structure and in pixel-wise accuracy. This study underscores the remarkable potential of diffusion models in reconstructing flow field data, paving the way for their application in Computational Fluid Dynamics research.
Authors: Tushar Nayan (Florida International University), Ziqi Zhang (University of Illinois Urbana-Champaign), Ruimin Sun (Florida International University)
Abstract: With the increasing deployment of Large Language Models (LLMs) on mobile and edge platforms, securing them against model extraction attacks has become a pressing concern. However, protecting model privacy without sacrificing the performance benefits of untrusted AI accelerators, such as GPUs, presents a challenging trade-off. In this paper, we initiate the study of high-performance execution on LLMs and present SecureInfer, a hybrid framework that leverages a heterogeneous Trusted Execution Environments (TEEs)-GPU architecture to isolate privacy-critical components while offloading compute-intensive operations to untrusted accelerators. Building upon an outsourcing scheme, SecureInfer adopts an information-theoretic and threat-informed partitioning strategy: security-sensitive components, including non-linear layers, projection of attention head, FNN transformations, and LoRA adapters, are executed inside an SGX enclave, while other linear operations (matrix multiplication) are performed on the GPU after encryption and are securely restored within the enclave. We implement a prototype of SecureInfer using the LLaMA-2 model and evaluate it across performance and security metrics. Our results show that SecureInfer offers strong security guarantees with reasonable performance, offering a practical solution for secure on-device model inference.
Authors: Yixiao Wang, Zishan Shao, Ting Jiang, Aditya Devarakonda
Abstract: We present a novel enhanced cyclic coordinate descent (ECCD) framework for solving generalized linear models with elastic net constraints that reduces training time in comparison to existing state-of-the-art methods. We redesign the CD method by performing a Taylor expansion around the current iterate to avoid nonlinear operations arising in the gradient computation. By introducing this approximation, we are able to unroll the vector recurrences occurring in the CD method and reformulate the resulting computations into more efficient batched computations. We show empirically that the recurrence can be unrolled by a tunable integer parameter, $s$, such that $s > 1$ yields performance improvements without affecting convergence, whereas $s = 1$ yields the original CD method. A key advantage of ECCD is that it avoids the convergence delay and numerical instability exhibited by block coordinate descent. Finally, we implement our proposed method in C++ using Eigen to accelerate linear algebra computations. Comparison of our method against existing state-of-the-art solvers shows consistent performance improvements of $3\times$ in average for regularization path variant on diverse benchmark datasets. Our implementation is available at https://github.com/Yixiao-Wang-Stats/ECCD.
Authors: Kushan Choudhury, Shubhrodeep Roy, Ankur Chanda, Shubhajit Biswas, Somenath Kuiry
Abstract: Deep learning models, especially convolutional neural networks, have achieved impressive results in medical image classification. However, these models often produce overconfident predictions, which can undermine their reliability in critical healthcare settings. While traditional label smoothing offers a simple way to reduce such overconfidence, it fails to consider relationships between classes by treating all non-target classes equally. In this study, we explore the use of Online Label Smoothing (OLS), a dynamic approach that adjusts soft labels throughout training based on the model's own prediction patterns. We evaluate OLS on the large-scale RadImageNet dataset using three widely used architectures: ResNet-50, MobileNetV2, and VGG-19. Our results show that OLS consistently improves both Top-1 and Top-5 classification accuracy compared to standard training methods, including hard labels, conventional label smoothing, and teacher-free knowledge distillation. In addition to accuracy gains, OLS leads to more compact and well-separated feature embeddings, indicating improved representation learning. These findings suggest that OLS not only strengthens predictive performance but also enhances calibration, making it a practical and effective solution for developing trustworthy AI systems in the medical imaging domain.
Authors: Dena Firoozi, Anastasis Kratsios, Xuwei Yang
Abstract: Traditional mean-field game (MFG) solvers operate on an instance-by-instance basis, which becomes infeasible when many related problems must be solved (e.g., for seeking a robust description of the solution under perturbations of the dynamics or utilities, or in settings involving continuum-parameterized agents.). We overcome this by training neural operators (NOs) to learn the rules-to-equilibrium map from the problem data (``rules'': dynamics and cost functionals) of LQ MFGs defined on separable Hilbert spaces to the corresponding equilibrium strategy. Our main result is a statistical guarantee: an NO trained on a small number of randomly sampled rules reliably solves unseen LQ MFG variants, even in infinite-dimensional settings. The number of NO parameters needed remains controlled under appropriate rule sampling during training. Our guarantee follows from three results: (i) local-Lipschitz estimates for the highly nonlinear rules-to-equilibrium map; (ii) a universal approximation theorem using NOs with a prespecified Lipschitz regularity (unlike traditional NO results where the NO's Lipschitz constant can diverge as the approximation error vanishes); and (iii) new sample-complexity bounds for $L$-Lipschitz learners in infinite dimensions, directly applicable as the Lipschitz constants of our approximating NOs are controlled in (ii).
Authors: Liron Mor Yosef, Haim Avron
Abstract: Over a decade ago, it was demonstrated that quantum computing has the potential to revolutionize numerical linear algebra by enabling algorithms with complexity superior to what is classically achievable, e.g., the seminal HHL algorithm for solving linear systems. Efficient execution of such algorithms critically depends on representing inputs (matrices and vectors) as quantum circuits that encode or implement these inputs. For that task, two common circuit representations emerged in the literature: block encodings and state preparation circuits. In this paper, we systematically study encodings matrices in the form of block encodings and state preparation circuits. We examine methods for constructing these representations from matrices given in classical form, as well as quantum two-way conversions between circuit representations. Two key results we establish (among others) are: (a) a general method for efficiently constructing a block encoding of an arbitrary matrix given in classical form (entries stored in classical random access memory); and (b) low-overhead, bidirectional conversion algorithms between block encodings and state preparation circuits, showing that these models are essentially equivalent. From a technical perspective, two central components of our constructions are: (i) a special constant-depth multiplexer that simultaneously multiplexes all higher-order Pauli matrices of a given size, and (ii) an algorithm for performing a quantum conversion between a matrix's expansion in the standard basis and its expansion in the basis of higher-order Pauli matrices.
Authors: Thibault Vatter, Thomas Nagler
Abstract: Vine copulas offer flexible multivariate dependence modeling and have become widely used in machine learning, yet structure learning remains a key challenge. Early heuristics like the greedy algorithm of Dissmann are still considered the gold standard, but often suboptimal. We propose random search algorithms that improve structure selection and a statistical framework based on model confidence sets, which provides theoretical guarantees on selection probabilities and a powerful foundation for ensembling. Empirical results on several real-world data sets show that our methods consistently outperform state-of-the-art approaches.
Authors: Nafis Chowdhury, Moinul Haque, Anika Ahmed, Nazia Tasnim, Md. Istiak Hossain Shihab, Sajjadur Rahman, Farig Sadeque
Abstract: Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.
Authors: Hashem Omrani, Raha Imanirad, Adam Diamant, Utkarsh Verma, Amol Verma, Fahad Razak
Abstract: We propose an approach for dynamic efficiency evaluation across multiple organizational dimensions using data envelopment analysis (DEA). The method generates both dimension-specific and aggregate efficiency scores, incorporates desirable and undesirable outputs, and is suitable for large-scale problem settings. Two regularized DEA models are introduced: a slack-based measure (SBM) and a linearized version of a nonlinear goal programming model (GP-SBM). While SBM estimates an aggregate efficiency score and then distributes it across dimensions, GP-SBM first estimates dimension-level efficiencies and then derives an aggregate score. Both models utilize a regularization parameter to enhance discriminatory power while also directly integrating both desirable and undesirable outputs. We demonstrate the computational efficiency and validity of our approach on multiple datasets and apply it to a case study of twelve hospitals in Ontario, Canada, evaluating three theoretically grounded dimensions of organizational effectiveness over a 24-month period from January 2018 to December 2019: technical efficiency, clinical efficiency, and patient experience. Our numerical results show that SBM and GP-SBM better capture correlations among input/output variables and outperform conventional benchmarking methods that separately evaluate dimensions before aggregation.
Authors: Antonio Norelli, Michael Bronstein
Abstract: A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.
Authors: Krishnakumar Balasubramanian, Sayan Banerjee, Philippe Rigollet
Abstract: We study stationary solutions of McKean-Vlasov equations on the circle. Our main contributions stem from observing an exact equivalence between solutions of the stationary McKean-Vlasov equation and an infinite-dimensional quadratic system of equations over Fourier coefficients, which allows explicit characterization of the stationary states in a sequence space rather than a function space. This framework provides a transparent description of local bifurcations, characterizing their periodicity, and resonance structures, while accommodating singular potentials. We derive analytic expressions that characterize the emergence, form and shape (supercritical, critical, subcritical or transcritical) of bifurcations involving possibly multiple Fourier modes and connect them with discontinuous phase transitions. We also characterize, under suitable assumptions, the detailed structure of the stationary bifurcating solutions that are accurate upto an arbitrary number of Fourier modes. At the global level, we establish regularity and concavity properties of the free energy landscape, proving existence, compactness, and coexistence of globally minimizing stationary measures, further identifying discontinuous phase transitions with points of non-differentiability of the minimum free energy map. As an application, we specialize the theory to the Noisy Mean-Field Transformer model, where we show how changing the inverse temperature parameter $\beta$ affects the geometry of the infinitely many bifurcations from the uniform measure. We also explain how increasing $\beta$ can lead to a rich class of approximate multi-mode stationary solutions which can be seen as `metastable states'. Further, a sharp transition from continuous to discontinuous (first-order) phase behavior is observed as $\beta$ increases.
Authors: Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
Abstract: This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
Authors: Rishabh Dey, Michael Brocidiacono, Kushal Koirala, Alexander Tropsha, Konstantin I. Popov
Abstract: The implicit solvent approach offers a computationally efficient framework to model solvation effects in molecular simulations. However, its accuracy often falls short compared to explicit solvent models, limiting its use in precise thermodynamic calculations. Recent advancements in machine learning (ML) present an opportunity to overcome these limitations by leveraging neural networks to develop more precise implicit solvent potentials for diverse applications. A major drawback of current ML-based methods is their reliance on force-matching alone, which can lead to energy predictions that differ by an arbitrary constant and are therefore unsuitable for absolute free energy comparisons. Here, we introduce a novel methodology with a graph neural network (GNN)-based implicit solvent model, dubbed Lambda Solvation Neural Network (LSNN). In addition to force-matching, this network was trained to match the derivatives of alchemical variables, ensuring that solvation free energies can be meaningfully compared across chemical species.. Trained on a dataset of approximately 300,000 small molecules, LSNN achieves free energy predictions with accuracy comparable to explicit-solvent alchemical simulations, while offering a computational speedup and establishing a foundational framework for future applications in drug discovery.
Authors: Huawei Bai, Yifan Huang, Wenqi Shi, Ansheng You, Feifan Shao, Tengfei Han, Minghui Yu
Abstract: The training efficiency and scalability of language models on massive clusters currently remain a critical bottleneck. Mainstream approaches like ND parallelism are often cumbersome and complex, while flexible alternatives such as the Zero Redundancy Optimizer (ZeRO) are frequently hampered by communication overhead. In this paper, we propose Asynchronous Hierarchical Zero Parallelism (AsyncHZP), a novel asynchronous variant of ZeRO designed to achieve superior performance while maintaining simplicity and memory efficiency. Unlike traditional ZeRO, which employs over-fine-grained sharding that can lead to inefficient communication, AsyncHZP adaptively reshards parameters, gradients, and optimizer states across different replica groups. This strategy optimizes device memory utilization and significantly reduces communication overhead. In addition, we also design a multi-stream asynchronous scheduling method that executes parameter all-gather and gradient reduce-scatter operations in dedicated background threads, effectively overlapping communication with computation while incurring negligible memory fragmentation. Empirical evaluations on both Dense and Mixture-of-Experts (MoE) models confirm that AsyncHZP maintains robust stability at scale. It consistently outperforms classic ND parallelism, achieving state-of-the-art performance without complex strategic tuning, thereby simplifying the path to efficient large-scale training.
Authors: Somayajulu L. N. Dhulipala, Deep Ray, Nicholas Forman
Abstract: Simulating coupled PDE systems is computationally intensive, and prior efforts have largely focused on training surrogates on the joint (coupled) data, which requires a large amount of data. In the paper, we study compositional diffusion approaches where diffusion models are only trained on the decoupled PDE data and are composed at inference time to recover the coupled field. Specifically, we investigate whether the compositional strategy can be feasible under long time horizons involving a large number of time steps. In addition, we compare a baseline diffusion model with that trained using the v-parameterization strategy. We also introduce a symmetric compositional scheme for the coupled fields based on the Euler scheme. We evaluate on Reaction-Diffusion and modified Burgers with longer time grids, and benchmark against a Fourier Neural Operator trained on coupled data. Despite seeing only decoupled training data, the compositional diffusion models recover coupled trajectories with low error. v-parameterization can improve accuracy over a baseline diffusion model, while the neural operator surrogate remains strongest given that it is trained on the coupled data. These results show that compositional diffusion is a viable strategy towards efficient, long-horizon modeling of coupled PDEs.
Authors: Rahul Raja, Arpita Vats
Abstract: Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
Authors: Zhenning Yang, Hui Guan, Victor Nicolet, Brandon Paulsen, Joey Dodds, Daniel Kroening, Ang Chen
Abstract: Cloud infrastructure is managed through a mix of interfaces -- traditionally, cloud consoles, command-line interfaces (CLI), and SDKs are the tools of choice. Recently, Infrastructure-as-Code/IaC frameworks (e.g., Terraform) have quickly gained popularity. Unlike conventional tools, IaC~frameworks encode the infrastructure in a "source-of-truth" configuration. They are capable of automatically carrying out modifications to the cloud -- deploying, updating, or destroying resources -- to bring the actual infrastructure into alignment with the IaC configuration. However, when IaC is used alongside consoles, CLIs, or SDKs, it loses visibility into external changes, causing infrastructure drift, where the configuration becomes outdated, and later IaC operations may undo valid updates or trigger errors. We present NSync, an automated system for IaC reconciliation that propagates out-of-band changes back into the IaC program. Our key insight is that infrastructure changes eventually all occur via cloud API invocations -- the lowest layer for cloud management operations. NSync gleans insights from API traces to detect drift (i.e., non-IaC changes) and reconcile it (i.e., update the IaC configuration to capture the changes). It employs an agentic architecture that leverages LLMs to infer high-level intents from noisy API sequences, synthesize targeted IaC updates using specialized tools, and continually improve through a self-evolving knowledge base of past reconciliations. We further introduce a novel evaluation pipeline for injecting realistic drifts into cloud infrastructure and assessing reconciliation performance. Experiments across five real-world Terraform projects and 372 drift scenarios show that NSync outperforms the baseline both in terms of accuracy (from 0.71 to 0.97 pass@3) and token efficiency (1.47$\times$ improvement).
Authors: Minseok Kang, Minhyeok Lee, Minjung Kim, Donghyeong Kim, Sangyoun Lee
Abstract: Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.
Authors: Guowei Zhong, Junjie Li, Huaiyu Zhu, Ruohong Huan, Yun Pan
Abstract: In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CMC.
Authors: Sauptik Dhar, Naveen Ramakrishnan, Michelle Munson
Abstract: Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.
Authors: Wu Yichao, Wang Yirui, Ding Panpan, Wang Hailong, Zhu Bingqian, Liu Chun
Abstract: With the wide application of deep reinforcement learning (DRL) techniques in complex fields such as autonomous driving, intelligent manufacturing, and smart healthcare, how to improve its security and robustness in dynamic and changeable environments has become a core issue in current research. Especially in the face of adversarial attacks, DRL may suffer serious performance degradation or even make potentially dangerous decisions, so it is crucial to ensure their stability in security-sensitive scenarios. In this paper, we first introduce the basic framework of DRL and analyze the main security challenges faced in complex and changing environments. In addition, this paper proposes an adversarial attack classification framework based on perturbation type and attack target and reviews the mainstream adversarial attack methods against DRL in detail, including various attack methods such as perturbation state space, action space, reward function and model space. To effectively counter the attacks, this paper systematically summarizes various current robustness training strategies, including adversarial training, competitive training, robust learning, adversarial detection, defense distillation and other related defense techniques, we also discuss the advantages and shortcomings of these methods in improving the robustness of DRL. Finally, this paper looks into the future research direction of DRL in adversarial environments, emphasizing the research needs in terms of improving generalization, reducing computational complexity, and enhancing scalability and explainability, aiming to provide valuable references and directions for researchers.
Authors: Ajay Sridhar, Jennifer Pan, Satvik Sharma, Chelsea Finn
Abstract: Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we finetune Qwen2.5-VL-7B-Instruct and $\pi_{0.5}$ as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at https://jen-pan.github.io/memer/.
Authors: A. P. Kryukov, A. Yu. Razumov, A. P. Demichev, J. J. Dubenskaya, E. O. Gres, S. P. Polyakov, E. B. Postnikov, P. A. Volchugov, D. P. Zhurov
Abstract: The objective of this work is to develop a method for detecting rare gamma quanta against the background of charged particles in the fluxes from sources in the Universe with the help of the deep learning and normalizing flows based method designed for anomaly detection. It is shown that the suggested method has a potential for the gamma detection. The method was tested on model data from the TAIGA-IACT experiment. The obtained quantitative performance indicators are still inferior to other approaches, and therefore possible ways to improve the implementation of the method are proposed.
Authors: D. Kucharski, A. Gaska, T. Kowaluk, K. Stepien, M. Repalska, B. Gapinski, M. Wieczorowski, M. Nawotka, P. Sobecki, P. Sosinowski, J. Tomasik, A. Wojtowicz
Abstract: A reproducible deep learning framework is presented for surface metrology to predict surface texture parameters together with their reported standard uncertainties. Using a multi-instrument dataset spanning tactile and optical systems, measurement system type classification is addressed alongside coordinated regression of Ra, Rz, RONt and their uncertainty targets (Ra_uncert, Rz_uncert, RONt_uncert). Uncertainty is modelled via quantile and heteroscedastic heads with post-hoc conformal calibration to yield calibrated intervals. On a held-out set, high fidelity was achieved by single-target regressors (R2: Ra 0.9824, Rz 0.9847, RONt 0.9918), with two uncertainty targets also well modelled (Ra_uncert 0.9899, Rz_uncert 0.9955); RONt_uncert remained difficult (R2 0.4934). The classifier reached 92.85% accuracy and probability calibration was essentially unchanged after temperature scaling (ECE 0.00504 -> 0.00503 on the test split). Negative transfer was observed for naive multi-output trunks, with single-target models performing better. These results provide calibrated predictions suitable to inform instrument selection and acceptance decisions in metrological workflows.
Authors: Wei Cao, Shanshan Wang
Abstract: Expectile regression neural networks (ERNNs) are powerful tools for capturing heterogeneity and complex nonlinear structures in data. However, most existing research has primarily focused on fully observed data, with limited attention paid to scenarios involving censored observations. In this paper, we propose a data augmentation based ERNNs algorithm, termed DAERNN, for modeling heterogeneous censored data. The proposed DAERNN is fully data driven, requires minimal assumptions, and offers substantial flexibility. Simulation studies and real data applications demonstrate that DAERNN outperforms existing censored ERNNs methods and achieves predictive performance comparable to models trained on fully observed data. Moreover, the algorithm provides a unified framework for handling various censoring mechanisms without requiring explicit parametric model specification, thereby enhancing its applicability to practical censored data analysis.
Authors: Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni
Abstract: Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.
Authors: Andr\'as R\'acz, Tam\'as Borsos, Andr\'as Veres, Benedek Csala
Abstract: We present AttDet, a Transformer-inspired MIMO (Multiple Input Multiple Output) detection method that treats each transmit layer as a token and learns inter-stream interference via a lightweight self-attention mechanism. Queries and keys are derived directly from the estimated channel matrix, so attention scores quantify channel correlation. Values are initialized by matched-filter outputs and iteratively refined. The AttDet design combines model-based interpretability with data-driven flexibility. We demonstrate through link-level simulations under realistic 5G channel models and high-order, mixed QAM modulation and coding schemes, that AttDet can approach near-optimal BER/BLER (Bit Error Rate/Block Error Rate) performance while maintaining predictable, polynomial complexity.
Authors: Lucas Darius Konrad, Nikolas Kuschnig
Abstract: Small subsets of data with disproportionate influence on model outcomes can have dramatic impacts on conclusions, with a few data points sometimes overturning key findings. While recent work has developed methods to identify these \emph{most influential sets}, no formal theory exists to determine when their influence reflects genuine problems rather than natural sampling variation. We address this gap by developing a principled framework for assessing the statistical significance of most influential sets. Our theoretical results characterize the extreme value distributions of maximal influence and enable rigorous hypothesis tests for excessive influence, replacing current ad-hoc sensitivity checks. We demonstrate the practical value of our approach through applications across economics, biology, and machine learning benchmarks.
Authors: Xiaogang Jia, Qian Wang, Anrui Wang, Han A. Wang, Bal\'azs Gyenes, Emiliyan Gospodinov, Xinkai Jiang, Ge Li, Hongyi Zhou, Weiran Liao, Xi Huang, Maximilian Beck, Moritz Reuss, Rudolf Lioutikov, Gerhard Neumann
Abstract: Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https://point-map.github.io/Point-Map/
Authors: Eulalie Boucher, Mihai Alexe, Peter Lean, Ewan Pinnington, Simon Lang, Patrick Laloyaux, Lorenzo Zampieri, Patricia de Rosnay, Niels Bormann, Anthony McNally
Abstract: Interactions between different components of the Earth System (e.g. ocean, atmosphere, land and cryosphere) are a crucial driver of global weather patterns. Modern Numerical Weather Prediction (NWP) systems typically run separate models of the different components, explicitly coupled across their interfaces to additionally model exchanges between the different components. Accurately representing these coupled interactions remains a major scientific and technical challenge of weather forecasting. GraphDOP is a graph-based machine learning model that learns to forecast weather directly from raw satellite and in-situ observations, without reliance on reanalysis products or traditional physics-based NWP models. GraphDOP simultaneously embeds information from diverse observation sources spanning the full Earth system into a shared latent space. This enables predictions that implicitly capture cross-domain interactions in a single model without the need for any explicit coupling. Here we present a selection of case studies which illustrate the capability of GraphDOP to forecast events where coupled processes play a particularly key role. These include rapid sea-ice freezing in the Arctic, mixing-induced ocean surface cooling during Hurricane Ian and the severe European heat wave of 2022. The results suggest that learning directly from Earth System observations can successfully characterise and propagate cross-component interactions, offering a promising path towards physically consistent end-to-end data-driven Earth System prediction with a single model.
Authors: David Stein, Bjoern Andres, Silvia Di Gregorio
Abstract: The higher-order correlation clustering problem for a graph $G$ and costs associated with cliques of $G$ consists in finding a clustering of $G$ so as to minimize the sum of the costs of those cliques whose nodes all belong to the same cluster. To tackle this NP-hard problem in practice, local search heuristics have been proposed and studied in the context of applications. Here, we establish partial optimality conditions for cubic correlation clustering, i.e., for the special case of at most 3-cliques. We define and implement algorithms for deciding these conditions and examine their effectiveness numerically, on two data sets.
Authors: Federico Lozano-Cuadra, Beatriz Soret, Marc Sanchez Net, Abhishek Cauligi, Federico Rossi
Abstract: We present a fully decentralized routing framework for multi-robot exploration missions operating under the constraints of a Lunar Delay-Tolerant Network (LDTN). In this setting, autonomous rovers must relay collected data to a lander under intermittent connectivity and unknown mobility patterns. We formulate the problem as a Partially Observable Markov Decision Problem (POMDP) and propose a Graph Attention-based Multi-Agent Reinforcement Learning (GAT-MARL) policy that performs Centralized Training, Decentralized Execution (CTDE). Our method relies only on local observations and does not require global topology updates or packet replication, unlike classical approaches such as shortest path and controlled flooding-based algorithms. Through Monte Carlo simulations in randomized exploration environments, GAT-MARL provides higher delivery rates, no duplications, and fewer packet losses, and is able to leverage short-term mobility forecasts; offering a scalable solution for future space robotic systems for planetary exploration, as demonstrated by successful generalization to larger rover teams.
Authors: Shehu AbdusSalam, Steven Abel, Deaglan Bartlett, Miguel Crispim Rom\~ao
Abstract: We demonstrate the efficacy of symbolic regression (SR) to probe models of particle physics Beyond the Standard Model (BSM), by considering the so-called Constrained Minimal Supersymmetric Standard Model (CMSSM). Like many incarnations of BSM physics this model has a number (four) of arbitrary parameters, which determine the experimental signals, and cosmological observables such as the dark matter relic density. We show that analysis of the phenomenology can be greatly accelerated by using symbolic expressions derived for the observables in terms of the input parameters. Here we focus on the Higgs mass, the cold dark matter relic density, and the contribution to the anomalous magnetic moment of the muon. We find that SR can produce remarkably accurate expressions. Using them we make global fits to derive the posterior probability densities of the CMSSM input parameters which are in good agreement with those performed using conventional methods. Moreover, we demonstrate a major advantage of SR which is the ability to make fits using differentiable methods rather than sampling methods. We also compare the method with neural network (NN) regression. SR produces more globally robust results, while NNs require data that is focussed on the promising regions in order to be equally performant.
Authors: Louis Mozart Kamdem Teyou, Luke Friedrichs, N'Dah Jean Kouagou, Caglar Demir, Yasir Mahmood, Stefan Heindorf, Axel-Cyrille Ngonga Ngomo
Abstract: Concept learning exploits background knowledge in the form of description logic axioms to learn explainable classification models from knowledge bases. Despite recent breakthroughs in neuro-symbolic concept learning, most approaches still cannot be deployed on real-world knowledge bases. This is due to their use of description logic reasoners, which are not robust against inconsistencies nor erroneous data. We address this challenge by presenting a novel neural reasoner dubbed EBR. Our reasoner relies on embeddings to approximate the results of a symbolic reasoner. We show that EBR solely requires retrieving instances for atomic concepts and existential restrictions to retrieve or approximate the set of instances of any concept in the description logic $\mathcal{SHOIQ}$. In our experiments, we compare EBR with state-of-the-art reasoners. Our results suggest that EBR is robust against missing and erroneous data in contrast to existing reasoners.
Authors: Touqeer Ahmad, Mohammadreza M. Kalan, Fran\c{c}ois Portier, Gilles Stupfler
Abstract: Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings
Authors: Zhiyu Lin, Jingwen Yang, Jiale Zhao, Meng Liu, Sunzhu Li, Benyou Wang
Abstract: Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
URLs: https://github.com/FreedomIntelligence/ExpressiveSpeech
Authors: Mohamed Seif, Malcolm Egan, Andrea J. Goldsmith, H. Vincent Poor
Abstract: AI-based sensing at wireless edge devices has the potential to significantly enhance Artificial Intelligence (AI) applications, particularly for vision and perception tasks such as in autonomous driving and environmental monitoring. AI systems rely both on efficient model learning and inference. In the inference phase, features extracted from sensing data are utilized for prediction tasks (e.g., classification or regression). In edge networks, sensors and model servers are often not co-located, which requires communication of features. As sensitive personal data can be reconstructed by an adversary, transformation of the features are required to reduce the risk of privacy violations. While differential privacy mechanisms provide a means of protecting finite datasets, protection of individual features has not been addressed. In this paper, we propose a novel framework for privacy-preserving AI-based sensing, where devices apply transformations of extracted features before transmission to a model server.
Authors: Guillermo Carbajal, Andr\'es Almansa, Pablo Mus\'e
Abstract: Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at https://github.com/GuillermoCarbajal/Blur2Seq/
Authors: Yunyi Shen, Alexander Gagliano
Abstract: Self-supervised learning has become a central strategy for representation learning, but the majority of architectures used for encoding data have only been validated on regularly-sampled inputs such as images, audios. and videos. In many scientific domains, data instead arrive as long, irregular, and multimodal sequences. To extract semantic information from these data, we introduce the Diffusion Autoencoder with Perceivers (daep). daep tokenizes heterogeneous measurements, compresses them with a Perceiver encoder, and reconstructs them with a Perceiver-IO diffusion decoder, enabling scalable learning in diverse data settings. To benchmark the daep architecture, we adapt the masked autoencoder to a Perceiver encoder/decoder design, and establish a strong baseline (maep) in the same architectural family as daep. Across diverse spectroscopic and photometric astronomical datasets, daep achieves lower reconstruction errors, produces more discriminative latent spaces, and better preserves fine-scale structure than both VAE and maep baselines. These results establish daep as an effective framework for scientific domains where data arrives as irregular, heterogeneous sequences.
Authors: L. Elisa Celis, Lingxiao Huang, Milind Sohoni, Nisheeth K. Vishnoi
Abstract: Meritocratic systems, from admissions to hiring, aim to impartially reward skill and effort. Yet persistent disparities across race, gender, and class challenge this ideal. Some attribute these gaps to structural inequality; others to individual choice. We develop a game-theoretic model in which candidates from different socioeconomic groups differ in their perceived post-selection value--shaped by social context and, increasingly, by AI-powered tools offering personalized career or salary guidance. Each candidate strategically chooses effort, balancing its cost against expected reward; effort translates into observable merit, and selection is based solely on merit. We characterize the unique Nash equilibrium in the large-agent limit and derive explicit formulas showing how valuation disparities and institutional selectivity jointly determine effort, representation, social welfare, and utility. We further propose a cost-sensitive optimization framework that quantifies how modifying selectivity or perceived value can reduce disparities without compromising institutional goals. Our analysis reveals a perception-driven bias: when perceptions of post-selection value differ across groups, these differences translate into rational differences in effort, propagating disparities backward through otherwise "fair" selection processes. While the model is static, it captures one stage of a broader feedback cycle linking perceptions, incentives, and outcome--bridging rational-choice and structural explanations of inequality by showing how techno-social environments shape individual incentives in meritocratic systems.
Authors: Wenjun Cao
Abstract: Large Language Models are increasingly adopted as critical tools for accelerating innovation. This paper identifies and formalizes a systemic risk inherent in this paradigm: \textbf{Black Box Absorption}. We define this as the process by which the opaque internal architectures of LLM platforms, often operated by large-scale service providers, can internalize, generalize, and repurpose novel concepts contributed by users during interaction. This mechanism threatens to undermine the foundational principles of innovation economics by creating severe informational and structural asymmetries between individual creators and platform operators, thereby jeopardizing the long-term sustainability of the innovation ecosystem. To analyze this challenge, we introduce two core concepts: the idea unit, representing the transportable functional logic of an innovation, and idea safety, a multidimensional standard for its protection. This paper analyzes the mechanisms of absorption and proposes a concrete governance and engineering agenda to mitigate these risks, ensuring that creator contributions remain traceable, controllable, and equitable.
Authors: Jack Butler, Nikita Kozodoi, Zainab Afolabi, Brian Tyacke, Gaiar Baimuratov
Abstract: As Large Language Models (LLMs) continue to evolve, practitioners face increasing options for enhancing inference-time performance without model retraining, including budget tuning and multi-step techniques like self-reflection. While these methods improve output quality, they create complex trade-offs among accuracy, cost, and latency that remain poorly understood across different domains. This paper systematically compares self-reflection and budget tuning across mathematical reasoning and translation tasks. We evaluate prominent LLMs, including Anthropic Claude, Amazon Nova, and Mistral families, along with other models under varying reflection depths and compute budgets to derive Pareto optimal performance frontiers. Our analysis reveals substantial domain dependent variation in self-reflection effectiveness, with performance gains up to 220\% in mathematical reasoning. We further investigate how reflection round depth and feedback mechanism quality influence performance across model families. To validate our findings in a real-world setting, we deploy a self-reflection enhanced marketing content localisation system at Lounge by Zalando, where it shows market-dependent effectiveness, reinforcing the importance of domain specific evaluation when deploying these techniques. Our results provide actionable guidance for selecting optimal inference strategies given specific domains and resource constraints. We open source our self-reflection implementation for reproducibility at https://github.com/aws-samples/sample-genai-reflection-for-bedrock.
URLs: https://github.com/aws-samples/sample-genai-reflection-for-bedrock.
Authors: Jinhee Kim, Jae Jun An, Kang Eun Jeon, Jong Hwan Ko
Abstract: Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at https://github.com/a2jinhee/EMQNet_jk.
Authors: Kushal Chakrabarti, Nirmal Balachundhar
Abstract: Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.
Authors: Ronghao Ni, Aidan Z. H. Yang, Min-Chien Hsu, Nuno Sabino, Limin Jia, Ruben Martins, Darion Cassel, Kevin Cheang
Abstract: Program analysis tools often produce large volumes of candidate vulnerability reports that require costly manual review, creating a practical challenge: how can security analysts prioritize the reports most likely to be true vulnerabilities? This paper investigates whether machine learning can be applied to prioritizing vulnerabilities reported by program analysis tools. We focus on Node.js packages and collect a benchmark of 1,883 Node.js packages, each containing one reported ACE or ACI vulnerability. We evaluate a variety of machine learning approaches, including classical models, graph neural networks (GNNs), large language models (LLMs), and hybrid models that combine GNN and LLMs, trained on data based on a dynamic program analysis tool's output. The top LLM achieves $F_{1} {=} 0.915$, while the best GNN and classical ML models reaching $F_{1} {=} 0.904$. At a less than 7% false-negative rate, the leading model eliminates 66.9% of benign packages from manual review, taking around 60 ms per package. If the best model is tuned to operate at a precision level of 0.8 (i.e., allowing 20% false positives amongst all warnings), our approach can detect 99.2% of exploitable taint flows while missing only 0.8%, demonstrating strong potential for real-world vulnerability triage.
Authors: Brandon Kaplowitz
Abstract: This paper demonstrates how reinforcement learning can explain two puzzling empirical patterns in household consumption behavior during economic downturns. I develop a model where agents use Q-learning with neural network approximation to make consumption-savings decisions under income uncertainty, departing from standard rational expectations assumptions. The model replicates two key findings from recent literature: (1) unemployed households with previously low liquid assets exhibit substantially higher marginal propensities to consume (MPCs) out of stimulus transfers compared to high-asset households (0.50 vs 0.34), even when neither group faces borrowing constraints, consistent with Ganong et al. (2024); and (2) households with more past unemployment experiences maintain persistently lower consumption levels after controlling for current economic conditions, a "scarring" effect documented by Malmendier and Shen (2024). Unlike existing explanations based on belief updating about income risk or ex-ante heterogeneity, the reinforcement learning mechanism generates both higher MPCs and lower consumption levels simultaneously through value function approximation errors that evolve with experience. Simulation results closely match the empirical estimates, suggesting that adaptive learning through reinforcement learning provides a unifying framework for understanding how past experiences shape current consumption behavior beyond what current economic conditions would predict.
Authors: Tianyi Xiong, Haonan Chen
Abstract: Accurate medium-range precipitation forecasting is crucial for hydrometeorological risk management and disaster mitigation, yet remains challenging for current numerical weather prediction (NWP) systems. Traditional ensemble systems such as the Global Ensemble Forecast System (GEFS) struggle to maintain high skill, especially for moderate and heavy rainfall at extended lead times. This study develops a deep learning-based ensemble framework for multi-step precipitation prediction through joint modeling of a comprehensive set of atmospheric variables. The model is trained on ERA5 reanalysis data at 0.25$^{\circ}$ spatial resolution, with precipitation labels from NASA's Integrated Multi-satellite Retrievals for Global Precipitation Measurement (GPM) constellation (IMERG), incorporating 57 input variables, including upper-air and surface predictors. The architecture employs a patch-based Swin Transformer backbone with periodic convolutions to handle longitudinal continuity and integrates time and noise embeddings through conditional layer normalization. A dual-branch decoder predicts total precipitation and other variables, with targeted freezing of encoder-decoder pathways for specialized training. Training minimizes a hybrid loss combining the Continuous Ranked Probability Score (CRPS) and weighted log1p mean squared error (log1pMSE), balancing probabilistic accuracy and magnitude fidelity. During inference, the model ingests real-time Global Forecast System (GFS) initial conditions to generate 15-day forecasts autoregressively. Evaluation against GEFS using IMERG data demonstrates higher Critical Success Index (CSI) scores at precipitation thresholds of 0.1 mm, 1 mm, 10 mm, and 20 mm, highlighting improved performance for moderate to heavy rainfall.
Authors: Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, Ivan Skorokhodov
Abstract: MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).
Authors: Fares Fourati
Abstract: Recent work by \citet{hendrycks2025agidefinition} formalized \textit{Artificial General Intelligence} (AGI) as the arithmetic mean of proficiencies across cognitive domains derived from the Cattell--Horn--Carroll (CHC) model of human cognition. While elegant, this definition assumes \textit{compensability} -- that exceptional ability in some domains can offset failure in others. True general intelligence, however, should reflect \textit{coherent sufficiency}: balanced competence across all essential domains. We propose a coherence-aware measure of AGI based on the integral of generalized means over a continuum of compensability exponents. This formulation spans arithmetic, geometric, and harmonic regimes, and the resulting \textit{area under the curve} (AUC) quantifies robustness under varying compensability assumptions. Unlike the arithmetic mean, which rewards specialization, the AUC penalizes imbalance and captures inter-domain dependency. Applied to published CHC-based domain scores for GPT-4 and GPT-5, the coherence-adjusted AUC reveals that both systems remain far from general competence despite high arithmetic scores (e.g., GPT-5 at~24\%). Integrating the generalized mean thus yields a principled, interpretable, and stricter foundation for measuring genuine progress toward AGI.
Authors: Mutian He, Philip N. Garner
Abstract: Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
Authors: Juan Alejandro Pinto Castro, H\'ector J. Hort\'ua, Jorge Enrique Garc\'ia-Farieta, Roger Anderson Hurtado
Abstract: Deep learning has emerged as a transformative methodology in modern cosmology, providing powerful tools to extract meaningful physical information from complex astronomical datasets. This paper implements a novel Bayesian graph deep learning framework for estimating key cosmological parameters in a primordial magnetic field (PMF) cosmology directly from simulated Cosmic Microwave Background (CMB) maps. Our methodology utilizes DeepSphere, a spherical convolutional neural network architecture specifically designed to respect the spherical geometry of CMB data through HEALPix pixelization. To advance beyond deterministic point estimates and enable robust uncertainty quantification, we integrate Bayesian Neural Networks (BNNs) into the framework, capturing aleatoric and epistemic uncertainties that reflect the model confidence in its predictions. The proposed approach demonstrates exceptional performance, achieving $R^{2}$ scores exceeding 0.89 for the magnetic parameter estimation. We further obtain well-calibrated uncertainty estimates through post-hoc training techniques including Variance Scaling and GPNormal. This integrated DeepSphere-BNNs framework not only delivers accurate parameter estimation from CMB maps with PMF contributions but also provides reliable uncertainty quantification, providing the necessary tools for robust cosmological inference in the era of precision cosmology.
Authors: Yair Feldman, Yoav Artzi
Abstract: A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
Authors: Dean L Slack, G Thomas Hudson, Thomas Winterbottom, Noura Al Moubayed
Abstract: Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
Authors: Elie Aljalbout, Jiaxu Xing, Angel Romero, Iretiayo Akinola, Caelan Reed Garrett, Eric Heiden, Abhishek Gupta, Tucker Hermans, Yashraj Narang, Dieter Fox, Davide Scaramuzza, Fabio Ramos
Abstract: Machine learning has facilitated significant advancements across various robotics domains, including navigation, locomotion, and manipulation. Many such achievements have been driven by the extensive use of simulation as a critical tool for training and testing robotic systems prior to their deployment in real-world environments. However, simulations consist of abstractions and approximations that inevitably introduce discrepancies between simulated and real environments, known as the reality gap. These discrepancies significantly hinder the successful transfer of systems from simulation to the real world. Closing this gap remains one of the most pressing challenges in robotics. Recent advances in sim-to-real transfer have demonstrated promising results across various platforms, including locomotion, navigation, and manipulation. By leveraging techniques such as domain randomization, real-to-sim transfer, state and action abstractions, and sim-real co-training, many works have overcome the reality gap. However, challenges persist, and a deeper understanding of the reality gap's root causes and solutions is necessary. In this survey, we present a comprehensive overview of the sim-to-real landscape, highlighting the causes, solutions, and evaluation metrics for the reality gap and sim-to-real transfer.
Authors: Xueyan Zou, Jianglong Ye, Hao Zhang, Xiaoyu Xiang, Mingyu Ding, Zhaojing Yang, Yong Jae Lee, Zhuowen Tu, Sifei Liu, Xiaolong Wang
Abstract: With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one's expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.
Authors: Mingmeng Geng, Thierry Poibeau
Abstract: With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
Authors: Mateo Guaman Castro, Sidharth Rajagopal, Daniel Gorbatov, Matt Schmittle, Rohan Baijal, Octi Zhang, Rosario Scalise, Sidharth Talia, Emma Romig, Celso de Melo, Byron Boots, Abhishek Gupta
Abstract: A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/
Authors: Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot
Abstract: Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.
Authors: Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar\'e, Maria Lomeli, Lucas Hosseini, Herv\'e J\'egou
Abstract: Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-off space of vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability.
Authors: Neta Glazer, Aviv Navon, Aviv Shamsian, Ethan Fetaya
Abstract: One of the challenges in applying reinforcement learning in a complex real-world environment lies in providing the agent with a sufficiently detailed reward function. Any misalignment between the reward and the desired behavior can result in unwanted outcomes. This may lead to issues like "reward hacking" where the agent maximizes rewards by unintended behavior. In this work, we propose to disentangle the reward into two distinct parts. A simple task-specific reward, outlining the particulars of the task at hand, and an unknown common-sense reward, indicating the expected behavior of the agent within the environment. We then explore how this common-sense reward can be learned from expert demonstrations. We first show that inverse reinforcement learning, even when it succeeds in training an agent, does not learn a useful reward function. That is, training a new agent with the learned reward does not impair the desired behaviors. We then demonstrate that this problem can be solved by training simultaneously on multiple tasks. That is, multi-task inverse reinforcement learning can be applied to learn a useful reward function.
Authors: Benjamin Walker, Andrew D. McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, Terry Lyons
Abstract: The vector field of a controlled differential equation (CDE) describes the relationship between a control path and the evolution of a solution path. Neural CDEs (NCDEs) treat time series data as observations from a control path, parameterise a CDE's vector field using a neural network, and use the solution path as a continuously evolving hidden state. As their formulation makes them robust to irregular sampling rates, NCDEs are a powerful approach for modelling real-world data. Building on neural rough differential equations (NRDEs), we introduce Log-NCDEs, a novel, effective, and efficient method for training NCDEs. The core component of Log-NCDEs is the Log-ODE method, a tool from the study of rough paths for approximating a CDE's solution. Log-NCDEs are shown to outperform NCDEs, NRDEs, the linear recurrent unit, S5, and MAMBA on a range of multivariate time series datasets with up to $50{,}000$ observations.
Authors: Vincent Davis, Emanuele Rossi, Vikash Singh
Abstract: The Bitcoin Lightning Network is a Layer 2 payment protocol that addresses Bitcoin's scalability by facilitating quick and cost effective transactions through payment channels. This research explores the feasibility of using machine learning models to interpolate channel balances within the network, which can be used for optimizing the network's pathfinding algorithms. While there has been much exploration in balance probing and multipath payment protocols, predicting channel balances using solely node and channel features remains an uncharted area. This paper evaluates the performance of several machine learning models against two heuristic baselines and investigates the predictive capabilities of various features. Our model performs favorably in experimental evaluation, outperforming by 10% against an equal split baseline where both edges are assigned half of the channel capacity.
Authors: Spencer Young, Riley Sinema, Cole Edgren, Andrew Hall, Nathan Dong, Porter Jenkins
Abstract: While significant progress has been made in specifying neural networks capable of representing uncertainty, deep networks still often suffer from overconfidence and misaligned predictive distributions. Existing approaches for measuring this misalignment are primarily developed under the framework of calibration, with common metrics such as Expected Calibration Error (ECE). However, calibration can only provide a strictly marginal assessment of probabilistic alignment. Consequently, calibration metrics such as ECE are $\textit{distribution-wise}$ measures and cannot diagnose the $\textit{point-wise}$ reliability of individual inputs, which is important for real-world decision-making. We propose a stronger condition, which we term $\textit{conditional congruence}$, for assessing probabilistic fit. We also introduce a metric, Conditional Congruence Error (CCE), that uses conditional kernel mean embeddings to estimate the distance, at any point, between the learned predictive distribution and the empirical, conditional distribution in a dataset. We perform several high dimensional regression tasks and show that CCE exhibits four critical properties: $\textit{correctness}$, $\textit{monotonicity}$, $\textit{reliability}$, and $\textit{robustness}$.
Authors: Shriram Chennakesavalu, Frank Hu, Sebastian Ibarraran, Grant M. Rotskoff
Abstract: Searching through chemical space is an exceptionally challenging problem because the number of possible molecules grows combinatorially with the number of atoms. Large, autoregressive models trained on databases of chemical compounds have yielded powerful generators, but we still lack robust strategies for generating molecules with desired properties. This molecular search problem closely resembles the "alignment" problem for large language models, though for many chemical tasks we have a specific and easily evaluable reward function. Here, we introduce an algorithm called energy rank alignment (ERA) that leverages an explicit reward function to produce a gradient-based objective that we use to optimize autoregressive policies. We show theoretically that this algorithm is closely related to proximal policy optimization (PPO) and direct preference optimization (DPO), but has a minimizer that converges to an ideal Gibbs-Boltzmann distribution with the reward playing the role of an energy function. Furthermore, this algorithm is highly scalable, does not require reinforcement learning, and performs well relative to DPO when the number of preference observations per pairing is small. We deploy this approach to align molecular transformers and protein language models to generate molecules and protein sequences, respectively, with externally specified properties and find that it does so robustly, searching through diverse parts of chemical space.
Authors: Rosario Messana, Rui Chen, Andrea Lodi, Alberto Ceselli
Abstract: We consider solving a combinatorial optimization problem with unknown knapsack constraints using a membership oracle for each unknown constraint such that, given a solution, the oracle determines whether the constraint is satisfied or not with absolute certainty. The goal of the decision maker is to find the best possible solution subject to a budget on the number of oracle calls. Inspired by active learning for binary classification based on Support Vector Machines (SVMs), we devise a framework to solve the problem by learning and exploiting surrogate linear constraints. The framework includes training linear separators on the labeled points and selecting new points to be labeled, which is achieved by applying a sampling strategy and solving a 0-1 integer linear program. Following the active learning literature, a natural choice would be SVM as a linear classifier and the information-based sampling strategy known as simple margin, for each unknown constraint. We improve on both sides: we propose an alternative sampling strategy based on mixed-integer quadratic programming and a linear separation method inspired by an algorithm for convex optimization in the oracle model. We conduct experiments on classical problems and variants inspired by realistic applications to show how different linear separation methods and sampling strategies influence the quality of the results in terms of several metrics including objective value, dual bound and running time.
Authors: Subhojyoti Mukherjee, Josiah P. Hanna, Qiaomin Xie, Robert Nowak
Abstract: We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network to learn a near-optimal policy in-context. This approach leverages the shared structure across tasks, does not require access to optimal actions, and can outperform the demonstrator. We validate these claims over a wide variety of structured bandit problems to show that our proposed solution is general and can quickly identify expected rewards on unseen test tasks to support effective exploration.
Authors: Difan Deng, Marius Lindauer
Abstract: The rapid development of time series forecasting research has brought many deep learning-based modules in this field. However, despite the increasing amount of new forecasting architectures, it is still unclear if we have leveraged the full potential of these existing modules within a properly designed architecture. In this work, we propose a novel hierarchical neural architecture search approach for time series forecasting tasks. With the design of a hierarchical search space, we incorporate many architecture types designed for forecasting tasks and allow for the efficient combination of different forecasting architecture modules. Results on long-term-time-series-forecasting tasks show that our approach can search for lightweight high-performing forecasting architectures across different forecasting tasks.
Authors: Ali Gorji, Andisheh Amrollahi, Andreas Krause
Abstract: SHAP (SHapley Additive exPlanations) values are a widely used method for local feature attribution in interpretable and explainable AI. We propose an efficient two-stage algorithm for computing SHAP values in both black-box setting and tree-based models. Motivated by spectral bias in real-world predictors, we first approximate models using compact Fourier representations, exactly for trees and approximately for black-box models. In the second stage, we introduce a closed-form formula for {\em exactly} computing SHAP values using the Fourier representation, that ``linearizes'' the computation into a simple summation and is amenable to parallelization. As the Fourier approximation is computed only once, our method enables amortized SHAP value computation, achieving significant speedups over existing methods and a tunable trade-off between efficiency and precision.
Authors: Luckeciano C. Melo, Alessandro Abate, Yarin Gal
Abstract: Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. In the Bayesian CL literature, variational methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution while constraining it to stay close to its previous estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. Experiments on challenging CL benchmarks show that our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.
Authors: Jacob L. Block, Sundararajan Srinivasan, Liam Collins, Aryan Mokhtari, Sanjay Shakkottai
Abstract: The power of foundation models (FMs) lies in their capacity to learn highly expressive representations that can be adapted to a broad spectrum of tasks. However, these pretrained models require additional training stages to become effective for downstream applications. In the multi-task setting, prior works have shown empirically that specific meta-learning approaches for preparing a model for future adaptation through parameter-efficient fine-tuning (PEFT) can outperform standard retraining methods, but the mechanism of the benefits of meta-learning has been largely unexplored. We introduce a framework for generic PEFT-based meta-learning to learn a model that can easily adapt to unseen tasks. For linear models using LoRA, we show that standard retraining is provably suboptimal for finding an adaptable set of parameters and provide strict performance guarantees for our proposed method. We verify these theoretical insights through experiments on synthetic data as well as real-data vision and language tasks. We observe significant performance benefits using a simple implementation of our proposed meta-learning scheme during retraining relative to the conventional approach.
Authors: Ming Gu, Zhuonan Zheng, Sheng Zhou, Meihan Liu, Jiawei Chen, Tanyu Qiao, Liangcheng Li, Jiajun Bu
Abstract: Graph Neural Networks (GNNs) have achieved great success but are often considered to be challenged by varying levels of homophily in graphs. Recent empirical studies have surprisingly shown that homophilic GNNs can perform well across datasets of different homophily levels with proper hyperparameter tuning, but the underlying theory and effective architectures remain unclear. To advance GNN universality across varying homophily, we theoretically revisit GNN message passing and uncover a novel smoothness-generalization dilemma, where increasing hops inevitably enhances smoothness at the cost of generalization. This dilemma hinders learning in higher-order homophilic neighborhoods and all heterophilic ones, where generalization is critical due to complex neighborhood class distributions that are sensitive to shifts induced by noise and sparsity. To address this, we introduce the Inceptive Graph Neural Network (IGNN) built on three simple yet effective design principles, which alleviate the dilemma by enabling distinct hop-wise generalization alongside improved overall generalization with adaptive smoothness. Benchmarking against 30 baselines demonstrates IGNN's superiority and reveals notable universality in certain homophilic GNN variants. Our code and datasets are available at https://github.com/galogm/IGNN.
Authors: Shyam Venkatasubramanian, Vahid Tarokh
Abstract: Accelerating model convergence in resource-constrained environments is essential for fast and efficient neural network training. This work presents learn2mix, a new training strategy that adaptively adjusts class proportions within batches, focusing on classes with higher error rates. Unlike classical training methods that use static class proportions, learn2mix continually adapts class proportions during training, leading to faster convergence. Empirical evaluations on benchmark datasets show that neural networks trained with learn2mix converge faster than those trained with existing approaches, achieving improved results for classification, regression, and reconstruction tasks under limited training resources and with imbalanced classes. Our empirical findings are supported by theoretical analysis.
Authors: Yibo Wen, Chenwei Xu, Jerry Yao-Chieh Hu, Kaize Ding, Han Liu
Abstract: We present a three-stage framework for training deep learning models specializing in antibody sequence-structure co-design. We first pre-train a language model using millions of antibody sequence data. Then, we employ the learned representations to guide the training of a diffusion model for joint optimization over both sequence and structure of antibodies. During the final alignment stage, we optimize the model to favor antibodies with low repulsion and high attraction to the antigen binding site, enhancing the rationality and functionality of the designs. To mitigate conflicting energy preferences, we extend AbDPO (Antibody Direct Preference Optimization) to guide the model toward Pareto optimality under multiple energy-based alignment objectives. Furthermore, we adopt an iterative learning paradigm with temperature scaling, enabling the model to benefit from diverse online datasets without requiring additional data. In practice, our proposed methods achieve high stability and efficiency in producing a better Pareto front of antibody designs compared to top samples generated by baselines and previous alignment techniques. Through extensive experiments, we showcase the superior performance of our methods in generating nature-like antibodies with high binding affinity.
Authors: Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao
Abstract: Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
Authors: Zewen Liu, Juntong Ni, Max S. Y. Lau, Wei Jin
Abstract: Accurate epidemic forecasting is crucial for outbreak preparedness, but existing data-driven models are often brittle. Typically trained on a single pathogen, they struggle with data scarcity during new outbreaks and fail under distribution shifts caused by viral evolution or interventions. However, decades of surveillance data from diverse diseases offer an untapped source of transferable knowledge. To leverage the collective lessons from history, we propose CAPE, the first open-source pre-trained model for epidemic forecasting. Unlike existing time series foundation models that overlook epidemiological challenges, CAPE models epidemic dynamics as mixtures of latent population states, termed compartmental prototypes. It discovers a flexible dictionary of compartment prototypes directly from surveillance data, enabling each outbreak to be expressed as a time-varying mixture that links observed infections to latent population states. To promote robust generalization, CAPE combines self-supervised pre-training objectives with lightweight epidemic-aware regularizers that align the learned prototypes with epidemiological semantics. On a comprehensive benchmark spanning 17 diseases and 50+ regions, CAPE significantly outperforms strong baselines in zero-shot, few-shot, and full-shot forecasting. This work represents a principled step toward pre-trained epidemic models that are both transferable and epidemiologically grounded.
Authors: Awa Khouna, Julien Ferry, Thibaut Vidal
Abstract: The advent of Machine Learning as a Service (MLaaS) has heightened the trade-off between model explainability and security. In particular, explainability techniques, such as counterfactual explanations, inadvertently increase the risk of model extraction attacks, enabling unauthorized replication of proprietary models. In this paper, we formalize and characterize the risks and inherent complexity of model reconstruction, focusing on the "oracle'' queries required for faithfully inferring the underlying prediction function. We present the first formal analysis of model extraction attacks through the lens of competitive analysis, establishing a foundational framework to evaluate their efficiency. Focusing on models based on additive decision trees (e.g., decision trees, gradient boosting, and random forests), we introduce novel reconstruction algorithms that achieve provably perfect fidelity while demonstrating strong anytime performance. Our framework provides theoretical bounds on the query complexity for extracting tree-based model, offering new insights into the security vulnerabilities of their deployment.
Authors: Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishnan
Abstract: Imagination in world models is crucial for enabling agents to learn long-horizon policy in a sample-efficient manner. Existing recurrent state-space model (RSSM)-based world models depend on single-step statistical inference to capture the environment dynamics, and, hence, they are unable to perform long-term imagination tasks due to the accumulation of prediction errors. Inspired by the dual-process theory of human cognition, we propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. DMWM is composed of two components: an RSSM-based System 1 (RSSM-S1) component that handles state transitions in an intuitive manner and a logic-integrated neural network-based System 2 (LINN-S2) component that guides the imagination process through hierarchical deep logical reasoning. The inter-system feedback mechanism is designed to ensure that the imagination process follows the logical rules of the real environment. The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite. Extensive experimental results demonstrate that the proposed framework yields significant improvements in terms of logical coherence, trial efficiency, data efficiency and long-term imagination over the state-of-the-art world models.
Authors: Nic Rummel, Daniel A. Messenger, Stephen Becker, Vanja Dukic, David M. Bortz
Abstract: The Weak-form Estimation of Non-linear Dynamics (WENDy) framework is a recently developed approach for parameter estimation and inference of systems of ordinary differential equations (ODEs). Prior work demonstrated WENDy to be robust, computationally efficient, and accurate, but only works for ODEs which are linear-in-parameters. In this work, we derive a novel extension to accommodate systems of a more general class of ODEs that are nonlinear-in-parameters. Our new WENDy-MLE algorithm approximates a maximum likelihood estimator via local non-convex optimization methods. This is made possible by the availability of analytic expressions for the likelihood function and its first and second order derivatives. WENDy-MLE has better accuracy, a substantially larger domain of convergence, and is often faster than other weak form methods and the conventional output error least squares method. Moreover, we extend the framework to accommodate data corrupted by multiplicative log-normal noise. The WENDy.jl algorithm is efficiently implemented in Julia. In order to demonstrate the practical benefits of our approach, we present extensive numerical results comparing our method, other weak form methods, and output error least squares on a suite of benchmark systems of ODEs in terms of accuracy, precision, bias, and coverage.
Authors: Moritz Grillo, Christoph Hertrich, Georg Loho
Abstract: We contribute towards resolving the open question of how many hidden layers are required in ReLU networks for exactly representing all continuous and piecewise linear functions on $\mathbb{R}^d$. While the question has been resolved in special cases, the best known lower bound in general is still 2. We focus on neural networks that are compatible with certain polyhedral complexes, more precisely with the braid fan. For such neural networks, we prove a non-constant lower bound of $\Omega(\log\log d)$ hidden layers required to exactly represent the maximum of $d$ numbers. Additionally, under our assumption, we provide a combinatorial proof that 3 hidden layers are necessary to compute the maximum of 5 numbers; this had only been verified with an excessive computation so far. Finally, we show that a natural generalization of the best known upper bound to maxout networks is not tight, by demonstrating that a rank-3 maxout layer followed by a rank-2 maxout layer is sufficient to represent the maximum of 7 numbers.
Authors: Jaehyeong Jo, Sung Ju Hwang
Abstract: Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. However, diffusion models that directly work on discrete data space fail to fully exploit the power of iterative refinement, as the signals are lost during transitions between discrete states. Existing continuous diffusion models for discrete data underperform compared to discrete methods, and the lack of a clear connection between the two approaches hinders the development of effective diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on this analogy, introduce a simple diffusion process that generalizes existing discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry, along with a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. The code is available at https://github.com/harryjo97/RDLM.
Authors: Shenzhi Yang, Junbo Zhao, Sharon Li, Shouqing Yang, Dingyu Yang, Xiaofang Zhang, Haobo Wang
Abstract: Detecting out-of-distribution (OOD) nodes in the graph-based machine-learning field is challenging, particularly when in-distribution (ID) node multi-category labels are unavailable. Thus, we focus on feature space rather than label space and find that, ideally, during the optimization of known ID samples, unknown ID samples undergo more significant representation changes than OOD samples, even if the model is trained to fit random targets, which we called the Feature Resonance phenomenon. The rationale behind it is that even without gold labels, the local manifold may still exhibit smooth resonance. Based on this, we further develop a novel graph OOD framework, dubbed Resonance-based Separation and Learning (RSL), which comprises two core modules: (i) a more practical micro-level proxy of feature resonance that measures the movement of feature vectors in one training step. (ii) integrate with synthetic OOD nodes strategy to train an effective OOD classifier. Theoretically, we derive an error bound showing the superior separability of OOD nodes during the resonance period. Extensive experiments on a total of thirteen real-world graph datasets empirically demonstrate that RSL achieves state-of-the-art performance.
Authors: Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari
Abstract: Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.
Authors: Yeonjun In, Kanghoon Yoon, Sukwon Yun, Kibum Kim, Sungchul Kim, Chanyoung Park
Abstract: In real-world applications, node features in graphs often contain noise from various sources, leading to significant performance degradation in GNNs. Although several methods have been developed to enhance robustness, they rely on the unrealistic assumption that noise in node features is independent of the graph structure and node labels, thereby limiting their applicability. To this end, we introduce a more realistic noise scenario, dependency-aware noise on graphs (DANG), where noise in node features create a chain of noise dependencies that propagates to the graph structure and node labels. We propose a novel robust GNN, DA-GNN, which captures the causal relationships among variables in the data generating process (DGP) of DANG using variational inference. In addition, we present new benchmark datasets that simulate DANG in real-world applications, enabling more practical research on robust GNNs. Extensive experiments demonstrate that DA-GNN consistently outperforms existing baselines across various noise scenarios, including both DANG and conventional noise models commonly considered in this field. Our code is available at https://github.com/yeonjun-in/torch-DA-GNN.
Authors: Xi He, Max A. Little
Abstract: We present an axiomatic framework for analyzing the algorithmic properties of decision trees. This framework supports the classification of decision tree problems through structural and ancestral constraints within a rigorous mathematical foundation. The central focus of this paper is a special class of decision tree problems-which we term proper decision trees-due to their versatility and effectiveness. In terms of versatility, this class subsumes several well-known data structures, including binary space partitioning trees, K-D trees, and machine learning decision tree models. Regarding effectiveness, we prove that only proper decision trees can be uniquely characterized as K-permutations, whereas typical non-proper decision trees correspond to binary-labeled decision trees with substantially greater complexity. Using this formal characterization, we develop a generic algorithmic approach for solving optimal decision tree problems over arbitrary splitting rules and objective functions for proper decision trees. We constructively derive a generic dynamic programming recursion for solving these problems exactly. However, we show that memoization is generally impractical in terms of space complexity, as both datasets and subtrees must be stored. This result contradicts claims in the literature that suggest a trade-off between memoizing datasets and subtrees. Our framework further accommodates constraints such as tree depth and leaf size, and can be accelerated using techniques such as thinning. Finally, we extend our analysis to several non-proper decision trees, including the commonly studied decision tree over binary feature data, the binary search tree, and the tree structure arising in the matrix chain multiplication problem. We demonstrate how these problems can be solved by appropriately modifying or discarding certain axioms.
Authors: Khayrul Islam, Ryan F. Forelli, Jianzhong Han, Deven Bhadane, Jian Huang, Joshua C. Agar, Nhan Tran, Seda Ogrenci, Yaling Liu
Abstract: Precise cell classification is essential in biomedical diagnostics and therapeutic monitoring, particularly for identifying diverse cell types involved in various diseases. Traditional cell classification methods such as flow cytometry depend on molecular labeling which is often costly, time-intensive, and can alter cell integrity. To overcome these limitations, we present a label-free machine learning framework for cell classification, designed for real-time sorting applications using bright-field microscopy images. This approach leverages a teacher-student model architecture enhanced by knowledge distillation, achieving high efficiency and scalability across different cell types. Demonstrated through a use case of classifying lymphocyte subsets, our framework accurately classifies T4, T8, and B cell types with a dataset of 80,000 preprocessed images, accessible via an open-source Python package for easy adaptation. Our teacher model attained 98\% accuracy in differentiating T4 cells from B cells and 93\% accuracy in zero-shot classification between T8 and B cells. Remarkably, our student model operates with only 0.02\% of the teacher model's parameters, enabling field-programmable gate array (FPGA) deployment. Our FPGA-accelerated student model achieves an ultra-low inference latency of just 14.5~$\mu$s and a complete cell detection-to-sorting trigger time of 24.7~$\mu$s, delivering 12x and 40x improvements over the previous state-of-the-art real-time cell analysis algorithm in inference and total latency, respectively, while preserving accuracy comparable to the teacher model. This framework provides a scalable, cost-effective solution for lymphocyte classification, as well as a new SOTA real-time cell sorting implementation for rapid identification of subsets using in situ deep learning on off-the-shelf computing hardware.
Authors: Tan-Khiem Huynh, Malcolm Egan, Giovanni Neglia, Jean-Marie Gorce
Abstract: Federated learning (FL) is now recognized as a key framework for communication-efficient collaborative learning. Most theoretical and empirical studies, however, rely on the assumption that clients have access to pre-collected data sets, with limited investigation into scenarios where clients continuously collect data. In many real-world applications, particularly when data is generated by physical or biological processes, client data streams are often modeled by non-stationary Markov processes. Unlike standard i.i.d. sampling, the performance of FL with Markovian data streams remains poorly understood due to the statistical dependencies between client samples over time. In this paper, we investigate whether FL can still support collaborative learning with Markovian data streams. Specifically, we analyze the performance of Minibatch SGD, Local SGD, and a variant of Local SGD with momentum. We answer affirmatively under standard assumptions and smooth non-convex client objectives: the sample complexity is proportional to the inverse of the number of clients with a communication complexity comparable to the i.i.d. scenario. However, the sample complexity for Markovian data streams remains higher than for i.i.d. sampling.
Authors: J. Quetzalc\'oatl Toledo-Marin, Anindita Maiti, Geoffrey C. Fox, Roger G. Melko
Abstract: Deep generative models have become ubiquitous due to their ability to learn and sample from complex distributions. Despite the proliferation of various frameworks, the relationships among these models remain largely unexplored, a gap that hinders the development of a unified theory of AI learning. We address two central challenges: clarifying the connections between different deep generative models and deepening our understanding of their learning mechanisms. We focus on Restricted Boltzmann Machines (RBMs), known for their universal approximation capabilities for discrete distributions. By introducing a reciprocal space formulation, we reveal a connection between RBMs, diffusion processes, and coupled Bosons. We show that at initialization, the RBM operates at a saddle point, where the local curvature is determined by the singular values, whose distribution follows the Marcenko-Pastur law and exhibits rotational symmetry. During training, this rotational symmetry is broken due to hierarchical learning, where different degrees of freedom progressively capture features at multiple levels of abstraction. This leads to a symmetry breaking in the energy landscape, reminiscent of Landau theory. This symmetry breaking in the energy landscape is characterized by the singular values and the weight matrix eigenvector matrix. We derive the corresponding free energy in a mean-field approximation. We show that in the limit of infinite size RBM, the reciprocal variables are Gaussian distributed. Our findings indicate that in this regime, there will be some modes for which the diffusion process will not converge to the Boltzmann distribution. To illustrate our results, we trained replicas of RBMs with different hidden layer sizes using the MNIST dataset. Our findings bridge the gap between disparate generative frameworks and also shed light on the processes underpinning learning in generative models.
Authors: Seewon Choi, Alaia Solko-Breslin, Rajeev Alur, Eric Wong
Abstract: Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neurosymbolic inference: decompose the symbolic program into sub-programs and summarize each sub-program with a sketched tensor. This strategy allows us to approximate the output distribution of the program with simple tensor operations over the input distributions and the sketches. We provide theoretical insight into the maximum approximation error. Furthermore, we evaluate CTSketch on benchmarks from the neurosymbolic learning literature, including some designed for evaluating scalability. Our results show that CTSketch pushes neurosymbolic learning to new scales that were previously unattainable, with neural predictors obtaining high accuracy on tasks with one thousand inputs, despite supervision only on the final output.
Authors: Advait Gadhikar, Tom Jacobs, Chao Zhou, Rebekka Burkholz
Abstract: The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.
Authors: Jonah Ekelund, Savvas Raptis, Vicki Toy-Edens, Wenli Mo, Drew L. Turner, Ian J. Cohen, Stefano Markidis
Abstract: Analyzing multi-featured time series data is critical for space missions making efficient event detection, potentially onboard, essential for automatic analysis. However, limited onboard computational resources and data downlink constraints necessitate robust methods for identifying regions of interest in real time. This work presents an adaptive outlier detection algorithm based on the reconstruction error of Principal Component Analysis (PCA) for feature reduction, designed explicitly for space mission applications. The algorithm adapts dynamically to evolving data distributions by using Incremental PCA, enabling deployment without a predefined model for all possible conditions. A pre-scaling process normalizes each feature's magnitude while preserving relative variance within feature types. We demonstrate the algorithm's effectiveness in detecting space plasma events, such as distinct space environments, dayside and nightside transients phenomena, and transition layers through NASA's MMS mission observations. Additionally, we apply the method to NASA's THEMIS data, successfully identifying a dayside transient using onboard-available measurements.
Authors: Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness
Abstract: We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art. All experiments were run on Cerebras CS-3 systems. A minimal implementation is available at https://github.com/EleutherAI/nanoGPT-mup/tree/completep.
URLs: https://github.com/EleutherAI/nanoGPT-mup/tree/completep.
Authors: Stepan Tretiakov, Xingjian Li, Krishna Kumar
Abstract: Neural operators, particularly the Deep Operator Network (DeepONet), have shown promise in learning mappings between function spaces for solving differential equations. However, standard DeepONet requires input functions to be sampled at fixed locations, limiting its applicability when sensor configurations vary or inputs exist on irregular grids. We introduce the Set Operator Network (SetONet), which modifies DeepONet's branch network to process input functions as unordered sets of location-value pairs. By incorporating Deep Sets principles, SetONet ensures permutation invariance while maintaining the same parameter count as the baseline. On classical operator-learning benchmarks, SetONet achieves parity with DeepONet on fixed layouts while sustaining accuracy under variable sensor configurations or sensor drop-off - conditions for which standard DeepONet is not applicable. More significantly, SetONet natively handles problems where inputs are naturally represented as unstructured point clouds (such as point sources or density samples) rather than values on fixed grids, a capability standard DeepONet lacks. On heat conduction with point sources, advection-diffusion modeling chemical plumes, and optimal transport between density samples, SetONet learns operators end-to-end without rasterization or multi-stage pipelines. These problems feature inputs that are naturally discrete point sets (point sources or density samples) rather than functions on fixed grids. SetONet is a DeepONet-class architecture that addresses such problems with a lightweight design, significantly broadening the applicability of operator learning to problems with variable, incomplete, or unstructured input data.
Authors: Xuran Li, Jingyi Wang, Xiaohan Yuan, Peixin Zhang
Abstract: It is often desirable to remove (a.k.a. unlearn) a specific part of the training data from a trained neural network model. A typical application scenario is to protect the data holder's right to be forgotten, which has been promoted by many recent regulation rules. Existing unlearning methods involve training alternative models with remaining data, which may be costly and challenging to verify from the data holder or a thirdparty auditor's perspective. In this work, we provide a new angle and propose a novel unlearning approach by imposing carefully crafted "patch" on the original neural network to achieve targeted "forgetting" of the requested data to delete. Specifically, inspired by the research line of neural network repair, we propose to strategically seek a lightweight minimum "patch" for unlearning a given data point with certifiable guarantee. Furthermore, to unlearn a considerable amount of data points (or an entire class), we propose to iteratively select a small subset of representative data points to unlearn, which achieves the effect of unlearning the whole set. Extensive experiments on multiple categorical datasets demonstrates our approach's effectiveness, achieving measurable unlearning while preserving the model's performance and being competitive in efficiency and memory consumption compared to various baseline methods.
Authors: Yuanhang Yang, Chaozheng Wang, Jing Li
Abstract: Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.
Authors: Kunwoong Kim, Jihu Lee, Sangchul Park, Yongdai Kim
Abstract: Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute. While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice. To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair $K$-means clustering objective function. The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space. A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling high-utility fair clustering in practice. Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving near-perfect fairness without numerical instability.
Authors: Yizhou Liu, Ziming Liu, Jeff Gore
Abstract: The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling like one over the model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.
Authors: Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, Mingyi Hong
Abstract: This paper investigates Reinforcement Learning (RL) approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents in long-horizon, multi-turn scenarios. Although RL algorithms such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) have been widely applied to train multi-turn LLM agents, they typically rely only on sparse outcome rewards and lack dense intermediate signals across multiple decision steps, limiting their performance on complex reasoning tasks. To bridge this gap, we present the first systematic study of \textit{turn-level reward design} for multi-turn RL algorithms and agent applications. By integrating turn-level rewards, we extend GRPO and PPO to their respective multi-turn variants, enabling fine-grained credit assignment. We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge. Our experiments on multi-turn search tasks demonstrate that incorporating well-designed turn-level rewards enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards. Both training and validation reward curves illustrate that our method achieves \textit{greater stability}, \textit{faster convergence}, and \textit{higher accuracy}. Numerical results across diverse question-answering datasets further show that our approach consistently delivers highest answer correctness and 100\% format correctness.
Authors: Andr\'es Guzm\'an-Cordero, Felix Dangel, Gil Goldshlager, Marius Zeinhofer
Abstract: Natural gradient methods significantly accelerate the training of Physics-Informed Neural Networks (PINNs), but are often prohibitively costly. We introduce a suite of techniques to improve the accuracy and efficiency of energy natural gradient descent (ENGD) for PINNs. First, we leverage the Woodbury formula to dramatically reduce the computational complexity of ENGD. Second, we adapt the Subsampled Projected-Increment Natural Gradient Descent algorithm from the variational Monte Carlo literature to accelerate the convergence. Third, we explore the use of randomized algorithms to further reduce the computational cost in the case of large batch sizes. We find that randomization accelerates progress in the early stages of training for low-dimensional problems, and we identify key barriers to attaining acceleration in other scenarios. Our numerical experiments demonstrate that our methods outperform previous approaches, achieving the same $L^2$ error as the original ENGD up to $75\times$ faster.
Authors: Jiahan Zhang, Yaoyu Zhang, Tao Luo
Abstract: In this paper, we study the Karush-Kuhn-Tucker (KKT) points of the associated maximum-margin problem in homogeneous neural networks, including fully-connected and convolutional neural networks. In particular, We investigates the relationship between such KKT points across networks of different widths generated. We introduce and formalize the \textbf{KKT point embedding principle}, establishing that KKT points of a homogeneous network's max-margin problem ($P_{\Phi}$) can be embedded into the KKT points of a larger network's problem ($P_{\tilde{\Phi}}$) via specific linear isometric transformations. We rigorously prove this principle holds for neuron splitting in fully-connected networks and channel splitting in convolutional neural networks. Furthermore, we connect this static embedding to the dynamics of gradient flow training with smooth losses. We demonstrate that trajectories initiated from appropriately mapped points remain mapped throughout training and that the resulting $\omega$-limit sets of directions are correspondingly mapped, thereby preserving the alignment with KKT directions dynamically when directional convergence occurs. We conduct several experiments to justify that trajectories are preserved. Our findings offer insights into the effects of network width, parameter redundancy, and the structural connections between solutions found via optimization in homogeneous networks of varying sizes.
Authors: Jan Hagnberger, Daniel Musekamp, Mathias Niepert
Abstract: Solving time-dependent Partial Differential Equations (PDEs) using a densely discretized spatial domain is a fundamental problem in various scientific and engineering disciplines, including modeling climate phenomena and fluid dynamics. However, performing these computations directly in the physical space often incurs significant computational costs. To address this issue, several neural surrogate models have been developed that operate in a compressed latent space to solve the PDE. While these approaches reduce computational complexity, they often use Transformer-based attention mechanisms to handle irregularly sampled domains, resulting in increased memory consumption. In contrast, convolutional neural networks allow memory-efficient encoding and decoding but are limited to regular discretizations. Motivated by these considerations, we propose CALM-PDE, a model class that efficiently solves arbitrarily discretized PDEs in a compressed latent space. We introduce a novel continuous convolution-based encoder-decoder architecture that uses an epsilon-neighborhood-constrained kernel and learns to apply the convolution operator to adaptive and optimized query points. We demonstrate the effectiveness of CALM-PDE on a diverse set of PDEs with both regularly and irregularly sampled spatial domains. CALM-PDE is competitive with or outperforms existing baseline methods while offering significant improvements in memory and inference time efficiency compared to Transformer-based methods.
Authors: Nimrod Berman, Ilan Naiman, Moshe Eliasof, Hedi Zisling, Omri Azencot
Abstract: Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model (KDM), a novel offline distillation approach grounded in Koopman theory - a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. KDM achieves highly competitive performance across standard offline distillation benchmarks.
Authors: Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri
Abstract: We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).
URLs: https://github.com/trishullab/clever), https://huggingface.co/datasets/amitayusht/clever)., https://github.com/trishullab/clever-prover).
Authors: Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi
Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
Authors: Qianyue Hao, Yiwen Song, Qingmin Liao, Jian Yuan, Yong Li
Abstract: Policy exploration is critical in reinforcement learning (RL), where existing approaches include greedy, Gaussian process, etc. However, these approaches utilize preset stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering task-specific features that influence policy exploration. Moreover, during RL training, the evolution of such stochastic processes is rigid, which typically only incorporates a decay in the variance, failing to adjust flexibly according to the agent's real-time learning status. Inspired by the analyzing and reasoning capability of large language models (LLMs), we design LLM-Explorer to adaptively generate task-specific exploration strategies with LLMs, enhancing the policy exploration in RL. In our design, we sample the learning trajectory of the agent during the RL training in a given task and prompt the LLM to analyze the agent's current policy learning status and then generate a probability distribution for future policy exploration. Updating the probability distribution periodically, we derive a stochastic process specialized for the particular task and dynamically adjusted to adapt to the learning process. Our design is a plug-in module compatible with various widely applied RL algorithms, including the DQN series, DDPG, TD3, and any possible variants developed based on them. Through extensive experiments on the Atari and MuJoCo benchmarks, we demonstrate LLM-Explorer's capability to enhance RL policy exploration, achieving an average performance improvement up to 37.27%. Our code is open-source at https://github.com/tsinghua-fib-lab/LLM-Explorer for reproducibility.
Authors: Patrick Cheridito, Jean-Loup Dupret, Donatien Hainaut
Abstract: In this paper, we introduce a model-based deep-learning approach to solve finite-horizon continuous-time stochastic control problems with jumps. We iteratively train two neural networks: one to represent the optimal policy and the other to approximate the value function. Leveraging a continuous-time version of the dynamic programming principle, we derive two different training objectives based on the Hamilton-Jacobi-Bellman equation, ensuring that the networks capture the underlying stochastic dynamics. Empirical evaluations on different problems illustrate the accuracy and scalability of our approach, demonstrating its effectiveness in solving complex, high-dimensional stochastic control tasks.
Authors: Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, Wendelin B\"ohmer
Abstract: In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
Authors: Beier Luo, Shuoyuan Wang, Sharon Li, Hongxin Wei
Abstract: Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$\%$ on common benchmarks.
Authors: Baran Hashemi, Kurt Pasque, Chris Teska, Ruriko Yoshida
Abstract: Can algebraic geometry enhance the sharpness, robustness, and interpretability of modern neural reasoning models by equipping them with a mathematically grounded inductive bias? To answer this, we introduce Tropical Attention, an attention mechanism grounded in tropical geometry that lifts the attention kernel into tropical projective space, where reasoning is piecewise-linear and 1-Lipschitz, thus preserving the polyhedral decision structure inherent to combinatorial reasoning. We prove that Multi-Head Tropical Attention (MHTA) stacks universally approximate tropical circuits and realize tropical transitive closure through composition, achieving polynomial resource bounds without invoking recurrent mechanisms. These guarantees explain why the induced polyhedral decision boundaries remain sharp and scale-invariant, rather than smoothed by Softmax. Empirically, we show that Tropical Attention delivers stronger out-of-distribution generalization in both length and value, with high robustness against perturbative noise, and substantially faster inference with fewer parameters compared to Softmax-based and recurrent attention baselines. For the first time, we extend neural algorithmic reasoning beyond PTIME problems to NP-hard and NP-complete problems, paving the way toward sharper and more expressive Large Reasoning Models (LRMs) capable of tackling complex combinatorial challenges in phylogenetics, cryptography, particle physics, and mathematical discovery.
Authors: Kaicheng Zhang, Sinian Zhang, Doudou Zhou, Yidong Zhou
Abstract: Transfer learning is a powerful paradigm for leveraging knowledge from source domains to enhance learning in a target domain. However, traditional transfer learning approaches often focus on scalar or multivariate data within Euclidean spaces, limiting their applicability to complex data structures such as probability distributions. To address this limitation, we introduce a novel transfer learning framework for regression models whose outputs are probability distributions residing in the Wasserstein space. When the informative subset of transferable source domains is known, we propose an estimator with provable asymptotic convergence rates, quantifying the impact of domain similarity on transfer efficiency. For cases where the informative subset is unknown, we develop a data-driven transfer learning procedure designed to mitigate negative transfer. The proposed methods are supported by rigorous theoretical analysis and are validated through extensive simulations and real-world applications. The code is available at https://github.com/h7nian/WaTL
Authors: Hongshu Guo, Zeyuan Ma, Yining Ma, Xinglin Zhang, Wei-Neng Chen, Yue-Jiao Gong
Abstract: Designing effective black-box optimizers is hampered by limited problem-specific knowledge and manual control that spans months for almost every detail. In this paper, we present \textit{DesignX}, the first automated algorithm design framework that generates an effective optimizer specific to a given black-box optimization problem within seconds. Rooted in the first principles, we identify two key sub-tasks: 1) algorithm structure generation and 2) hyperparameter control. To enable systematic construction, a comprehensive modular algorithmic space is first built, embracing hundreds of algorithm components collected from decades of research. We then introduce a dual-agent reinforcement learning system that collaborates on structural and parametric design through a novel cooperative training objective, enabling large-scale meta-training across 10k diverse instances. Remarkably, through days of autonomous learning, the DesignX-generated optimizers continuously surpass human-crafted optimizers by orders of magnitude, either on synthetic testbed or on realistic optimization scenarios such as Protein-docking, AutoML and UAV path planning. Further in-depth analysis reveals DesignX's capability to discover non-trivial algorithm patterns beyond expert intuition, which, conversely, provides valuable design insights for the optimization community. We provide DesignX's Python project at~ https://github.com/MetaEvo/DesignX.
Authors: Shizheng Wen, Arsh Kumbhat, Levi Lingsch, Sepehr Mousavi, Yizhou Zhao, Praveen Chandrashekar, Siddhartha Mishra
Abstract: The very challenging task of learning solution operators of PDEs on arbitrary domains accurately and efficiently is of vital importance to engineering and industrial simulations. Despite the existence of many operator learning algorithms to approximate such PDEs, we find that accurate models are not necessarily computationally efficient and vice versa. We address this issue by proposing a geometry aware operator transformer (GAOT) for learning PDEs on arbitrary domains. GAOT combines novel multiscale attentional graph neural operator encoders and decoders, together with geometry embeddings and (vision) transformer processors to accurately map information about the domain and the inputs into a robust approximation of the PDE solution. Multiple innovations in the implementation of GAOT also ensure computational efficiency and scalability. We demonstrate this significant gain in both accuracy and efficiency of GAOT over several baselines on a large number of learning tasks from a diverse set of PDEs, including achieving state of the art performance on three large scale three-dimensional industrial CFD datasets.
Authors: James Oldfield, Shawn Im, Sharon Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos
Abstract: Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
Authors: Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn
Abstract: Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of $\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain transformer capabilities, but does not overcome fundamental length-generalization limits.
Authors: Haomiao Qiu, Miao Zhang, Ziyue Qiao, Liqiang Nie
Abstract: Continual Learning (CL) aims to enable models to continuously acquire new knowledge from a sequence of tasks with avoiding the forgetting of learned information. However, existing CL methods only rely on the parameters of the most recent task for inference, which makes them susceptible to catastrophic forgetting. Inspired by the recent success of model merging techniques, we propose \textbf{Perturb-and-Merge (P\&M)}, a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting. Specifically, after training on each task, P\&M constructs a new model by forming a convex combination of the previous model and the newly trained task-specific model. Through theoretical analysis, We minimize the total loss increase across all tasks and derive a closed-form solution for the merging coefficient under mild assumptions. To further improve the performance of the merged model, we observe that the degradation introduced during merging can be alleviated by a regularization term composed of the task vector and the Hessian matrix of the loss function. Interestingly, we show that this term can be efficiently approximated using second-order symmetric finite differences, and a stochastic perturbation strategy along the task vector direction is accordingly devised which incurs no additional forward or backward passes while providing an effective approximation of the regularization term. Finally, we combine P\&M with LoRA, a parameter-efficient fine-tuning method, to reduce memory overhead. Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets. The code is available at https://github.com/qhmiao/P-M-for-Continual-Learning.
Authors: Jacob L. Block, Aryan Mokhtari, Sanjay Shakkottai
Abstract: Machine unlearning algorithms aim to remove the influence of specific training samples, ideally recovering the model that would have resulted from training on the remaining data alone. We study unlearning in the overparameterized setting, where many models interpolate the data, and defining the solution as any loss minimizer over the retained set$\unicode{x2013}$as in prior work in the underparameterized setting$\unicode{x2013}$is inadequate, since the original model may already interpolate the retained data and satisfy this condition. In this regime, loss gradients vanish, rendering prior methods based on gradient perturbations ineffective, motivating both new unlearning definitions and algorithms. For this setting, we define the unlearning solution as the minimum-complexity interpolator over the retained data and propose a new algorithmic framework that only requires access to model gradients on the retained set at the original solution. We minimize a regularized objective over perturbations constrained to be orthogonal to these model gradients, a first-order relaxation of the interpolation condition. For different model classes, we provide exact and approximate unlearning guarantees and demonstrate that an implementation of our framework outperforms existing baselines across various unlearning experiments.
Authors: Declan Kutscher, David M. Chan, Yutong Bai, Trevor Darrell, Ritwik Gupta
Abstract: Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
Authors: Weiyi Wang, Junwei Deng, Yuzheng Hu, Shiyuan Zhang, Xirui Jiang, Runting Zhang, Han Zhao, Jiaqi W. Ma
Abstract: Data attribution methods, which quantify the influence of individual training data points on a machine learning model, have gained increasing popularity in data-centric applications in modern AI. Despite a recent surge of new methods developed in this space, the impact of hyperparameter tuning in these methods remains under-explored. In this work, we present the first large-scale empirical study to understand the hyperparameter sensitivity of common data attribution methods. Our results show that most methods are indeed sensitive to certain key hyperparameters. However, unlike typical machine learning algorithms -- whose hyperparameters can be tuned using computationally-cheap validation metrics -- evaluating data attribution performance often requires retraining models on subsets of training data, making such metrics prohibitively costly for hyperparameter tuning. This poses a critical open challenge for the practical application of data attribution methods. To address this challenge, we advocate for better theoretical understandings of hyperparameter behavior to inform efficient tuning strategies. As a case study, we provide a theoretical analysis of the regularization term that is critical in many variants of influence function methods. Building on this analysis, we propose a lightweight procedure for selecting the regularization value without model retraining, and validate its effectiveness across a range of standard data attribution benchmarks. Overall, our study identifies a fundamental yet overlooked challenge in the practical application of data attribution, and highlights the importance of careful discussion on hyperparameter selection in future method development.
Authors: Kazuki Irie, Morris Yau, Samuel J. Gershman
Abstract: We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with fast weight memory through dynamic synaptic modulation (FW-memory) -- the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system, differing in how and when input information is delivered to each system, to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training 340M- and 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments. Overall, we demonstrate how a well-designed hybrid can overcome the limitations of its individual components, offering new insights into the design principle of neural memory systems.
Authors: Tim Walter, Hannah Markgraf, Jonathan K\"ulz, Matthias Althoff
Abstract: The deployment of autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research that aims to provide such guarantees using safeguards. These safeguards should be integrated during training to reduce the sim-to-real gap. While there are several approaches for safeguarding sampling-based reinforcement learning, analytic gradient-based reinforcement learning often achieves superior performance from fewer environment interactions. However, there is no safeguarding approach for this learning paradigm yet. Our work addresses this gap by developing the first effective safeguard for analytic gradient-based reinforcement learning. We analyse existing, differentiable safeguards, adapt them through modified mappings and gradient formulations, and integrate them into a state-of-the-art learning algorithm and a differentiable simulation. Using numerical experiments on three control tasks, we evaluate how different safeguards affect learning. The results demonstrate safeguarded training without compromising performance. Additional visuals are provided at \href{https://timwalter.github.io/safe-agb-rl.github.io}{timwalter.github.io/safe-agb-rl.github.io}.
Authors: Zixuan Xia, Aram Davtyan, Paolo Favaro
Abstract: We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.
Authors: Tobias W\"urth, Niklas Freymuth, Gerhard Neumann, Luise K\"arger
Abstract: Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations usually occurring in solid mechanics, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion-Batched Inference (ROBI), a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.
Authors: Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah J. Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E. Culler, Henry M. Levy, Sanjiv Kumar
Abstract: The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges. This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
Authors: Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, Jiang Bian
Abstract: A unified foundation model for medical time series -- pretrained on open access and ethics board-approved medical corpora -- offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundation models struggle to handle medical time series data due to their inherent challenges, including irregular intervals, heterogeneous sampling rates, and frequent missing values. To address these challenges, we introduce MIRA, a unified foundation model specifically designed for medical time series forecasting. MIRA incorporates a Continuous-Time Rotary Positional Encoding that enables fine-grained modeling of variable time intervals, a frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes to further promote temporal specialization, and a Continuous Dynamics Extrapolation Block based on Neural ODE that models the continuous trajectory of latent states, enabling accurate forecasting at arbitrary target timestamps. Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieves reductions in forecasting errors by an average of 10% and 7% in out-of-distribution and in-distribution scenarios, respectively, when compared to other zero-shot and fine-tuned baselines. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.
Authors: Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen
Abstract: Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations$\unicode{x2013}$insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.
Authors: Boaz Lavon, Shahar Katz, Lior Wolf
Abstract: We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming and data science tasks. Our code is available at: https://github.com/boazlavon/eg_cfg
Authors: Pulkit Gopalani, Wei Hu
Abstract: Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena-repetition bias and representation collapse-are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.
Authors: An Luo, Xun Xian, Jin Du, Fangqiao Tian, Ganghua Wang, Ming Zhong, Shengchun Zhao, Xuan Bi, Zirui Liu, Jiawei Zhou, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding
Abstract: Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models' ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems. Our data and code are publicly available here: https://github.com/jeremyxianx/Assisted-DS
Authors: Prithvi Raj
Abstract: Learning an energy-based model (EBM) in the latent space of a top-down generative model offers a versatile framework for generation across multiple data modalities. However, it remains unclear how its interpretability can be used to guide model design, improve generative quality, and reduce training time. Moreover, the reliance on Langevin Monte Carlo (LMC) sampling presents challenges in efficiency and sampling multimodal latent distributions. In this work, we propose a novel adaptation of the Kolmogorov-Arnold representation theorem for generative modeling and introduce the Thermodynamic Kolmogorov-Arnold Model (T-KAM) to take advantage of structural and inductive biases. By constraining the prior to univariate relationships, T-KAM enables fast and exact inference via the inverse transform method. With the low dimensionality of the latent space and suitable inductive biases encoded, we demonstrate that importance sampling (IS) becomes a viable, unbiased, and highly efficient posterior sampler. For situations where IS fails, we investigate a novel strategy using population-based LMC, which decomposes posterior sampling into a sequence of annealed distributions to improve multimodal sampling. T-KAM elegantly balances common trade-offs in generative modeling, offering fast inference, interpretability, and stable training, while being naturally suited to upcoming Zettascale Computing Corp. hardware.
Authors: Ben Finkelshtein, \.Ismail \.Ilkan Ceylan, Michael Bronstein, Ron Levie
Abstract: Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases.
Authors: Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, Hongteng Xu
Abstract: Adapting large-scale foundation models in multi-task scenarios often suffers from task conflict and oblivion. To mitigate such issues, we propose a novel ''model MoE-ization'' strategy that leads to a conflict- and oblivion-resistant multi-task adaptation method. Given a weight matrix of a pre-trained model, our method applies SVD to it and introduces a learnable router to adjust its singular values based on tasks and samples. Accordingly, the weight matrix becomes a Mixture of Orthogonal Rank-one Experts (MoORE), in which each expert corresponds to the outer product of a left singular vector and the corresponding right one. We can improve the model capacity by imposing a learnable orthogonal transform on the right singular vectors. Unlike low-rank adaptation (LoRA) and its MoE-driven variants, MoORE guarantees the experts' orthogonality and maintains the column space of the original weight matrix. These two properties make the adapted model resistant to the conflicts among the new tasks and the oblivion of its original tasks, respectively. Experiments on various datasets demonstrate that MoORE outperforms existing multi-task adaptation methods consistently, showing its superiority in terms of conflict- and oblivion-resistance. The code of the experiments is available at https://github.com/DaShenZi721/MoORE.
Authors: Nikola Jovanovi\'c, Ismail Labiad, Tom\'a\v{s} Sou\v{c}ek, Martin Vechev, Pierre Fernandez
Abstract: Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values. Code and models are available at https://github.com/facebookresearch/wmar.
Authors: Abdellah Rahmani, Pascal Frossard
Abstract: Understanding causal relationships in multivariate time series is crucial in many scenarios, such as those dealing with financial or neurological data. Many such time series exhibit multiple regimes, i.e., consecutive temporal segments with a priori unknown boundaries, with each regime having its own causal structure. Inferring causal dependencies and regime shifts is critical for analyzing the underlying processes. However, causal structure learning in this setting is challenging due to (1) non-stationarity, i.e., each regime can have its own causal graph and mixing function, and (2) complex noise distributions, which may be nonGaussian or heteroscedastic. Existing causal discovery approaches cannot address these challenges, since generally assume stationarity or Gaussian noise with constant variance. Hence, we introduce FANTOM, a unified framework for causal discovery that handles non-stationary processes along with non-Gaussian and heteroscedastic noises. FANTOM simultaneously infers the number of regimes and their corresponding indices and learns each regime's Directed Acyclic Graph. It uses a Bayesian Expectation Maximization algorithm that maximizes the evidence lower bound of the data log-likelihood. On the theoretical side, we prove, under mild assumptions, that temporal heteroscedastic causal models, introduced in FANTOM's formulation, are identifiable in both stationary and non-stationary settings. In addition, extensive experiments on synthetic and real data show that FANTOM outperforms existing methods.
Authors: Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu
Abstract: DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.
Authors: David Demitri Africa, Sara M. Kapoor, Theo Simon Sorg, Challenger Mishra
Abstract: Modular exponentiation is crucial to number theory and cryptography, yet remains largely unexplored from a mechanistic interpretability standpoint. We train a 4-layer encoder-decoder Transformer model to perform this operation and investigate the emergence of numerical reasoning during training. Utilizing principled sampling strategies, PCA-based embedding analysis, and activation patching, we examine how number-theoretic properties are encoded within the model. We find that reciprocal operand training leads to strong performance gains, with sudden generalization across related moduli. These synchronized accuracy surges reflect grokking-like dynamics, suggesting the model internalizes shared arithmetic structure. We also find a subgraph consisting entirely of attention heads in the final layer sufficient to achieve full performance on the task of regular exponentiation. These results suggest that transformer models learn modular arithmetic through specialized computational circuits, paving the way for more interpretable and efficient neural approaches to modular exponentiation.
Authors: Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, Micah Goldblum
Abstract: Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a proportional increase in batch size. While it is common to decrease the learning rate for smaller batch sizes, other hyperparameters are often held fixed. In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyperparameters to small batch sizes. In particular, rather than holding the decay rate of the second moment fixed across batch sizes, we propose to hold its half-life fixed in terms of tokens. We find that small batch sizes (1) train stably, (2) are consistently more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) notably enable stable language model training with vanilla SGD, even without momentum, despite storing no optimizer state. Building on these results, we provide practical recommendations for selecting a batch size and setting optimizer hyperparameters. We further recommend against gradient accumulation unless training on multiple devices with multiple model replicas. Finally, we show that a small batch size combined with an optimizer with a small state size can provide the performance benefits of full fine-tuning while maintaining a similar memory footprint to LoRA.
Authors: Zhipeng He, Alexander Stevens, Chun Ouyang, Johannes De Smedt, Alistair Barros, Catarina Moreira
Abstract: Adversarial attacks on tabular data present unique challenges due to the heterogeneous nature of mixed categorical and numerical features. Unlike images where pixel perturbations maintain visual similarity, tabular data lacks intuitive similarity metrics, making it difficult to define imperceptible modifications. Additionally, traditional gradient-based methods prioritise $\ell_p$-norm constraints, often producing adversarial examples that deviate from the original data distributions. To address this, we propose a latent-space perturbation framework using a mixed-input Variational Autoencoder (VAE) to generate statistically consistent adversarial examples. The proposed VAE integrates categorical embeddings and numerical features into a unified latent manifold, enabling perturbations that preserve statistical consistency. We introduce In-Distribution Success Rate (IDSR) to jointly evaluate attack effectiveness and distributional alignment. Evaluation across six publicly available datasets and three model architectures demonstrates that our method achieves substantially lower outlier rates and more consistent performance compared to traditional input-space attacks and other VAE-based methods adapted from image domain approaches, achieving substantially lower outlier rates and higher IDSR across six datasets and three model architectures. Our comprehensive analyses of hyperparameter sensitivity, sparsity control, and generative architecture demonstrate that the effectiveness of VAE-based attacks depends strongly on reconstruction quality and the availability of sufficient training data. When these conditions are met, the proposed framework achieves superior practical utility and stability compared with input-space methods. This work underscores the importance of maintaining on-manifold perturbations for generating realistic and robust adversarial examples in tabular domains.
Authors: Felix Kronenwett, Georg Maier, Thomas L\"angle
Abstract: Sensor-based sorting systems enable the physical separation of a material stream into two fractions. The sorting decision is based on the image data evaluation of the sensors used and is carried out using actuators. Various process parameters must be set depending on the properties of the material stream, the dimensioning of the system, and the required sorting accuracy. However, continuous verification and re-adjustment are necessary due to changing requirements and material stream compositions. In this paper, we introduce an approach for optimizing, recurrently monitoring and adjusting the process parameters of a sensor-based sorting system. Based on Bayesian Optimization, Gaussian process regression models are used as surrogate models to achieve specific requirements for system behavior with the uncertainties contained therein. This method minimizes the number of necessary experiments while simultaneously considering two possible optimization targets based on the requirements for both material output streams. In addition, uncertainties are considered during determining sorting accuracies in the model calculation. We evaluated the method with three example process parameters.
Authors: Haonan Yang, Jianchao Tang, Zhuo Li, Long Lan
Abstract: Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer's decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC.
Authors: Daniel Beaglehole, David Holzm\"uller, Adityanarayanan Radhakrishnan, Mikhail Belkin
Abstract: Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
Authors: Andrea Fox, Francesco De Pellegrini, Eitan Altman
Abstract: In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.
Authors: Sophia Sun, Rose Yu
Abstract: Conformal prediction has been explored as a general and efficient way to provide uncertainty quantification for time series. However, current methods struggle to handle time series data with change points - sudden shifts in the underlying data-generating process. In this paper, we propose a novel Conformal Prediction for Time-series with Change points (CPTC) algorithm, addressing this gap by integrating a model to predict the underlying state with online conformal prediction to model uncertainties in non-stationary time series. We prove CPTC's validity and improved adaptivity in the time series setting under minimum assumptions, and demonstrate CPTC's practical effectiveness on 6 synthetic and real-world datasets, showing improved validity and adaptivity compared to state-of-the-art baselines.
Authors: Bahareh Tolooshams, Ailsa Shen, Anima Anandkumar
Abstract: We frame the problem of unifying representations in neural models as one of sparse model recovery and introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO). While the Platonic Representation Hypothesis suggests that neural networks converge to similar representations across architectures, the representational properties of neural operators remain underexplored despite their growing importance in scientific computing. We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators. We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.
Authors: Christo Mathew, Wentian Wang, Jacob Feldman, Lazaros K. Gallos, Paul B. Kantor, Vladimir Menkov, Hao Wang
Abstract: We investigate reinforcement learning in the Game Of Hidden Rules (GOHR) environment, a complex puzzle in which an agent must infer and execute hidden rules to clear a 6$\times$6 board by placing game pieces into buckets. We explore two state representation strategies, namely Feature-Centric (FC) and Object-Centric (OC), and employ a Transformer-based Advantage Actor-Critic (A2C) algorithm for training. The agent has access only to partial observations and must simultaneously infer the governing rule and learn the optimal policy through experience. We evaluate our models across multiple rule-based and trial-list-based experimental setups, analyzing transfer effects and the impact of representation on learning efficiency.
Authors: Bhavya Agrawalla, Michal Nauman, Khush Agrawal, Aviral Kumar
Abstract: A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.
Authors: Manh Cuong Dao, The Hung Tran, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang
Abstract: This paper studies the black-box optimization task which aims to find the maxima of a black-box function using a static set of its observed input-output pairs. This is often achieved via learning and optimizing a surrogate function with that offline data. Alternatively, it can also be framed as an inverse modeling task that maps a desired performance to potential input candidates that achieve it. Both approaches are constrained by the limited amount of offline data. To mitigate this limitation, we introduce a new perspective that casts offline optimization as a distributional translation task. This is formulated as learning a probabilistic bridge transforming an implicit distribution of low-value inputs (i.e., offline data) into another distribution of high-value inputs (i.e., solution candidates). Such probabilistic bridge can be learned using low- and high-value inputs sampled from synthetic functions that resemble the target function. These synthetic functions are constructed as the mean posterior of multiple Gaussian processes fitted with different parameterizations on the offline data, alleviating the data bottleneck. The proposed approach is evaluated on an extensive benchmark comprising most recent methods, demonstrating significant improvement and establishing a new state-of-the-art performance. Our code is publicly available at https://github.com/cuong-dm/ROOT.
Authors: Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter
Abstract: Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.
Authors: Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava
Abstract: Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today's hardware. We hope that it gets adopted in practice.
Authors: Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, Jure Leskovec
Abstract: Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 93% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT's zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.
Authors: Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, Yue Wang
Abstract: Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
Authors: Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
Abstract: Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
Authors: Zhi Yang, Changwu Huang, Ke Tang, Xin Yao
Abstract: While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.
Authors: Masahiro Negishi, Hyunsoo Park, Kinga O. Mastej, Aron Walsh
Abstract: To address pressing scientific challenges such as climate change, increasingly sophisticated generative artificial intelligence models are being developed that can efficiently sample the large chemical space of possible functional materials. These models can quickly sample new chemical compositions paired with crystal structures. They are typically evaluated using uniqueness and novelty metrics, which depend on a chosen crystal distance function. However, the most prevalent distance function has four limitations: it fails to quantify the degree of similarity between compounds, cannot distinguish compositional difference and structural difference, lacks Lipschitz continuity against shifts in atomic coordinates, and results in a uniqueness metric that is not invariant against the permutation of generated samples. In this work, we propose using two continuous distance functions to evaluate uniqueness and novelty, which theoretically overcome these limitations. Our experiments show that these distances reveal insights missed by traditional distance functions, providing a more reliable basis for evaluating and comparing generative models for inorganic crystals.
Authors: Aditya Puttaparthi Tirumala
Abstract: Marketing Mix Modeling (MMM) is a statistical technique used to estimate the impact of marketing activities on business outcomes such as sales, revenue, or customer visits. Traditional MMM approaches often rely on linear regression or Bayesian hierarchical models that assume independence between marketing channels and struggle to capture complex temporal dynamics and non-linear saturation effects [@Chan2017; @Hanssens2005; @Ng2021Bayesian]. **DeepCausalMMM** is a Python package that addresses these limitations by combining deep learning, causal inference, and advanced marketing science. The package uses Gated Recurrent Units (GRUs) to automatically learn temporal patterns such as adstock (carryover effects) and lag, while simultaneously learning statistical dependencies and potential causal structures between marketing channels through Directed Acyclic Graph (DAG) learning [@Zheng2018NOTEARS; @Gong2024CausalMMM]. Additionally, it implements Hill equation-based saturation curves to model diminishing returns and optimize budget allocation. Key features include: (1) a data-driven design where hyperparameters and transformations (e.g., adstock decay, saturation curves) are learned or estimated from data with sensible defaults, rather than requiring fixed heuristics or manual specification, (2) multi-region modeling with both shared and region-specific parameters, (3) robust statistical methods including Huber loss and advanced regularization, (4) comprehensive response curve analysis for understanding channel saturation.
Authors: Jahidul Arafat, Fariha Tasmin, Sanjaya Poudel
Abstract: Multi-class wine classification presents fundamental trade-offs between model accuracy, feature dimensionality, and interpretability - critical factors for production deployment in analytical chemistry. This paper presents a comprehensive empirical study of One-vs-Rest logistic regression on the UCI Wine dataset (178 samples, 3 cultivars, 13 chemical features), comparing from-scratch gradient descent implementation against scikit-learn's optimized solvers and quantifying L1 regularization effects on feature sparsity. Manual gradient descent achieves 92.59 percent mean test accuracy with smooth convergence, validating theoretical foundations, though scikit-learn provides 24x training speedup and 98.15 percent accuracy. Class-specific analysis reveals distinct chemical signatures with heterogeneous patterns where color intensity varies dramatically (0.31 to 16.50) across cultivars. L1 regularization produces 54-69 percent feature reduction with only 4.63 percent accuracy decrease, demonstrating favorable interpretability-performance trade-offs. We propose an optimal 5-feature subset achieving 62 percent complexity reduction with estimated 92-94 percent accuracy, enabling cost-effective deployment with 80 dollars savings per sample and 56 percent time reduction. Statistical validation confirms robust generalization with sub-2ms prediction latency suitable for real-time quality control. Our findings provide actionable guidelines for practitioners balancing comprehensive chemical analysis against targeted feature measurement in resource-constrained environments.
Authors: Kexin Zheng, Lauriane Teyssier, Yinan Zheng, Yu Luo, Xianyuan Zhan
Abstract: The recent development of zero-shot reinforcement learning (RL) has opened a new avenue for learning pre-trained generalist policies that can adapt to arbitrary new tasks in a zero-shot manner. While the popular Forward-Backward representations (FB) and related methods have shown promise in zero-shot RL, we empirically found that their modeling lacks expressivity and that extrapolation errors caused by out-of-distribution (OOD) actions during offline learning sometimes lead to biased representations, ultimately resulting in suboptimal performance. To address these issues, we propose Behavior-REgularizEd Zero-shot RL with Expressivity enhancement (BREEZE), an upgraded FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality. BREEZE introduces behavioral regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm. Additionally, BREEZE extracts the policy using a task-conditioned diffusion model, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings. Moreover, BREEZE employs expressive attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics. Extensive experiments on ExORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. The official implementation is available at: https://github.com/Whiterrrrr/BREEZE.
Authors: Longwei Wang, Ifrat Ikhtear Uddin, KC Santosh, Chaowei Zhang, Xiao Qin, Yang Zhou
Abstract: Adversarial examples reveal critical vulnerabilities in deep neural networks by exploiting their sensitivity to imperceptible input perturbations. While adversarial training remains the predominant defense strategy, it often incurs significant computational cost and may compromise clean-data accuracy. In this work, we investigate an architectural approach to adversarial robustness by embedding group-equivariant convolutions-specifically, rotation- and scale-equivariant layers-into standard convolutional neural networks (CNNs). These layers encode symmetry priors that align model behavior with structured transformations in the input space, promoting smoother decision boundaries and greater resilience to adversarial attacks. We propose and evaluate two symmetry-aware architectures: a parallel design that processes standard and equivariant features independently before fusion, and a cascaded design that applies equivariant operations sequentially. Theoretically, we demonstrate that such models reduce hypothesis space complexity, regularize gradients, and yield tighter certified robustness bounds under the CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) framework. Empirically, our models consistently improve adversarial robustness and generalization across CIFAR-10, CIFAR-100, and CIFAR-10C under both FGSM and PGD attacks, without requiring adversarial training. These findings underscore the potential of symmetry-enforcing architectures as efficient and principled alternatives to data augmentation-based defenses.
Authors: Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, Zhouchen Lin
Abstract: Transformer models have driven breakthroughs across various language tasks by their strong capability to learn rich contextual representations. Scaling them to improve representation, however, often demands substantial memory and compute costs, such as the Key-Value (KV) cache used during auto-regressive decoding. Skip connections offer a promising way to improve representation without bloating resource usage, yet most prior works either improve expressivity while leaving KV costs unchanged, or reduce memory at the cost of weaker representation. In this work, we propose SkipV1Former, a Transformer variant that uses skip connections from the first layer's Value heads to strengthen model representation and reduce KV cache. Specifically, from the second block onward, each layer reuses half of its Value heads from the very first layer, while computing the other half as usual-cutting Value projections and V cache by nearly 50 \%. Theoretically, we show that routing uncompressed first-layer Values into deeper layers restores information lost to compression and accelerates the model's implicit mesa-optimization-a key pattern of Transformer in auto-regressive tasks. Empirically, across different model scales, SkipV1Former delivers consistent reductions of approximately 25 \% in KV cache while improving perplexity relative to standard Multi-Head Attention (MHA) Transformers and some advanced variants. Moreover, we propose a recipe for uptraining existing MHA Transformer checkpoints to SkipV1Former with only 10-15\% additional compute. Finally, SkipV1Former can seamlessly combine advanced methods like Group-Query Attention and Multi-Latent Attention to achieve further KV cache savings and performance improvement. When combined with YOCO, it cuts KV cache size by nearly 50 \% while still improving performance.
Authors: Pengxiang Cai, Zihao Gao, Jintai Chen
Abstract: Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
Authors: Shuofeng Zhang, Ard Louis
Abstract: A wide variety of generalization measures have been applied to deep neural networks (DNNs). Although obtaining tight bounds remains challenging, such measures are often assumed to reproduce qualitative generalization trends. In this position paper, we argue that many post-mortem generalization measures -- those computed on trained networks -- are \textbf{fragile}: small training modifications that barely affect the underlying DNN can substantially change a measure's value, trend, or scaling behavior. For example, minor hyperparameter changes, such as learning rate adjustments or switching between SGD variants can reverse the slope of a learning curve in widely used generalization measures like the path norm. We also identify subtler forms of fragility. For instance, the PAC-Bayes origin measure is regarded as one of the most reliable, and is indeed less sensitive to hyperparameter tweaks than many other measures. However, it completely fails to capture differences in data complexity across learning curves. This data fragility contrasts with the function-based marginal-likelihood PAC-Bayes bound, which does capture differences in data-complexity, including scaling behavior, in learning curves, but which is not a post-mortem measure. Beyond demonstrating that many bounds -- such as path, spectral and Frobenius norms, flatness proxies, and deterministic PAC-Bayes surrogates -- are fragile, this position paper also argues that developers of new measures should explicitly audit them for fragility.
Authors: Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, Kai Wang
Abstract: State-space models (SSMs) have emerged as efficient alternatives to Transformers for sequence modeling, offering superior scalability through recurrent structures. However, their training remains costly and the ecosystem around them is far less mature than that of Transformers. Moreover, the structural heterogeneity between SSMs and Transformers makes it challenging to efficiently distill knowledge from pretrained attention models. In this work, we propose Cross-architecture distillation via Attention Bridge (CAB), a novel data-efficient distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models. Unlike conventional knowledge distillation that transfers knowledge only at the output level, CAB enables token-level supervision via a lightweight bridge and flexible layer-wise alignment, improving both efficiency and transferability. We further introduce flexible layer-wise alignment strategies to accommodate architectural discrepancies between teacher and student. Extensive experiments across vision and language domains demonstrate that our method consistently improves the performance of state-space models, even under limited training data, outperforming both standard and cross-architecture distillation methods. Our findings suggest that attention-based knowledge can be efficiently transferred to recurrent models, enabling rapid utilization of Transformer expertise for building a stronger SSM community.
Authors: Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
Abstract: In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
Authors: Demian Till, John Smeaton, Peter Haubrick, Gouse Saheb, Florian Graef, David Berman
Abstract: Recent work has demonstrated state-of-the-art results in large language model (LLM) hallucination detection and mitigation through consistency-based approaches which involve aggregating multiple responses sampled from a single LLM for a given prompt. These approaches help offset limitations stemming from the imperfect data on which LLMs are trained, which includes biases and under-representation of information required at deployment time among other limitations which can lead to hallucinations. We show that extending these single-model consistency methods to combine responses from multiple LLMs with different training data, training schemes and model architectures can result in substantial further improvements in hallucination detection and mitigation capabilities beyond their single-model consistency counterparts. We evaluate this "consortium consistency" approach across many model teams from a pool of 15 LLMs and explore under what conditions it is beneficial to team together different LLMs in this manner. Further, we show that these performance improvements often come with reduced inference costs, offsetting a significant drawback with single-model consistency methods.
Authors: Clara Mohri, Haim Kaplan, Tal Schuster, Yishay Mansour, Amir Globerson
Abstract: Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
Authors: Aman Bilkhoo, Mehran Hosseini, Milad Kazemi, Nicola Paoletti
Abstract: Counterfactual explanations (CFXs) provide human-understandable justifications for model predictions, enabling actionable recourse and enhancing interpretability. To be reliable, CFXs must avoid regions of high predictive uncertainty, where explanations may be misleading or inapplicable. However, existing methods often neglect uncertainty or lack principled mechanisms for incorporating it with formal guarantees. We propose CONFEX, a novel method for generating uncertainty-aware counterfactual explanations using Conformal Prediction (CP) and Mixed-Integer Linear Programming (MILP). CONFEX explanations are designed to provide local coverage guarantees, addressing the issue that CFX generation violates exchangeability. To do so, we develop a novel localised CP procedure that enjoys an efficient MILP encoding by leveraging an offline tree-based partitioning of the input space. This way, CONFEX generates CFXs with rigorous guarantees on both predictive uncertainty and optimality. We evaluate CONFEX against state-of-the-art methods across diverse benchmarks and metrics, demonstrating that our uncertainty-aware approach yields robust and plausible explanations.
Authors: Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, Chang Zou, Yue Ma, Linfeng Zhang
Abstract: Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
Authors: Cole Franks, Rafael Oliveira, Akshay Ramachandran, Michael Walter
Abstract: The matrix normal model, i.e., the family of Gaussian matrix-variate distributions whose covariance matrices are the Kronecker product of two lower dimensional factors, is frequently used to model matrix-variate data. The tensor normal model generalizes this family to Kronecker products of three or more factors. We study the estimation of the Kronecker factors of the covariance matrix in the matrix and tensor normal models. For the above models, we show that the maximum likelihood estimator (MLE) achieves nearly optimal nonasymptotic sample complexity and nearly tight error rates in the Fisher-Rao and Thompson metrics. In contrast to prior work, our results do not rely on the factors being well-conditioned or sparse, nor do we need to assume an accurate enough initial guess. For the matrix normal model, all our bounds are minimax optimal up to logarithmic factors, and for the tensor normal model our bounds for the largest factor and for overall covariance matrix are minimax optimal up to constant factors provided there are enough samples for any estimator to obtain constant Frobenius error. In the same regimes as our sample complexity bounds, we show that the flip-flop algorithm, a practical and widely used iterative procedure to compute the MLE, converges linearly with high probability. Our main technical insight is that, given enough samples, the negative log-likelihood function is strongly geodesically convex in the geometry on positive-definite matrices induced by the Fisher information metric. This strong convexity is determined by the expansion of certain random quantum channels.
Authors: Kirsten Fischer, David Dahmen, Moritz Helias
Abstract: Residual networks have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks, finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size field theory for residual networks to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a theoretical framework to study ResNets at finite size.
Authors: Carlos Fern\'andez-Lor\'ia, Yanfang Hou, Foster Provost, Jennifer Hill
Abstract: Organizations increasingly rely on predictive models to decide who should be targeted for interventions, such as marketing campaigns, customer retention offers, or medical treatments. Yet these models are usually built to predict outcomes (e.g., likelihood of purchase or churn), not the actual impact of an intervention. As a result, the scores (predicted values) they produce are often imperfect guides for allocating resources. Causal effects can be estimated with randomized experiments, but experiments are costly, limited in scale, and tied to specific actions. We propose causal post-processing (CPP), a family of techniques that uses limited experimental data to refine the outputs of predictive models, so they better align with causal decision making. The CPP family spans approaches that trade off flexibility against data efficiency, unifying existing methods and motivating new ones. Through simulations and an empirical study in digital advertising, we show that CPP can improve intervention decisions, particularly when predictive models capture a useful but imperfect causal signal. Our results show how organizations can combine predictive modeling with experimental evidence to make more effective and scalable intervention decisions.
Authors: Kevin Luo, Yufan Li, Pragya Sur
Abstract: Two key tasks in high-dimensional regularized regression are tuning the regularization strength for accurate predictions and estimating the out-of-sample risk. It is known that the standard approach -- $k$-fold cross-validation -- is inconsistent in modern high-dimensional settings. While leave-one-out and generalized cross-validation remain consistent in some high-dimensional cases, they become inconsistent when samples are dependent or contain heavy-tailed covariates. As a first step towards modeling structured sample dependence and heavy tails, we use right-rotationally invariant covariate distributions -- a crucial concept from compressed sensing. In the proportional asymptotics regime where the number of features and samples grow comparably, which is known to better reflect the empirical behavior in moderately sized datasets, we introduce a new framework, ROTI-GCV, for reliably performing cross-validation under these challenging conditions. Along the way, we propose new estimators for the signal-to-noise ratio and noise variance. We conduct experiments that demonstrate the accuracy of our approach in a variety of synthetic and semi-synthetic settings.
Authors: Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li
Abstract: Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about $5\%$ safety neurons, and by only patching their activations we can restore over $90\%$ of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation. The source code is available at https://github.com/THU-KEG/SafetyNeuron.
Authors: Xing Liu, Fran\c{c}ois-Xavier Briol
Abstract: Goodness-of-fit testing is often criticized for its lack of practical relevance: since ``all models are wrong'', the null hypothesis that the data conform to our model is ultimately always rejected as the sample size grows. Despite this, probabilistic models are still used extensively, raising the more pertinent question of whether the model is \emph{good enough} for the task at hand. This question can be formalized as a robust goodness-of-fit testing problem by asking whether the data were generated from a distribution that is a mild perturbation of the model. In this paper, we show that existing kernel goodness-of-fit tests are not robust under common notions of robustness including both qualitative and quantitative robustness. We further show that robustification techniques using tilted kernels, while effective in the parameter estimation literature, are not sufficient to ensure both types of robustness in the testing setting. To address this, we propose the first robust kernel goodness-of-fit test, which resolves this open problem by using kernel Stein discrepancy (KSD) balls. This framework encompasses many well-known perturbation models, such as Huber's contamination and density-band models.
Authors: Deepak Dagar, Dinesh Kumar Vishwakarma
Abstract: Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.
Authors: Ray Congrui Yu, Sherry Wu, Jiang Gui
Abstract: Despite their immense success, deep convolutional neural networks (CNNs) can be difficult to optimize and costly to train due to hundreds of layers within the network depth. Conventional convolutional operations are fundamentally limited by their linear nature along with fixed activations, where many layers are needed to learn meaningful patterns in data. Because of the sheer size of these networks, this approach is simply computationally inefficient, and poses overfitting or gradient explosion risks, especially in small datasets. As a result, we introduce a "plug-in" module, called Residual Kolmogorov-Arnold Network (RKAN). Our module is highly compact, so it can be easily added into any stage (level) of traditional deep networks, where it learns to integrate supportive polynomial feature transformations to existing convolutional frameworks. RKAN offers consistent improvements over baseline models in different vision tasks and widely tested benchmarks, accomplishing cutting-edge performance on them.
Authors: Yan Li, Xiao Zhang, Mingyi Li, Guangwei Xu, Feng Chen, Yuan Yuan, Yifei Zou, Mengying Zhao, Jianbo Lu, Dongxiao Yu
Abstract: In this work, we study to release the potential of massive heterogeneous weak computing power to collaboratively train large-scale models on dispersed datasets. In order to improve both efficiency and accuracy in resource-adaptive collaborative learning, we take the first step to consider the \textit{unstructured pruning}, \textit{varying submodel architectures}, \textit{knowledge loss}, and \textit{straggler} challenges simultaneously. We propose a novel semi-asynchronous collaborative training framework, namely ${Co\text{-}S}^2{P}$, with data distribution-aware structured pruning and cross-block knowledge transfer mechanism to address the above concerns. Furthermore, we provide theoretical proof that ${Co\text{-}S}^2{P}$ can achieve asymptotic optimal convergence rate of $O(1/\sqrt{N^*EQ})$. Finally, we conduct extensive experiments on two types of tasks with a real-world hardware testbed including diverse IoT devices.The experimental results demonstrate that $Co\text{-}S^2P$ improves accuracy by up to 8.8\% and resource utilization by up to 1.2$\times$ compared to state-of-the-art methods, while reducing memory consumption by approximately 22\% and training time by about 24\% on all resource-limited devices.
Authors: Ameya Daigavane, Bodhi P. Vani, Darcy Davidson, Saeed Saremi, Joshua Rackers, Joseph Kleinhenz
Abstract: Conformational ensembles of protein structures are immensely important both for understanding protein function and drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles such as molecular dynamics (MD) are computationally inefficient, while many recent machine learning methods do not transfer to systems outside their training data. We propose JAMUN which performs MD in a smoothed, noised space of all-atom 3D conformations of molecules by utilizing the framework of walk-jump sampling. JAMUN enables ensemble generation for small peptides at rates of an order of magnitude faster than traditional molecular dynamics. The physical priors in JAMUN enables transferability to systems outside of its training data, even to peptides that are longer than those originally trained on. Our model, code and weights are available at https://github.com/prescient-design/jamun.
Authors: G\'erard Ben Arous, C\'edric Gerbelot, Vanessa Piccolo
Abstract: We study the high-dimensional dynamics of online stochastic gradient descent (SGD) for the multi-spiked tensor model. This multi-index model arises from the tensor principal component analysis (PCA) problem with multiple spikes, where the goal is to estimate $r$ unknown signal vectors within the $N$-dimensional unit sphere through maximum likelihood estimation from noisy observations of a $p$-tensor. We determine the number of samples and the conditions on the signal-to-noise ratios (SNRs) required to efficiently recover the unknown spikes from natural random initializations. We show that full recovery of all spikes is possible provided a number of sample scaling as $N^{p-2}$, matching the algorithmic threshold identified in the rank-one case [Ben Arous, Gheissari, Jagannath 2020, 2021]. Our results are obtained through a detailed analysis of a low-dimensional system that describes the evolution of the correlations between the estimators and the spikes, while controlling the noise in the dynamics. We find that the spikes are recovered sequentially in a process we term "sequential elimination": once a correlation exceeds a critical threshold, all correlations sharing a row or column index become sufficiently small, allowing the next correlation to grow and become macroscopic. The order in which correlations become macroscopic depends on their initial values and the corresponding SNRs, leading to either exact recovery or recovery of a permutation of the spikes. In the matrix case, when $p=2$, if the SNRs are sufficiently separated, we achieve exact recovery of the spikes, whereas equal SNRs lead to recovery of the subspace spanned by them.
Authors: Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao
Abstract: Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety. Our code is available at: https://github.com/SunChungEn/ADV-LLM
Authors: Ayush Mohanty, Jason Dekarske, Stephen K. Robinson, Sanjay Joshi, Nagi Gebraeel
Abstract: Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot's end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot's position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches -- (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.
Authors: Yuxin Chang, Alex Boyd, Cao Xiao, Taha Kass-Hout, Parminder Bhatia, Padhraic Smyth, Andrew Warrington
Abstract: Marked temporal point processes (MTPPs) model sequences of events occurring at irregular time intervals, with wide-ranging applications in fields such as healthcare, finance and social networks. We propose the state-space point process (S2P2) model, a novel and performant model that leverages techniques derived for modern deep state-space models (SSMs) to overcome limitations of existing MTPP models, while simultaneously imbuing strong inductive biases for continuous-time event sequences that other discrete sequence models (i.e., RNNs, transformers) do not capture. Inspired by the classical linear Hawkes processes, we propose an architecture that interleaves stochastic jump differential equations with nonlinearities to create a highly expressive intensity-based MTPP model, without the need for restrictive parametric assumptions for the intensity. Our approach enables efficient training and inference with a parallel scan, bringing linear complexity and sublinear scaling while retaining expressivity to MTPPs. Empirically, S2P2 achieves state-of-the-art predictive likelihoods across eight real-world datasets, delivering an average improvement of 33% over the best existing approaches.
Authors: Adrien Vacher, Omar Chehab, Anna Korba
Abstract: Even in low dimensions, sampling from multi-modal distributions is challenging. We provide the first sampling algorithm for a broad class of distributions -- including all Gaussian mixtures -- with a query complexity that is polynomial in the parameters governing multi-modality, assuming fixed dimension. Our sampling algorithm simulates a time-reversed diffusion process, using a self-normalized Monte Carlo estimator of the intermediate score functions. Unlike previous works, it avoids metastability, requires no prior knowledge of the mode locations, and relaxes the well-known log-smoothness assumption which excluded general Gaussian mixtures so far.
Authors: Zijun Gao, Yan Sun, Han Su
Abstract: Generative models have achieved remarkable success across a range of applications, yet their evaluation still lacks principled uncertainty quantification. In this paper, we develop a method for comparing how close different generative models are to the underlying distribution of test samples. Particularly, our approach employs the Kullback-Leibler (KL) divergence to measure the distance between a generative model and the unknown test distribution, as KL requires no tuning parameters such as the kernels used by RKHS-based distances, and is the only $f$-divergence that admits a crucial cancellation to enable the uncertainty quantification. Furthermore, we extend our method to comparing conditional generative models and leverage Edgeworth expansions to address limited-data settings. On simulated datasets with known ground truth, we show that our approach realizes effective coverage rates, and has higher power compared to kernel-based methods. When applied to generative models on image and text datasets, our procedure yields conclusions consistent with benchmark metrics but with statistical confidence.
Authors: Runxiong Wu, Dong Liu, Xueqin Wang, Andi Wang
Abstract: We consider primal-dual algorithms for general empirical risk minimization problems in distributed settings, focusing on two prominent classes of algorithms. The first class is the communication-efficient distributed dual coordinate ascent (CoCoA), derived from the coordinate ascent method for solving the dual problem. The second class is the alternating direction method of multipliers (ADMM), including consensus ADMM, proximal ADMM, and linearized ADMM. We demonstrate that both classes of algorithms can be transformed into a unified update form that involves only primal and dual variables. This discovery reveals key connections between the two classes of algorithms: CoCoA can be interpreted as a special case of proximal ADMM for solving the dual problem, while consensus ADMM is equivalent to a proximal ADMM algorithm. This discovery provides insight into how we can easily enable the ADMM variants to outperform the CoCoA variants by adjusting the augmented Lagrangian parameter. We further explore linearized versions of ADMM and analyze the effects of tuning parameters on these ADMM variants in the distributed setting. Extensive simulation studies and real-world data analysis support our theoretical findings.
Authors: Xinkai Jiang, Qihao Yuan, Enes Ulas Dincer, Hongyi Zhou, Ge Li, Xueyin Li, Xiaogang Jia, Timo Schnizer, Nicolas Schreiber, Weiran Liao, Julius Haag, Kailai Li, Gerhard Neumann, Rudolf Lioutikov
Abstract: This paper introduces IRIS, an Immersive Robot Interaction System leveraging Extended Reality (XR). Existing XR-based systems enable efficient data collection but are often challenging to reproduce and reuse due to their specificity to particular robots, objects, simulators, and environments. IRIS addresses these issues by supporting immersive interaction and data collection across diverse simulators and real-world scenarios. It visualizes arbitrary rigid and deformable objects, robots from simulation, and integrates real-time sensor-generated point clouds for real-world applications. Additionally, IRIS enhances collaborative capabilities by enabling multiple users to simultaneously interact within the same virtual scene. Extensive experiments demonstrate that IRIS offers efficient and intuitive data collection in both simulated and real-world settings.
Authors: Jose Blanchet, Yassine Hamoudi, Mario Szegedy, Guanyang Wang
Abstract: The mean of a random variable can be understood as a linear functional on the space of probability distributions. Quantum computing is known to provide a quadratic speedup over classical Monte Carlo methods for mean estimation. In this paper, we investigate whether a similar quadratic speedup is achievable for estimating non-linear functionals of probability distributions. We propose a quantum-inside-quantum Monte Carlo algorithm that achieves such a speedup for a broad class of non-linear estimation problems, including nested conditional expectations and stochastic optimization. Our algorithm improves upon the direct application of the quantum multilevel Monte Carlo algorithm introduced by An et al. (2021). The existing lower bound indicates that our algorithm is optimal up polylogarithmic factors. A key innovation of our approach is a new sequence of multilevel Monte Carlo approximations specifically designed for quantum computing, which is central to the algorithm's improved performance.
Authors: Hidde Fokkema, Tim van Erven, Sara Magliacane
Abstract: Machine learning is a vital part of many real-world systems, but several concerns remain about the lack of interpretability, explainability and robustness of black-box AI systems. Concept Bottleneck Models (CBM) address some of these challenges by learning interpretable concepts from high-dimensional data, e.g. images, which are used to predict labels. An important issue in CBMs are spurious correlation between concepts, which effectively lead to learning "wrong" concepts. Current mitigating strategies have strong assumptions, e.g., they assume that the concepts are statistically independent of each other, or require substantial interaction in terms of both interventions and labels provided by annotators. In this paper, we describe a framework that provides theoretical guarantees on the correctness of the learned concepts and on the number of required labels, without requiring any interventions. Our framework leverages causal representation learning (CRL) methods to learn latent causal variables from high-dimensional observations in a unsupervised way, and then learns to align these variables with interpretable concepts with few concept labels. We propose a linear and a non-parametric estimator for this mapping, providing a finite-sample high probability result in the linear case and an asymptotic consistency result for the non-parametric estimator. We evaluate our framework in synthetic and image benchmarks, showing that the learned concepts have less impurities and are often more accurate than other CBMs, even in settings with strong correlations between concepts.
Authors: Elizaveta Semenova, Alisa Sheinkman, Timothy James Hitge, Siobhan Mackenzie Hall, Jon Cockayne
Abstract: Surrogate models are widely used to approximate complex systems across science and engineering to reduce computational costs. Despite their widespread adoption, the field lacks standardisation across key stages of the modelling pipeline, including data sampling, model selection, evaluation, and downstream analysis. This fragmentation limits reproducibility and cross-domain utility -- a challenge further exacerbated by the rapid proliferation of AI-driven surrogate models. We argue for the urgent need to establish a structured reporting standard, the Surrogate Model Reporting Standard (SMRS), that systematically captures essential design and evaluation choices while remaining agnostic to implementation specifics. By promoting a standardised yet flexible framework, we aim to improve the reliability of surrogate modelling, foster interdisciplinary knowledge transfer, and, as a result, accelerate scientific progress in the AI era.
Authors: Anastasia N. Krouglova, Hayden R. Johnson, Basile Confavreux, Michael Deistler, Pedro J. Gon\c{c}alves
Abstract: Across many domains of science, stochastic models are an essential tool to understand the mechanisms underlying empirically observed data. Models can be of different levels of detail and accuracy, with models of high-fidelity (i.e., high accuracy) to the phenomena under study being often preferable. However, inferring parameters of high-fidelity models via simulation-based inference is challenging, especially when the simulator is computationally expensive. We introduce MF-(TS)NPE, a multifidelity approach to neural posterior estimation that uses transfer learning to leverage inexpensive low-fidelity simulations to efficiently infer parameters of high-fidelity simulators. MF-(TS)NPE applies the multifidelity scheme to both amortized and non-amortized neural posterior estimation. We further improve simulation efficiency by introducing A-MF-TSNPE, a sequential variant that uses an acquisition function targeting the predictive uncertainty of the density estimator to adaptively select high-fidelity parameters. On established benchmark and neuroscience tasks, our approaches require up to two orders of magnitude fewer high-fidelity simulations than current methods, while showing comparable performance. Overall, our approaches open new opportunities to perform efficient Bayesian inference on computationally expensive simulators.
Authors: Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen
Abstract: The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.
Authors: Haoran Ni, Martin Lotz
Abstract: Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.
Authors: Jiahe Li, Xin Chen, Fanqi Shen, Junru Chen, Yuxin Liu, Daoze Zhang, Zhizhang Yuan, Fang Zhao, Meng Li, Yang Yang
Abstract: Neurological disorders pose major global health challenges, driving advances in brain signal analysis. Scalp electroencephalography (EEG) and intracranial EEG (iEEG) are widely used for diagnosis and monitoring. However, dataset heterogeneity and task variations hinder the development of robust deep learning solutions. This review systematically examines recent advances in deep learning approaches for EEG/iEEG-based neurological diagnostics, focusing on applications across 7 neurological conditions using 46 datasets. For each condition, we review representative methods and their quantitative results, integrating performance comparisons with analyses of data usage, model design, and task-specific adaptations, while highlighting the role of pre-trained multi-task models in achieving scalable, generalizable solutions. Finally, we propose a standardized benchmark to evaluate models across diverse datasets and improve reproducibility, emphasizing how recent innovations are transforming neurological diagnostics toward intelligent, adaptable healthcare systems.
Authors: Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm
Abstract: The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.
Authors: Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, Fei-Yue Wang
Abstract: Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.
Authors: Yang Deng, Yaohui Liu, Rui Liang, Dafang Zhao, Donghua Xie, Ittetsu Taniguchi, Dan Wang
Abstract: The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavily on expert knowledge, making the modeling process inefficient and limiting the reusability of the models. This paper explores a model ensemble perspective that utilizes existing developed models as base models to serve a target building environment, thereby providing accurate predictions while reducing the associated efforts. Given that building data streams are non-stationary and the number of base models may increase, we propose a Hierarchical Reinforcement Learning (HRL) approach to dynamically select and weight the base models. Our approach employs a two-tiered decision-making process: the high-level focuses on model selection, while the low-level determines the weights of the selected models. We thoroughly evaluate the proposed approach through offline experiments and an on-site case study, and the experimental results demonstrate the effectiveness of our method.
Authors: Soham Bonnerjee, Sayar Karmakar, Wei Biao Wu
Abstract: Federated Learning has gained traction in privacy-sensitive collaborative environments, with local SGD emerging as a key optimization method in decentralized settings. While its convergence properties are well-studied, asymptotic statistical guarantees beyond convergence remain limited. In this paper, we present two generalized Gaussian approximation results for local SGD and explore their implications. First, we prove a Berry-Esseen theorem for the final local SGD iterates, enabling valid multiplier bootstrap procedures. Second, motivated by robustness considerations, we introduce two distinct time-uniform Gaussian approximations for the entire trajectory of local SGD. The time-uniform approximations support Gaussian bootstrap-based tests for detecting adversarial attacks. Extensive simulations are provided to support our theoretical results.
Authors: Yi Zhang, Elynn Chen, Yujun Yan
Abstract: We study contextual dynamic pricing when a target market can leverage K auxiliary markets -- offline logs or concurrent streams -- whose mean utilities differ by a structured preference shift. We propose Cross-Market Transfer Dynamic Pricing (CM-TDP), the first algorithm that provably handles such model-shift transfer and delivers minimax-optimal regret for both linear and non-parametric utility models. For linear utilities of dimension d, where the difference between source- and target-task coefficients is $s_{0}$-sparse, CM-TDP attains regret $\tilde{O}((d*K^{-1}+s_{0})\log T)$. For nonlinear demand residing in a reproducing kernel Hilbert space with effective dimension $\alpha$, complexity $\beta$ and task-similarity parameter $H$, the regret becomes $\tilde{O}\!(K^{-2\alpha\beta/(2\alpha\beta+1)}T^{1/(2\alpha\beta+1)} + H^{2/(2\alpha+1)}T^{1/(2\alpha+1)})$, matching information-theoretic lower bounds up to logarithmic factors. The RKHS bound is the first of its kind for transfer pricing and is of independent interest. Extensive simulations show up to 50% lower cumulative regret and 5 times faster learning relative to single-market pricing baselines. By bridging transfer learning, robust aggregation, and revenue optimization, CM-TDP moves toward pricing systems that transfer faster, price smarter.
Authors: Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart
Abstract: Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
Authors: Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Abstract: Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.
URLs: https://github.com/roboflow/rf100-vl, https://universe.roboflow.com/rf100-vl/.
Authors: Binh Duc Vu, Jan Kapar, Marvin Wright, David S. Watson
Abstract: We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
Authors: Josef Dick, Seungchan Ko, Quoc Thong Le Gia, Kassem Mustapha, Sanghyeon Park
Abstract: Due to divergence instability, the accuracy of low-order conforming finite element methods for nearly incompressible elasticity equations deteriorates as the Lam\'e coefficient $\lambda\to\infty$, or equivalently as the Poisson ratio $\nu\to1/2$. This phenomenon, known as locking or non-robustness, remains not fully understood despite extensive investigation. In this work, we illustrate first that an analogous instability arises when applying the popular Physics-Informed Neural Networks (PINNs) to nearly incompressible elasticity problems, leading to significant loss of accuracy and convergence difficulties. Then, to overcome this challenge, we propose a robust decomposition-based PINN framework that reformulates the elasticity equations into balanced subsystems, thereby eliminating the ill-conditioning that causes locking. Our approach simultaneously solves the forward and inverse problems to recover both the decomposed field variables and the associated external conditions. We will also perform a convergence analysis to further enhance the reliability of the proposed approach. Moreover, through various numerical experiments, including constant, variable and parametric Lam\'e coefficients, we illustrate the efficiency of the proposed methodology.
Authors: Yi Ding, Ruqi Zhang
Abstract: Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
Authors: Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Abstract: Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.
Authors: Yuanzhe Liu, Ryan Deng, Tim Kaler, Xuhao Chen, Charles E. Leiserson, Yao Ma, Jie Chen
Abstract: Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.
Authors: Zijie Xu, Tong Bu, Zecheng Hao, Jianhao Ding, Zhaofei Yu
Abstract: Spiking Neural Networks (SNNs) offer low-latency and energy-efficient decision making on neuromorphic hardware, making them attractive for Reinforcement Learning (RL) in resource-constrained edge devices. However, most RL algorithms for continuous control are designed for Artificial Neural Networks (ANNs), particularly the target network soft update mechanism, which conflicts with the discrete and non-differentiable dynamics of spiking neurons. We show that this mismatch destabilizes SNN training and degrades performance. To bridge the gap between discrete SNNs and continuous-control algorithms, we propose a novel proxy target framework. The proxy network introduces continuous and differentiable dynamics that enable smooth target updates, stabilizing the learning process. Since the proxy operates only during training, the deployed SNN remains fully energy-efficient with no additional inference overhead. Extensive experiments on continuous control benchmarks demonstrate that our framework consistently improves stability and achieves up to $32\%$ higher performance across various spiking neuron models. Notably, to the best of our knowledge, this is the first approach that enables SNNs with simple Leaky Integrate and Fire (LIF) neurons to surpass their ANN counterparts in continuous control. This work highlights the importance of SNN-tailored RL algorithms and paves the way for neuromorphic agents that combine high performance with low power consumption. Code is available at https://github.com/xuzijie32/Proxy-Target.
Authors: Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, Branislav Kveton
Abstract: LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
Authors: Haifeng Sun, Yu Xiong, Runze Wu, Kai Wang, Lan Zhang, Changjie Fan, Shaojie Tang, Xiang-Yang Li
Abstract: The burgeoning growth of the esports and multiplayer online gaming community has highlighted the critical importance of evaluating the Most Valuable Player (MVP). The establishment of an explainable and practical MVP evaluation method is very challenging. In our study, we specifically focus on play-by-play data, which records related events during the game, such as assists and points. We aim to address the challenges by introducing a new MVP evaluation framework, denoted as \oursys, which leverages Shapley values. This approach encompasses feature processing, win-loss model training, Shapley value allocation, and MVP ranking determination based on players' contributions. Additionally, we optimize our algorithm to align with expert voting results from the perspective of causality. Finally, we substantiated the efficacy of our method through validation using the NBA dataset and the Dunk City Dynasty dataset and implemented online deployment in the industry.
Authors: Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai
Abstract: We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems. Our code is publicly available at: https://github.com/kAIto47802/Prover-Agent.
Authors: Brian Liu, Rahul Mazumder, Peter Radchenko
Abstract: Tree ensembles are non-parametric methods widely recognized for their accuracy and ability to capture complex interactions. While these models excel at prediction, they are difficult to interpret and may fail to uncover useful relationships in the data. We propose an estimator to extract compact sets of decision rules from tree ensembles. The extracted models are accurate and can be manually examined to reveal relationships between the predictors and the response. A key novelty of our estimator is the flexibility to jointly control the number of rules extracted and the interaction depth of each rule, which improves accuracy. We develop a tailored exact algorithm to efficiently solve optimization problems underlying our estimator and an approximate algorithm for computing regularization paths, sequences of solutions that correspond to varying model sizes. We also establish novel non-asymptotic prediction error bounds for our proposed approach, comparing it to an oracle that chooses the best data-dependent linear combination of the rules in the ensemble subject to the same complexity constraint as our estimator. The bounds illustrate that the large-sample predictive performance of our estimator is on par with that of the oracle. Through experiments, we demonstrate that our estimator outperforms existing algorithms for rule extraction.
Authors: Kay Giesecke, Enguerrand Horel, Chartsiri Jirachotkulthorn
Abstract: Machine learning has become a central tool across scientific, industrial, and policy domains. Algorithms now identify chemical properties, forecast disease risk, screen borrowers, and guide public interventions. Yet this predictive power often comes at the cost of transparency: we rarely know which input features truly drive a model's predictions. Without such understanding, researchers cannot draw reliable scientific conclusions, practitioners cannot ensure fairness or accountability, and policy makers cannot trust or govern model-based decisions. Despite its importance, existing tools for assessing feature influence are limited -- most lack statistical guarantees, and many require costly retraining or surrogate modeling, making them impractical for large modern models. We introduce AICO, a broadly applicable framework that turns model interpretability into an efficient statistical exercise. AICO asks, for any trained regression or classification model, whether each feature genuinely improves model performance. It does so by masking the feature's information and measuring the resulting change in performance. The method delivers exact, finite-sample inference -- exact feature p-values and confidence intervals -- without any retraining, surrogate modeling, or distributional assumptions, making it feasible for today's large-scale algorithms. In both controlled experiments and real applications -- from credit scoring to mortgage-behavior prediction -- AICO consistently pinpoints the variables that drive model behavior, providing a fast and reliable path toward transparent and trustworthy machine learning.
Authors: Saransh Gupta, Umesh Deshpande, Travis Janssen, Swami Sundararaman
Abstract: Parameter-efficient fine-tuning (PEFT) allows model builders to capture the task-specific parameters into adapters, which are a fraction of the size of the original base model. Popularity of PEFT technique for fine-tuning has led to the creation of a large number of adapters for popular Large Language Models (LLMs). However, existing frameworks fall short in supporting inference or fine-tuning with multiple adapters in the following ways. 1) For fine-tuning, each job needs to deploy its dedicated base model instance, which results in excessive GPU memory consumption and poor GPU utilization. 2) While popular inference platforms can serve multiple PEFT adapters, they do not allow independent resource management or mixing of different PEFT methods. 3) They cannot make effective use of heterogeneous accelerators. 4) They do not provide privacy to users who may not wish to expose their fine-tuned parameters to service providers. In Symbiosis, we address the above problems by enabling the as-a-service deployment of the base model. The base model layers can be shared across multiple inference or fine-tuning processes. Our split-execution technique decouples the execution of client-specific adapters and layers from the frozen base model layers offering them flexibility to manage their resources, to select their fine-tuning method, to achieve their performance goals. Our approach is transparent to models and works out-of-the-box for most models in the transformers library. We demonstrate the use of Symbiosis to simultaneously fine-tune 20 Gemma2-27B adapters on 8 GPUs.
Authors: Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
Abstract: The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative despite inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been explored for domain adaptation, its utility in commercial settings remains under-examined. In this study, we validate the effectiveness of a DACP-based recipe across diverse foundation models and service domains, producing DACP-applied sLLMs (ixi-GEN). Through extensive experiments and real-world evaluations, we demonstrate that ixi-GEN models achieve substantial gains in target-domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
Authors: Zhehui Huang, Guangyao Shi, Yuwei Wu, Vijay Kumar, Gaurav S. Sukhatme
Abstract: Multi-robot coordination has traditionally relied on a mission-specific and expert-driven pipeline, where natural language mission descriptions are manually translated by domain experts into mathematical formulation, algorithm design, and executable code. This conventional process is labor-intensive, inaccessible to non-experts, and inflexible to changes in mission requirements. Here, we propose LAN2CB (Language to Collective Behavior), a novel framework that leverages large language models (LLMs) to streamline and generalize the multi-robot coordination pipeline. LAN2CB transforms natural language (NL) mission descriptions into executable Python code for multi-robot systems through two core modules: (1) Mission Analysis, which parses mission descriptions into behavior trees, and (2) Code Generation, which leverages the behavior tree and a structured knowledge base to generate robot control code. We further introduce a dataset of natural language mission descriptions to support development and benchmarking. Experiments in both simulation and real-world environments demonstrate that LAN2CB enables robust and flexible multi-robot coordination from natural language, significantly reducing manual engineering effort and supporting broad generalization across diverse mission types. Website: https://sites.google.com/view/lan-cb
Authors: Guangyi He, Tobias Sutter, Lukas Gonon
Abstract: In this paper, we study the robustness of classical deep hedging strategies under distributional shifts by leveraging the concept of adversarial attacks. We first demonstrate that standard deep hedging models are highly vulnerable to small perturbations in the input distribution, resulting in significant performance degradation. Motivated by this, we propose an adversarial training framework tailored to increase the robustness of deep hedging strategies. Our approach extends pointwise adversarial attacks to the distributional setting and introduces a computationally tractable reformulation of the adversarial optimization problem over a Wasserstein ball. This enables the efficient training of hedging strategies that are resilient to distributional perturbations. Through extensive numerical experiments, we show that adversarially trained deep hedging strategies consistently outperform their classical counterparts in terms of out-of-sample performance and resilience to model misspecification. Additional results indicate that the robust strategies maintain reliable performance on real market data and remain effective during periods of market change. Our findings establish a practical and effective framework for robust deep hedging under realistic market uncertainties.
Authors: Baiyuan Chen, Shinji Ito, Masaaki Imaizumi
Abstract: Transformers have demonstrated exceptional performance across a wide range of domains. While their ability to perform reinforcement learning in-context has been established both theoretically and empirically, their behavior in non-stationary environments remains less understood. In this study, we address this gap by showing that transformers can achieve nearly optimal dynamic regret bounds in non-stationary settings. We prove that transformers are capable of approximating strategies used to handle non-stationary environments and can learn the approximator in the in-context learning setup. Our experiments further show that transformers can match or even outperform existing expert algorithms in such environments.
Authors: Tinglong Deng, Hang Tao, Xinxiang Wang, Yinyan Wang, Hanjiang Luo
Abstract: As underwater human activities are increasing, the demand for underwater communication service presents a significant challenge. Existing underwater diver communication methods face hurdles due to inherent disadvantages and complex underwater environments. To address this issue, we propose a scheme that utilizes maritime unmanned systems to assist divers with reliable and high-speed communication. Multiple AUVs are equipped with optical and acoustic multimodal communication devices as relay nodes, providing adaptive communication services based on changes in the diver's activity area. By using a multi-agent reinforcement learning (MARL) approach to control the cooperative movement of AUVs, high-speed and reliable data transmission between divers can be achieved. At the same time, utilizing the advantages of on-demand deployment and wide coverage of unmanned surface vehicles (USVs) as surface relay nodes to coordinate and forward information from AUVs, and controlling AUVs to adaptively select relay USV nodes for data transmission, high-quality communication between divers and surface platform can be achieved. Through simulation verification, the proposed scheme can effectively achieve reliable and high-speed communication for divers.
Authors: Shengbo Wang, Nian Si
Abstract: Learning and optimal control under robust Markov decision processes (MDPs) have received increasing attention, yet most existing theory, algorithms, and applications focus on finite-horizon or discounted models. Long-run average-reward formulations, while natural in many operations research and management contexts, remain underexplored. This is primarily because the dynamic programming foundations are technically challenging and only partially understood, with several fundamental questions remaining open. This paper steps toward a general framework for average-reward robust MDPs by analyzing the constant-gain setting. We study the average-reward robust control problem with possible information asymmetries between the controller and an S-rectangular adversary. Our analysis centers on the constant-gain robust Bellman equation, examining both the existence of solutions and their relationship to the optimal average reward. Specifically, we identify when solutions to the robust Bellman equation characterize the optimal average reward and stationary policies, and we provide one-sided weak communication conditions ensuring solutions' existence. These findings expand the dynamic programming theory for average-reward robust MDPs and lay a foundation for robust dynamic decision making under long-run average criteria in operational environments.
Authors: Karl Zhu, Dimitris Bertsimas
Abstract: Adaptive robust optimization (ARO) extends static robust optimization by allowing decisions to depend on the realized uncertainty - weakly dominating static solutions within the modeled uncertainty set. However, ARO makes previous constraints that were independent of uncertainty now dependent, making it vulnerable to additional infeasibilities when realizations fall outside the uncertainty set. This phenomenon of adaptive policies being brittle is analogous to overfitting in machine learning. To mitigate against this, we propose assigning constraint-specific uncertainty set sizes, with harder constraints given stronger probabilistic guarantees. Interpreted through the overfitting lens, this acts as regularization: tighter guarantees shrink adaptive coefficients to ensure stability, while looser ones preserve useful flexibility. This view motivates a principled approach to designing uncertainty sets that balances robustness and adaptivity.
Authors: Abdou Karim Kandji, Fr\'ed\'eric Precioso, Cheikh Ba, Samba Ndiaye, Augustin Ndione
Abstract: Intent classification models have made a significant progress in recent years. However, previous studies primarily focus on high-resource language datasets, which results in a gap for low-resource languages and for regions with high rates of illiteracy, where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, while the national illiteracy rate remains at of 42\%. Wolof is actually spoken by more than 10 million people in West African region. To address these limitations, we introduce the Wolof Banking Speech Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. In addition, this paper presents an in-depth examination of the dataset's contents. We report baseline F1-scores and word error rates metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. Dataset and code available at: \href{https://github.com/abdoukarim/wolbanking77}{wolbanking77}.
Authors: Eloy Mosig, Andrea Agazzi, Dario Trevisan
Abstract: In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time $t \ge 0$, demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.
Authors: Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin
Abstract: Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
Authors: Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
Abstract: Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.
Authors: Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong
Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
Authors: Theodore Jerome Tinker, Kenji Doya, Jun Tani
Abstract: Human infants acquire language and action gradually through development, achieving remarkable generalization capabilities from only a minimal number of learning examples. In contrast, recent large language models require exposure to billions of training tokens to achieve such generalization. What mechanisms underlie such efficient developmental learning in humans? This study addresses this question through simulation experiments in which robots learn to perform various actions corresponding to imperative sentences (e.g., \textit{push red cube}) via trials of self-guided exploration. Our approach integrates the active inference framework with reinforcement learning, enabling curiosity-driven developmental learning. The simulations yielded several important findings: i) Generalization is drastically improved as the number of compositional elements increases. ii) Curiosity-driven exploration combined with motor noise substantially outperforms learning without curiosity. iii) Rote pairing of sentences and actions occurs before the emergence of compositional generalization. iv) Simpler, prerequisite-like actions emerge earlier in development, while more complex actions involving these prerequisites develop later. These results shed light into possible mechanisms underlying efficient developmental learning in infants and provide computational parallels to findings in developmental psychology.
Authors: Jahidul Arafat, Fariha Tasmin, Sanjaya Poudel, Iftekhar Haider
Abstract: Medical AI faces challenges in privacy-preserving collaborative learning while ensuring fairness across heterogeneous healthcare institutions. Current federated learning approaches suffer from static architectures, slow convergence (45-73 rounds), fairness gaps marginalizing smaller institutions, and scalability constraints (15-client limit). We propose Adaptive Fair Federated Learning (AFFL) through three innovations: (1) Adaptive Knowledge Messengers dynamically scaling capacity based on heterogeneity and task complexity, (2) Fairness-Aware Distillation using influence-weighted aggregation, and (3) Curriculum-Guided Acceleration reducing rounds by 60-70%. Our theoretical analysis provides convergence guarantees with epsilon-fairness bounds, achieving O(T^{-1/2}) + O(H_max/T^{3/4}) rates. Projected results show 55-75% communication reduction, 56-68% fairness improvement, 34-46% energy savings, and 100+ institution support. The framework enables multi-modal integration across imaging, genomics, EHR, and sensor data while maintaining HIPAA/GDPR compliance. We propose MedFedBench benchmark suite for standardized evaluation across six healthcare dimensions: convergence efficiency, institutional fairness, privacy preservation, multi-modal integration, scalability, and clinical deployment readiness. Economic projections indicate 400-800% ROI for rural hospitals and 15-25% performance gains for academic centers. This work presents a seven-question research agenda, 24-month implementation roadmap, and pathways toward democratizing healthcare AI.
Authors: Imranur Rahman, Jill Marley, William Enck, Laurie Williams
Abstract: Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emph{Pinning} dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emph{floating} can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emph{pinning} dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emph{pinning} is the tightest version constraint, \emph{pinning} is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emph{pinning} as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emph{floating-minor}, with \emph{pinning} being the next most common. We also find that \emph{floating-major} is the least likely to result in outdated and \emph{floating-minor} is the least likely to result in vulnerable dependencies.
Authors: Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You
Abstract: Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .
Authors: Simin Li, Zihao Mao, Hanxiao Li, Zonglei Jing, Zhuohang bian, Jun Guo, Li Wang, Zhuoran Han, Ruixiao Xu, Xin Yu, Chengdong Ma, Yuqing Ma, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu
Abstract: In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions--a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at https://github.com/BUAA-TrustworthyMARL/adv_marl_benchmark .
URLs: https://github.com/BUAA-TrustworthyMARL/adv_marl_benchmark
Authors: Numan Zafar, Priyo Ranjan Kundu Prosun, Shafique Ahmad Chaudhry
Abstract: As virtual reality (VR) devices become increasingly integrated into everyday settings, a growing number of users without prior experience will engage with VR systems. Automatically detecting a user's familiarity with VR as an interaction medium enables real-time, adaptive training and interface adjustments, minimizing user frustration and improving task performance. In this study, we explore the automatic detection of VR familiarity by analyzing hand movement patterns during a passcode-based door-opening task, which is a well-known interaction in collaborative virtual environments such as meeting rooms, offices, and healthcare spaces. While novice users may lack prior VR experience, they are likely to be familiar with analogous real-world tasks involving keypad entry. We conducted a pilot study with 26 participants, evenly split between experienced and inexperienced VR users, who performed tasks using both controller-based and hand-tracking interactions. Our approach uses state-of-the-art deep classifiers for automatic VR familiarity detection, achieving the highest accuracies of 92.05% and 83.42% for hand-tracking and controller-based interactions, respectively. In the cross-device evaluation, where classifiers trained on controller data were tested using hand-tracking data, the model achieved an accuracy of 78.89%. The integration of both modalities in the mixed-device evaluation obtained an accuracy of 94.19%. Our results underline the promise of using hand movement biometrics for the real-time detection of user familiarity in critical VR applications, paving the way for personalized and adaptive VR experiences.
Authors: Volker Tresp, Hang Li, Federico Harjes, Yunpu Ma
Abstract: Although quantum systems are generally described by quantum state vectors, we show that in certain cases their measurement processes can be reformulated as probabilistic equations expressed in terms of probabilistic state vectors. These probabilistic representations can, in turn, be approximated by the neural network dynamics of the Tensor Brain (TB) model. The Tensor Brain is a recently proposed framework for modeling perception and memory in the brain, providing a biologically inspired mechanism for efficiently integrating generated symbolic representations into reasoning processes.
Authors: Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu, Wei Sui, Hu Su, Junran Peng, Zhipeng Wang, Bin He
Abstract: In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
Authors: Xinyue Xu (Daisy), Jieqiang Sun (Daisy), Jing (Daisy), Dai, Siyuan Chen, Lanjie Ma, Ke Sun, Bin Zhao, Jianbo Yuan, Sheng Yi, Haohua Zhu, Yiwen Lu
Abstract: We present DexCanvas, a large-scale hybrid real-synthetic human manipulation dataset containing 7,000 hours of dexterous hand-object interactions seeded from 70 hours of real human demonstrations, organized across 21 fundamental manipulation types based on the Cutkosky taxonomy. Each entry combines synchronized multi-view RGB-D, high-precision mocap with MANO hand parameters, and per-frame contact points with physically consistent force profiles. Our real-to-sim pipeline uses reinforcement learning to train policies that control an actuated MANO hand in physics simulation, reproducing human demonstrations while discovering the underlying contact forces that generate the observed object motion. DexCanvas is the first manipulation dataset to combine large-scale real demonstrations, systematic skill coverage based on established taxonomies, and physics-validated contact annotations. The dataset can facilitate research in robotic manipulation learning, contact-rich control, and skill transfer across different hand morphologies.
Authors: Jahidul Arafat, Sanjaya Poudel
Abstract: Nanopore sequencing enables real-time long-read DNA sequencing with reads exceeding 10 kilobases, but inherent error rates of 12-15 percent present significant computational challenges for read alignment. The critical seed chaining step must connect exact k-mer matches between reads and reference genomes while filtering spurious matches, yet state-of-the-art methods rely on fixed gap penalty functions unable to adapt to varying genomic contexts including tandem repeats and structural variants. This paper presents RawHash3, a hybrid framework combining graph neural networks with classical dynamic programming for adaptive seed chaining that maintains real-time performance while providing statistical guarantees. We formalize seed chaining as graph learning where seeds constitute nodes with 12-dimensional feature vectors and edges encode 8-dimensional spatial relationships including gap consistency. Our architecture employs three-layer EdgeConv GNN with confidence-based method selection that dynamically switches between learned guidance and algorithmic fallback. Comprehensive evaluation on 1,000 synthetic nanopore reads with 5,200 test seeds demonstrates RawHash3 achieves 99.94 percent precision and 40.07 percent recall, representing statistically significant 25.0 percent relative improvement over baseline with p less than 0.001. The system maintains median inference latency of 1.59ms meeting real-time constraints, while demonstrating superior robustness with 100 percent success rate under 20 percent label corruption versus baseline degradation to 30.3 percent. Cross-validation confirms stability establishing graph neural networks as viable approach for production genomics pipelines.
Authors: Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng
Abstract: Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.
Authors: Florent Foucaud, Harmender Gahlawat, Fionn Mc Inerney, Prafullkumar Tale
Abstract: The VC-dimension is a well-studied and fundamental complexity measure of a set system (or hypergraph) that is central to many areas of machine learning. We establish several new results on the complexity of computing the VC-dimension. In particular, given a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$, we prove that the naive $2^{\mathcal{O}(|\mathcal{V}|)}$-time algorithm is asymptotically tight under the Exponential Time Hypothesis (ETH). We then prove that the problem admits a $1$-additive fixed-parameter approximation algorithm when parameterized by the maximum degree of $\mathcal{H}$ and a fixed-parameter algorithm when parameterized by its dimension, and that these are essentially the only such exploitable structural parameters. Lastly, we consider a generalization of the problem, formulated using graphs, which captures the VC-dimension of both set systems and graphs. We design a $2^{\mathcal{O}(\rm{tw}\cdot \log \rm{tw})}\cdot |V|$-time algorithm for any graph $G=(V,E)$ of treewidth $\rm{tw}$ (which, for a set system, applies to the treewidth of its incidence graph). This is in contrast with closely related problems that require a double-exponential dependency on the treewidth (assuming the ETH).
Authors: Paula Cordero-Encinar, Andrew B. Duncan
Abstract: Recent advances such as self-consistency and test-time reinforcement learning (TTRL) improve the reliability of large language models (LLMs) without additional supervision, yet their underlying mechanisms and statistical guarantees remain poorly understood. We present a unified framework for certifiable inference in LLMs, showing that majority voting provides a statistical certificate of self-consistency: under mild assumptions, the aggregated answer coincides with the mode of the model's terminal distribution with high probability. We derive finite-sample and anytime-valid concentration bounds that quantify this confidence, and introduce the Martingale Majority Certificate (MMC), a sequential stopping rule that adaptively determines when sufficient samples have been drawn. We further prove that label-free post-training methods such as TTRL implicitly sharpen the answer distribution by exponentially tilting it toward its mode, thereby reducing the number of samples required for certification. Building on this insight, we propose new post-training objectives that explicitly optimise this trade-off between sharpness and bias. Together, these results explain and connect two central test-time scaling strategies, self-consistency and TTRL, within a single statistical framework for label-free, certifiable reliability in reasoning LLMs.
Authors: Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang
Abstract: Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing external mechanisms (e.g., intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms, using MAIDs to analyze and visualize both unguided self-organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In our implementation, we introduce a causal inference technique, referred to as Pre-Strategy Intervention (PSI), to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.
Authors: Hamsa Bastani, Osbert Bastani, Bryce McLaughlin
Abstract: There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. A common approach is to train a machine learning model to predict counterfactual outcomes, and then select the policy that optimizes the predicted objective value. In addition, practitioners also want confidence that the learned policy has better performance than the incumbent policy according to downstream policy evaluation. However, due to the winner's curse-an issue where the policy optimization procedure exploits prediction errors rather than finding actual improvements-predicted performance improvements are often not substantiated by downstream policy optimization. To address this challenge, we propose a novel strategy called inference-aware policy optimization, which modifies policy optimization to account for how the policy will be evaluated downstream. Specifically, it optimizes not only for the estimated objective value, but also for the chances that the policy will be statistically significantly better than the observational policy used to collect data. We mathematically characterize the Pareto frontier of policies according to the tradeoff of these two goals. Based on our characterization, we design a policy optimization algorithm that uses machine learning to predict counterfactual outcomes, and then plugs in these predictions to estimate the Pareto frontier; then, the decision-maker can select the policy that optimizes their desired tradeoff, after which policy evaluation can be performed on the test set as usual. Finally, we perform simulations to illustrate the effectiveness of our methodology.
Authors: Sion Weatherhead, Flora Salim, Aaron Belbasis
Abstract: Humans do not just find mistakes after the fact -- we often catch them mid-stream because 'reflection' is tied to the goal and its constraints. Today's large language models produce reasoning tokens and 'reflective' text, but is it functionally equivalent with human reflective reasoning? Prior work on closed-ended tasks -- with clear, external 'correctness' signals -- can make 'reflection' look effective while masking limits in self-correction. We therefore test eight frontier models on a simple, real-world task that is open-ended yet rule-constrained, with auditable success criteria: to produce valid scientific test items, then revise after considering their own critique. First-pass performance is poor (often zero valid items out of 4 required; mean $\approx$ 1), and reflection yields only modest gains (also $\approx$ 1). Crucially, the second attempt frequently repeats the same violation of constraint, indicating 'corrective gains' arise largely from chance production of a valid item rather than error detection and principled, constraint-sensitive repair. Performance before and after reflection deteriorates as open-endedness increases, and models marketed for 'reasoning' show no advantage. Our results suggest that current LLM 'reflection' lacks functional evidence of the active, goal-driven monitoring that helps humans respect constraints even on a first pass. Until such mechanisms are instantiated in the model itself, reliable performance requires external structure that enforces constraints. Our code is available at: https://github.com/cruiseresearchgroup/LLM_ReflectionTest
URLs: https://github.com/cruiseresearchgroup/LLM_ReflectionTest
Authors: Pascal Bergstr\"a{\ss}er, Ryan Cotterell, Anthony W. Lin
Abstract: We propose succinctness as a measure of the expressive power of a transformer in describing a concept. To this end, we prove that transformers are highly expressive in that they can represent formal languages substantially more succinctly than standard representations of formal languages like finite automata and Linear Temporal Logic (LTL) formulas. As a by-product of this expressivity, we show that verifying properties of transformers is provably intractable (i.e. EXPSPACE-complete).
Authors: M. H. I. Abdalla, Zhipin Wang, Christian Frey, Steffen Eger, Josif Grabocka
Abstract: Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.
Authors: Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares
Abstract: Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.