Authors: Jiazheng Zhang, Wenqing Jing, Zizhuo Zhang, Zhiheng Xi, Shihan Dou, Rongxiang Weng, Jiahuan Li, Jingang Wang, MingXu Cai, Shibo Hong, Tao Gui, Qi Zhang
Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other's data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.
Authors: Jiaxuan Liu, Zhenhua Ling
Abstract: Recent neural codec language models have made great progress in the field of text-to-speech (TTS), but controllable emotional TTS still faces many challenges. Traditional methods rely on predefined discrete emotion labels to control emotion categories and intensities, which can't capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotion annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a neural codec language model unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotion annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along the three dimensions of ADV space, and exhibits superior end-to-end emotional speech synthesis capabilities.
Authors: Md. Ehsanul Haque, Md. Saymon Hosen Polash, Md Al-Imran Sanjida Simla, Md Alomgir Hossain, Sarwar Jahan
Abstract: Due to the rapid growth in the number of Internet of Things (IoT) networks, the cyber risk has increased exponentially, and therefore, we have to develop effective IDS that can work well with highly imbalanced datasets. A high rate of missed threats can be the result, as traditional machine learning models tend to struggle in identifying attacks when normal data volume is much higher than the volume of attacks. For example, the dataset used in this study reveals a strong class imbalance with 94,659 instances of the majority class and only 28 instances of the minority class, making it quite challenging to determine rare attacks accurately. The challenges presented in this research are addressed by hybrid sampling techniques designed to improve data imbalance detection accuracy in IoT domains. After applying these techniques, we evaluate the performance of several machine learning models such as Random Forest, Soft Voting, Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), and Logistic Regression with respect to the classification of cyber-attacks. The obtained results indicate that the Random Forest model achieved the best performance with a Kappa score of 0.9903, test accuracy of 0.9961, and AUC of 0.9994. Strong performance is also shown by the Soft Voting model, with an accuracy of 0.9952 and AUC of 0.9997, indicating the benefits of combining model predictions. Overall, this work demonstrates the value of hybrid sampling combined with robust model and feature selection for significantly improving IoT security against cyber-attacks, especially in highly imbalanced data environments.
Authors: Hector Pasten, Felipe Urrutia, Hector Jimenez, Cristian B. Calderon, Crist\'obal Rojas, Alexander Kozachinskiy
Abstract: Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on the practical scale.
Authors: Patara Trirat, Jae-Gil Lee
Abstract: The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.
Authors: Ben Adcock, Nick Huang
Abstract: We study the sample complexity of Bayesian recovery for solving inverse problems with general prior, forward operator and noise distributions. We consider posterior sampling according to an approximate prior $\mathcal{P}$, and establish sufficient conditions for stable and accurate recovery with high probability. Our main result is a non-asymptotic bound that shows that the sample complexity depends on (i) the intrinsic complexity of $\mathcal{P}$, quantified by its so-called approximate covering number, and (ii) concentration bounds for the forward operator and noise distributions. As a key application, we specialize to generative priors, where $\mathcal{P}$ is the pushforward of a latent distribution via a Deep Neural Network (DNN). We show that the sample complexity scales log-linearly with the latent dimension $k$, thus establishing the efficacy of DNN-based priors. Generalizing existing results on deterministic (i.e., non-Bayesian) recovery for the important problem of random sampling with an orthogonal matrix $U$, we show how the sample complexity is determined by the coherence of $U$ with respect to the support of $\mathcal{P}$. Hence, we establish that coherence plays a fundamental role in Bayesian recovery as well. Overall, our framework unifies and extends prior work, providing rigorous guarantees for the sample complexity of solving Bayesian inverse problems with arbitrary distributions.
Authors: Linjing You, Jiabao Lu, Xiayuan Huang, Xiangli Nie
Abstract: Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data. In this context, only a pre-trained model and unlabeled test data are available, making it particularly relevant for privacy-sensitive applications. In practice, we observe that feature redundancy in embeddings tends to increase as domain shifts intensify in TTA. However, existing TTA methods often overlook this redundancy, which can hinder the model's adaptability to new data. To address this issue, we introduce Feature Redundancy Elimination for Test-time Adaptation (FRET), a novel perspective for TTA. A straightforward approach (S-FRET) is to directly minimize the feature redundancy score as an optimization objective to improve adaptation. Despite its simplicity and effectiveness, S-FRET struggles with label shifts, limiting its robustness in real-world scenarios. To mitigate this limitation, we further propose Graph-based FRET (G-FRET), which integrates a Graph Convolutional Network (GCN) with contrastive learning. This design not only reduces feature redundancy but also enhances feature discriminability in both the representation and prediction layers. Extensive experiments across multiple model architectures, tasks, and datasets demonstrate the effectiveness of S-FRET and show that G-FRET achieves state-of-the-art performance. Further analysis reveals that G-FRET enables the model to extract non-redundant and highly discriminative features during inference, thereby facilitating more robust test-time adaptation.
Authors: Haoxiang You, Yilang Liu, Ian Abraham
Abstract: In this work, we propose a computationally efficient algorithm for visual policy learning that leverages differentiable simulation and first-order analytical policy gradients. Our approach decouple the rendering process from the computation graph, enabling seamless integration with existing differentiable simulation ecosystems without the need for specialized differentiable rendering software. This decoupling not only reduces computational and memory overhead but also effectively attenuates the policy gradient norm, leading to more stable and smoother optimization. We evaluate our method on standard visual control benchmarks using modern GPU-accelerated simulation. Experiments show that our approach significantly reduces wall-clock training time and consistently outperforms all baseline methods in terms of final returns. Notably, on complex tasks such as humanoid locomotion, our method achieves a $4\times$ improvement in final return, and successfully learns a humanoid running policy within 4 hours on a single GPU.
Authors: Wei Wang, Weidong Yang, Lei Wang, Guihua Wang, Ruibo Lei
Abstract: The rapid decline of Arctic sea ice resulting from anthropogenic climate change poses significant risks to indigenous communities, ecosystems, and the global climate system. This situation emphasizes the immediate necessity for precise seasonal sea ice forecasts. While dynamical models perform well for short-term forecasts, they encounter limitations in long-term forecasts and are computationally intensive. Deep learning models, while more computationally efficient, often have difficulty managing seasonal variations and uncertainties when dealing with complex sea ice dynamics. In this research, we introduce IceMamba, a deep learning architecture that integrates sophisticated attention mechanisms within the state space model. Through comparative analysis of 25 renowned forecast models, including dynamical, statistical, and deep learning approaches, our experimental results indicate that IceMamba delivers excellent seasonal forecasting capabilities for Pan-Arctic sea ice concentration. Specifically, IceMamba outperforms all tested models regarding average RMSE and anomaly correlation coefficient (ACC) and ranks second in Integrated Ice Edge Error (IIEE). This innovative approach enhances our ability to foresee and alleviate the effects of sea ice variability, offering essential insights for strategies aimed at climate adaptation.
Authors: Ioannis Pitsiorlas, Nour Jamoussi, Marios Kountouris
Abstract: This work introduces a novel methodology for assessing catastrophic forgetting (CF) in continual learning. We propose a new conformal prediction (CP)-based metric, termed the Conformal Prediction Confidence Factor (CPCF), to quantify and evaluate CF effectively. Our framework leverages adaptive CP to estimate forgetting by monitoring the model's confidence on previously learned tasks. This approach provides a dynamic and practical solution for monitoring and measuring CF of previous tasks as new ones are introduced, offering greater suitability for real-world applications. Experimental results on four benchmark datasets demonstrate a strong correlation between CPCF and the accuracy of previous tasks, validating the reliability and interpretability of the proposed metric. Our results highlight the potential of CPCF as a robust and effective tool for assessing and understanding CF in dynamic learning environments.
Authors: Gabriele Santini, Francesco Paissan, Elisabetta Farella
Abstract: We propose a probabilistic framework for dynamic quantization of neural networks that allows for a computationally efficient input-adaptive rescaling of the quantization parameters. Our framework applies a probabilistic model to the network's pre-activations through a lightweight surrogate, enabling the adaptive adjustment of the quantization parameters on a per-input basis without significant memory overhead. We validate our approach on a set of popular computer vision tasks and models, observing only a negligible loss in performance. Our method strikes the best performance and computational overhead tradeoff compared to standard quantization strategies.
Authors: Alexia Atsidakou, Orestis Papadigenopoulos, Constantine Caramanis, Sujay Sanghavi, Sanjay Shakkottai
Abstract: We study the problem of Gaussian bandits with general side information, as first introduced by Wu, Szepesvari, and Gyorgy. In this setting, the play of an arm reveals information about other arms, according to an arbitrary a priori known side information matrix: each element of this matrix encodes the fidelity of the information that the ``row'' arm reveals about the ``column'' arm. In the case of Gaussian noise, this model subsumes standard bandits, full-feedback, and graph-structured feedback as special cases. In this work, we first construct an LP-based asymptotic instance-dependent lower bound on the regret. The LP optimizes the cost (regret) required to reliably estimate the suboptimality gap of each arm. This LP lower bound motivates our main contribution: the first known asymptotically optimal algorithm for this general setting.
Authors: Kutay B\"olat, Tarek Alskaif, Peter Palensky, Simon Tindemans
Abstract: As the number of rooftop photovoltaic (PV) installations increases, aggregators and system operators are required to monitor and analyze these systems, raising the challenge of integration and management of large, spatially distributed time-series data that are both high-dimensional and affected by missing values. In this work, a probabilistic entity embedding-based clustering framework is proposed to address these problems. This method encodes each PV system's characteristic power generation patterns and uncertainty as a probability distribution, then groups systems by their statistical distances and agglomerative clustering. Applied to a multi-year residential PV dataset, it produces concise, uncertainty-aware cluster profiles that outperform a physics-based baseline in representativeness and robustness, and support reliable missing-value imputation. A systematic hyperparameter study further offers practical guidance for balancing model performance and robustness.
Authors: Patryk Marsza{\l}ek, Tomasz Ku\'smierczyk, Witold Wydma\'nski, Jacek Tabor, Marek \'Smieja
Abstract: Clustering tabular data remains a significant open challenge in data analysis and machine learning. Unlike for image data, similarity between tabular records often varies across datasets, making the definition of clusters highly dataset-dependent. Furthermore, the absence of supervised signals complicates hyperparameter tuning in deep learning clustering methods, frequently resulting in unstable performance. To address these issues and reduce the need for per-dataset tuning, we adopt an emerging approach in deep learning: zero-shot learning. We propose ZEUS, a self-contained model capable of clustering new datasets without any additional training or fine-tuning. It operates by decomposing complex datasets into meaningful components that can then be clustered effectively. Thanks to pre-training on synthetic datasets generated from a latent-variable prior, it generalizes across various datasets without requiring user intervention. To the best of our knowledge, ZEUS is the first zero-shot method capable of generating embeddings for tabular data in a fully unsupervised manner. Experimental results demonstrate that it performs on par with or better than traditional clustering algorithms and recent deep learning-based methods, while being significantly faster and more user-friendly.
Authors: Sebesty\'en Kamp, Giovanni Stracquadanio, T. Ian Simpson
Abstract: We present GNN-Suite, a robust modular framework for constructing and benchmarking Graph Neural Network (GNN) architectures in computational biology. GNN-Suite standardises experimentation and reproducibility using the Nextflow workflow to evaluate GNN performance. We demonstrate its utility in identifying cancer-driver genes by constructing molecular networks from protein-protein interaction (PPI) data from STRING and BioGRID and annotating nodes with features from the PCAWG, PID, and COSMIC-CGC repositories. Our design enables fair comparisons among diverse GNN architectures including GAT, GAT3H, GCN, GCN2, GIN, GTN, HGCN, PHGCN, and GraphSAGE and a baseline Logistic Regression (LR) model. All GNNs were configured as standardised two-layer models and trained with uniform hyperparameters (dropout = 0.2; Adam optimiser with learning rate = 0.01; and an adjusted binary cross-entropy loss to address class imbalance) over an 80/20 train-test split for 300 epochs. Each model was evaluated over 10 independent runs with different random seeds to yield statistically robust performance metrics, with balanced accuracy (BACC) as the primary measure. Notably, GCN2 achieved the highest BACC (0.807 +/- 0.035) on a STRING-based network, although all GNN types outperformed the LR baseline, highlighting the advantage of network-based learning over feature-only approaches. Our results show that a common framework for implementing and evaluating GNN architectures aids in identifying not only the best model but also the most effective means of incorporating complementary data. By making GNN-Suite publicly available, we aim to foster reproducible research and promote improved benchmarking standards in computational biology. Future work will explore additional omics datasets and further refine network architectures to enhance predictive accuracy and interpretability in biomedical applications.
Authors: Yihan Zhu, Gang Liu, Eric Inae, Tengfei Luo, Meng Jiang
Abstract: Polymers are large macromolecules composed of repeating structural units known as monomers and are widely applied in fields such as energy storage, construction, medicine, and aerospace. However, existing graph neural network methods, though effective for small molecules, only model the single unit of polymers and fail to produce consistent vector representations for the true polymer structure with varying numbers of units. To address this challenge, we introduce Graph Repetition Invariance (GRIN), a novel method to learn polymer representations that are invariant to the number of repeating units in their graph representations. GRIN integrates a graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency. We provide theoretical guarantees for repetition-invariance from both model and data perspectives, demonstrating that three repeating units are the minimal augmentation required for optimal invariant representation learning. GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.
Authors: Achmad Ginanjar, Xue Li, Priyanka Singh, Wen Hua
Abstract: Vertical Federated Learning (VFL) has revolutionised collaborative machine learning by enabling privacy-preserving model training across multiple parties. However, it remains vulnerable to information leakage during intermediate computation sharing. While Contrastive Federated Learning (CFL) was introduced to mitigate these privacy concerns through representation learning, it still faces challenges from gradient-based attacks. This paper presents a comprehensive experimental analysis of gradient-based attacks in CFL environments and evaluates random client selection as a defensive strategy. Through extensive experimentation, we demonstrate that random client selection proves particularly effective in defending against gradient attacks in the CFL network. Our findings provide valuable insights for implementing robust security measures in contrastive federated learning systems, contributing to the development of more secure collaborative learning frameworks
Authors: Conor F. Hayes, Felipe Leno Da Silva, Jiachen Yang, T. Nathan Mundhenk, Chak Shing Lee, Jacob F. Pettit, Claudio Santiago, Sookyung Kim, Joanne T. Kim, Ignacio Aravena Solis, Ruben Glatt, Andre R. Goncalves, Alexander Ladd, Ahmet Can Solak, Thomas Desautels, Daniel Faissol, Brenden K. Petersen, Mikel Landajuela
Abstract: Deep Symbolic Optimization (DSO) is a novel computational framework that enables symbolic optimization for scientific discovery, particularly in applications involving the search for intricate symbolic structures. One notable example is equation discovery, which aims to automatically derive mathematical models expressed in symbolic form. In DSO, the discovery process is formulated as a sequential decision-making task. A generative neural network learns a probabilistic model over a vast space of candidate symbolic expressions, while reinforcement learning strategies guide the search toward the most promising regions. This approach integrates gradient-based optimization with evolutionary and local search techniques, and it incorporates in-situ constraints, domain-specific priors, and advanced policy optimization methods. The result is a robust framework capable of efficiently exploring extensive search spaces to identify interpretable and physically meaningful models. Extensive evaluations on benchmark problems have demonstrated that DSO achieves state-of-the-art performance in both accuracy and interpretability. In this chapter, we provide a comprehensive overview of the DSO framework and illustrate its transformative potential for automating symbolic optimization in scientific discovery.
Authors: Yueyang Yao, Jiajun Li, Xingyuan Dai, MengMeng Zhang, Xiaoyan Gong, Fei-Yue Wang, Yisheng Lv
Abstract: Time series forecasting is important for applications spanning energy markets, climate analysis, and traffic management. However, existing methods struggle to effectively integrate exogenous texts and align them with the probabilistic nature of large language models (LLMs). Current approaches either employ shallow text-time series fusion via basic prompts or rely on deterministic numerical decoding that conflict with LLMs' token-generation paradigm, which limits contextual awareness and distribution modeling. To address these limitations, we propose CAPTime, a context-aware probabilistic multimodal time series forecasting method that leverages text-informed abstraction and autoregressive LLM decoding. Our method first encodes temporal patterns using a pretrained time series encoder, then aligns them with textual contexts via learnable interactions to produce joint multimodal representations. By combining a mixture of distribution experts with frozen LLMs, we enable context-aware probabilistic forecasting while preserving LLMs' inherent distribution modeling capabilities. Experiments on diverse time series forecasting tasks demonstrate the superior accuracy and generalization of CAPTime, particularly in multimodal scenarios. Additional analysis highlights its robustness in data-scarce scenarios through hybrid probabilistic decoding.
Authors: Tao Bai, Junzhuo Zhou, Zeyuan Deng, Peng Cao
Abstract: The composite current source (CCS) model has been adopted as an advanced timing model that represents the current behavior of cells for improved accuracy and better capability than traditional non-linear delay models (NLDM) to model complex dynamic effects and interactions under advanced process nodes. However, the high accuracy requirement, large amount of data and extensive simulation cost pose severe challenges to CCS characterization. To address these challenges, we introduce a novel Gaussian Process Regression(GPR) model with active learning(AL) to establish the characterization framework efficiently and accurately. Our approach significantly outperforms conventional commercial tools as well as learning based approaches by achieving an average absolute error of 2.05 ps and a relative error of 2.27% for current waveform of 57 cells under 9 process, voltage, temperature (PVT) corners with TSMC 22nm process. Additionally, our model drastically reduces the runtime to 27% and the storage by up to 19.5x compared with that required by commercial tools.
Authors: Ian Holmes, Min Chi
Abstract: Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer's attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can significantly improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.
Authors: Reilly Haskins, Benjamin Adams
Abstract: Knowledge distillation compresses a larger neural model (teacher) into smaller, faster student models by training them to match teacher outputs. However, the internal computational transformations that occur during this process remain poorly understood. We apply techniques from mechanistic interpretability to analyze how internal circuits, representations, and activation patterns differ between teacher and student. Focusing on GPT2-small and its distilled counterpart DistilGPT2, we find that student models reorganize, compress, and discard teacher components, often resulting in stronger reliance on fewer individual components. To quantify functional alignment beyond output similarity, we introduce an alignment metric based on influence-weighted component similarity, validated across multiple tasks. Our findings reveal that while knowledge distillation preserves broad functional behaviors, it also causes significant shifts in internal computation, with important implications for the robustness and generalization capacity of distilled models.
Authors: Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao
Abstract: Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. We open source our code at \href{https://github.com/uiuctml/MergeBench}{https://github.com/uiuctml/MergeBench}.
URLs: https://github.com/uiuctml/MergeBench, https://github.com/uiuctml/MergeBench
Authors: Ran Li, Hao Wang, Chengzhi Mao
Abstract: Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts. On standard benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate. Our findings demonstrate a potent alternative to agentic LLM prompting, highlighting the efficacy of interpreting and attacking LLM internals through gradient optimization.
Authors: Hanyu Duan, Yi Yang, Ahmed Abbasi, Kar Yan Tam
Abstract: This paper introduces Ready2Unlearn, a learning-time optimization approach designed to facilitate future unlearning processes. Unlike the majority of existing unlearning efforts that focus on designing unlearning algorithms, which are typically implemented reactively when an unlearning request is made during the model deployment phase, Ready2Unlearn shifts the focus to the training phase, adopting a "forward-looking" perspective. Building upon well-established meta-learning principles, Ready2Unlearn proactively trains machine learning models with unlearning readiness, such that they are well prepared and can handle future unlearning requests in a more efficient and principled manner. Ready2Unlearn is model-agnostic and compatible with any gradient ascent-based machine unlearning algorithms. We evaluate the method on both vision and language tasks under various unlearning settings, including class-wise unlearning and random data unlearning. Experimental results show that by incorporating such preparedness at training time, Ready2Unlearn produces an unlearning-ready model state, which offers several key advantages when future unlearning is required, including reduced unlearning time, improved retention of overall model capability, and enhanced resistance to the inadvertent recovery of forgotten data. We hope this work could inspire future efforts to explore more proactive strategies for equipping machine learning models with built-in readiness towards more reliable and principled machine unlearning.
Authors: Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang
Abstract: This paper presents AutoRAN, the first automated, weak-to-strong jailbreak attack framework targeting large reasoning models (LRMs). At its core, AutoRAN leverages a weak, less-aligned reasoning model to simulate the target model's high-level reasoning structures, generates narrative prompts, and iteratively refines candidate prompts by incorporating the target model's intermediate reasoning steps. We evaluate AutoRAN against state-of-the-art LRMs including GPT-o3/o4-mini and Gemini-2.5-Flash across multiple benchmark datasets (AdvBench, HarmBench, and StrongReject). Results demonstrate that AutoRAN achieves remarkable success rates (approaching 100%) within one or a few turns across different LRMs, even when judged by a robustly aligned external model. This work reveals that leveraging weak reasoning models can effectively exploit the critical vulnerabilities of much more capable reasoning models, highlighting the need for improved safety measures specifically designed for reasoning-based models. The code for replicating AutoRAN and running records are available at: (https://github.com/JACKPURCELL/AutoRAN-public). (warning: this paper contains potentially harmful content generated by LRMs.)
Authors: Justin Sanders, Melih Yilmaz, Jacob H. Russell, Wout Bittremieux, William E. Fondrie, Nicholas M. Riley, Sewoong Oh, William Stafford Noble
Abstract: Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.
Authors: Mengxuan Li, Ke Liu, Jialong Guo, Jiajun Bu, Hongwei Wang, Haishuai Wang
Abstract: Healthcare data frequently contain a substantial proportion of missing values, necessitating effective time series imputation to support downstream disease diagnosis tasks. However, existing imputation methods focus on discrete data points and are unable to effectively model sparse data, resulting in particularly poor performance for imputing substantial missing values. In this paper, we propose a novel approach, ImputeINR, for time series imputation by employing implicit neural representations (INR) to learn continuous functions for time series. ImputeINR leverages the merits of INR in that the continuous functions are not coupled to sampling frequency and have infinite sampling frequency, allowing ImputeINR to generate fine-grained imputations even on extremely sparse observed values. Extensive experiments conducted on eight datasets with five ratios of masked values show the superior imputation performance of ImputeINR, especially for high missing ratios in time series data. Furthermore, we validate that applying ImputeINR to impute missing values in healthcare data enhances the performance of downstream disease diagnosis tasks. Codes are available.
Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo
Abstract: Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
Authors: Thang Duong, Minglai Yang, Chicheng Zhang
Abstract: We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to $4 \times$ the cumulative rewards of the pure RL baseline.
Authors: Filippo Leveni, Luca Magri, Cesare Alippi, Giacomo Boracchi
Abstract: We focus on the problem of identifying samples in a set that do not conform to structured patterns represented by low-dimensional manifolds. An effective way to solve this problem is to embed data in a high dimensional space, called Preference Space, where anomalies can be identified as the most isolated points. In this work, we employ Locality Sensitive Hashing to avoid explicit computation of distances in high dimensions and thus improve Anomaly Detection efficiency. Specifically, we present an isolation-based anomaly detection technique designed to work in the Preference Space which achieves state-of-the-art performance at a lower computational cost. Code is publicly available at https://github.com/ineveLoppiliF/Hashing-for-Structure-based-Anomaly-Detection.
URLs: https://github.com/ineveLoppiliF/Hashing-for-Structure-based-Anomaly-Detection.
Authors: Luca Magri, Filippo Leveni, Giacomo Boracchi
Abstract: We address the problem of recovering multiple structures of different classes in a dataset contaminated by noise and outliers. In particular, we consider geometric structures defined by a mixture of underlying parametric models (e.g. planes and cylinders, homographies and fundamental matrices), and we tackle the robust fitting problem by preference analysis and clustering. We present a new algorithm, termed MultiLink, that simultaneously deals with multiple classes of models. MultiLink combines on-the-fly model fitting and model selection in a novel linkage scheme that determines whether two clusters are to be merged. The resulting method features many practical advantages with respect to methods based on preference analysis, being faster, less sensitive to the inlier threshold, and able to compensate limitations deriving from hypotheses sampling. Experiments on several public datasets demonstrate that Multi-Link favourably compares with state of the art alternatives, both in multi-class and single-class problems. Code is publicly made available for download.
Authors: Filippo Leveni, Luca Magri, Cesare Alippi, Giacomo Boracchi
Abstract: We address the problem of detecting anomalies as samples that do not conform to structured patterns represented by low-dimensional manifolds. To this end, we conceive a general anomaly detection framework called Preference Isolation Forest (PIF), that combines the benefits of adaptive isolation-based methods with the flexibility of preference embedding. The key intuition is to embed the data into a high-dimensional preference space by fitting low-dimensional manifolds, and to identify anomalies as isolated points. We propose three isolation approaches to identify anomalies: $i$) Voronoi-iForest, the most general solution, $ii$) RuzHash-iForest, that avoids explicit computation of distances via Local Sensitive Hashing, and $iii$) Sliding-PIF, that leverages a locality prior to improve efficiency and effectiveness.
Authors: Mathieu Alain, So Takao, Xiaowen Dong, Bastian Rieck, Emmanuel Noutahi
Abstract: Predicting the labels of graph-structured data is crucial in scientific applications and is often achieved using graph neural networks (GNNs). However, when data is scarce, GNNs suffer from overfitting, leading to poor performance. Recently, Gaussian processes (GPs) with graph-level inputs have been proposed as an alternative. In this work, we extend the Gaussian process framework to simplicial complexes (SCs), enabling the handling of edge-level attributes and attributes supported on higher-order simplices. We further augment the resulting SC representations by considering their Hodge decompositions, allowing us to account for homological information, such as the number of holes, in the SC. We demonstrate that our framework enhances the predictions across various applications, paving the way for GPs to be more widely used for graph and SC-level predictions.
Authors: Guoji Fu, Wee Sun Lee
Abstract: This paper studies the approximation and generalization abilities of score-based neural network generative models (SGMs) in estimating an unknown distribution $P_0$ from $n$ i.i.d. observations in $d$ dimensions. Assuming merely that $P_0$ is $\alpha$-sub-Gaussian, we prove that for any time step $t \in [t_0, n^{O(1)}]$, where $t_0 \geq O(\alpha^2n^{-2/d}\log n)$, there exists a deep ReLU neural network with width $\leq O(\log^3n)$ and depth $\leq O(n^{3/d}\log_2n)$ that can approximate the scores with $\tilde{O}(n^{-1})$ mean square error and achieve a nearly optimal rate of $\tilde{O}(n^{-1}t_0^{-d/2})$ for score estimation, as measured by the score matching loss. Our framework is universal and can be used to establish convergence rates for SGMs under milder assumptions than previous work. For example, assuming further that the target density function $p_0$ lies in Sobolev or Besov classes, with an appropriately early stopping strategy, we demonstrate that neural network-based SGMs can attain nearly minimax convergence rates up to logarithmic factors. Our analysis removes several crucial assumptions, such as Lipschitz continuity of the score function or a strictly positive lower bound on the target density.
Authors: Donghyeon Ki, JunHyeok Oh, Seong-Woong Shim, Byung-Jun Lee
Abstract: Diffusion models have recently gained prominence in offline reinforcement learning due to their ability to effectively learn high-performing, generalizable policies from static datasets. Diffusion-based planners facilitate long-horizon decision-making by generating high-quality trajectories through iterative denoising, guided by return-maximizing objectives. However, existing guided sampling strategies such as Classifier Guidance, Classifier-Free Guidance, and Monte Carlo Sample Selection either produce suboptimal multi-modal actions, struggle with distributional drift, or incur prohibitive inference-time costs. To address these challenges, we propose Prior Guidance (PG), a novel guided sampling framework that replaces the standard Gaussian prior of a behavior-cloned diffusion model with a learnable distribution, optimized via a behavior-regularized objective. PG directly generates high-value trajectories without costly reward optimization of the diffusion model itself, and eliminates the need to sample multiple candidates at inference for sample selection. We present an efficient training strategy that applies behavior regularization in latent space, and empirically demonstrate that PG outperforms state-of-the-art diffusion policies and planners across diverse long-horizon offline RL benchmarks.
Authors: Alex Saad-Falcon, Brighton Ancelin, Justin Romberg
Abstract: This paper addresses the challenge of efficient principal component analysis (PCA) in high-dimensional spaces by analyzing a compressively sampled variant of Oja's algorithm with adaptive sensing. Traditional PCA methods incur substantial computational costs that scale poorly with data dimensionality, whereas subspace tracking algorithms like Oja's offer more efficient alternatives but typically require full-dimensional observations. We analyze a variant where, at each iteration, only two compressed measurements are taken: one in the direction of the current estimate and one in a random orthogonal direction. We prove that this adaptive sensing approach achieves global convergence in the presence of noise when tracking the leading eigenvector of a datastream with eigengap $\Delta=\lambda_1-\lambda_2$. Our theoretical analysis demonstrates that the algorithm experiences two phases: (1) a warmup phase requiring $O(\lambda_1\lambda_2d^2/\Delta^2)$ iterations to achieve a constant-level alignment with the true eigenvector, followed by (2) a local convergence phase where the sine alignment error decays at a rate of $O(\lambda_1\lambda_2d^2/\Delta^2 t)$ for iterations $t$. The guarantee aligns with existing minimax lower bounds with an added factor of $d$ due to the compressive sampling. This work provides the first convergence guarantees in adaptive sensing for subspace tracking with noise. Our proof technique is also considerably simpler than those in prior works. The results have important implications for applications where acquiring full-dimensional samples is challenging or costly.
Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen
Abstract: Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In practice, human users have multiple objectives, such as helpfulness and harmlessness, and there is no natural way to aggregate them into a single objective. In this paper, we address the multi-objective preference-alignment problem, where a policy must optimize several, potentially conflicting, objectives. We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization: the primary objective is maximized while secondary objectives are lower-bounded by tunable safety thresholds. Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids heuristic prompt-context engineering. The method recovers policies on the Pareto front whenever the front is attainable; practically, it reduces to simple closed-form iterative updates suitable for large-scale training. On synthetic benchmarks with diverse canonical preference structures, we show that MOPO approximates the Pareto front. When fine-tuning a 1.3B-parameter language model on real-world human-preference datasets, MOPO attains higher rewards and yields policies that Pareto-dominate baselines; ablation studies confirm optimization stability and robustness to hyperparameters.
Authors: Yishuo Wang, Feng Zhou, Muping Zhou, Qicheng Meng, Zhijun Hu, Yi Wang
Abstract: This paper proposes CTP, a novel deep learning framework that integrates convolutional neural network(CNN), Transformer architectures, and physics-informed neural network(PINN) for ocean front prediction. Ocean fronts, as dynamic interfaces between distinct water masses, play critical roles in marine biogeochemical and physical processes. Existing methods such as LSTM, ConvLSTM, and AttentionConv often struggle to maintain spatial continuity and physical consistency over multi-step forecasts. CTP addresses these challenges by combining localized spatial encoding, long-range temporal attention, and physical constraint enforcement. Experimental results across south China sea(SCS) and Kuroshio(KUR) regions from 1993 to 2020 demonstrate that CTP achieves state-of-the-art(SOTA) performance in both single-step and multi-step predictions, significantly outperforming baseline models in accuracy, $F_1$ score, and temporal stability.
Authors: Muntasir Hoq, Ananya Rao, Reisha Jaishankar, Krish Piryani, Nithya Janapati, Jessica Vandenberg, Bradford Mott, Narges Norouzi, James Lester, Bita Akram
Abstract: In Computer Science (CS) education, understanding factors contributing to students' programming difficulties is crucial for effective learning support. By identifying specific issues students face, educators can provide targeted assistance to help them overcome obstacles and improve learning outcomes. While identifying sources of struggle, such as misconceptions, in real-time can be challenging in current educational practices, analyzing logical errors in students' code can offer valuable insights. This paper presents a scalable framework for automatically detecting logical errors in students' programming solutions. Our framework is based on an explainable Abstract Syntax Tree (AST) embedding model, the Subtree-based Attention Neural Network (SANN), that identifies the structural components of programs containing logical errors. We conducted a series of experiments to evaluate its effectiveness, and the results suggest that our framework can accurately capture students' logical errors and, more importantly, provide us with deeper insights into their learning processes, offering a valuable tool for enhancing programming education.
Authors: Xiao Han, Dayan Pan, Xiangyu Zhao, Xuyuan Hu, Zhaolin Deng, Xiangjie Kong, Guojiang Shen
Abstract: Spatiotemporal relationships are critical in data science, as many prediction and reasoning tasks require analysis across both spatial and temporal dimensions--for instance, navigating an unfamiliar city involves planning itineraries that sequence locations and timing cultural experiences. However, existing Question-Answering (QA) datasets lack sufficient spatiotemporal-sensitive questions, making them inadequate benchmarks for evaluating models' spatiotemporal reasoning capabilities. To address this gap, we introduce POI-QA, a novel spatiotemporal-sensitive QA dataset centered on Point of Interest (POI), constructed through three key steps: mining and aligning open-source vehicle trajectory data from GAIA with high-precision geographic POI data, rigorous manual validation of noisy spatiotemporal facts, and generating bilingual (Chinese/English) QA pairs that reflect human-understandable spatiotemporal reasoning tasks. Our dataset challenges models to parse complex spatiotemporal dependencies, and evaluations of state-of-the-art multilingual LLMs (e.g., Qwen2.5-7B, Llama3.1-8B) reveal stark limitations: even the top-performing model (Qwen2.5-7B fine-tuned with RAG+LoRA) achieves a top 10 Hit Ratio (HR@10) of only 0.41 on the easiest task, far below human performance at 0.56. This underscores persistent weaknesses in LLMs' ability to perform consistent spatiotemporal reasoning, while highlighting POI-QA as a robust benchmark to advance algorithms sensitive to spatiotemporal dynamics. The dataset is publicly available at https://www.kaggle.com/ds/7394666.
Authors: Congcong Zhu, Xiaoyan Xu, Jiayue Han, Jingrun Chen
Abstract: Auto-regressive partial differential equation (PDE) foundation models have shown great potential in handling time-dependent data. However, these models suffer from the shortcut problem deeply rooted in auto-regressive prediction, causing error accumulation. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to the out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data. The code is available at https://github.com/SCAILab-USTC/PITA.
Authors: Ozan \"Ozdenizci, Elmar Rueckert, Robert Legenstein
Abstract: Lifelong learning algorithms enable models to incrementally acquire new knowledge without forgetting previously learned information. Contrarily, the field of machine unlearning focuses on explicitly forgetting certain previous knowledge from pretrained models when requested, in order to comply with data privacy regulations on the right-to-be-forgotten. Enabling efficient lifelong learning with the capability to selectively unlearn sensitive information from models presents a critical and largely unaddressed challenge with contradicting objectives. We address this problem from the perspective of simultaneously preventing catastrophic forgetting and allowing forward knowledge transfer during task-incremental learning, while ensuring exact task unlearning and minimizing memory requirements, based on a single neural network model to be adapted. Our proposed solution, privacy-aware lifelong learning (PALL), involves optimization of task-specific sparse subnetworks with parameter sharing within a single architecture. We additionally utilize an episodic memory rehearsal mechanism to facilitate exact unlearning without performance degradations. We empirically demonstrate the scalability of PALL across various architectures in image classification, and provide a state-of-the-art solution that uniquely integrates lifelong learning and privacy-aware unlearning mechanisms for responsible AI applications.
Authors: Kehan Long, Jorge Cort\'es, Nikolay Atanasov
Abstract: We study the problem of certifying the stability of closed-loop systems under control policies derived from optimal control or reinforcement learning (RL). Classical Lyapunov methods require a strict step-wise decrease in the Lyapunov function but such a certificate is difficult to construct for a learned control policy. The value function associated with an RL policy is a natural Lyapunov function candidate but it is not clear how it should be modified. To gain intuition, we first study the linear quadratic regulator (LQR) problem and make two key observations. First, a Lyapunov function can be obtained from the value function of an LQR policy by augmenting it with a residual term related to the system dynamics and stage cost. Second, the classical Lyapunov decrease requirement can be relaxed to a generalized Lyapunov condition requiring only decrease on average over multiple time steps. Using this intuition, we consider the nonlinear setting and formulate an approach to learn generalized Lyapunov functions by augmenting RL value functions with neural network residual terms. Our approach successfully certifies the stability of RL policies trained on Gymnasium and DeepMind Control benchmarks. We also extend our method to jointly train neural controllers and stability certificates using a multi-step Lyapunov loss, resulting in larger certified inner approximations of the region of attraction compared to the classical Lyapunov approach. Overall, our formulation enables stability certification for a broad class of systems with learned policies by making certificates easier to construct, thereby bridging classical control theory and modern learning-based methods.
Authors: Chenhui Xu, Dancheng Liu, Amir Nassereldine, Jinjun Xiong
Abstract: Physics Informed Neural Networks (PINNs) often exhibit failure modes in which the PDE residual loss converges while the solution error stays large, a phenomenon traditionally blamed on local optima separated from the true solution by steep loss barriers. We challenge this understanding by demonstrate that the real culprit is insufficient arithmetic precision: with standard FP32, the LBFGS optimizer prematurely satisfies its convergence test, freezing the network in a spurious failure phase. Simply upgrading to FP64 rescues optimization, enabling vanilla PINNs to solve PDEs without any failure modes. These results reframe PINN failure modes as precision induced stalls rather than inescapable local minima and expose a three stage training dynamic unconverged, failure, success whose boundaries shift with numerical precision. Our findings emphasize that rigorous arithmetic precision is the key to dependable PDE solving with neural networks.
Authors: Tianshuo Zhang, Gao Jia, Wenzhe Zhai, Rui Yann, Xianglei Xing
Abstract: Data steganography aims to conceal information within visual content, yet existing spatial- and frequency-domain approaches suffer from trade-offs between security, capacity, and perceptual quality. Recent advances in generative models, particularly diffusion models, offer new avenues for adaptive image synthesis, but integrating precise information embedding into the generative process remains challenging. We introduce Shackled Dancing Diffusion, or SD$^2$, a plug-and-play generative steganography method that combines bit-position locking with diffusion sampling injection to enable controllable information embedding within the generative trajectory. SD$^2$ leverages the expressive power of diffusion models to synthesize diverse carrier images while maintaining full message recovery with $100\%$ accuracy. Our method achieves a favorable balance between randomness and constraint, enhancing robustness against steganalysis without compromising image fidelity. Extensive experiments show that SD$^2$ substantially outperforms prior methods in security, embedding capacity, and stability. This algorithm offers new insights into controllable generation and opens promising directions for secure visual communication.
Authors: Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Cheng Long, Jie Zhang
Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68$\times$ reduction in time-to-first-token (TTFT).
Authors: Koki Iwai, Yusuke Kumagae, Yuki Koyama, Masahiro Hamasaki, Masataka Goto
Abstract: Preferential Bayesian optimization (PBO) is a variant of Bayesian optimization that observes relative preferences (e.g., pairwise comparisons) instead of direct objective values, making it especially suitable for human-in-the-loop scenarios. However, real-world optimization tasks often involve inequality constraints, which existing PBO methods have not yet addressed. To fill this gap, we propose constrained preferential Bayesian optimization (CPBO), an extension of PBO that incorporates inequality constraints for the first time. Specifically, we present a novel acquisition function for this purpose. Our technical evaluation shows that our CPBO method successfully identifies optimal solutions by focusing on exploring feasible regions. As a practical application, we also present a designer-in-the-loop system for banner ad design using CPBO, where the objective is the designer's subjective preference, and the constraint ensures a target predicted click-through rate. We conducted a user study with professional ad designers, demonstrating the potential benefits of our approach in guiding creative design under real-world constraints.
Authors: Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico L\'opez, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec
Abstract: Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.
Authors: Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An
Abstract: Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to long-horizon LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals and achieves performance gains of > 12\% on ALFWorld and > 9\% on WebShop over the GRPO baseline: all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.
Authors: Haozheng Luo, Chenghao Qiu, Yimin Wang, Shang Wu, Jiahao Yu, Han Liu, Binghui Wang, Yan Chen
Abstract: We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.
Authors: Feiran You, Hongyang Du
Abstract: Heterogeneous Networks (HetNets) pose critical challenges for intelligent management due to the diverse user requirements and time-varying wireless conditions. These factors introduce significant decision complexity, which limits the adaptability of existing Deep Reinforcement Learning (DRL) methods. In many DRL algorithms, especially those involving value-based or actor-critic structures, the critic component plays a key role in guiding policy learning by estimating value functions. However, conventional critic models often use shallow architectures that map observations directly to scalar estimates, limiting their ability to handle multi-task complexity. In contrast, recent progress in inference-time scaling of Large Language Models (LLMs) has shown that generating intermediate reasoning steps can significantly improve decision quality. Motivated by this, we propose ReaCritic, a large reasoning transformer-based criticmodel scaling scheme that brings reasoning ability into DRL. ReaCritic performs horizontal reasoning over parallel state-action inputs and vertical reasoning through deep transformer stacks. It is compatible with a broad range of value-based and actor-critic DRL algorithms and enhances generalization in dynamic wireless environments. Extensive experiments demonstrate that ReaCritic improves convergence speed and final performance across various HetNet settings and standard OpenAI Gym control tasks.
Authors: Wenjie Ou, Zhishuo Zhao, Dongyue Guo, Yi Lin
Abstract: Time series forecasting is critical across multiple domains, where time series data exhibits both local patterns and global dependencies. While Transformer-based methods effectively capture global dependencies, they often overlook short-term local variations in time series. Recent methods that adapt large language models (LLMs) into time series forecasting inherit this limitation by treating LLMs as black-box encoders, relying solely on the final-layer output and underutilizing hierarchical representations. To address this limitation, we propose Logo-LLM, a novel LLM-based framework that explicitly extracts and models multi-scale temporal features from different layers of a pre-trained LLM. Through empirical analysis, we show that shallow layers of LLMs capture local dynamics in time series, while deeper layers encode global trends. Moreover, Logo-LLM introduces lightweight Local-Mixer and Global-Mixer modules to align and integrate features with the temporal input across layers. Extensive experiments demonstrate that Logo-LLM achieves superior performance across diverse benchmarks, with strong generalization in few-shot and zero-shot settings while maintaining low computational overhead.
Authors: Kutalm{\i}\c{s} Co\c{s}kun, Ivo Kavisanczki, Amin Mirzaei, Tom Siegl, Bjarne C. Hiller, Stefan L\"udtke, Martin Becker
Abstract: In complex and low-data domains such as biomedical research, incorporating background knowledge (BK) graphs, such as protein-protein interaction (PPI) networks, into graph-based machine learning pipelines is a promising research direction. However, while BK is often assumed to improve model performance, its actual contribution and the impact of imperfect knowledge remain poorly understood. In this work, we investigate the role of BK in an important real-world task: cancer subtype classification. Surprisingly, we find that (i) state-of-the-art GNNs using BK perform no better than uninformed models like linear regression, and (ii) their performance remains largely unchanged even when the BK graph is heavily perturbed. To understand these unexpected results, we introduce an evaluation framework, which employs (i) a synthetic setting where the BK is clearly informative and (ii) a set of perturbations that simulate various imperfections in BK graphs. With this, we test the robustness of BK-aware models in both synthetic and real-world biomedical settings. Our findings reveal that careful alignment of GNN architectures and BK characteristics is necessary but holds the potential for significant performance improvements.
Authors: Wolfgang Rannetbauer, Simon Hubmer, Carina Hambrock, Ronny Ramlau
Abstract: The implementation of thermally sprayed components in steel manufacturing presents challenges for production and plant maintenance. While enhancing performance through specialized surface properties, these components may encounter difficulties in meeting modified requirements due to standardization in the refurbishment process. This article proposes updating the established coating process for thermally spray coated components for steel manufacturing (TCCSM) by integrating real-time data analytics and predictive quality management. Two essential components--the data aggregator and the quality predictor--are designed through continuous process monitoring and the application of data-driven methodologies to meet the dynamic demands of the evolving steel landscape. The quality predictor is powered by the simple and effective multiple kernel learning strategy with the goal of realizing predictive quality. The data aggregator, designed with sensors, flow meters, and intelligent data processing for the thermal spray coating process, is proposed to facilitate real-time analytics. The performance of this combination was verified using small-scale tests that enabled not only the accurate prediction of coating quality based on the collected data but also proactive notification to the operator as soon as significant deviations are identified.
Authors: Li Ju, Max Andersson, Stina Fredriksson, Edward Gl\"ockner, Andreas Hellander, Ekta Vats, Prashant Singh
Abstract: Vision-language models (VLMs) as foundation models have significantly enhanced performance across a wide range of visual and textual tasks, without requiring large-scale training from scratch for downstream tasks. However, these deterministic VLMs fail to capture the inherent ambiguity and uncertainty in natural language and visual data. Recent probabilistic post-hoc adaptation methods address this by mapping deterministic embeddings onto probability distributions; however, existing approaches do not account for the asymmetric uncertainty structure of the modalities, and the constraint that meaningful deterministic embeddings reside on a unit hypersphere, potentially leading to suboptimal performance. In this paper, we address the asymmetric uncertainty structure inherent in textual and visual data, and propose AsymVLM to build probabilistic embeddings from pre-trained VLMs on the unit hypersphere, enabling uncertainty quantification. We validate the effectiveness of the probabilistic embeddings on established benchmarks, and present comprehensive ablation studies demonstrating the inherent nature of asymmetry in the uncertainty structure of textual and visual data.
Authors: Kihun Hong, Sejun Park, Ganguk Hwang
Abstract: Federated learning (FL) has attracted significant attention for enabling collaborative learning without exposing private data. Among the primary variants of FL, vertical federated learning (VFL) addresses feature-partitioned data held by multiple institutions, each holding complementary information for the same set of users. However, existing VFL methods often impose restrictive assumptions such as a small number of participating parties, fully aligned data, or only using labeled data. In this work, we reinterpret alignment gaps in VFL as missing data problems and propose a unified framework that accommodates both training and inference under arbitrary alignment and labeling scenarios, while supporting diverse missingness mechanisms. In the experiments on 168 configurations spanning four benchmark datasets, six training-time missingness patterns, and seven testing-time missingness patterns, our method outperforms all baselines in 160 cases with an average gap of 9.6 percentage points over the next-best competitors. To the best of our knowledge, this is the first VFL framework to jointly handle arbitrary data alignment, unlabeled data, and multi-party collaboration all at once.
Authors: Zhexiang Li, Haoyu Wang, Yutong Bao, David Woodruff
Abstract: Recent advances in transformer architectures deeply enhance long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However,such a sampling limits crucial keys' capturing, which in turn raises the overall perplexity. In this paper, we propose a pre-scoring mechanism to assist HyperAttention to prioritize significant keys. Specifically, we introduce three scoring methods: K-means clustering, K-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention's original uniform residual sampling entirely, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces computational overhead, its combination with HyperAttention remains 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving Transformer's efficiency.
Authors: Zhirui Fang, Kai Yang, Jian Tao, Jiafei Lyu, Lusong Li, Li Shen, Xiu Li
Abstract: Exploration remains a critical challenge in online reinforcement learning, as an agent must effectively explore unknown environments to achieve high returns. Currently, the main exploration algorithms are primarily count-based methods and curiosity-based methods, with prediction-error methods being a prominent example. In this paper, we propose a novel method called \textbf{R}andom \textbf{D}istribution \textbf{D}istillation (RDD), which samples the output of a target network from a normal distribution. RDD facilitates a more extensive exploration by explicitly treating the difference between the prediction network and the target network as an intrinsic reward. Furthermore, by introducing randomness into the output of the target network for a given state and modeling it as a sample from a normal distribution, intrinsic rewards are bounded by two key components: a pseudo-count term ensuring proper exploration decay and a discrepancy term accounting for predictor convergence. We demonstrate that RDD effectively unifies both count-based and prediction-error approaches. It retains the advantages of prediction-error methods in high-dimensional spaces, while also implementing an intrinsic reward decay mode akin to the pseudo-count method. In the experimental section, RDD is compared with more advanced methods in a series of environments. Both theoretical analysis and experimental results confirm the effectiveness of our approach in improving online exploration for reinforcement learning tasks.
Authors: Jeroen Bollen, Jan Van den Bussche, Stijn Vansummeren, Jonni Virtema
Abstract: Graph Neural Networks (GNNs) are a class of machine-learning models that operate on graph-structured data. Their expressive power is intimately related to logics that are invariant under graded bisimilarity. Current proposals for recurrent GNNs either assume that the graph size is given to the model, or suffer from a lack of termination guarantees. In this paper, we propose a halting mechanism for recurrent GNNs. We prove that our halting model can express all node classifiers definable in graded modal mu-calculus, even for the standard GNN variant that is oblivious to the graph size. A recent breakthrough in the study of the expressivity of graded modal mu-calculus in the finite suggests that conversely, restricted to node classifiers definable in monadic second-order logic, recurrent GNNs can express only node classifiers definable in graded modal mu-calculus. To prove our main result, we develop a new approximate semantics for graded mu-calculus, which we believe to be of independent interest. We leverage this new semantics into a new model-checking algorithm, called the counting algorithm, which is oblivious to the graph size. In a final step we show that the counting algorithm can be implemented on a halting recurrent GNN.
Authors: M\'elodie Monod, Alessandro Micheli, Samir Bhatt
Abstract: We introduce NeuralSurv, the first deep survival model to incorporate Bayesian uncertainty quantification. Our non-parametric, architecture-agnostic framework flexibly captures time-varying covariate-risk relationships in continuous time via a novel two-stage data-augmentation scheme, for which we establish theoretical guarantees. For efficient posterior inference, we introduce a mean-field variational algorithm with coordinate-ascent updates that scale linearly in model size. By locally linearizing the Bayesian neural network, we obtain full conjugacy and derive all coordinate updates in closed form. In experiments, NeuralSurv delivers superior calibration compared to state-of-the-art deep survival models, while matching or exceeding their discriminative performance across both synthetic benchmarks and real-world datasets. Our results demonstrate the value of Bayesian principles in data-scarce regimes by enhancing model calibration and providing robust, well-calibrated uncertainty estimates for the survival function.
Authors: Omobayode Fagbohungbe, Corey Lammie, Malte J. Rasch, Takashi Ando, Tayfun Gokmen, Vijay Narayanan
Abstract: Analog in-memory computing is a next-generation computing paradigm that promises fast, parallel, and energy-efficient deep learning training and transfer learning (TL). However, achieving this promise has remained elusive due to a lack of suitable training algorithms. Analog memory devices exhibit asymmetric and non-linear switching behavior in addition to device-to-device variation, meaning that most, if not all, of the current off-the-shelf training algorithms cannot achieve good training outcomes. Also, recently introduced algorithms have enjoyed limited attention, as they require bi-directionally switching devices of unrealistically high symmetry and precision and are highly sensitive. A new algorithm chopped TTv2 (c-TTv2), has been introduced, which leverages the chopped technique to address many of the challenges mentioned above. In this paper, we assess the performance of the c-TTv2 algorithm for analog TL using a Swin-ViT model on a subset of the CIFAR100 dataset. We also investigate the robustness of our algorithm to changes in some device specifications, including weight transfer noise, symmetry point skew, and symmetry point variability
Authors: Vladim\'ir Bo\v{z}a, Vladim\'ir Macko
Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: https://github.com/usamec/double_binary
Authors: Pierre Clavier, Nathan Grinsztajn, Raphael Avalos, Yannis Flet-Berliac, Irem Ergun, Omar D. Domingues, Eugene Tarassov, Olivier Pietquin, Pierre H. Richemond, Florian Strub, Matthieu Geist
Abstract: The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights from the RL literature to account for LLM-specific characteristics, ensuring that the logits become reliable Q-value estimates. We then use this loss to build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement. Finally, we evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings
Authors: Guangqiang Li, M. Amine Atoui, Xiangshun Li
Abstract: Deep learning methods have shown promising performance in fault diagnosis for multimode process. Most existing studies assume that the collected health state categories from different operating modes are identical. However, in real industrial scenarios, these categories typically exhibit only partial overlap. The incompleteness of the available data and the large distributional differences between the operating modes pose a significant challenge to existing fault diagnosis methods. To address this problem, a novel fault diagnosis model named self-adaptive temporal-spatial attention network (TSA-SAN) is proposed. First, inter-mode mappings are constructed using healthy category data to generate multimode samples. To enrich the diversity of the fault data, interpolation is performed between healthy and fault samples. Subsequently, the fault diagnosis model is trained using real and generated data. The self-adaptive instance normalization is established to suppress irrelevant information while retaining essential statistical features for diagnosis. In addition, a temporal-spatial attention mechanism is constructed to focus on the key features, thus enhancing the generalization ability of the model. The extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art methods. The code will be available on Github at https://github.com/GuangqiangLi/TSA-SAN.
Authors: Oliver Schacht, Biwei Huang
Abstract: Kernel-based conditional independence (KCI) testing is a powerful nonparametric method commonly employed in causal discovery tasks. Despite its flexibility and statistical reliability, cubic computational complexity limits its application to large datasets. To address this computational bottleneck, we propose \textit{FastKCI}, a scalable and parallelizable kernel-based conditional independence test that utilizes a mixture-of-experts approach inspired by embarrassingly parallel inference techniques for Gaussian processes. By partitioning the dataset based on a Gaussian mixture model over the conditioning variables, FastKCI conducts local KCI tests in parallel, aggregating the results using an importance-weighted sampling scheme. Experiments on synthetic datasets and benchmarks on real-world production data validate that FastKCI maintains the statistical power of the original KCI test while achieving substantial computational speedups. FastKCI thus represents a practical and efficient solution for conditional independence testing in causal inference on large-scale data.
Authors: Lang Feng, Jiahao Lin, Dong Xing, Li Zhang, De Ma, Gang Pan
Abstract: Population-population generalization is a challenging problem in multi-agent reinforcement learning (MARL), particularly when agents encounter unseen co-players. However, existing self-play-based methods are constrained by the limitation of inside-space generalization. In this study, we propose Bidirectional Distillation (BiDist), a novel mixed-play framework, to overcome this limitation in MARL. BiDist leverages knowledge distillation in two alternating directions: forward distillation, which emulates the historical policies' space and creates an implicit self-play, and reverse distillation, which systematically drives agents towards novel distributions outside the known policy space in a non-self-play manner. In addition, BiDist operates as a concise and efficient solution without the need for the complex and costly storage of past policies. We provide both theoretical analysis and empirical evidence to support BiDist's effectiveness. Our results highlight its remarkable generalization ability across a variety of cooperative, competitive, and social dilemma tasks, and reveal that BiDist significantly diversifies the policy distribution space. We also present comprehensive ablation studies to reinforce BiDist's effectiveness and key success factors. Source codes are available in the supplementary material.
Authors: Thanadej Rattanakornphan, Piyanon Charoenpoonpanich, Chainarong Amornbunchornvej
Abstract: Finding the most similar subsequences between two multidimensional time series has many applications: e.g. capturing dependency in stock market or discovering coordinated movement of baboons. Considering one pattern occurring in one time series, we might be wondering whether the same pattern occurs in another time series with some distortion that might have a different length. Nevertheless, to the best of our knowledge, there is no efficient framework that deals with this problem yet. In this work, we propose an algorithm that provides the exact solution of finding the most similar multidimensional subsequences between time series where there is a difference in length both between time series and between subsequences. The algorithm is built based on theoretical guarantee of correctness and efficiency. The result in simulation datasets illustrated that our approach not just only provided correct solution, but it also utilized running time only quarter of time compared against the baseline approaches. In real-world datasets, it extracted the most similar subsequences even faster (up to 20 times faster against baseline methods) and provided insights regarding the situation in stock market and following relations of multidimensional time series of baboon movement. Our approach can be used for any time series. The code and datasets of this work are provided for the public use.
Authors: Lin Zhu, Yijun Bian, Lei You
Abstract: Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of bias.Our code is on https://github.com/youlei202/FairSHAP.
Authors: Chenhong Zhou, Jie Chen, Zaifeng Yang, Ching Eng Png
Abstract: Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty levels. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at https://github.com/chenhong-zhou/DualBalanced-PINNs.
Authors: Enjun Du, Siyi Liu, Yongqi Zhang
Abstract: Foundation models have demonstrated remarkable capabilities across various domains, but developing analogous models for knowledge graphs presents unique challenges due to their dynamic nature and the need for cross-domain reasoning. To address these issues, we introduce \textbf{\textsc{GraphOracle}}, a relation-centric foundation model that unifies reasoning across knowledge graphs by converting them into Relation-Dependency Graphs (RDG), explicitly encoding compositional patterns with fewer edges than prior methods. A query-dependent attention mechanism is further developed to learn inductive representations for both relations and entities. Pre-training on diverse knowledge graphs, followed by minutes-level fine-tuning, enables effective generalization to unseen entities, relations, and entire graphs. Through comprehensive experiments on 31 diverse benchmarks spanning transductive, inductive, and cross-domain settings, we demonstrate consistent state-of-the-art performance with minimal adaptation, improving the prediction performance by up to 35\% compared to the strongest baselines.
Authors: Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa
Abstract: Federated learning is a distributed learning framework where clients collaboratively train a global model without sharing their raw data. FedAvg is a popular algorithm for federated learning, but it often suffers from slow convergence due to the heterogeneity of local datasets and anisotropy in the parameter space. In this work, we formalize the central server optimization procedure through the lens of mirror descent and propose a novel framework, called FedDuA, which adaptively selects the global learning rate based on both inter-client and coordinate-wise heterogeneity in the local updates. We prove that our proposed doubly adaptive step-size rule is minimax optimal and provide a convergence analysis for convex objectives. Although the proposed method does not require additional communication or computational cost on clients, extensive numerical experiments show that our proposed framework outperforms baselines in various settings and is robust to the choice of hyperparameters.
Authors: Simone Azeglio, Arianna Di Bernardo
Abstract: Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold's contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.
Authors: Feng Xiao, Xiaoying Tang, Jicong Fan
Abstract: Unsupervised anomaly detection is a critical task in many high-social-impact applications such as finance, healthcare, social media, and cybersecurity, where demographics involving age, gender, race, disease, etc, are used frequently. In these scenarios, possible bias from anomaly detection systems can lead to unfair treatment for different groups and even exacerbate social bias. In this work, first, we thoroughly analyze the feasibility and necessary assumptions for ensuring group fairness in unsupervised anomaly detection. Second, we propose a novel fairness-aware anomaly detection method FairAD. From the normal training data, FairAD learns a projection to map data of different demographic groups to a common target distribution that is simple and compact, and hence provides a reliable base to estimate the density of the data. The density can be directly used to identify anomalies while the common target distribution ensures fairness between different groups. Furthermore, we propose a threshold-free fairness metric that provides a global view for model's fairness, eliminating dependence on manual threshold selection. Experiments on real-world benchmarks demonstrate that our method achieves an improved trade-off between detection accuracy and fairness under both balanced and skewed data across different groups.
Authors: Desong Zhang, Jia Hu, Geyong Min
Abstract: Spiking Neural Networks (SNNs) process information via discrete spikes, enabling them to operate at remarkably low energy levels. However, our experimental observations reveal a striking vulnerability when SNNs are trained using the mainstream method--direct encoding combined with backpropagation through time (BPTT): even a single backward pass on data drawn from a slightly different distribution can lead to catastrophic network collapse. Our theoretical analysis attributes this vulnerability to the repeated inputs inherent in direct encoding and the gradient accumulation characteristic of BPTT, which together produce an exceptional large Hessian spectral radius. To address this challenge, we develop a hyperparameter-free method called Dominant Eigencomponent Projection (DEP). By orthogonally projecting gradients to precisely remove their dominant components, DEP effectively reduces the Hessian spectral radius, thereby preventing SNNs from settling into sharp minima. Extensive experiments demonstrate that DEP not only mitigates the vulnerability of SNNs to heterogeneous data poisoning, but also significantly enhances overall robustness compared to key baselines, providing strong support for safer and more reliable SNN deployment.
Authors: Om Roy, Yashar Moshfeghi, Keith Smith
Abstract: Graph neural networks have re-defined how we model and predict on network data but there lacks a consensus on choosing the correct underlying graph structure on which to model signals. CoVariance Neural Networks (VNN) address this issue by using the sample covariance matrix as a Graph Shift Operator (GSO). Here, we improve on the performance of VNNs by constructing a Density Matrix where we consider the sample Covariance matrix as a quasi-Hamiltonian of the system in the space of random variables. Crucially, using this density matrix as the GSO allows components of the data to be extracted at different scales, allowing enhanced discriminability and performance. We show that this approach allows explicit control of the stability-discriminability trade-off of the network, provides enhanced robustness to noise compared to VNNs, and outperforms them in useful real-life applications where the underlying covariance matrix is informative. In particular, we show that our model can achieve strong performance in subject-independent Brain Computer Interface EEG motor imagery classification, outperforming EEGnet while being faster. This shows how covariance density neural networks provide a basis for the notoriously difficult task of transferability of BCIs when evaluated on unseen individuals.
Authors: Ashok Arora, Neetesh Kumar
Abstract: In real-world reinforcement learning (RL) scenarios, agents often encounter partial observability, where incomplete or noisy information obscures the true state of the environment. Partially Observable Markov Decision Processes (POMDPs) are commonly used to model these environments, but effective performance requires memory mechanisms to utilise past observations. While recurrence networks have traditionally addressed this need, transformer-based models have recently shown improved sample efficiency in RL tasks. However, their application to POMDPs remains underdeveloped, and their real-world deployment is constrained due to the high parameter count. This work introduces a novel bi-recurrent model architecture that improves sample efficiency and reduces model parameter count in POMDP scenarios. The architecture replaces the multiple feed forward layers with a single layer of bi-directional recurrence unit to better capture and utilize sequential dependencies and contextual information. This approach improves the model's ability to handle partial observability and increases sample efficiency, enabling effective learning from comparatively fewer interactions. To evaluate the performance of the proposed model architecture, experiments were conducted on a total of 23 POMDP environments. The proposed model architecture outperforms existing transformer-based, attention-based, and recurrence-based methods by a margin ranging from 87.39% to 482.04% on average across the 23 POMDP environments.
Authors: Boris Bonev, Max Rietmann, Andrea Paris, Alberto Carpentieri, Thorsten Kurth
Abstract: We introduce a generalized attention mechanism for spherical domains, enabling Transformer architectures to natively process data defined on the two-dimensional sphere - a critical need in fields such as atmospheric physics, cosmology, and robotics, where preserving spherical symmetries and topology is essential for physical accuracy. By integrating numerical quadrature weights into the attention mechanism, we obtain a geometrically faithful spherical attention that is approximately rotationally equivariant, providing strong inductive biases and leading to better performance than Cartesian approaches. To further enhance both scalability and model performance, we propose neighborhood attention on the sphere, which confines interactions to geodesic neighborhoods. This approach reduces computational complexity and introduces the additional inductive bias for locality, while retaining the symmetry properties of our method. We provide optimized CUDA kernels and memory-efficient implementations to ensure practical applicability. The method is validated on three diverse tasks: simulating shallow water equations on the rotating sphere, spherical image segmentation, and spherical depth estimation. Across all tasks, our spherical Transformers consistently outperform their planar counterparts, highlighting the advantage of geometric priors for learning on spherical domains.
Authors: Haiqing Hao, Nikola Zubi\'c, Weihua He, Zhipeng Sui, Davide Scaramuzza, Wenhui Wang
Abstract: Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned representations for ML pipelines, existing A2S approaches often sacrifice representation expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous representation learning), a novel A2S framework to generate highly expressive and generalizable event-by-event representations. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks (DVS128-Gesture and N-Cars), and represents the first A2S framework to successfully master demanding detection tasks, achieving a remarkable 47.7 mAP on the Gen1 dataset. These results underscore EVA's transformative potential for advancing real-time event-based vision applications.
Authors: Myeonghwan Ahn, Sungjoo Yoo
Abstract: Ever-growing scale of large language models (LLMs) is pushing for improved efficiency, favoring fully quantized training (FQT) over BF16. While FQT accelerates training, it faces consistency challenges and requires searching over an exponential number of cases, each needing over 200B tokens to ensure stability. Pseudo-quantization training (PQT) addresses the issues of FQT, although it is not well-studied. We explore the practical implications of PQT in detail and propose a noise distribution $R$ that is floating-point (FP)-friendly, with ideal properties including stochastic precision annealing. As a result, the proposed method serves as an effective theoretical foundation for low-precision FP parameters through PQT, utilizing efficient fake quantization via an addition and subsequent FP casting. We demonstrate that Gaussian weight sampling is (1) scalable: supports low-precision FP parameters down to FP6 and high-precision noise up to 9-bit with BF16 operator. The proposed method is (2) efficient: incurring computational overhead as low as 1.40\% on the A100 GPU in terms of Llama2 training tokens per second, and requiring 2 bytes per parameter in GPU memory. We demonstrate that PQT with Gaussian weight sampling is (3) stable: closely following or even surpassing performance of the BF16 baseline while pre-training GPT2 and Llama2 models with up to 1B parameters and 300B tokens.
Authors: Francesco Madeddu, Lucia Testa, Gianluca De Carlo, Michele Pieroni, Andrea Mastropietro, Aris Anagnostopoulos, Paolo Tieri, Sergio Barbarossa
Abstract: The intrinsic complexity of human biology presents ongoing challenges to scientific understanding. Researchers collaborate across disciplines to expand our knowledge of the biological interactions that define human life. AI methodologies have emerged as powerful tools across scientific domains, particularly in computational biology, where graph data structures effectively model biological entities such as protein-protein interaction (PPI) networks and gene functional networks. Those networks are used as datasets for paramount network medicine tasks, such as gene-disease association prediction, drug repurposing, and polypharmacy side effect studies. Reliable predictions from machine learning models require high-quality foundational data. In this work, we present a comprehensive multi-purpose biological knowledge graph constructed by integrating and refining multiple publicly available datasets. Building upon the Drug Repurposing Knowledge Graph (DRKG), we define a pipeline tasked with a) cleaning inconsistencies and redundancies present in DRKG, b) coalescing information from the main available public data sources, and c) enriching the graph nodes with expressive feature vectors such as molecular fingerprints and gene ontologies. Biologically and chemically relevant features improve the capacity of machine learning models to generate accurate and well-structured embedding spaces. The resulting resource represents a coherent and reliable biological knowledge graph that serves as a state-of-the-art platform to advance research in computational biology and precision medicine. Moreover, it offers the opportunity to benchmark graph-based machine learning and network medicine models on relevant tasks. We demonstrate the effectiveness of the proposed dataset by benchmarking it against the task of drug repurposing, PPI prediction, and side-effect prediction, modeled as link prediction problems.
Authors: Zhenyi Zhang, Zihan Wang, Yuhao Sun, Tiejun Li, Peijie Zhou
Abstract: Modeling the dynamics from sparsely time-resolved snapshot data is crucial for understanding complex cellular processes and behavior. Existing methods leverage optimal transport, Schr\"odinger bridge theory, or their variants to simultaneously infer stochastic, unbalanced dynamics from snapshot data. However, these approaches remain limited in their ability to account for cell-cell interactions. This integration is essential in real-world scenarios since intercellular communications are fundamental life processes and can influence cell state-transition dynamics. To address this challenge, we formulate the Unbalanced Mean-Field Schr\"odinger Bridge (UMFSB) framework to model unbalanced stochastic interaction dynamics from snapshot data. Inspired by this framework, we further propose CytoBridge, a deep learning algorithm designed to approximate the UMFSB problem. By explicitly modeling cellular transitions, proliferation, and interactions through neural networks, CytoBridge offers the flexibility to learn these processes directly from data. The effectiveness of our method has been extensively validated using both synthetic gene regulatory data and real scRNA-seq datasets. Compared to existing methods, CytoBridge identifies growth, transition, and interaction patterns, eliminates false transitions, and reconstructs the developmental landscape with greater accuracy.
Authors: Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan
Abstract: From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving.
Authors: Anders Gj{\o}lbye, Stefan Haufe, Lars Kai Hansen
Abstract: Suppressor variables can influence model predictions without being dependent on the target outcome and they pose a significant challenge for Explainable AI (XAI) methods. These variables may cause false-positive feature attributions, undermining the utility of explanations. Although effective remedies exist for linear models, their extension to non-linear models and to instance-based explanations has remained limited. We introduce PatternLocal, a novel XAI technique that addresses this gap. PatternLocal begins with a locally linear surrogate, e.g. LIME, KernelSHAP, or gradient-based methods, and transforms the resulting discriminative model weights into a generative representation, thereby suppressing the influence of suppressor variables while preserving local fidelity. In extensive hyperparameter optimization on the XAI-TRIS benchmark, PatternLocal consistently outperformed other XAI methods and reduced false-positive attributions when explaining non-linear tasks, thereby enabling more reliable and actionable insights.
Authors: Francisco Madaleno, Pernille Julie Viuff Sand, Francisco C. Pereira, Sergio Hernan Garrido Mejia
Abstract: We propose Bayesian Hierarchical Invariant Prediction (BHIP) reframing Invariant Causal Prediction (ICP) through the lens of Hierarchical Bayes. We leverage the hierarchical structure to explicitly test invariance of causal mechanisms under heterogeneous data, resulting in improved computational scalability for a larger number of predictors compared to ICP. Moreover, given its Bayesian nature BHIP enables the use of prior information. In this paper, we test two sparsity inducing priors: horseshoe and spike-and-slab, both of which allow us a more reliable identification of causal features. We test BHIP in synthetic and real-world data showing its potential as an alternative inference method to ICP.
Authors: Donghoon Lee, Tung M. Luu, Younghwan Lee, Chang D. Yoo
Abstract: Recent research highlights the potential of multimodal foundation models in tackling complex decision-making challenges. However, their large parameters make real-world deployment resource-intensive and often impractical for constrained systems. Reinforcement learning (RL) shows promise for task-specific agents but suffers from high sample complexity, limiting practical applications. To address these challenges, we introduce LVLM to Policy (LVLM2P), a novel framework that distills knowledge from large vision-language models (LVLM) into more efficient RL agents. Our approach leverages the LVLM as a teacher, providing instructional actions based on trajectories collected by the RL agent, which helps reduce less meaningful exploration in the early stages of learning, thereby significantly accelerating the agent's learning progress. Additionally, by leveraging the LVLM to suggest actions directly from visual observations, we eliminate the need for manual textual descriptors of the environment, enhancing applicability across diverse tasks. Experiments show that LVLM2P significantly enhances the sample efficiency of baseline RL algorithms.
Authors: Oskar Bohn Lassen, Serio Agriesti, Mohamed Eldafrawi, Daniele Gammelli, Guido Cantelmo, Guido Gentile, Francisco Camara Pereira
Abstract: The Traffic Assignment Problem is a fundamental, yet computationally expensive, task in transportation modeling, especially for large-scale networks. Traditional methods require iterative simulations to reach equilibrium, making real-time or large-scale scenario analysis challenging. In this paper, we propose a learning-based approach using Message-Passing Neural Networks as a metamodel to approximate the equilibrium flow of the Stochastic User Equilibrium assignment. Our model is designed to mimic the algorithmic structure used in conventional traffic simulators allowing it to better capture the underlying process rather than just the data. We benchmark it against other conventional deep learning techniques and evaluate the model's robustness by testing its ability to predict traffic flows on input data outside the domain on which it was trained. This approach offers a promising solution for accelerating out-of-distribution scenario assessments, reducing computational costs in large-scale transportation planning, and enabling real-time decision-making.
Authors: Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang
Abstract: Driven by the relentless growth in model parameters, which renders full fine-tuning prohibitively expensive for large-scale deployment, parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for rapidly adapting large models to a wide range of downstream tasks. Among the PEFT family, orthogonal fine-tuning and its variants have demonstrated remarkable performance by preserving hyperspherical energy, which encodes pairwise angular similarity between neurons. However, these methods are inherently memory-inefficient due to the need to store intermediate activations from multiple full-dimensional sparse matrices. To address this limitation, we propose Memory-efficient Orthogonal Fine-Tuning (MOFT) with principal subspace adaptation. Specifically, we first establish a theoretical condition under which orthogonal transformations within a low-rank subspace preserve hyperspherical energy. Based on this insight, we constrain orthogonal fine-tuning to the principal subspace defined by the top-r components obtained through singular value decomposition and impose an additional constraint on the projection matrix to satisfy the preservation condition. To enhance MOFT's flexibility across tasks, we relax strict orthogonality by introducing two learnable scaling vectors. Extensive experiments on 37 diverse tasks and four models across NLP and CV demonstrate that MOFT consistently outperforms key baselines while significantly reducing the memory footprint of orthogonal fine-tuning.
Authors: Wilson Wongso, Hao Xue, Flora D. Salim
Abstract: Understanding human mobility through Point-of-Interest (POI) recommendation is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 12 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI recommendation models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI recommendation. The dataset and benchmarking code are available at: https://github.com/cruiseresearchgroup/Massive-STEPS
Authors: Elliot L. Epstein, Apaar Sadhwani, Kay Giesecke
Abstract: In many financial prediction problems, the behavior of individual units (such as loans, bonds, or stocks) is influenced by observable unit-level factors and macroeconomic variables, as well as by latent cross-sectional effects. Traditional approaches attempt to capture these latent effects via handcrafted summary features. We propose a Set-Sequence model that eliminates the need for handcrafted features. The Set model first learns a shared cross-sectional summary at each period. The Sequence model then ingests the summary-augmented time series for each unit independently to predict its outcome. Both components are learned jointly over arbitrary sets sampled during training. Our approach harnesses the set nature of the cross-section and is computationally efficient, generating set summaries in linear time relative to the number of units. It is also flexible, allowing the use of existing sequence models and accommodating a variable number of units at inference. Empirical evaluations demonstrate that our Set-Sequence model significantly outperforms benchmarks on stock return prediction and mortgage behavior tasks. Code will be released.
Authors: Xvyuan Liu, Xiangfei Qiu, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Jilin Hu, Bin Yang
Abstract: The forecasting of irregular multivariate time series (IMTS) is crucial in key areas such as healthcare, biomechanics, climate science, and astronomy. However, achieving accurate and practical predictions is challenging due to two main factors. First, the inherent irregularity and data missingness in irregular time series make modeling difficult. Second, most existing methods are typically complex and resource-intensive. In this study, we propose a general framework called APN to address these challenges. Specifically, we design a novel Time-Aware Patch Aggregation (TAPA) module that achieves adaptive patching. By learning dynamically adjustable patch boundaries and a time-aware weighted averaging strategy, TAPA transforms the original irregular sequences into high-quality, regularized representations in a channel-independent manner. Additionally, we use a simple query module to effectively integrate historical information while maintaining the model's efficiency. Finally, predictions are made by a shallow MLP. Experimental results on multiple real-world datasets show that APN outperforms existing state-of-the-art methods in both efficiency and accuracy.
Authors: Jeffrey Willette, Heejun Lee, Sung Ju Hwang
Abstract: The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
Authors: Jingyang Li, Jiuqian Shang, Yang Chen
Abstract: Tensor completion is crucial in many scientific domains with missing data problems. Traditional low-rank tensor models, including CP, Tucker, and Tensor-Train, exploit low-dimensional structures to recover missing data. However, these methods often treat all tensor modes symmetrically, failing to capture the unique spatiotemporal patterns inherent in scientific data, where the temporal component exhibits both low-frequency stability and high-frequency variations. To address this, we propose a novel model, \underline{F}ourier \underline{Lo}w-rank and \underline{S}parse \underline{T}ensor (FLoST), which decomposes the tensor along the temporal dimension using a Fourier transform. This approach captures low-frequency components with low-rank matrices and high-frequency fluctuations with sparsity, resulting in a hybrid structure that efficiently models both smooth and localized variations. Compared to the well-known tubal-rank model, which assumes low-rankness across all frequency components, FLoST requires significantly fewer parameters, making it computationally more efficient, particularly when the time dimension is large. Through theoretical analysis and empirical experiments, we demonstrate that FLoST outperforms existing tensor completion models in terms of both accuracy and computational efficiency, offering a more interpretable solution for spatiotemporal data reconstruction.
Authors: Shengjia Chang, Xianshuo Yue
Abstract: This study proposes a dynamically weighted ARIMA-RF-HW hybrid model integrating ARIMA for seasonality and trends, Random Forest for nonlinear features, and Holt-Winters smoothing for seasonal adjustment to improve China's pet population forecasting accuracy. Using 2005-2023 data with nine economic, social, and policy indicators (urban income, consumption, aging ratio, policy quantity, new veterinary drug approvals), data were preprocessed via Z-score normalization and missing value imputation. The results show that key drivers of pet populations include urban income (19.48% for cats, 17.15% for dogs), consumption (17.99% for cats), and policy quantity (13.33% for cats, 14.02% for dogs), with aging (12.81% for cats, 13.27% for dogs) and urbanization amplifying the demand for pets. Forecasts show steady cat growth and fluctuating dog numbers, reflecting cats' adaptability to urban environments. This research supports policymakers in optimizing pet health management and guides enterprises in developing differentiated services, advancing sustainable industry growth.
Authors: Francesco Marchetti, Edoardo Legnaro, Sabrina Guastavino
Abstract: In this paper, we introduce a threshold-based framework for multiclass classification that generalizes the standard argmax rule. This is done by replacing the probabilistic interpretation of softmax outputs with a geometric one on the multidimensional simplex, where the classification depends on a multidimensional threshold. This change of perspective enables for any trained classification network an a posteriori optimization of the classification score by means of threshold tuning, as usually carried out in the binary setting. This allows a further refinement of the prediction capability of any network. Moreover, this multidimensional threshold-based setting makes it possible to define score-oriented losses, which are based on the interpretation of the threshold as a random variable. Our experiments show that the multidimensional threshold tuning yields consistent performance improvements across various networks and datasets, and that the proposed multiclass score-oriented losses are competitive with standard loss functions, resembling the advantages observed in the binary case.
Authors: Tom Siegl, Kutalm{\i}\c{s} Co\c{s}kun, Bjarne Hiller, Amin Mirzaei, Florian Lemmerich, Martin Becker
Abstract: Machine learning (ML) is increasingly employed in real-world applications like medicine or economics, thus, potentially affecting large populations. However, ML models often do not perform homogeneously across such populations resulting in subgroups of the population (e.g., sex=female AND marital_status=married) where the model underperforms or, conversely, is particularly accurate. Identifying and describing such subgroups can support practical decisions on which subpopulation a model is safe to deploy or where more training data is required. The potential of identifying and analyzing such subgroups has been recognized, however, an efficient and coherent framework for effective search is missing. Consequently, we introduce SubROC, an open-source, easy-to-use framework based on Exceptional Model Mining for reliably and efficiently finding strengths and weaknesses of classification models in the form of interpretable population subgroups. SubROC incorporates common evaluation measures (ROC and PR AUC), efficient search space pruning for fast exhaustive subgroup search, control for class imbalance, adjustment for redundant patterns, and significance testing. We illustrate the practical benefits of SubROC in case studies as well as in comparative analyses across multiple datasets.
Authors: Juan D. Guerra (Polytechnique Montr\'eal, Mila - Quebec Artificial Intelligence Institute), Thomas Garbay (Polytechnique Montr\'eal, Mila - Quebec Artificial Intelligence Institute), Guillaume Lajoie (Universit\'e de Montr\'eal, Mila - Quebec Artificial Intelligence Institute), Marco Bonizzato (Polytechnique Montr\'eal, Mila - Quebec Artificial Intelligence Institute)
Abstract: Hierarchical Gaussian Process (H-GP) models divide problems into different subtasks, allowing for different models to address each part, making them well-suited for problems with inherent hierarchical structure. However, typical H-GP models do not fully take advantage of this structure, only sending information up or down the hierarchy. This one-way coupling limits sample efficiency and slows convergence. We propose Bidirectional Information Flow (BIF), an efficient H-GP framework that establishes bidirectional information exchange between parent and child models in H-GPs for online training. BIF retains the modular structure of hierarchical models - the parent combines subtask knowledge from children GPs - while introducing top-down feedback to continually refine children models during online learning. This mutual exchange improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models. BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 85% and 5x higher $R^2$ scores for the parent and children respectively, on synthetic and real-world neurostimulation optimization tasks.
Authors: Sohir Maskey, Raffaele Paolino, Fabian Jogl, Gitta Kutyniok, Johannes F. Lutzeyer
Abstract: Graph Neural Networks (GNNs) are powerful tools for learning on structured data, yet the relationship between their expressivity and predictive performance remains unclear. We introduce a family of premetrics that capture different degrees of structural similarity between graphs and relate these similarities to generalization, and consequently, the performance of expressive GNNs. By considering a setting where graph labels are correlated with structural features, we derive generalization bounds that depend on the distance between training and test graphs, model complexity, and training set size. These bounds reveal that more expressive GNNs may generalize worse unless their increased complexity is balanced by a sufficiently large training set or reduced distance between training and test graphs. Our findings relate expressivity and generalization, offering theoretical insights supported by empirical results.
Authors: Shudi Weng, Chao Ren, Ming Xiao, Mikael Skoglund
Abstract: Federated learning (FL) commonly involves clients with diverse communication and computational capabilities. Such heterogeneity can significantly distort the optimization dynamics and lead to objective inconsistency, where the global model converges to an incorrect stationary point potentially far from the pursued optimum. Despite its critical impact, the joint effect of communication and computation heterogeneity has remained largely unexplored, due to the intrinsic complexity of their interaction. In this paper, we reveal the fundamentally distinct mechanisms through which heterogeneous communication and computation drive inconsistency in FL. To the best of our knowledge, this is the first unified theoretical analysis of general heterogeneous FL, offering a principled understanding of how these two forms of heterogeneity jointly distort the optimization trajectory under arbitrary choices of local solvers. Motivated by these insights, we propose Federated Heterogeneity-Aware Client Sampling, FedACS, a universal method to eliminate all types of objective inconsistency. We theoretically prove that FedACS converges to the correct optimum at a rate of $O(1/\sqrt{R})$, even in dynamic heterogeneous environments. Extensive experiments across multiple datasets show that FedACS outperforms state-of-the-art and category-specific baselines by 4.3%-36%, while reducing communication costs by 22%-89% and computation loads by 14%-105%, respectively.
Authors: Xinyan Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu
Abstract: We propose the Fourier Adaptive Lite Diffusion Architecture (FALDA), a novel probabilistic framework for time series forecasting. First, we introduce the Diffusion Model for Residual Regression (DMRR) framework, which unifies diffusion-based probabilistic regression methods. Within this framework, FALDA leverages Fourier-based decomposition to incorporate a component-specific architecture, enabling tailored modeling of individual temporal components. A conditional diffusion model is utilized to estimate the future noise term, while our proposed lightweight denoiser, DEMA (Decomposition MLP with AdaLN), conditions on the historical noise term to enhance denoising performance. Through mathematical analysis and empirical validation, we demonstrate that FALDA effectively reduces epistemic uncertainty, allowing probabilistic learning to primarily focus on aleatoric uncertainty. Experiments on six real-world benchmarks demonstrate that FALDA consistently outperforms existing probabilistic forecasting approaches across most datasets for long-term time series forecasting while achieving enhanced computational efficiency without compromising accuracy. Notably, FALDA also achieves superior overall performance compared to state-of-the-art (SOTA) point forecasting approaches, with improvements of up to 9%.
Authors: Elsa Rizk, Kun Yuan, Ali H. Sayed
Abstract: Diffusion learning is a framework that endows edge devices with advanced intelligence. By processing and analyzing data locally and allowing each agent to communicate with its immediate neighbors, diffusion effectively protects the privacy of edge devices, enables real-time response, and reduces reliance on central servers. However, traditional diffusion learning relies on communication at every iteration, leading to communication overhead, especially with large learning models. Furthermore, the inherent volatility of edge devices, stemming from power outages or signal loss, poses challenges to reliable communication between neighboring agents. To mitigate these issues, this paper investigates an enhanced diffusion learning approach incorporating local updates and partial agent participation. Local updates will curtail communication frequency, while partial agent participation will allow for the inclusion of agents based on their availability. We prove that the resulting algorithm is stable in the mean-square error sense and provide a tight analysis of its Mean-Square-Deviation (MSD) performance. Various numerical experiments are conducted to illustrate our theoretical findings.
Authors: Lothar Heimbach, Sebastian Kaltenbach, Petr Karnakov, Francis J. Alexander, Petros Koumoutsakos
Abstract: Partial Differential Equations (PDEs) describe phenomena ranging from turbulence and epidemics to quantum mechanics and financial markets. Despite recent advances in computational science, solving such PDEs for real-world applications remains prohibitively expensive because of the necessity of resolving a broad range of spatiotemporal scales. In turn, practitioners often rely on coarse-grained approximations of the original PDEs, trading off accuracy for reduced computational resources. To mitigate the loss of detail inherent in such approximations, closure models are employed to represent unresolved spatiotemporal interactions. We present a framework for developing closure models for PDEs using synthetic data acquired through the method of manufactured solutions. These data are used in conjunction with reinforcement learning to provide closures for coarse-grained PDEs. We illustrate the efficacy of our method using the one-dimensional and two-dimensional Burgers' equations and the two-dimensional advection equation. Moreover, we demonstrate that closure models trained for inhomogeneous PDEs can be effectively generalized to homogeneous PDEs. The results demonstrate the potential for developing accurate and computationally efficient closure models for systems with scarce data.
Authors: Emanuele Francazi, Francesco Pinto, Aurelien Lucchi, Marco Baity-Jesi
Abstract: Normalization layers, such as Batch Normalization and Layer Normalization, are central components in modern neural networks, widely adopted to improve training stability and generalization. While their practical effectiveness is well documented, a detailed theoretical understanding of how normalization affects model behavior, starting from initialization, remains an important open question. In this work, we investigate how both the presence and placement of normalization within hidden layers influence the statistical properties of network predictions before training begins. In particular, we study how these choices shape the distribution of class predictions at initialization, which can range from unbiased (Neutral) to highly concentrated (Prejudiced) toward a subset of classes. Our analysis shows that normalization placement induces systematic differences in the initial prediction behavior of neural networks, which in turn shape the dynamics of learning. By linking architectural choices to prediction statistics at initialization, our work provides a principled understanding of how normalization can influence early training behavior and offers guidance for more controlled and interpretable network design.
Authors: Pu Yang, J. A. Barria
Abstract: In this paper, an unsupervised Recurrent Wavelet Probabilistic Neural Network (RWPNN) is proposed, which aims at detecting anomalies in non-stationary environments by modelling the temporal features using a nonparametric density estimation network. The novel framework consists of two components, a Stacked Recurrent Encoder-Decoder (SREnc-Dec) module that captures temporal features in a latent space, and a Multi-Receptive-field Wavelet Probabilistic Network (MRWPN) that creates an ensemble probabilistic model to characterise the latent space. This formulation extends the standard wavelet probabilistic networks to wavelet deep probabilistic networks, which can handle higher data dimensionality. The MRWPN module can adapt to different rates of data variation in different datasets without imposing strong distribution assumptions, resulting in a more robust and accurate detection for Time Series Anomaly Detection (TSAD) tasks in the non-stationary environment. We carry out the assessment on 45 real-world time series datasets from various domains, verify the performance of RWPNN in TSAD tasks with several constraints, and show its ability to provide early warnings for anomalous events.
Authors: Jincheng Huang, Jie Xu, Xiaoshuang Shi, Ping Hu, Lei Feng, Xiaofeng Zhu
Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness on graph-based tasks. However, their predictive confidence is often miscalibrated, typically exhibiting under-confidence, which harms the reliability of their decisions. Existing calibration methods for GNNs normally introduce additional calibration components, which fail to capture the intrinsic relationship between the model and the prediction confidence, resulting in limited theoretical guarantees and increased computational overhead. To address this issue, we propose a simple yet efficient graph calibration method. We establish a unified theoretical framework revealing that model confidence is jointly governed by class-centroid-level and node-level calibration at the final layer. Based on this insight, we theoretically show that reducing the weight decay of the final-layer parameters alleviates GNN under-confidence by acting on the class-centroid level, while node-level calibration acts as a finer-grained complement to class-centroid level calibration, which encourages each test node to be closer to its predicted class centroid at the final-layer representations. Extensive experiments validate the superiority of our method.
Authors: Andrew W. Rosemberg, Joaquim Dias Garcia, Russell Bent, Pascal Van Hentenryck
Abstract: Optimization proxies - machine learning models trained to approximate the solution mapping of parametric optimization problems in a single forward pass - offer dramatic reductions in inference time compared to traditional iterative solvers. This work investigates the integration of solver sensitivities into such end to end proxies via a Sobolev training paradigm and does so in two distinct settings: (i) fully supervised proxies, where exact solver outputs and sensitivities are available, and (ii) self supervised proxies that rely only on the objective and constraint structure of the underlying optimization problem. By augmenting the standard training loss with directional derivative information extracted from the solver, the proxy aligns both its predicted solutions and local derivatives with those of the optimizer. Under Lipschitz continuity assumptions on the true solution mapping, matching first order sensitivities is shown to yield uniform approximation error proportional to the training set covering radius. Empirically, different impacts are observed in each studied setting. On three large Alternating Current Optimal Power Flow benchmarks, supervised Sobolev training cuts mean squared error by up to 56 percent and the median worst case constraint violation by up to 400 percent while keeping the optimality gap below 0.22 percent. For a mean variance portfolio task trained without labeled solutions, self supervised Sobolev training halves the average optimality gap in the medium risk region (standard deviation above 10 percent of budget) and matches the baseline elsewhere. Together, these results highlight Sobolev training whether supervised or self supervised as a path to fast reliable surrogates for safety critical large scale optimization workloads.
Authors: Andreas Roth, Thomas Liebig
Abstract: Most graph neural networks (GNNs) utilize approximations of the general graph convolution derived in the graph Fourier domain. While GNNs are typically applied in the multi-input multi-output (MIMO) case, the approximations are performed in the single-input single-output (SISO) case. In this work, we first derive the MIMO graph convolution through the convolution theorem and approximate it directly in the MIMO case. We find the key MIMO-specific property of the graph convolution to be operating on multiple computational graphs, or equivalently, applying distinct feature transformations for each pair of nodes. As a localized approximation, we introduce localized MIMO graph convolutions (LMGCs), which generalize many linear message-passing neural networks. For almost every choice of edge weights, we prove that LMGCs with a single computational graph are injective on multisets, and the resulting representations are linearly independent when more than one computational graph is used. Our experimental results confirm that an LMGC can combine the benefits of various methods.
Authors: Johannes Schwab, Bryan Kelly, Semyon Malamud, Teng Andrea Xu
Abstract: The performance of the data-dependent neural tangent kernel (NTK; Jacot et al. (2018)) associated with a trained deep neural network (DNN) often matches or exceeds that of the full network. This implies that DNN training via gradient descent implicitly performs kernel learning by optimizing the NTK. In this paper, we propose instead to optimize the NTK explicitly. Rather than minimizing empirical risk, we train the NTK to minimize its generalization error using the recently developed Kernel Alignment Risk Estimator (KARE; Jacot et al. (2020)). Our simulations and real data experiments show that NTKs trained with KARE consistently match or significantly outperform the original DNN and the DNN- induced NTK (the after-kernel). These results suggest that explicitly trained kernels can outperform traditional end-to-end DNN optimization in certain settings, challenging the conventional dominance of DNNs. We argue that explicit training of NTK is a form of over-parametrized feature learning.
Authors: Yuanzhao Zhang, William Gilpin
Abstract: Recently-developed time series foundation models for scientific machine learning exhibit emergent abilities to predict physical systems. These abilities include zero-shot forecasting, in which a model forecasts future states of a system given only a short trajectory as context. Here, we show that foundation models applied to physical systems can give accurate predictions, but that they fail to develop meaningful representations of the underlying physics. Instead, foundation models often forecast by context parroting, a simple zero-shot forecasting strategy that copies directly from the context. As a result, a naive direct context parroting model scores higher than state-of-the-art time-series foundation models on predicting a diverse range of dynamical systems, at a tiny fraction of the computational cost. We draw a parallel between context parroting and induction heads, which explains why large language models trained on text can be repurposed for time series forecasting. Our dynamical systems perspective also ties the scaling between forecast accuracy and context length to the fractal dimension of the attractor, providing insight into the previously observed in-context neural scaling laws. Context parroting thus serves as a simple but tough-to-beat baseline for future time-series foundation models and can help identify in-context learning strategies beyond parroting.
Authors: Nero Z. Li, Xuehao Zhai, Zhichao Shi, Boshen Shi, Xuhui Jiang
Abstract: While Graph Contrastive Learning (GCL) has attracted considerable attention in the field of graph self-supervised learning, its performance heavily relies on data augmentations that are expected to generate semantically consistent positive pairs. Existing strategies typically resort to random perturbations or local structure preservation, yet lack explicit control over global structural consistency between augmented views. To address this limitation, we propose Fractal Graph Contrastive Learning (FractalGCL), a theory-driven framework that leverages fractal self-similarity to enforce global topological coherence. FractalGCL introduces two key innovations: a renormalisation-based augmentation that generates structurally aligned positive views via box coverings; and a fractal-dimension-aware contrastive loss that aligns graph embeddings according to their fractal dimensions. While combining the two innovations markedly boosts graph-representation quality, it also adds non-trivial computational overhead. To mitigate the computational overhead of fractal dimension estimation, we derive a one-shot estimator by proving that the dimension discrepancy between original and renormalised graphs converges weakly to a centred Gaussian distribution. This theoretical insight enables a reduction in dimension computation cost by an order of magnitude, cutting overall training time by approximately 61%. The experiments show that FractalGCL not only delivers state-of-the-art results on standard benchmarks but also outperforms traditional baselines on traffic networks by an average margin of about remarkably 7%. Codes are available at (https://anonymous.4open.science/r/FractalGCL-0511).
Authors: Zihang Jia, Zhen Zhang, Witold Pedrycz
Abstract: The density peaks clustering (DPC) algorithm has attracted considerable attention for its ability to detect arbitrarily shaped clusters based on a simple yet effective assumption. Recent advancements integrating granular-ball (GB) computing with DPC have led to the GB-based DPC (GBDPC) algorithm, which improves computational efficiency. However, GBDPC demonstrates limitations when handling complex clustering tasks, particularly those involving data with complex manifold structures or non-uniform density distributions. To overcome these challenges, this paper proposes the local GB quality peaks clustering (LGBQPC) algorithm, which offers comprehensive improvements to GBDPC in both GB generation and clustering processes based on the principle of justifiable granularity (POJG). Firstly, an improved GB generation method, termed GB-POJG+, is developed, which systematically refines the original GB-POJG in four key aspects: the objective function, termination criterion for GB division, definition of abnormal GB, and granularity level adaptation strategy. GB-POJG+ simplifies parameter configuration by requiring only a single penalty coefficient and ensures high-quality GB generation while maintaining the number of generated GBs within an acceptable range. In the clustering phase, two key innovations are introduced based on the GB k-nearest neighbor graph: relative GB quality for density estimation and geodesic distance for GB distance metric. These modifications substantially improve the performance of GBDPC on datasets with complex manifold structures or non-uniform density distributions. Extensive numerical experiments on 40 benchmark datasets, including both synthetic and publicly available datasets, validate the superior performance of the proposed LGBQPC algorithm.
Authors: Rares Cristian, Pavithra Harsha, Georgia Perakis, Brian Quanz
Abstract: End-to-end learning has become a widely applicable and studied problem in training predictive ML models to be aware of their impact on downstream decision-making tasks. These end-to-end models often outperform traditional methods that separate training from the optimization and only myopically focus on prediction error. However, the computational complexity of end-to-end frameworks poses a significant challenge, particularly for large-scale problems. While training an ML model using gradient descent, each time we need to compute a gradient we must solve an expensive optimization problem. We present a meta-optimization method that learns efficient algorithms to approximate optimization problems, dramatically reducing computational overhead of solving the decision problem in general, an aspect we leverage in the training within the end-to-end framework. Our approach introduces a neural network architecture that near-optimally solves optimization problems while ensuring feasibility constraints through alternate projections. We prove exponential convergence, approximation guarantees, and generalization bounds for our learning method. This method offers superior computational efficiency, producing high-quality approximations faster and scaling better with problem size compared to existing techniques. Our approach applies to a wide range of optimization problems including deterministic, single-stage as well as two-stage stochastic optimization problems. We illustrate how our proposed method applies to (1) an electricity generation problem using real data from an electricity routing company coordinating the movement of electricity throughout 13 states, (2) a shortest path problem with a computer vision task of predicting edge costs from terrain maps, (3) a two-stage multi-warehouse cross-fulfillment newsvendor problem, as well as a variety of other newsvendor-like problems.
Authors: Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang
Abstract: One explanation for the strong generalization ability of neural networks is implicit bias. Yet, the definition and mechanism of implicit bias in non-linear contexts remains little understood. In this work, we propose to characterize implicit bias by the count of connected regions in the input space with the same predicted label. Compared with parameter-dependent metrics (e.g., norm or normalized margin), region count can be better adapted to nonlinear, overparameterized models, because it is determined by the function mapping and is invariant to reparametrization. Empirically, we found that small region counts align with geometrically simple decision boundaries and correlate well with good generalization performance. We also observe that good hyper-parameter choices such as larger learning rates and smaller batch sizes can induce small region counts. We further establish the theoretical connections and explain how larger learning rate can induce small region counts in neural networks.
Authors: Alejandro Moreo
Abstract: When the distribution of the data used to train a classifier differs from that of the test data, i.e., under dataset shift, well-established routines for calibrating the decision scores of the classifier, estimating the proportion of positives in a test sample, or estimating the accuracy of the classifier, become particularly challenging. This paper investigates the interconnections among three fundamental problems, calibration, quantification, and classifier accuracy prediction, under dataset shift conditions. Specifically, we prove their equivalence through mutual reduction, i.e., we show that access to an oracle for any one of these tasks enables the resolution of the other two. Based on these proofs, we propose new methods for each problem based on direct adaptations of well-established methods borrowed from the other disciplines. Our results show such methods are often competitive, and sometimes even surpass the performance of dedicated approaches from each discipline. The main goal of this paper is to fostering cross-fertilization among these research areas, encouraging the development of unified approaches and promoting synergies across the fields.
Authors: Millend Roy, Vladimir Pyltsov, Yinbo Hu
Abstract: Accurate electricity load forecasting is essential for grid stability, resource optimization, and renewable energy integration. While transformer-based deep learning models like TimeGPT have gained traction in time-series forecasting, their effectiveness in long-term electricity load prediction remains uncertain. This study evaluates forecasting models ranging from classical regression techniques to advanced deep learning architectures using data from the ESD 2025 competition. The dataset includes two years of historical electricity load data, alongside temperature and global horizontal irradiance (GHI) across five sites, with a one-day-ahead forecasting horizon. Since actual test set load values remain undisclosed, leveraging predicted values would accumulate errors, making this a long-term forecasting challenge. We employ (i) Principal Component Analysis (PCA) for dimensionality reduction and (ii) frame the task as a regression problem, using temperature and GHI as covariates to predict load for each hour, (iii) ultimately stacking 24 models to generate yearly forecasts. Our results reveal that deep learning models, including TimeGPT, fail to consistently outperform simpler statistical and machine learning approaches due to the limited availability of training data and exogenous variables. In contrast, XGBoost, with minimal feature engineering, delivers the lowest error rates across all test cases while maintaining computational efficiency. This highlights the limitations of deep learning in long-term electricity forecasting and reinforces the importance of model selection based on dataset characteristics rather than complexity. Our study provides insights into practical forecasting applications and contributes to the ongoing discussion on the trade-offs between traditional and modern forecasting methods.
Authors: Dazhuo Qiu, Jinwen Chen, Arijit Khan, Yan Zhao, Francesco Bonchi
Abstract: Counterfactual learning is emerging as an important paradigm, rooted in causality, which promises to alleviate common issues of graph neural networks (GNNs), such as fairness and interpretability. However, as in many real-world application domains where conducting randomized controlled trials is impractical, one has to rely on available observational (factual) data to detect counterfactuals. In this paper, we introduce and tackle the problem of searching for counterfactual evidences for the GNN-based node classification task. A counterfactual evidence is a pair of nodes such that, regardless they exhibit great similarity both in the features and in their neighborhood subgraph structures, they are classified differently by the GNN. We develop effective and efficient search algorithms and a novel indexing solution that leverages both node features and structural information to identify counterfactual evidences, and generalizes beyond any specific GNN. Through various downstream applications, we demonstrate the potential of counterfactual evidences to enhance fairness and accuracy of GNNs.
Authors: Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vuli\'c
Abstract: Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
Authors: Xiaotian Zhang, Yue Shang, Entao Yang, Ge Zhang
Abstract: Understanding neural network's (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs' generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs' Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.
Authors: Ciaran Bench, Vivek Desai, Mohammad Moulaeifard, Nils Strodthoff, Philip Aston, Andrew Thompson
Abstract: Photoplethysmography (PPG) signals encode information about relative changes in blood volume that can be used to assess various aspects of cardiac health non-invasively, e.g.\ to detect atrial fibrillation (AF) or predict blood pressure (BP). Deep networks are well-equipped to handle the large quantities of data acquired from wearable measurement devices. However, they lack interpretability and are prone to overfitting, leaving considerable risk for poor performance on unseen data and misdiagnosis. Here, we describe the use of two scalable uncertainty quantification techniques: Monte Carlo Dropout and the recently proposed Improved Variational Online Newton. These techniques are used to assess the trustworthiness of models trained to perform AF classification and BP regression from raw PPG time series. We find that the choice of hyperparameters has a considerable effect on the predictive performance of the models and on the quality and composition of predicted uncertainties. E.g. the stochasticity of the model parameter sampling determines the proportion of the total uncertainty that is aleatoric, and has varying effects on predictive performance and calibration quality dependent on the chosen uncertainty quantification technique and the chosen expression of uncertainty. We find significant discrepancy in the quality of uncertainties over the predicted classes, emphasising the need for a thorough evaluation protocol that assesses local and adaptive calibration. This work suggests that the choice of hyperparameters must be carefully tuned to balance predictive performance and calibration quality, and that the optimal parameterisation may vary depending on the chosen expression of uncertainty.
Authors: Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai
Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.
Authors: Adrian Robert Minut, Tommaso Mencattini, Andrea Santilli, Donato Crisostomi, Emanuele Rodol\`a
Abstract: Model merging allows combining the capabilities of existing models into a new one - post hoc, without additional training. This has made it increasingly popular thanks to its low cost and the availability of libraries that support merging on consumer GPUs. Recent work shows that pairing merging with evolutionary algorithms can boost performance, but no framework currently supports flexible experimentation with such strategies in language models. We introduce Mergenetic, an open-source library for evolutionary model merging. Mergenetic enables easy composition of merging methods and evolutionary algorithms while incorporating lightweight fitness estimators to reduce evaluation costs. We describe its design and demonstrate that Mergenetic produces competitive results across tasks and languages using modest hardware.
Authors: Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu
Abstract: We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.
Authors: Xinran Song, Tianyu Chen, Mingyuan Zhou
Abstract: Estimating individualized treatment effects from observational data is a central challenge in causal inference, largely due to covariate imbalance and confounding bias from non-randomized treatment assignment. While inverse probability weighting (IPW) is a well-established solution to this problem, its integration into modern deep learning frameworks remains limited. In this work, we propose Importance-Weighted Diffusion Distillation (IWDD), a novel generative framework that combines the pretraining of diffusion models with importance-weighted score distillation to enable accurate and fast causal estimation-including potential outcome prediction and treatment effect estimation. We demonstrate how IPW can be naturally incorporated into the distillation of pretrained diffusion models, and further introduce a randomization-based adjustment that eliminates the need to compute IPW explicitly-thereby simplifying computation and, more importantly, provably reducing the variance of gradient estimates. Empirical results show that IWDD achieves state-of-the-art out-of-sample prediction performance, with the highest win rates compared to other baselines, significantly improving causal estimation and supporting the development of individualized treatment strategies. We will release our PyTorch code for reproducibility and future research.
Authors: Wesley A Suttle, Vipul K Sharma, Brian M Sadler
Abstract: Classic multi-agent reinforcement learning (MARL) methods require that agents enjoy global state observability, preventing development of decentralized algorithms and limiting scalability. Recent work has shown that, under assumptions on decaying inter-agent influence, global observability can be replaced by local neighborhood observability at each agent, enabling decentralization and scalability. Real-world applications enjoying such decay properties remain underexplored, however, despite the fact that signal power decay, or signal attenuation, due to path loss is an intrinsic feature of many problems in wireless communications and radar networks. In this paper, we show that signal attenuation enables decentralization in MARL by considering the illustrative special case of performing power allocation for target detection in a radar network. To achieve this, we propose two new constrained multi-agent Markov decision process formulations of this power allocation problem, derive local neighborhood approximations for global value function and gradient estimates and establish corresponding error bounds, and develop decentralized saddle point policy gradient algorithms for solving the proposed problems. Our approach, though oriented towards the specific radar network problem we consider, provides a useful model for future extensions to additional problems in wireless communications and radar networks.
Authors: Zhaolan Huang, Emmanuel Baccelli
Abstract: AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.
Authors: Yuan-Zheng Lei, Yaobang Gong, Dianwei Chen, Yao Cheng, Xianfeng Terry Yang
Abstract: This study critically examines the performance of physics-informed machine learning (PIML) approaches for traffic flow modeling, defining the failure of a PIML model as the scenario where it underperforms both its purely data-driven and purely physics-based counterparts. We analyze the loss landscape by perturbing trained models along the principal eigenvectors of the Hessian matrix and evaluating corresponding loss values. Our results suggest that physics residuals in PIML do not inherently hinder optimization, contrary to a commonly assumed failure cause. Instead, successful parameter updates require both ML and physics gradients to form acute angles with the quasi-true gradient and lie within a conical region. Given inaccuracies in both the physics models and the training data, satisfying this condition is often difficult. Experiments reveal that physical residuals can degrade the performance of LWR- and ARZ-based PIML models, especially under highly physics-driven settings. Moreover, sparse sampling and the use of temporally averaged traffic data can produce misleadingly small physics residuals that fail to capture actual physical dynamics, contributing to model failure. We also identify the Courant-Friedrichs-Lewy (CFL) condition as a key indicator of dataset suitability for PIML, where successful applications consistently adhere to this criterion. Lastly, we observe that higher-order models like ARZ tend to have larger error lower bounds than lower-order models like LWR, which is consistent with the experimental findings of existing studies.
Authors: Nana Liu, Michele Minervini, Dhrumil Patel, Mark M. Wilde
Abstract: In quantum thermodynamics, a system is described by a Hamiltonian and a list of non-commuting charges representing conserved quantities like particle number or electric charge, and an important goal is to determine the system's minimum energy in the presence of these conserved charges. In optimization theory, a semi-definite program (SDP) involves a linear objective function optimized over the cone of positive semi-definite operators intersected with an affine space. These problems arise from differing motivations in the physics and optimization communities and are phrased using very different terminology, yet they are essentially identical mathematically. By adopting Jaynes' mindset motivated by quantum thermodynamics, we observe that minimizing free energy in the aforementioned thermodynamics problem, instead of energy, leads to an elegant solution in terms of a dual chemical potential maximization problem that is concave in the chemical potential parameters. As such, one can employ standard (stochastic) gradient ascent methods to find the optimal values of these parameters, and these methods are guaranteed to converge quickly. At low temperature, the minimum free energy provides an excellent approximation for the minimum energy. We then show how this Jaynes-inspired gradient-ascent approach can be used in both first- and second-order classical and hybrid quantum-classical algorithms for minimizing energy, and equivalently, how it can be used for solving SDPs, with guarantees on the runtimes of the algorithms. The approach discussed here is well grounded in quantum thermodynamics and, as such, provides physical motivation underpinning why algorithms published fifty years after Jaynes' seminal work, including the matrix multiplicative weights update method, the matrix exponentiated gradient update method, and their quantum algorithmic generalizations, perform well at solving SDPs.
Authors: Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo
Abstract: While the capabilities and utility of AI systems have advanced, rigorous norms for evaluating these systems have lagged. Grand claims, such as models achieving general reasoning capabilities, are supported with model performance on narrow benchmarks, like performance on graduate-level exam questions, which provide a limited and potentially misleading assessment. We provide a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence. For instance, our framework helps determine whether performance on a mathematical benchmark is an indication of the ability to solve problems on math tests or instead indicates a broader ability to reason. Our framework is well-suited for the contemporary paradigm in machine learning, where various stakeholders provide measurements and evaluations that downstream users use to validate their claims and decisions. At the same time, our framework also informs the construction of evaluations designed to speak to the validity of the relevant claims. By leveraging psychometrics' breakdown of validity, evaluations can prioritize the most critical facets for a given claim, improving empirical utility and decision-making efficacy. We illustrate our framework through detailed case studies of vision and language model evaluations, highlighting how explicitly considering validity strengthens the connection between evaluation evidence and the claims being made.
Authors: Adnan Ahmad, Bahareh Nakisa, Mohammad Naim Rastgoo
Abstract: Emotion recognition through physiological signals such as electroencephalogram (EEG) has become an essential aspect of affective computing and provides an objective way to capture human emotions. However, physiological data characterized by cross-subject variability and noisy labels hinder the performance of emotion recognition models. Existing domain adaptation and continual learning methods struggle to address these issues, especially under realistic conditions where data is continuously streamed and unlabeled. To overcome these limitations, we propose a novel bi-level self-supervised continual learning framework, SSOCL, based on a dynamic memory buffer. This bi-level architecture iteratively refines the dynamic buffer and pseudo-label assignments to effectively retain representative samples, enabling generalization from continuous, unlabeled physiological data streams for emotion recognition. The assigned pseudo-labels are subsequently leveraged for accurate emotion prediction. Key components of the framework, including a fast adaptation module and a cluster-mapping module, enable robust learning and effective handling of evolving data streams. Experimental validation on two mainstream EEG tasks demonstrates the framework's ability to adapt to continuous data streams while maintaining strong generalization across subjects, outperforming existing approaches.
Authors: Frederik K\"ohne, Anton Schiela
Abstract: Averaging, or smoothing, is a fundamental approach to obtain stable, de-noised estimates from noisy observations. In certain scenarios, observations made along trajectories of random dynamical systems are of particular interest. One popular smoothing technique for such a scenario is exponential moving averaging (EMA), which assigns observations a weight that decreases exponentially in their age, thus giving younger observations a larger weight. However, EMA fails to enjoy strong stochastic convergence properties, which stems from the fact that the weight assigned to the youngest observation is constant over time, preventing the noise in the averaged quantity from decreasing to zero. In this work, we consider an adaptation to EMA, which we call $p$-EMA, where the weights assigned to the last observations decrease to zero at a subharmonic rate. We provide stochastic convergence guarantees for this kind of averaging under mild assumptions on the autocorrelations of the underlying random dynamical system. We further discuss the implications of our results for a recently introduced adaptive step size control for Stochastic Gradient Descent (SGD), which uses $p$-EMA for averaging noisy observations.
Authors: Jonathan Garc\'ia, Philipp Petersen
Abstract: We study classification problems using binary estimators where the decision boundary is described by horizon functions and where the data distribution satisfies a geometric margin condition. We establish upper and lower bounds for the minimax learning rate over broad function classes with bounded Kolmogorov entropy in Lebesgue norms. A key novelty of our work is the derivation of lower bounds on the worst-case learning rates under a geometric margin condition -- a setting that is almost universally satisfied in practice but remains theoretically challenging. Moreover, our results deal with the noiseless setting, where lower bounds are particularly hard to establish. We apply our general results to classification problems with decision boundaries belonging to several function classes: for Barron-regular functions, and for H\"older-continuous functions with strong margins, we identify optimal rates close to the fast learning rates of $\mathcal{O}(n^{-1})$ for $n \in \mathbb{N}$ samples. Also for merely convex decision boundaries, in a strong margin case optimal rates near $\mathcal{O}(n^{-1/2})$ can be achieved.
Authors: Kirill Vasilevski (Justina), Benjamin Rombaut (Justina), Gopi Krishnan Rajbahadur (Justina), Gustavo A. Oliva (Justina), Keheliya Gallaba (Justina), Filipe R. Cogo (Justina), Jiahuei (Justina), Lin (Jack), Dayi Lin (Jack), Haoxiang Zhang (Jack), Bouyan Chen (Jack), Kishanthan Thangarajah (Jack), Ahmed E. Hassan (Jack), Zhen Ming (Jack), Jiang
Abstract: Foundation Models (FMs) such as Large Language Models (LLMs) are reshaping the software industry by enabling FMware, systems that integrate these FMs as core components. In this KDD 2025 tutorial, we present a comprehensive exploration of FMware that combines a curated catalogue of challenges with real-world production concerns. We first discuss the state of research and practice in building FMware. We further examine the difficulties in selecting suitable models, aligning high-quality domain-specific data, engineering robust prompts, and orchestrating autonomous agents. We then address the complex journey from impressive demos to production-ready systems by outlining issues in system testing, optimization, deployment, and integration with legacy software. Drawing on our industrial experience and recent research in the area, we provide actionable insights and a technology roadmap for overcoming these challenges. Attendees will gain practical strategies to enable the creation of trustworthy FMware in the evolving technology landscape.
Authors: Rebecca G. Hart, Omkar Sudhir Patil, Zachary I. Bell, Warren E. Dixon
Abstract: Deep Neural Networks (DNNs) are increasingly used in control applications due to their powerful function approximation capabilities. However, many existing formulations focus primarily on tracking error convergence, often neglecting the challenge of identifying the system dynamics using the DNN. This paper presents the first result on simultaneous trajectory tracking and online system identification using a DNN-based controller, without requiring persistent excitation. Two new concurrent learning adaptation laws are constructed for the weights of all the layers of the DNN, achieving convergence of the DNN's parameter estimates to a neighborhood of their ideal values, provided the DNN's Jacobian satisfies a finite-time excitation condition. A Lyapunov-based stability analysis is conducted to ensure convergence of the tracking error, weight estimation errors, and observer errors to a neighborhood of the origin. Simulations performed on a range of systems and trajectories, with the same initial and operating conditions, demonstrated 40.5% to 73.6% improvement in function approximation performance compared to the baseline, while maintaining a similar tracking error and control effort. Simulations evaluating function approximation capabilities on data points outside of the trajectory resulted in 58.88% and 74.75% improvement in function approximation compared to the baseline.
Authors: Sayed Mehedi Azim, Brian Corbett, Iman Dehzangi
Abstract: The hippocampus, a critical brain structure involved in memory processing and various neurodegenerative and psychiatric disorders, comprises three key subregions: the dentate gyrus (DG), Cornu Ammonis 1 (CA1), and Cornu Ammonis 3 (CA3). Accurate segmentation of these subregions from histological tissue images is essential for advancing our understanding of disease mechanisms, developmental dynamics, and therapeutic interventions. However, no existing methods address the automated segmentation of hippocampal subregions from tissue images, particularly from immunohistochemistry (IHC) images. To bridge this gap, we introduce a novel set of four comprehensive murine hippocampal IHC datasets featuring distinct staining modalities: cFos, NeuN, and multiplexed stains combining cFos, NeuN, and either {\Delta}FosB or GAD67, capturing structural, neuronal activity, and plasticity associated information. Additionally, we propose ROIsGAN, a region-guided U-Net-based generative adversarial network tailored for hippocampal subregion segmentation. By leveraging adversarial learning, ROIsGAN enhances boundary delineation and structural detail refinement through a novel region-guided discriminator loss combining Dice and binary cross-entropy loss. Evaluated across DG, CA1, and CA3 subregions, ROIsGAN consistently outperforms conventional segmentation models, achieving performance gains ranging from 1-10% in Dice score and up to 11% in Intersection over Union (IoU), particularly under challenging staining conditions. Our work establishes foundational datasets and methods for automated hippocampal segmentation, enabling scalable, high-precision analysis of tissue images in neuroscience research. Our generated datasets, proposed model as a standalone tool, and its corresponding source code are publicly available at: https://github.com/MehediAzim/ROIsGAN
Authors: Trisha Das, Afrah Shafquat, Beigi Mandis, Jacob Aptekar, Jimeng Sun
Abstract: Clinical trials are vital for evaluation of safety and efficacy of new treatments. However, clinical trials are resource-intensive, time-consuming and expensive to conduct, where errors in trial design, reduced efficacy, and safety events can result in significant delays, financial losses, and damage to reputation. These risks underline the importance of informed and strategic decisions in trial design to mitigate these risks and improve the chances of a successful trial. Identifying similar historical trials is critical as these trials can provide an important reference for potential pitfalls and challenges including serious adverse events, dosage inaccuracies, recruitment difficulties, patient adherence issues, etc. Addressing these challenges in trial design can lead to development of more effective study protocols with optimized patient safety and trial efficiency. In this paper, we present a novel method to identify similar historical trials by summarizing clinical trial protocols and searching for similar trials based on a query trial's protocol. Our approach significantly outperforms all baselines, achieving up to a 78% improvement in recall@1 and a 53% improvement in precision@1 over the best baseline. We also show that our method outperforms all other baselines in partial trial similarity search and zero-shot patient-trial matching, highlighting its superior utility in these tasks.
Authors: Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, Kevin Ellis
Abstract: Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.
Authors: Jinke Li, Yue Wu, Xiaoyan Yang
Abstract: Thermal Infrared (TIR) technology involves the use of sensors to detect and measure infrared radiation emitted by objects, and it is widely utilized across a broad spectrum of applications. The advancements in object detection methods utilizing TIR images have sparked significant research interest. However, most traditional methods lack the capability to effectively extract and fuse local-global information, which is crucial for TIR-domain feature attention. In this study, we present a novel and efficient thermal infrared object detection framework, known as CRT-YOLO, that is based on centralized feature regulation, enabling the establishment of global-range interaction on TIR information. Our proposed model integrates efficient multi-scale attention (EMA) modules, which adeptly capture long-range dependencies while incurring minimal computational overhead. Additionally, it leverages the Centralized Feature Pyramid (CFP) network, which offers global regulation of TIR features. Extensive experiments conducted on two benchmark datasets demonstrate that our CRT-YOLO model significantly outperforms conventional methods for TIR image object detection. Furthermore, the ablation study provides compelling evidence of the effectiveness of our proposed modules, reinforcing the potential impact of our approach on advancing the field of thermal infrared object detection.
Authors: Achintha Wijesinghe, Weiwei Wang, Suchinthaka Wanninayaka, Songyang Zhang, Zhi Ding
Abstract: Recent advancements in generative artificial intelligence have introduced groundbreaking approaches to innovating next-generation semantic communication, which prioritizes conveying the meaning of a message rather than merely transmitting raw data. A fundamental challenge in semantic communication lies in accurately identifying and extracting the most critical semantic information while adapting to downstream tasks without degrading performance, particularly when the objective at the receiver may evolve over time. To enable flexible adaptation to multiple tasks at the receiver, this work introduces a novel semantic communication framework, which is capable of jointly capturing task-specific information to enhance downstream task performance and contextual information. Through rigorous experiments on popular image datasets and computer vision tasks, our framework shows promising improvement compared to existing work, including superior performance in downstream tasks, better generalizability, ultra-high bandwidth efficiency, and low reconstruction latency.
Authors: Yuta Higuchi, Rikuto Nagai, Atsushi Okazaki, Masaki Ogura, Naoki Wakamiya
Abstract: As climate change increases the threat of weather-related disasters, research on weather control is gaining importance. The objective of weather control is to mitigate disaster risks by administering interventions with optimal timing, location, and intensity. However, the optimization process is highly challenging due to the vast scale and complexity of weather phenomena, which introduces two major challenges. First, obtaining accurate gradient information for optimization is difficult. In addition, numerical weather prediction (NWP) models demand enormous computational resources, necessitating parameter optimization with minimal function evaluations. To address these challenges, this study proposes a method for designing weather interventions based on black-box optimization, which enables efficient exploration without requiring gradient information. The proposed method is evaluated in two distinct control scenarios: one-shot initial value intervention and sequential intervention based on model predictive control. Furthermore, a comparative analysis is conducted among four representative black-box optimization methods in terms of total rainfall reduction. Experimental results show that Bayesian optimization achieves higher control effectiveness than the others, particularly in high-dimensional search spaces. These findings suggest that Bayesian optimization is a highly effective approach for weather intervention computation.
Authors: Ali Sartaz Khan, Tolulope Ogunremi, Ahmed Attia, Dorottya Demszky
Abstract: Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student speech and to separate all speakers. Our results show that denoising significantly improves the Diarization Error Rate (DER) by reducing the rate of missed speech. Additionally, training on both denoised and noisy datasets leads to substantial performance gains in noisy conditions. The hybrid VAD model leads to further improvements in speech detection, achieving a DER as low as 17% in teacher-student experiments and 45% in all-speaker experiments. However, we also identified trade-offs between voice activity detection and speaker confusion. Overall, our study highlights the effectiveness of multi-stage diarization models and integrating ASR-based information for enhancing speaker diarization in noisy classroom environments.
Authors: Luca Menicali, Andrew Grace, David H. Richter, Stefano Castruccio
Abstract: Fluid thermodynamics underpins atmospheric dynamics, climate science, industrial applications, and energy systems. However, direct numerical simulations (DNS) of such systems are computationally prohibitive. To address this, we present a novel physics-informed spatio-temporal surrogate model for Rayleigh-B\'enard convection (RBC), a canonical example of convective fluid flow. Our approach combines convolutional neural networks for spatial feature extraction with an innovative recurrent architecture inspired by large language models, comprising a context builder and a sequence generator to capture temporal dynamics. Inference is penalized with respect to the governing partial differential equations to ensure physical interpretability. Given the sensitivity of turbulent convection to initial conditions, we quantify uncertainty using a conformal prediction framework. This model replicates key features of RBC dynamics while significantly reducing computational cost, offering a scalable alternative to DNS for long-term simulations.
Authors: Meghali Nandi, Arash Shaghaghi, Nazatul Haque Sultan, Gustavo Batista, Raymond K. Zhao, Sanjay Jha
Abstract: Federated Learning (FL) has emerged as a powerful paradigm for collaborative model training while keeping client data decentralized and private. However, it is vulnerable to Data Reconstruction Attacks (DRA) such as "LoKI" and "Robbing the Fed", where malicious models sent from the server to the client can reconstruct sensitive user data. To counter this, we introduce DRArmor, a novel defense mechanism that integrates Explainable AI with targeted detection and mitigation strategies for DRA. Unlike existing defenses that focus on the entire model, DRArmor identifies and addresses the root cause (i.e., malicious layers within the model that send gradients with malicious intent) by analyzing their contribution to the output and detecting inconsistencies in gradient values. Once these malicious layers are identified, DRArmor applies defense techniques such as noise injection, pixelation, and pruning to these layers rather than the whole model, minimizing the attack surface and preserving client data privacy. We evaluate DRArmor's performance against the advanced LoKI attack across diverse datasets, including MNIST, CIFAR-10, CIFAR-100, and ImageNet, in a 200-client FL setup. Our results demonstrate DRArmor's effectiveness in mitigating data leakage, achieving high True Positive and True Negative Rates of 0.910 and 0.890, respectively. Additionally, DRArmor maintains an average accuracy of 87%, effectively protecting client privacy without compromising model performance. Compared to existing defense mechanisms, DRArmor reduces the data leakage rate by 62.5% with datasets containing 500 samples per client.
Authors: Narayanan PP, Sarvesh Prasanth Venkatesan, Srinivas Kantha Reddy, Shishir Kolathaya
Abstract: Recent advancements in large-scale offline training have demonstrated the potential of generalist policy learning for complex robotic tasks. However, applying these principles to legged locomotion remains a challenge due to continuous dynamics and the need for real-time adaptation across diverse terrains and robot morphologies. In this work, we propose GROQLoco, a scalable, attention-based framework that learns a single generalist locomotion policy across multiple quadruped robots and terrains, relying solely on offline datasets. Our approach leverages expert demonstrations from two distinct locomotion behaviors - stair traversal (non-periodic gaits) and flat terrain traversal (periodic gaits) - collected across multiple quadruped robots, to train a generalist model that enables behavior fusion for both behaviors. Crucially, our framework operates directly on proprioceptive data from all robots without incorporating any robot-specific encodings. The policy is directly deployable on an Intel i7 nuc, producing low-latency control outputs without any test-time optimization. Our extensive experiments demonstrate strong zero-shot transfer across highly diverse quadruped robots and terrains, including hardware deployment on the Unitree Go1, a commercially available 12kg robot. Notably, we evaluate challenging cross-robot training setups where different locomotion skills are unevenly distributed across robots, yet observe successful transfer of both flat walking and stair traversal behaviors to all robots at test time. We also show preliminary walking on Stoch 5, a 70kg quadruped, on flat and outdoor terrains without requiring any fine tuning. These results highlight the potential for robust generalist locomotion across diverse robots and terrains.
Authors: Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, Tieniu Tan
Abstract: Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a method according to probability theory to quickly and accurately predict the scaling performance and select the best strategy under large sampling times without extra resource-intensive inference in practice. It can serve as the test-time scaling law for majority voting. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance.
Authors: Yacine Izza, Alexey Ignatiev, Joao Marques-Silva, Peter J. Stuckey
Abstract: Explainable Artificial Intelligence (XAI) is critical for attaining trust in the operation of AI systems. A key question of an AI system is ``why was this decision made this way''. Formal approaches to XAI use a formal model of the AI system to identify abductive explanations. While abductive explanations may be applicable to a large number of inputs sharing the same concrete values, more general explanations may be preferred for numeric inputs. So-called inflated abductive explanations give intervals for each feature ensuring that any input whose values fall withing these intervals is still guaranteed to make the same prediction. Inflated explanations cover a larger portion of the input space, and hence are deemed more general explanations. But there can be many (inflated) abductive explanations for an instance. Which is the best? In this paper, we show how to find a most general abductive explanation for an AI decision. This explanation covers as much of the input space as possible, while still being a correct formal explanation of the model's behaviour. Given that we only want to give a human one explanation for a decision, the most general explanation gives us the explanation with the broadest applicability, and hence the one most likely to seem sensible. (The paper has been accepted at IJCAI2025 conference.)
Authors: Oskar Allerbo, Thomas B. Sch\"on
Abstract: The success of unsupervised learning raises the question of whether also supervised models can be trained without using the information in the output $y$. In this paper, we demonstrate that this is indeed possible. The key step is to formulate the model as a smoother, i.e. on the form $\hat{f}=Sy$, and to construct the smoother matrix $S$ independently of $y$, e.g. by training on random labels. We present a simple model selection criterion based on the distribution of the out-of-sample predictions and show that, in contrast to cross-validation, this criterion can be used also without access to $y$. We demonstrate on real and synthetic data that $y$-free trained versions of linear and kernel ridge regression, smoothing splines, and neural networks perform similarly to their standard, $y$-based, versions and, most importantly, significantly better than random guessing.
Authors: Ye Kyaw Thu, Thazin Myint Oo
Abstract: This paper explores syllable sequence prediction in Abugida languages using Transformer-based models, focusing on six languages: Bengali, Hindi, Khmer, Lao, Myanmar, and Thai, from the Asian Language Treebank (ALT) dataset. We investigate the reconstruction of complete syllable sequences from various incomplete input types, including consonant sequences, vowel sequences, partial syllables (with random character deletions), and masked syllables (with fixed syllable deletions). Our experiments reveal that consonant sequences play a critical role in accurate syllable prediction, achieving high BLEU scores, while vowel sequences present a significantly greater challenge. The model demonstrates robust performance across tasks, particularly in handling partial and masked syllable reconstruction, with strong results for tasks involving consonant information and syllable masking. This study advances the understanding of sequence prediction for Abugida languages and provides practical insights for applications such as text prediction, spelling correction, and data augmentation in these scripts.
Authors: Harsh Parikh, Trang Quynh Nguyen, Elizabeth A. Stuart, Kara E. Rudolph, Caleb H. Miles
Abstract: Data integration approaches are increasingly used to enhance the efficiency and generalizability of studies. However, a key limitation of these methods is the assumption that outcome measures are identical across datasets -- an assumption that often does not hold in practice. Consider the following opioid use disorder (OUD) studies: the XBOT trial and the POAT study, both evaluating the effect of medications for OUD on withdrawal symptom severity (not the primary outcome of either trial). While XBOT measures withdrawal severity using the subjective opiate withdrawal scale, POAT uses the clinical opiate withdrawal scale. We analyze this realistic yet challenging setting where outcome measures differ across studies and where neither study records both types of outcomes. Our paper studies whether and when integrating studies with disparate outcome measures leads to efficiency gains. We introduce three sets of assumptions -- with varying degrees of strength -- linking both outcome measures. Our theoretical and empirical results highlight a cautionary tale: integration can improve asymptotic efficiency only under the strongest assumption linking the outcomes. However, misspecification of this assumption leads to bias. In contrast, a milder assumption may yield finite-sample efficiency gains, yet these benefits diminish as sample size increases. We illustrate these trade-offs via a case study integrating the XBOT and POAT datasets to estimate the comparative effect of two medications for opioid use disorder on withdrawal symptoms. By systematically varying the assumptions linking the SOW and COW scales, we show potential efficiency gains and the risks of bias. Our findings emphasize the need for careful assumption selection when fusing datasets with differing outcome measures, offering guidance for researchers navigating this common challenge in modern data integration.
Authors: Naqueeb Ahmad Warsi, Ayanava Dasgupta, Masahito Hayashi
Abstract: This work advances the theoretical understanding of quantum learning by establishing a new family of upper bounds on the expected generalization error of quantum learning algorithms, leveraging the framework introduced by Caro et al. (2024) and a new definition for the expected true loss. Our primary contribution is the derivation of these bounds in terms of quantum and classical R\'enyi divergences, utilizing a variational approach for evaluating quantum R\'enyi divergences, specifically the Petz and a newly introduced modified sandwich quantum R\'enyi divergence. Analytically and numerically, we demonstrate the superior performance of the bounds derived using the modified sandwich quantum R\'enyi divergence compared to those based on the Petz divergence. Furthermore, we provide probabilistic generalization error bounds using two distinct techniques: one based on the modified sandwich quantum R\'enyi divergence and classical R\'enyi divergence, and another employing smooth max R\'enyi divergence.
Authors: Maria Dziuba, Valentin Malykh
Abstract: Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom - the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation. Fine-tuning Qwen2.5-Coder models (0.5B-7B) on StRuCom shows statistically significant improvements of chrf++ and BERTScore over baseline models.
Authors: Fabian Gr\"oger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly
Abstract: Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.
Authors: Adrian Kazakov (INAF-IAPS, Rome, Italy), Anna Milillo (INAF-IAPS, Rome, Italy), Alessandro Mura (INAF-IAPS, Rome, Italy), Stavro Ivanovski (INAF-Osservatorio Astronomico di Trieste, Trieste, Italy), Valeria Mangano (INAF-IAPS, Rome, Italy), Alessandro Aronica (INAF-IAPS, Rome, Italy), Elisabetta De Angelis (INAF-IAPS, Rome, Italy), Pier Paolo Di Bartolomeo (INAF-IAPS, Rome, Italy), Alessandro Brin (INAF-IAPS, Rome, Italy), Luca Colasanti (INAF-IAPS, Rome, Italy), Miguel Escalona-Moran (Augmented Intelligence Lab, Salceda de Caselas, Spain), Francesco Lazzarotto (INAF-Osservatorio Astronomico di Padova, Padova, Italy), Stefano Massetti (INAF-IAPS, Rome, Italy), Martina Moroni (INAF-IAPS, Rome, Italy), Raffaella Noschese (INAF-IAPS, Rome, Italy), Fabrizio Nuccilli (INAF-IAPS, Rome, Italy), Stefano Orsini (INAF-IAPS, Rome, Italy), Christina Plainaki (ASI - Italian Space Agency, Rome, Italy), Rosanna Rispoli (INAF-IAPS, Rome, Italy), Roberto Sordini (INAF-IAPS, Rome, Italy), Mirko Stumpo (INAF-IAPS, Rome, Italy), Nello Vertolli (INAF-IAPS, Rome, Italy)
Abstract: Surface information derived from exospheric measurements at planetary bodies complements surface mapping provided by dedicated imagers, offering critical insights into surface release processes, interactions within the planetary environment, space weathering, and planetary evolution. This study explores the feasibility of deriving Mercury's regolith elemental composition from in-situ measurements of its neutral exosphere using deep neural networks (DNNs). We present a supervised feed-forward DNN architecture - a multilayer perceptron (MLP) - that, starting from exospheric densities and proton precipitation fluxes, predicts the chemical elements of the surface regolith below. It serves as an estimator for the surface-exosphere interaction and the processes leading to exosphere formation. Because the DNN requires a comprehensive exospheric dataset not available from previous missions, this study uses simulated exosphere components and simulated drivers. Extensive training and testing campaigns demonstrate the MLP's ability to accurately predict and reconstruct surface composition maps from these simulated measurements. Although this initial version does not aim to reproduce Mercury's actual surface composition, it provides a proof of concept, showcasing the algorithm's robustness and capacity for handling complex datasets to create estimators for exospheric generation models. Moreover, our tests reveal substantial potential for further development, suggesting that this method could significantly enhance the analysis of complex surface-exosphere interactions and complement planetary exosphere models. This work anticipates applying the approach to data from the BepiColombo mission, specifically the SERENA package, whose nominal phase begins in 2027.
Authors: Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer
Abstract: Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.
Authors: Yiran Yang, Rui Chen
Abstract: In this paper, we consider a score-based Integer Programming (IP) approach for solving the Bayesian Network Structure Learning (BNSL) problem. State-of-the-art BNSL IP formulations suffer from the exponentially large number of variables and constraints. A standard approach in IP to address such challenges is to employ row and column generation techniques, which dynamically generate rows and columns, while the complex pricing problem remains a computational bottleneck for BNSL. For the general class of $\ell_0$-penalized likelihood scores, we show how the pricing problem can be reformulated as a difference of submodular optimization problem, and how the Difference of Convex Algorithm (DCA) can be applied as an inexact method to efficiently solve the pricing problems. Empirically, we show that, for continuous Gaussian data, our row and column generation approach yields solutions with higher quality than state-of-the-art score-based approaches, especially when the graph density increases, and achieves comparable performance against benchmark constraint-based and hybrid approaches, even when the graph size increases.
Authors: Florinel-Alin Croitoru, Vlad Hondru, Marius Popescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
Abstract: We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.
Authors: Jiabei Cheng, Zhen-Qun Yang, Jiannong Cao, Yu Yang, Xinzhe Zheng
Abstract: Timely prediction of students at high risk of dropout is critical for early intervention and improving educational outcomes. However, in offline educational settings, poor data quality, limited scale, and high heterogeneity often hinder the application of advanced machine learning models. Furthermore, while educational theories provide valuable insights into dropout phenomena, the lack of quantifiable metrics for key indicators limits their use in data-driven modeling. Through data analysis and a review of educational literature, we identified abrupt changes in student behavior as key early signals of dropout risk. To address this, we propose the Dual-Modal Multiscale Sliding Window (DMSW) Model, which integrates academic performance and behavioral data to dynamically capture behavior patterns using minimal data. The DMSW model improves prediction accuracy by 15% compared to traditional methods, enabling educators to identify high-risk students earlier, provide timely support, and foster a more inclusive learning environment. Our analysis highlights key behavior patterns, offering practical insights for preventive strategies and tailored support. These findings bridge the gap between theory and practice in dropout prediction, giving educators an innovative tool to enhance student retention and outcomes.
Authors: Makoto Yamada, Kian Ming A. Chai, Ayoub Rhim, Satoki Ishikawa, Mohammad Sabokrou, Yao-Hung Hubert Tsai
Abstract: Recent advances in self-supervised learning (SSL) have revolutionized computer vision through innovative architectures and learning objectives, yet they have not fully leveraged insights from biological visual processing systems. Recently, a brain-inspired SSL model named PhiNet was proposed; it is based on a ResNet backbone and operates on static image inputs with strong augmentation. In this paper, we introduce PhiNet v2, a novel Transformer-based architecture that processes temporal visual input (that is, sequences of images) without relying on strong augmentation. Our model leverages variational inference to learn robust visual representations from continuous input streams, similar to human visual processing. Through extensive experimentation, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision foundation models, while maintaining the ability to learn from sequential input without strong data augmentation. This work represents a significant step toward more biologically plausible computer vision systems that process visual information in a manner more closely aligned with human cognitive processes.
Authors: Patrick St\"ockermann, Henning S\"udfeld, Alessandro Immordino, Thomas Altenm\"uller, Marc Wegmann, Martin Gebser, Konstantin Schekotihin, Georg Seidel, Chew Wye Chan, Fei Fei Zhang
Abstract: Benchmark datasets are crucial for evaluating approaches to scheduling or dispatching in the semiconductor industry during the development and deployment phases. However, commonly used benchmark datasets like the Minifab or SMT2020 lack the complex details and constraints found in real-world scenarios. To mitigate this shortcoming, we compare open-source simulation models with a real industry dataset to evaluate how optimization methods scale with different levels of complexity. Specifically, we focus on Reinforcement Learning methods, performing optimization based on policy-gradient and Evolution Strategies. Our research provides insights into the effectiveness of these optimization methods and their applicability to realistic semiconductor frontend fab simulations. We show that our proposed Evolution Strategies-based method scales much better than a comparable policy-gradient-based approach. Moreover, we identify the selection and combination of relevant bottleneck tools to control by the agent as crucial for an efficient optimization. For the generalization across different loading scenarios and stochastic tool failure patterns, we achieve advantages when utilizing a diverse training dataset. While the overall approach is computationally expensive, it manages to scale well with the number of CPU cores used for training. For the real industry dataset, we achieve an improvement of up to 4% regarding tardiness and up to 1% regarding throughput. For the less complex open-source models Minifab and SMT2020, we observe double-digit percentage improvement in tardiness and single digit percentage improvement in throughput by use of Evolution Strategies.
Authors: William R. P. Denault
Abstract: Sparse linear regression is a fundamental tool in data analysis. However, traditional approaches often fall short when covariates exhibit structure or arise from heterogeneous sources. In biomedical applications, covariates may stem from distinct modalities or be structured according to an underlying graph. We introduce Neural Adaptive Shrinkage (Nash), a unified framework that integrates covariate-specific side information into sparse regression via neural networks. Nash adaptively modulates penalties on a per-covariate basis, learning to tailor regularization without cross-validation. We develop a variational inference algorithm for efficient training and establish connections to empirical Bayes regression. Experiments on real data demonstrate that Nash can improve accuracy and adaptability over existing methods.
Authors: Jacob Trauger, Ambuj Tewari
Abstract: Probabilistic next-token prediction trained using cross-entropy loss is the basis of most large language models. Given a sequence of previous values, next-token prediction assigns a probability to each possible next value in the vocabulary. There are many ways to use next-token prediction to output token sequences. This paper examines a few of these algorithms (greedy, lookahead, random sampling, and temperature-scaled random sampling) and studies their consistency with respect to various goals encoded as loss functions. Although consistency of surrogate losses with respect to a target loss function is a well researched topic, we are the first to study it in the context of LLMs (to the best of our knowledge). We find that, so long as next-token prediction converges to its true probability distribution, random sampling is consistent with outputting sequences that mimic sampling from the true probability distribution. For the other goals, such as minimizing the 0-1 loss on the entire sequence, we show no polynomial-time algorithm is optimal for all probability distributions and all decoding algorithms studied are only optimal for a subset of probability distributions. When analyzing these results, we see that there is a dichotomy created between the goals of information retrieval and creative generation for the decoding algorithms. This shows that choosing the correct decoding algorithm based on the desired goal is extremely important and many of the ones used are lacking theoretical grounding in numerous scenarios.
Authors: Francesco Sovrano
Abstract: Generative AI systems can help spread information but also misinformation and biases, potentially undermining the UN Sustainable Development Goals (SDGs). Explainable AI (XAI) aims to reveal the inner workings of AI systems and expose misbehaviours or biases. However, current XAI tools, built for simpler models, struggle to handle the non-numerical nature of large language models (LLMs). This paper examines the effectiveness of global XAI methods, such as rule-extraction algorithms and SHAP, in detecting bias in LLMs. To do so, we first show a text-to-ordinal mapping strategy to convert non-numerical inputs/outputs into numerical features, enabling these tools to identify (some) misinformation-related biases in LLM-generated content. Then, we inject non-linear biases of varying complexity (univariate, conjunctive, and non-convex) into widespread LLMs like ChatGPT and Llama via system instructions, using global XAI methods to detect them. This way, we found that RuleFit struggles with conjunctive and non-convex biases, while SHAP can approximate conjunctive biases but cannot express them as actionable rules. Hence, we introduce RuleSHAP, a global rule extraction algorithm combining SHAP and RuleFit to detect more non-univariate biases, improving injected bias detection over RuleFit by +94% (MRR@1) on average.
Authors: Chris K\"ocher, Alexander Kozachinskiy, Anthony Widjaja Lin, Marco S\"alzer, Georg Zetzsche
Abstract: Positional Encodings (PEs) seem to be indispensable for ensuring expressiveness of transformers; without them attention transformers reduce to a bag-of-word model. NoPE-transformers (i.e. with No PEs) with unique hard attention mechanisms were very recently shown to only be able to express regular languages, i.e., with limited counting ability. This paper shows that, with average hard attention mechanisms, NoPE-transformers are still surprisingly expressive: they can express counting languages corresponding to nonnegative integer solutions to multivariate polynomial equations (i.e. Diophantine equations), reasoning about which is well-known to be undecidable. In fact, we provide a precise characterization of languages expressible by Average Hard Attention NoPE-Transformers (NoPE-AHATs): they correspond precisely to what we call \emph{semi-algebraic sets}, i.e., finite unions of sets of nonnegative integer solutions to systems of multivariate polynomial inequations. We obtain several interesting consequences of our characterization. Firstly, NoPE-transformers can express counting properties that are far more complex than established models like simplified counter machines and Petri nets, but cannot express a very simple counting property of PARITY. Secondly, the problem of analyzing NoPE-transformers is undecidable, e.g., whether a given NoPE transformer classifies all input strings in one class. To complement our results, we exhibit a counting language that is not expressible by average hard attention transformers even with arbitrary PEs but is expressible in the circuit complexity class TC$^0$, answering an open problem.
Authors: Xihuai Wang, Ziyi Zhao, Siyu Ren, Shao Zhang, Song Li, Xiaoyu Li, Ziwen Wang, Lin Qiu, Guanglu Wan, Xuezhi Cao, Xunliang Cai, Weinan Zhang
Abstract: Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).
URLs: https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).
Authors: Dongjun Kim, Junwoo Park, Chaehyeon Shin, Jaeheon Jung, Kyungho Shin, Seungheon Baek, Sanghyuk Heo, Woongrae Kim, Inchul Jeong, Joohwan Cho, Jongsun Park
Abstract: Analog/mixed-signal circuit design encounters significant challenges due to performance degradation from process, voltage, and temperature (PVT) variations. To achieve commercial-grade reliability, iterative manual design revisions and extensive statistical simulations are required. While several studies have aimed to automate variation aware analog design to reduce time-to-market, the substantial mismatches in real-world wafers have not been thoroughly addressed. In this paper, we present GLOVA, an analog circuit sizing framework that effectively manages the impact of diverse random mismatches to improve robustness against PVT variations. In the proposed approach, risk-sensitive reinforcement learning is leveraged to account for the reliability bound affected by PVT variations, and ensemble-based critic is introduced to achieve sample-efficient learning. For design verification, we also propose $\mu$-$\sigma$ evaluation and simulation reordering method to reduce simulation costs of identifying failed designs. GLOVA supports verification through industrial-level PVT variation evaluation methods, including corner simulation as well as global and local Monte Carlo (MC) simulations. Compared to previous state-of-the-art variation-aware analog sizing frameworks, GLOVA achieves up to 80.5$\times$ improvement in sample efficiency and 76.0$\times$ reduction in time.
Authors: Chengyu Huang, Zhengxin Zhang, Claire Cardie
Abstract: While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs' concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.
Authors: Zhangying Feng, Qianglong Chen, Ning Lu, Yongqian Li, Siqi Cheng, Shuangmu Peng, Duyu Tang, Shengcai Liu, Zhirui Zhang
Abstract: The development of reasoning capabilities represents a critical frontier in large language models (LLMs) research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (<10\%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.
Authors: Derrick Gilchrist Edward Manoharan, Anubha Goel, Alexandros Iosifidis, Henri Hansen, Juho Kanniainen
Abstract: The spreading dynamics in social networks are often studied under the assumption that individuals' statuses, whether informed or infected, are fully observable. However, in many real-world situations, such statuses remain unobservable, which is crucial for determining an individual's potential to further spread the infection. While this final status is hidden, intermediate indicators such as symptoms of infection are observable and provide important insights into the spread process. We propose a partial observability-aware Machine Learning framework to learn the characteristics of the spreading model. We term the method Distribution Classification, which utilizes the power of classifiers to infer the underlying transmission dynamics. We evaluate our method on two types of synthetic networks and extend the study to a real-world insider trading network. Results show that the method performs well, especially on complex networks with high cyclic connectivity, supporting its utility in analyzing real-world spreading phenomena where direct observation of individual statuses is not possible.
Authors: Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li
Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
URLs: https://github.com/Qianvenh/CDGLT, https://github.com/Qianvenh/CDGLT
Authors: Mingxing Peng, Yuting Xie, Xusen Guo, Ruoyu Yao, Hai Yang, Jun Ma
Abstract: Ensuring the safety and robustness of autonomous driving systems necessitates a comprehensive evaluation in safety-critical scenarios. However, these safety-critical scenarios are rare and difficult to collect from real-world driving data, posing significant challenges to effectively assessing the performance of autonomous vehicles. Typical existing methods often suffer from limited controllability and lack user-friendliness, as extensive expert knowledge is essentially required. To address these challenges, we propose LD-Scene, a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) for user-controllable adversarial scenario generation through natural language. Our approach comprises an LDM that captures realistic driving trajectory distributions and an LLM-based guidance module that translates user queries into adversarial loss functions, facilitating the generation of scenarios aligned with user queries. The guidance module integrates an LLM-based Chain-of-Thought (CoT) code generator and an LLM-based code debugger, enhancing the controllability and robustness in generating guidance functions. Extensive experiments conducted on the nuScenes dataset demonstrate that LD-Scene achieves state-of-the-art performance in generating realistic, diverse, and effective adversarial scenarios. Furthermore, our framework provides fine-grained control over adversarial behaviors, thereby facilitating more effective testing tailored to specific driving scenarios.
Authors: Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, Alessandro Piva
Abstract: The remarkable ease of use of diffusion models for image generation has led to a proliferation of synthetic content online. While these models are often employed for legitimate purposes, they are also used to generate fake images that support misinformation and hate speech. Consequently, it is crucial to develop robust tools capable of detecting whether an image has been generated by such models. Many current detection methods, however, require large volumes of sample images for training. Unfortunately, due to the rapid evolution of the field, existing datasets often cover only a limited range of models and quickly become outdated. In this work, we introduce DRAGON, a comprehensive dataset comprising images from 25 diffusion models, spanning both recent advancements and older, well-established architectures. The dataset contains a broad variety of images representing diverse subjects. To enhance image realism, we propose a simple yet effective pipeline that leverages a large language model to expand input prompts, thereby generating more diverse and higher-quality outputs, as evidenced by improvements in standard quality metrics. The dataset is provided in multiple sizes (ranging from extra-small to extra-large) to accomodate different research scenarios. DRAGON is designed to support the forensic community in developing and evaluating detection and attribution techniques for synthetic content. Additionally, the dataset is accompanied by a dedicated test set, intended to serve as a benchmark for assessing the performance of newly developed methods.
Authors: Gabriele Iommazzo, David Mart\'inez-Rubio, Francisco Criado, Elias Wirth, Sebastian Pokutta
Abstract: We study the linear convergence of Frank-Wolfe algorithms over product polytopes. We analyze two condition numbers for the product polytope, namely the \emph{pyramidal width} and the \emph{vertex-facet distance}, based on the condition numbers of individual polytope components. As a result, for convex objectives that are $\mu$-Polyak-{\L}ojasiewicz, we show linear convergence rates quantified in terms of the resulting condition numbers. We apply our results to the problem of approximately finding a feasible point in a polytope intersection in high-dimensions, and demonstrate the practical efficiency of our algorithms through empirical results.
Authors: Camille Couturier, Spyros Mastorakis, Haiying Shen, Saravan Rajmohan, Victor R\"uhle
Abstract: Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.
Authors: Fabian Falck, Teodora Pandeva, Kiarash Zahirnia, Rachel Lawrence, Richard Turner, Edward Meeds, Javier Zazo, Sushrut Karmalkar
Abstract: Diffusion models are state-of-the-art generative models on data modalities such as images, audio, proteins and materials. These modalities share the property of exponentially decaying variance and magnitude in the Fourier domain. Under the standard Denoising Diffusion Probabilistic Models (DDPM) forward process of additive white noise, this property results in high-frequency components being corrupted faster and earlier in terms of their Signal-to-Noise Ratio (SNR) than low-frequency ones. The reverse process then generates low-frequency information before high-frequency details. In this work, we study the inductive bias of the forward process of diffusion models in Fourier space. We theoretically analyse and empirically demonstrate that the faster noising of high-frequency components in DDPM results in violations of the normality assumption in the reverse process. Our experiments show that this leads to degraded generation quality of high-frequency components. We then study an alternate forward process in Fourier space which corrupts all frequencies at the same rate, removing the typical frequency hierarchy during generation, and demonstrate marked performance improvements on datasets where high frequencies are primary, while performing on par with DDPM on standard imaging benchmarks.
Authors: Yuejiang Wen, Paul D. Franzon
Abstract: Bayesian Optimization (BO) in high-dimensional spaces remains fundamentally limited by the curse of dimensionality and the rigidity of global low-dimensional assumptions. While Random EMbedding Bayesian Optimization (REMBO) mitigates this via linear projections into low-dimensional subspaces, it typically assumes a single global embedding and a stationary objective. In this work, we introduce Self-Adaptive embedding REMBO (SA-REMBO), a novel framework that generalizes REMBO to support multiple random Gaussian embeddings, each capturing a different local subspace structure of the high-dimensional objective. An index variable governs the embedding choice and is jointly modeled with the latent optimization variable via a product kernel in a Gaussian Process surrogate. This enables the optimizer to adaptively select embeddings conditioned on location, effectively capturing locally varying effective dimensionality, nonstationarity, and heteroscedasticity in the objective landscape. We theoretically analyze the expressiveness and stability of the index-conditioned product kernel and empirically demonstrate the advantage of our method across synthetic and real-world high-dimensional benchmarks, where traditional REMBO and other low-rank BO methods fail. Our results establish SA-REMBO as a powerful and flexible extension for scalable BO in complex, structured design spaces.
Authors: Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank R\"oder, Tianhe Yu, Zhanpeng He, K. R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro
Abstract: Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release a new open-source version of Meta-World (https://github.com/Farama-Foundation/Metaworld/) that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.
Authors: Ardian Selmonaj, Alessandro Antonucci, Adrian Schneider, Michael R\"uegsegger, Matthias Sommer
Abstract: Artificial intelligence (AI) is reshaping strategic planning, with Multi-Agent Reinforcement Learning (MARL) enabling coordination among autonomous agents in complex scenarios. However, its practical deployment in sensitive military contexts is constrained by the lack of explainability, which is an essential factor for trust, safety, and alignment with human strategies. This work reviews and assesses current advances in explainability methods for MARL with a focus on simulated air combat scenarios. We proceed by adapting various explainability techniques to different aerial combat scenarios to gain explanatory insights about the model behavior. By linking AI-generated tactics with human-understandable reasoning, we emphasize the need for transparency to ensure reliable deployment and meaningful human-machine interaction. By illuminating the crucial importance of explainability in advancing MARL for operational defense, our work supports not only strategic planning but also the training of military personnel with insightful and comprehensible analyses.
Authors: Chin-Yun Yu, Marco A. Mart\'inez-Ram\'irez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, Gy\"orgy Fazekas
Abstract: Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can lead to unrealistic or biased results. We address this pitfall by introducing a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces parameter mean squared error by up to 33% and matches the reference style better. Subjective evaluations with 16 participants confirm our method's superiority, especially in limited data regimes. This work demonstrates how incorporating prior knowledge in inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
Authors: Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Danai Koutra
Abstract: Collaborative filtering (CF) enables large-scale recommendation systems by encoding information from historical user-item interactions into dense ID-embedding tables. However, as embedding tables grow, closed-form solutions become impractical, often necessitating the use of mini-batch gradient descent for training. Despite extensive work on designing loss functions to train CF models, we argue that one core component of these pipelines is heavily overlooked: weight decay. Attaining high-performing models typically requires careful tuning of weight decay, regardless of loss, yet its necessity is not well understood. In this work, we question why weight decay is crucial in CF pipelines and how it impacts training. Through theoretical and empirical analysis, we surprisingly uncover that weight decay's primary function is to encode popularity information into the magnitudes of the embedding vectors. Moreover, we find that tuning weight decay acts as a coarse, non-linear knob to influence preference towards popular or unpopular items. Based on these findings, we propose PRISM (Popularity-awaRe Initialization Strategy for embedding Magnitudes), a straightforward yet effective solution to simplify the training of high-performing CF models. PRISM pre-encodes the popularity information typically learned through weight decay, eliminating its necessity. Our experiments show that PRISM improves performance by up to 4.77% and reduces training times by 38.48%, compared to state-of-the-art training strategies. Additionally, we parameterize PRISM to modulate the initialization strength, offering a cost-effective and meaningful strategy to mitigate popularity bias.
Authors: Haowei Wang, Jingyi Wang, Zhongxiang Dai, Nai-Yuan Chiang, Szu Hui Ng, Cosmin G. Petra
Abstract: Constrained Bayesian optimization (CBO) methods have seen significant success in black-box optimization with constraints, and one of the most commonly used CBO methods is the constrained expected improvement (CEI) algorithm. CEI is a natural extension of the expected improvement (EI) when constraints are incorporated. However, the theoretical convergence rate of CEI has not been established. In this work, we study the convergence rate of CEI by analyzing its simple regret upper bound. First, we show that when the objective function $f$ and constraint function $c$ are assumed to each lie in a reproducing kernel Hilbert space (RKHS), CEI achieves the convergence rates of $\mathcal{O} \left(t^{-\frac{1}{2}}\log^{\frac{d+1}{2}}(t) \right) \ \text{and }\ \mathcal{O}\left(t^{\frac{-\nu}{2\nu+d}} \log^{\frac{\nu}{2\nu+d}}(t)\right)$ for the commonly used squared exponential and Mat\'{e}rn kernels, respectively. Second, we show that when $f$ and $c$ are assumed to be sampled from Gaussian processes (GPs), CEI achieves the same convergence rates with a high probability. Numerical experiments are performed to validate the theoretical analysis.
Authors: Raja Gond, Nipun Kwatra, Ramachandran Ramjee
Abstract: Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLINK. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Further, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The computation of one subset is then overlapped with the communication of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce-RMSNorm kernel carefully leveraging Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 29% latency gains and up to 26% throughput gains across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
Authors: Rajeeva Laxman Karandikar, Bhamidi Visweswara Rao, Mathukumalli Vidyasagar
Abstract: In this paper, we take a fresh look at stochastic approximation (SA) and Stochastic Gradient Descent (SGD). We derive new sufficient conditions for the convergence of SA. In particular, the "noise" or measurement error need not have a finite second moment, and under suitable conditions, not even a finite mean. By adapting this method of proof, we also derive sufficient conditions for the convergence of zero-order SGD, wherein the stochastic gradient is computed using only two function evaluations, and no gradient computations. The sufficient conditions derived here are the weakest to date, thus leading to a considerable expansion of the applicability of SA and SGD theory.
Authors: Chenyu Huang, Peng Ye, Shenghe Zheng, Xiaohui Wang, Lei Bai, Tao Chen, Wanli Ouyang
Abstract: Transformer-based models with the pretrain-finetune paradigm bring about significant progress, along with the heavy storage and deployment costs of finetuned models on multiple tasks. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights) through pruning or quantization. However, existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task, which may causes significant performance degradation, especially when the compression rate is extremely high. To tackle this issue, we investigate the impact of different base models on the performance of delta compression and find that the pre-trained base model can hardly be optimal. To this end, we propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression. Specifically, we adjust two parameters, which respectively determine the magnitude of the base model shift and the overall scale of delta compression, to boost the compression performance on each task. Through low-cost learning of these two parameters, our DBMS can maintain most of the finetuned model's performance even under an extremely high compression ratio setting, significantly surpassing existing methods. Moreover, our DBMS is orthogonal and can be integrated with a variety of other methods, and it has been evaluated across different types of models including language, vision transformer, and multi-modal models.
Authors: Simon Urbainczyk, Aretha L. Teckentrup, Jonas Latz
Abstract: Gaussian processes (GPs) have gained popularity as flexible machine learning models for regression and function approximation with an in-built method for uncertainty quantification. However, GPs suffer when the amount of training data is large or when the underlying function contains multi-scale features that are difficult to represent by a stationary kernel. To address the former, training of GPs with large-scale data is often performed through inducing point approximations (also known as sparse GP regression (GPR)), where the size of the covariance matrices in GPR is reduced considerably through a greedy search on the data set. To aid the latter, deep GPs have gained traction as hierarchical models that resolve multi-scale features by combining multiple GPs. Posterior inference in deep GPs requires a sampling or, more usual, a variational approximation. Variational approximations lead to large-scale stochastic, non-convex optimisation problems and the resulting approximation tends to represent uncertainty incorrectly. In this work, we combine variational learning with MCMC to develop a particle-based expectation-maximisation method to simultaneously find inducing points within the large-scale data (variationally) and accurately train the GPs (sampling-based). The result is a highly efficient and accurate methodology for deep GP training on large-scale data. We test our method on standard benchmark problems.
Authors: Ali Rabiee, Sima Ghafoori, MH Farhadi, Robert Beyer, Xiangyu Bai, David J Lin, Sarah Ostadabbas, Reza Abiri
Abstract: Current invasive assistive technologies are designed to infer high-dimensional motor control signals from severely paralyzed patients. However, they face significant challenges, including public acceptance, limited longevity, and barriers to commercialization. Meanwhile, noninvasive alternatives often rely on artifact-prone signals, require lengthy user training, and struggle to deliver robust high-dimensional control for dexterous tasks. To address these issues, this study introduces a novel human-centered multimodal AI approach as intelligent compensatory mechanisms for lost motor functions that could potentially enable patients with severe paralysis to control high-dimensional assistive devices, such as dexterous robotic arms, using limited and noninvasive inputs. In contrast to the current state-of-the-art (SoTA) noninvasive approaches, our context-aware, multimodal shared-autonomy framework integrates deep reinforcement learning algorithms to blend limited low-dimensional user input with real-time environmental perception, enabling adaptive, dynamic, and intelligent interpretation of human intent for complex dexterous manipulation tasks, such as pick-and-place. The results from our ARAS (Adaptive Reinforcement learning for Amplification of limited inputs in Shared autonomy) trained with synthetic users over 50,000 computer simulation episodes demonstrated the first successful implementation of the proposed closed-loop human-in-the-loop paradigm, outperforming the SoTA shared autonomy algorithms. Following a zero-shot sim-to-real transfer, ARAS was evaluated on 23 human subjects, demonstrating high accuracy in dynamic intent detection and smooth, stable 3D trajectory control for dexterous pick-and-place tasks. ARAS user study achieved a high task success rate of 92.88%, with short completion times comparable to those of SoTA invasive assistive technologies.
Authors: Alistair Carson, Alec Wright, Stefan Bilbao
Abstract: Neural networks have become ubiquitous with guitar distortion effects modelling in recent years. Despite their ability to yield perceptually convincing models, they are susceptible to frequency aliasing when driven by high frequency and high gain inputs. Nonlinear activation functions create both the desired harmonic distortion and unwanted aliasing distortion as the bandwidth of the signal is expanded beyond the Nyquist frequency. Here, we present a method for reducing aliasing in neural models via a teacher-student fine tuning approach, where the teacher is a pre-trained model with its weights frozen, and the student is a copy of this with learnable parameters. The student is fine-tuned against an aliasing-free dataset generated by passing sinusoids through the original model and removing non-harmonic components from the output spectra. Our results show that this method significantly suppresses aliasing for both long-short-term-memory networks (LSTM) and temporal convolutional networks (TCN). In the majority of our case studies, the reduction in aliasing was greater than that achieved by two times oversampling. One side-effect of the proposed method is that harmonic distortion components are also affected. This adverse effect was found to be model-dependent, with the LSTM models giving the best balance between anti-aliasing and preserving the perceived similarity to an analog reference device.
Authors: Alexander Kim, Charlotte Botha
Abstract: For singers of all experience levels, one of the most daunting challenges in learning technical repertoire is navigating placement and vocal register in and around the passagio (passage between chest voice and head voice registers). Particularly in pop music, where a single artist may use a variety of timbre's and textures to achieve a desired quality, it can be difficult to identify what vocal register within the vocal range a singer is using. This paper presents two methods for classifying vocal registers in an audio signal of male pop music through the analysis of textural features of mel-spectrogram images. Additionally, we will discuss the practical integration of these models for vocal analysis tools, and introduce a concurrently developed software called AVRA which stands for Automatic Vocal Register Analysis. Our proposed methods achieved consistent classification of vocal register through both Support Vector Machine (SVM) and Convolutional Neural Network (CNN) models, which supports the promise of more robust classification possibilities across more voice types and genres of singing.
Authors: Petr Kasalick\'y, Martin Spi\v{s}\'ak, Vojt\v{e}ch Van\v{c}ura, Daniel Bohun\v{e}k, Rodrigo Alves, Pavel Kord\'ik
Abstract: Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a lightweight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders. We release our code at https://github.com/recombee/CompresSAE.
Authors: Pouya Shaeri, Ariane Middel
Abstract: Modern neural networks often activate all neurons for every input, leading to unnecessary computation and inefficiency. We introduce Matrix-Interpolated Dropout Layer (MID-L), a novel module that dynamically selects and activates only the most informative neurons by interpolating between two transformation paths via a learned, input-dependent gating vector. Unlike conventional dropout or static sparsity methods, MID-L employs a differentiable Top-k masking strategy, enabling per-input adaptive computation while maintaining end-to-end differentiability. MID-L is model-agnostic and integrates seamlessly into existing architectures. Extensive experiments on six benchmarks, including MNIST, CIFAR-10, CIFAR-100, SVHN, UCI Adult, and IMDB, show that MID-L achieves up to average 55\% reduction in active neurons, 1.7$\times$ FLOPs savings, and maintains or exceeds baseline accuracy. We further validate the informativeness and selectivity of the learned neurons via Sliced Mutual Information (SMI) and observe improved robustness under overfitting and noisy data conditions. Additionally, MID-L demonstrates favorable inference latency and memory usage profiles, making it suitable for both research exploration and deployment on compute-constrained systems. These results position MID-L as a general-purpose, plug-and-play dynamic computation layer, bridging the gap between dropout regularization and efficient inference.
Authors: Patryk Bartkowiak, Michal Podstawski
Abstract: This paper introduces a novel dataset and evaluation benchmark designed to assess and improve small language models deployable on edge devices, with a focus on user profiling from multi-session natural language interactions in smart home environments. At the core of the dataset are structured user profiles, each defined by a set of routines - context-triggered, repeatable patterns of behavior that govern how users interact with their home systems. Using these profiles as input, a large language model (LLM) generates corresponding interaction sessions that simulate realistic, diverse, and context-aware dialogues between users and their devices. The primary task supported by this dataset is profile reconstruction: inferring user routines and preferences solely from interactions history. To assess how well current models can perform this task under realistic conditions, we benchmarked several state-of-the-art compact language models and compared their performance against large foundation models. Our results show that while small models demonstrate some capability in reconstructing profiles, they still fall significantly short of large models in accurately capturing user behavior. This performance gap poses a major challenge - particularly because on-device processing offers critical advantages, such as preserving user privacy, minimizing latency, and enabling personalized experiences without reliance on the cloud. By providing a realistic, structured testbed for developing and evaluating behavioral modeling under these constraints, our dataset represents a key step toward enabling intelligent, privacy-respecting AI systems that learn and adapt directly on user-owned devices.
Authors: Rana Poureskandar, Shiva Razzagzadeh
Abstract: This study evaluated the performance of a YOLOv8-based segmentation model for detecting and segmenting wrinkles in facial images.
Authors: Utsav Rai, Haozheng Xu, Stamatia Giannarou
Abstract: Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.
Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, Oleksii Kuchaiev
Abstract: Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference
Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim
Abstract: A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.
Authors: Takuya Kanazawa
Abstract: The need to collect data via expensive measurements of black-box functions is prevalent across science, engineering and medicine. As an example, hyperparameter tuning of a large AI model is critical to its predictive performance but is generally time-consuming and unwieldy. Bayesian optimization (BO) is a collection of methods that aim to address this issue by means of Bayesian statistical inference. In this work, we put forward a BO scheme named BDC, which integrates BO with a statistical measure of association of two random variables called Distance Correlation. BDC balances exploration and exploitation automatically, and requires no manual hyperparameter tuning. We evaluate BDC on a range of benchmark tests and observe that it performs on per with popular BO methods such as the expected improvement and max-value entropy search. We also apply BDC to optimization of sequential integral observations of an unknown terrain and confirm its utility.
Authors: Niclas Kannengie{\ss}er, Niklas Hasebrook, Felix Morsbach, Marc-Andr\'e Z\"oller, J\"org Franke, Marius Lindauer, Frank Hutter, Ali Sunyaev
Abstract: Programmatic hyperparameter optimization (HPO) methods, such as Bayesian optimization and evolutionary algorithms, are highly sample-efficient in identifying optimal hyperparameter configurations for machine learning (ML) models. However, practitioners frequently use less efficient methods, such as grid search, which can lead to under-optimized models. We suspect this behavior is driven by a range of practitioner-specific motives. Practitioner motives, however, still need to be clarified to enhance user-centered development of HPO tools. To uncover practitioner motives to use different HPO methods, we conducted 20 semi-structured interviews and an online survey with 49 ML experts. By presenting main goals (e.g., increase ML model understanding) and contextual factors affecting practitioners' selection of HPO methods (e.g., available computer resources), this study offers a conceptual foundation to better understand why practitioners use different HPO methods, supporting development of more user-centered and context-adaptive HPO tools in automated ML.
Authors: Arpan Dasgupta, Preeti Lamba, Ankita Kushwaha, Kiran Ravish, Siddhant Katyan, Shrutimoy Das, Pawan Kumar
Abstract: Extreme multi-label classification or XMLC, is an active area of interest in machine learning. Compared to traditional multi-label classification, here the number of labels is extremely large, hence, the name extreme multi-label classification. Using classical one-versus-all classification does not scale in this case due to large number of labels; the same is true for any other classifier. Embedding labels and features into a lower-dimensional space is a common first step in many XMLC methods. Moreover, other issues include existence of head and tail labels, where tail labels are those that occur in a relatively small number of samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify correctly the prediction for head or tail labels.
Authors: Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath
Abstract: We rigorously study the relation between the training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, both the SGD trajectory and emergent outlier eigenspaces of the Hessian and gradient matrices align with a common low-dimensional subspace. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.
Authors: Chengpiao Huang, Kaizheng Wang
Abstract: We develop a versatile framework for statistical learning in non-stationary environments. In each time period, our approach applies a stability principle to select a look-back window that maximizes the utilization of historical data while keeping the cumulative bias within an acceptable range relative to the stochastic error. Our theory showcases the adaptivity of this approach to unknown non-stationarity. We prove regret bounds that are minimax optimal up to logarithmic factors when the population losses are strongly convex, or Lipschitz only. At the heart of our analysis lie two novel components: a measure of similarity between functions and a segmentation technique for dividing the non-stationary data sequence into quasi-stationary pieces. We evaluate the practical performance of our approach through real-data experiments on electricity demand prediction and hospital nurse staffing.
Authors: Yang Zhao, Jiaxi Yang, Yiling Tao, Lixu Wang, Xiaoxiao Li, Dusit Niyato, H. Vincent Poor
Abstract: The increasing demand for privacy-preserving machine learning has spurred interest in federated unlearning, which enables the selective removal of data from models trained in federated systems. However, developing federated unlearning methods presents challenges, particularly in balancing three often conflicting objectives: privacy, accuracy, and efficiency. This paper provides a comprehensive analysis of existing federated unlearning approaches, examining their algorithmic efficiency, impact on model accuracy, and effectiveness in preserving privacy. We discuss key trade-offs among these dimensions and highlight their implications for practical applications across various domains. Additionally, we propose the OpenFederatedUnlearning framework, a unified benchmark for evaluating federated unlearning methods, incorporating classic baselines and diverse performance metrics. Our findings aim to guide practitioners in navigating the complex interplay of these objectives, offering insights to achieve effective and efficient federated unlearning. Finally, we outline directions for future research to further advance the state of federated unlearning techniques.
Authors: Chenguang Wang, Xuanhao Pan, Tianshu Yu
Abstract: Multi-task learning (MTL) aims to leverage shared information among tasks to improve learning efficiency and accuracy. However, MTL often struggles to effectively manage positive and negative transfer between tasks, which can hinder performance improvements. Task grouping addresses this challenge by organizing tasks into meaningful clusters, maximizing beneficial transfer while minimizing detrimental interactions. This paper introduces a principled approach to task grouping in MTL, advancing beyond existing methods by addressing key theoretical and practical limitations. Unlike prior studies, our method offers a theoretically grounded approach that does not depend on restrictive assumptions for constructing transfer gains. We also present a flexible mathematical programming formulation that accommodates a wide range of resource constraints, thereby enhancing its versatility. Experimental results across diverse domains, including computer vision datasets, combinatorial optimization benchmarks, and time series tasks, demonstrate the superiority of our method over extensive baselines, thereby validating its effectiveness and general applicability in MTL without sacrificing efficiency.
Authors: Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
Abstract: Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8$\times$ longer than existing SP methods. Code is available at: https://github.com/OpenNLPLab/LASP.
Authors: Wenwen Qiang, Jingyao Wang, Changwen Zheng, Hui Xiong, Gang Hua
Abstract: In this paper, we investigate what constitutes a good representation or model in self-supervised learning (SSL). We argue that a good representation should exhibit universality, characterized by three essential properties: discriminability, generalizability, and transferability. While these capabilities are implicitly desired in most SSL frameworks, existing methods lack an explicit modeling of universality, and its theoretical foundations remain underexplored. To address these gaps, we propose General SSL (GeSSL), a novel framework that explicitly models universality from three complementary dimensions: the optimization objective, the parameter update mechanism, and the learning paradigm. GeSSL integrates a bi-level optimization structure that jointly models task-specific adaptation and cross-task consistency, thereby capturing all three aspects of universality within a unified SSL objective. Furthermore, we derive a theoretical generalization bound, ensuring that the optimization process of GeSSL consistently leads to representations that generalize well to unseen tasks. Empirical results on multiple benchmark datasets demonstrate that GeSSL consistently achieves superior performance across diverse downstream tasks, validating its effectiveness in modeling universal representations.
Authors: Zhijian Xu, Hao Wang, Qiang Xu
Abstract: Traditional time series forecasting methods predominantly rely on historical data patterns, neglecting external interventions that significantly shape future dynamics. Through control-theoretic analysis, we show that the implicit "self-stimulation" assumption limits the accuracy of these forecasts. To overcome this limitation, we propose an Intervention-Aware Time Series Forecasting (IATSF) framework explicitly designed to incorporate external interventions. We particularly emphasize textual interventions due to their unique capability to represent qualitative or uncertain influences inadequately captured by conventional exogenous variables. We propose a leak-free benchmark composed of temporally synchronized textual intervention data across synthetic and real-world scenarios. To rigorously evaluate IATSF, we develop FIATS, a lightweight forecasting model that integrates textual interventions through Channel-Aware Adaptive Sensitivity Modeling (CASM) and Channel-Aware Parameter Sharing (CAPS) mechanisms, enabling the model to adjust its sensitivity to interventions and historical data in a channel-specific manner. Extensive empirical evaluations confirm that FIATS surpasses state-of-the-art methods, highlighting that forecasting improvements stem explicitly from modeling external interventions rather than increased model complexity alone.
Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen
Abstract: Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In particular, approaches for online learning that can be helpful in adaptive data collection via active learning cannot incorporate offline preference data. In this paper, we adopt a finite-armed linear bandit model as a prototypical model of online learning. We consider an offline preference dataset to be available generated by an expert of unknown 'competence'. We propose warmPref-PS, a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively. We support our claims with novel theoretical analysis of its Bayesian regret, as well as, extensive empirical evaluation of an approximate loss function that optimizes for infinitely many arms, and performs substantially better than baselines.
Authors: Omer Sahin Tas, Royden Wagner
Abstract: Transformer-based models generate hidden states that are difficult to interpret. In this work, we analyze hidden states and modify them at inference, with a focus on motion forecasting. We use linear probing to analyze whether interpretable features are embedded in hidden states. Our experiments reveal high probing accuracy, indicating latent space regularities with functionally important directions. Building on this, we use the directions between hidden states with opposing features to fit control vectors. At inference, we add our control vectors to hidden states and evaluate their impact on predictions. Remarkably, such modifications preserve the feasibility of predictions. We further refine our control vectors using sparse autoencoders (SAEs). This leads to more linear changes in predictions when scaling control vectors. Our approach enables mechanistic interpretation as well as zero-shot generalization to unseen dataset characteristics with negligible computational overhead.
Authors: Zeqin Yang, Weilin Chen, Ruichu Cai, Yuguang Yan, Zhifeng Hao, Zhipeng Yu, Zhichao Zou, Jixing Xu, Zhen Peng, Jiecheng Guo
Abstract: Long-term treatment effect estimation is a significant but challenging problem in many applications. Existing methods rely on ideal assumptions, such as no unobserved confounders or binary treatment, to estimate long-term average treatment effects. However, in numerous real-world applications, these assumptions could be violated, and average treatment effects are insufficient for personalized decision-making. In this paper, we address a more general problem of estimating long-term Heterogeneous Dose-Response Curve (HDRC) while accounting for unobserved confounders and continuous treatment. Specifically, to remove the unobserved confounders in the long-term observational data, we introduce an optimal transport weighting framework to align the long-term observational data to an auxiliary short-term experimental data. Furthermore, to accurately predict the heterogeneous effects of continuous treatment, we establish a generalization bound on counterfactual prediction error by leveraging the reweighted distribution induced by optimal transport. Finally, we develop a long-term HDRC estimator building upon the above theoretical foundations. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the effectiveness of our approach.
Authors: Jeremy Carleton, Prathik Vijaykumar, Divyanshu Saxena, Dheeraj Narasimha, Srinivas Shakkottai, Aditya Akella
Abstract: We address the challenge of zeroth-order online convex optimization where the objective function's gradient exhibits sparsity, indicating that only a small number of dimensions possess non-zero gradients. Our aim is to leverage this sparsity to obtain useful estimates of the objective function's gradient even when the only information available is a limited number of function samples. Our motivation stems from the optimization of large-scale queueing networks that process time-sensitive jobs. Here, a job must be processed by potentially many queues in sequence to produce an output, and the service time at any queue is a function of the resources allocated to that queue. Since resources are costly, the end-to-end latency for jobs must be balanced with the overall cost of the resources used. While the number of queues is substantial, the latency function primarily reacts to resource changes in only a few, rendering the gradient sparse. We tackle this problem by introducing the Compressive Online Gradient Optimization framework which allows compressive sensing methods previously applied to stochastic optimization to achieve regret bounds with an optimal dependence on the time horizon without the full problem dimension appearing in the bound. For specific algorithms, we reduce the samples required per gradient estimate to scale with the gradient's sparsity factor rather than its full dimensionality. Numerical simulations and real-world microservices benchmarks demonstrate CONGO's superiority over gradient descent approaches that do not account for sparsity.
Authors: Joseph Paillard, Angel Reyero Lobo, Vitaliy Kolodyazhniy, Bertrand Thirion, Denis A. Engemann
Abstract: Causal machine learning holds promise for estimating individual treatment effects from complex data. For successful real-world applications of machine learning methods, it is of paramount importance to obtain reliable insights into which variables drive heterogeneity in the response to treatment. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) reference method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference in the limited-data regime common to biomedical applications. We empirically demonstrate the benefits of PermuCATE in simulated and real-world health datasets, including settings with up to hundreds of correlated variables.
Authors: Bo Yue, Jian Li, Guiliang Liu
Abstract: Optimizing objective functions subject to constraints is fundamental in many real-world applications. However, these constraints are often not readily defined and must be inferred from expert agent behaviors, a problem known as Inverse Constraint Inference. Inverse Constrained Reinforcement Learning (ICRL) is a common solver for recovering feasible constraints in complex environments, relying on training samples collected from interactive environments. However, the efficacy and efficiency of current sampling strategies remain unclear. We propose a strategic exploration framework for sampling with guaranteed efficiency to bridge this gap. By defining the feasible cost set for ICRL problems, we analyze how estimation errors in transition dynamics and the expert policy influence the feasibility of inferred constraints. Based on this analysis, we introduce two exploratory algorithms to achieve efficient constraint inference via 1) dynamically reducing the bounded aggregate error of cost estimations or 2) strategically constraining the exploration policy around plausibly optimal ones. Both algorithms are theoretically grounded with tractable sample complexity, and their performance is validated empirically across various environments.
Authors: Yingxu Wang, Mengzhu Wang, Siwei Liu, Houcheng Su, Nan Yin, James Kwok
Abstract: Spiking Graph Networks (SGNs) have demonstrated significant potential in graph classification by emulating brain-inspired neural dynamics to achieve energy-efficient computation. However, existing SGNs are generally constrained to in-distribution scenarios and struggle with distribution shifts. In this paper, we first propose the domain adaptation problem in SGNs, and introduce a novel framework named Degree-Consicious Spiking Graph for Cross-Domain Adaptation. DeSGraDA enhances generalization across domains with three key components. First, we introduce the degree-conscious spiking representation module by adapting spike thresholds based on node degrees, enabling more expressive and structure-aware signal encoding. Then, we perform temporal distribution alignment by adversarially matching membrane potentials between domains, ensuring effective performance under domain shift while preserving energy efficiency. Additionally, we extract consistent predictions across two spaces to create reliable pseudo-labels, effectively leveraging unlabeled data to enhance graph classification performance. Furthermore, we establish the first generalization bound for SGDA, providing theoretical insights into its adaptation performance. Extensive experiments on benchmark datasets validate that DeSGraDA consistently outperforms state-of-the-art methods in both classification accuracy and energy efficiency.
Authors: Zhenyu Tao, Wei Xu, Xiaohu You
Abstract: Optimization of user association in a densely deployed cellular network is usually challenging and even more complicated due to the dynamic nature of user mobility and fluctuation in user counts. While deep reinforcement learning (DRL) emerges as a promising solution, its application in practice is hindered by high trial-and-error costs in real world and unsatisfactory physical network performance during training. Also, existing DRL-based user association methods are typically applicable to scenarios with a fixed number of users due to convergence and compatibility challenges. To address these limitations, we introduce a large vision model (LVM)-enhanced digital twin (DT) for wireless networks and propose a parallel DT-driven DRL method for user association and load balancing in networks with dynamic user counts, distribution, and mobility patterns. To construct this LVM-enhanced DT for DRL training, we develop a zero-shot generative user mobility model, named Map2Traj, based on the diffusion model. Map2Traj estimates user trajectory patterns and spatial distributions solely from street maps. DRL models undergo training in the DT environment, avoiding direct interactions with physical networks. To enhance the generalization ability of DRL models for dynamic scenarios, a parallel DT framework is further established to alleviate strong correlation and non-stationarity in single-environment training and improve training efficiency. Numerical results show that the developed LVM-enhanced DT achieves closely comparable training efficacy to the real environment, and the proposed parallel DT framework even outperforms the single real-world environment in DRL training with nearly 20\% gain in terms of cell-edge user performance.
Authors: Nicholas Gao, Eike Eberhard, Stephan G\"unnemann
Abstract: The accuracy of density functional theory hinges on the approximation of non-local contributions to the exchange-correlation (XC) functional. To date, machine-learned and human-designed approximations suffer from insufficient accuracy, limited scalability, or dependence on costly reference data. To address these issues, we introduce Equivariant Graph Exchange Correlation (EG-XC), a novel non-local XC functional based on equivariant graph neural networks (GNNs). Where previous works relied on semi-local functionals or fixed-size descriptors of the density, we compress the electron density into an SO(3)-equivariant nuclei-centered point cloud for efficient non-local atomic-range interactions. By applying an equivariant GNN on this point cloud, we capture molecular-range interactions in a scalable and accurate manner. To train EG-XC, we differentiate through a self-consistent field solver requiring only energy targets. In our empirical evaluation, we find EG-XC to accurately reconstruct `gold-standard' CCSD(T) energies on MD17. On out-of-distribution conformations of 3BPA, EG-XC reduces the relative MAE by 35% to 50%. Remarkably, EG-XC excels in data efficiency and molecular size extrapolation on QM9, matching force fields trained on 5 times more and larger molecules. On identical training sets, EG-XC yields on average 51% lower MAEs.
Authors: Wenlong Wang, Ivana Dusparic, Yucheng Shi, Ke Zhang, Vinny Cahill
Abstract: Model-based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model-free RL algorithms. However, learning a robust world model often requires complex and deep architectures, which are computationally expensive and challenging to train. Within the world model, sequence models play a critical role in accurate predictions, and various architectures have been explored, each with its own challenges. Currently, recurrent neural network (RNN)-based world models struggle with vanishing gradients and capturing long-term dependencies. Transformers, on the other hand, suffer from the quadratic memory and computational complexity of self-attention mechanisms, scaling as $O(n^2)$, where $n$ is the sequence length. To address these challenges, we propose a state space model (SSM)-based world model, Drama, specifically leveraging Mamba, that achieves $O(n)$ memory and computational complexity while effectively capturing long-term dependencies and enabling efficient training with longer sequences. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early training stages. Combining these techniques, Drama achieves a normalised score on the Atari100k benchmark that is competitive with other state-of-the-art (SOTA) model-based RL algorithms, using only a 7 million-parameter world model. Drama is accessible and trainable on off-the-shelf hardware, such as a standard laptop. Our code is available at https://github.com/realwenlongwang/Drama.git.
Authors: Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng
Abstract: KV cache has become a de facto technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention. As the size of the model and data grows, the KV cache can quickly become a bottleneck within the system in both storage and memory transfer. To address this, prior studies usually focus on the first three axes of the cache tensors for compression. This paper supplements them, focusing on the feature dimension axis, by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA). We observe the issue with PCA projection where significant performance degradation is observed at low compression rates. To bridge the gap, we propose to directly tune the orthogonal projection matrices with a distillation objective using an elaborate Matryoshka training strategy. After training, we adaptively search for the optimal compression rates for various layers and heads given varying compression budgets. Compared to previous works, our method can easily embrace pre-trained LLMs and hold a smooth tradeoff between performance and compression rate. We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.
Authors: Zhi Zhang, Chris Chow, Yasi Zhang, Yanchao Sun, Haochen Zhang, Eric Hanchen Jiang, Han Liu, Furong Huang, Yuchen Cui, Oscar Hernan Madrid Padilla
Abstract: Lifelong reinforcement learning (RL) has been developed as a paradigm for extending single-task RL to more realistic, dynamic settings. In lifelong RL, the "life" of an RL agent is modeled as a stream of tasks drawn from a task distribution. We propose EPIC (Empirical PAC-Bayes that Improves Continuously), a novel algorithm designed for lifelong RL using PAC-Bayes theory. EPIC learns a shared policy distribution, referred to as the world policy, which enables rapid adaptation to new tasks while retaining valuable knowledge from previous experiences. Our theoretical analysis establishes a relationship between the algorithm's generalization performance and the number of prior tasks preserved in memory. We also derive the sample complexity of EPIC in terms of RL regret. Extensive experiments on a variety of environments demonstrate that EPIC significantly outperforms existing methods in lifelong RL, offering both theoretical guarantees and practical efficacy through the use of the world policy.
Authors: Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun
Abstract: Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
Authors: Yang Su, Na Yan, Yansha Deng, Mischa Dohler, Robert Schober
Abstract: Federated fine-tuning of pre-trained Large Language Models (LLMs) enables task-specific adaptation across diverse datasets while preserving privacy. However, challenges such as high computational and memory demands, heterogeneous client resources, bandwidth constraints, and ineffective global aggregation hinder its efficiency. To address these issues, we propose HAFLQ (Heterogeneous Adaptive Federated Low-Rank Adaptation Fine-tuned LLM with Quantization), a novel framework for efficient and scalable federated fine-tuning of LLMs in heterogeneous environments. To reduce memory and computation demands, we propose a salience-driven adaptive LLM quantization framework that evaluates the importance of transformer blocks using a salience metric and applies adaptive block-wise quantization accordingly. To handle heterogeneous computational capabilities, we propose an importance-based parameter truncation and freezing scheme. To address communication bottlenecks, we propose an importance-aware bandwidth-adaptive quantization method, which dynamically adjusts parameter precision based on importance and bandwidth constraints. To improve global model aggregation, we propose an adaptive rank-1 matrix-level aggregation strategy, which prevents information dilution and accelerates convergence by aggregating only updated rank-1 matrices from clients. Experimental results on the text classification task demonstrate that HAFLQ reduces memory usage by 31%, lowers communication cost by 49%, improves accuracy by 50%, and achieves faster convergence compared to the baseline method.
Authors: Xuanhao Pan, Chenguang Wang, Chaolong Ying, Ye Xue, Tianshu Yu
Abstract: The ``Heatmap + Monte Carlo Tree Search (MCTS)'' paradigm has recently emerged as a prominent framework for solving the Travelling Salesman Problem (TSP). While considerable effort has been devoted to enhancing heatmap sophistication through advanced learning models, this paper rigorously examines whether this emphasis is justified, critically assessing the relative impact of heatmap complexity versus MCTS configuration. Our extensive empirical analysis across diverse TSP scales, distributions, and benchmarks reveals two pivotal insights: 1) The configuration of MCTS strategies significantly influences solution quality, underscoring the importance of meticulous tuning to achieve optimal results and enabling valid comparisons among different heatmap methodologies. 2) A rudimentary, parameter-free heatmap based on the intrinsic $k$-nearest neighbor structure of TSP instances, when coupled with an optimally tuned MCTS, can match or surpass the performance of more sophisticated, learned heatmaps, demonstrating robust generalizability on problem scale and distribution shift. To facilitate rigorous and fair evaluations in future research, we introduce a streamlined pipeline for standardized MCTS hyperparameter tuning. Collectively, these findings challenge the prevalent assumption that heatmap complexity is the primary determinant of performance, advocating instead for a balanced integration and comprehensive evaluation of both learning and search components within this paradigm. Our code is available at: https://github.com/LOGO-CUHKSZ/rethink_mcts_tsp.
Authors: Hu Wang, Congbo Ma, Ibrahim Almakky, Ian Reid, Gustavo Carneiro, Mohammad Yaqub
Abstract: Model-merging has emerged as a powerful approach in deep learning, capable of enhancing model performance without any training. However, the underlying mechanisms that explain its effectiveness remain largely unexplored. In this paper, we investigate this technique from three novel perspectives to empirically provide deeper insights into why and how weight-averaged model-merging~\cite{wortsman2022soups} works: (1) we examine the intrinsic patterns captured by the learning of the model weights, and we are the first to connect that these weights encode structured with why weight-averaged model merging can work; (2) we investigate averaging on weights versus averaging on features, providing analyses from the view of diverse architecture comparisons on multiple datasets; and (3) we explore the impact on model-merging prediction stability in terms of changing the parameter magnitude, revealing insights into the way of weight averaging works as regularization by showing the robustness across different parameter scales. The code is available at https://github.com/billhhh/Rethink-Merge.
Authors: Dan Hudson, Jurgen van den Hoogen, Martin Atzmueller
Abstract: State-of-the-art algorithms are reported to be almost perfect at distinguishing the vibrations arising from healthy and damaged machine bearings, according to benchmark datasets at least. However, what about their application to new data? In this paper, we confirm that neural networks for bearing fault detection can be crippled by incorrect hyperparameterisation, and also that the correct hyperparameter settings can change when transitioning to new data. The paper combines multiple methods to explain the behaviour of the hyperparameters of a wide-kernel convolutional neural network and how to set them. Since guidance already exists for generic hyperparameters like minibatch size, we focus on how to set architecture-specific hyperparameters such as the width of the convolutional kernels, a topic which might otherwise be obscure. We reflect different data properties by fusing information from seven different benchmark datasets, and our results show that the kernel size in the first layer in particular is sensitive to changes in the data. Looking deeper, we use manipulated copies of one dataset in an attempt to spot why the kernel size sometimes needs to change. The relevance of sampling rate is studied by using different levels of resampling, and spectral content is studied by increasingly filtering out high frequencies. We find that, contrary to speculation in earlier work, high-frequency noise is not the main reason why a wide kernel is preferable to a narrow kernel. Finally, we conclude by stating clear guidance on how to set the hyperparameters of our neural network architecture to work effectively on new data.
Authors: Yaomin Wang, Chaolong Ying, Xiaodong Luo, Tianshu Yu
Abstract: Inference in large-scale Markov Random Fields (MRFs) is a critical yet challenging task, traditionally approached through approximate methods like belief propagation and mean field, or exact methods such as the Toulbar2 solver. These strategies often fail to strike an optimal balance between efficiency and solution quality, particularly as the problem scale increases. This paper introduces NeuroLifting, a novel technique that leverages Graph Neural Networks (GNNs) to reparameterize decision variables in MRFs, facilitating the use of standard gradient descent optimization. By extending traditional lifting techniques into a non-parametric neural network framework, NeuroLifting benefits from the smooth loss landscape of neural networks, enabling efficient and parallelizable optimization. Empirical results demonstrate that, on moderate scales, NeuroLifting performs very close to the exact solver Toulbar2 in terms of solution quality, significantly surpassing existing approximate methods. Notably, on large-scale MRFs, NeuroLifting delivers superior solution quality against all baselines, as well as exhibiting linear computational complexity growth. This work presents a significant advancement in MRF inference, offering a scalable and effective solution for large-scale problems.
Authors: Yunfei Luo, Yuliang Chen, Asif Salekin, Tauhidur Rahman
Abstract: Time-series foundation models excel at tasks like forecasting across diverse data types by leveraging informative waveform representations. Wearable sensing data, however, pose unique challenges due to their variability in patterns and frequency bands, especially for healthcare-related outcomes. The main obstacle lies in crafting generalizable representations that adapt efficiently across heterogeneous sensing configurations and applications. To address this, we propose NormWear, the first multi-modal and ubiquitous foundation model designed to extract generalized and informative representations from wearable sensing data. Specifically, we design a channel-aware attention mechanism with a shared special liaison [CLS] token to detect signal patterns in both intra-sensor and inter-sensors. This helps the model to extract more meaningful information considering both time series themselves and the relationships between input sensors. This helps the model to be widely compatible with various sensors settings. NormWear is pretrained on a diverse set of physiological signals, including PPG, ECG, EEG, GSR, and IMU, from various public datasets. Our model shows exceptional generalizability across 11 public wearable sensing datasets, spanning 18 applications in mental health, body state inference, vital sign estimation, and disease risk evaluation. It consistently outperforms competitive baselines under zero-shot, partial-shot, and full-shot settings, indicating broad applicability in real-world health applications.
Authors: Taiyan Zhang, Renchi Yang, Yurui Lai, Mingyu Yan, Xiaochun Ye, Dongrui Fan
Abstract: Graph neural networks (GNNs) have become the preferred models for node classification in graph data due to their robust capabilities in integrating graph structures and attributes. However, these models heavily depend on a substantial amount of high-quality labeled data for training, which is often costly to obtain. With the rise of large language models (LLMs), a promising approach is to utilize their exceptional zero-shot capabilities and extensive knowledge for node labeling. Despite encouraging results, this approach either requires numerous queries to LLMs or suffers from reduced performance due to noisy labels generated by LLMs. To address these challenges, we introduce Locle, an active self-training framework that does Label-free node Classification with LLMs cost-Effectively. Locle iteratively identifies small sets of "critical" samples using GNNs and extracts informative pseudo-labels for them with both LLMs and GNNs, serving as additional supervision signals to enhance model training. Specifically, Locle comprises three key components: (i) an effective active node selection strategy for initial annotations; (ii) a careful sample selection scheme to identify "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module that combines LLMs and GNNs with a rewired topology. Extensive experiments on five benchmark text-attributed graph datasets demonstrate that Locle significantly outperforms state-of-the-art methods under the same query budget to LLMs in terms of label-free node classification. Notably, on the DBLP dataset with 14.3k nodes, Locle achieves an 8.08% improvement in accuracy over the state-of-the-art at a cost of less than one cent. Our code is available at https://github.com/HKBU-LAGAS/Locle.
Authors: Ahmed M. Adly
Abstract: This paper introduces EXAdam ($\textbf{EX}$tended $\textbf{Adam}$), a novel optimization algorithm that builds upon the widely-used Adam optimizer. EXAdam incorporates two key enhancements: (1) new debiasing terms for improved moment estimation and (2) a gradient-based acceleration mechanism for increased responsiveness to the current loss landscape. These innovations work synergistically to address limitations of the original Adam algorithm, potentially offering improved convergence properties, enhanced ability to escape saddle points, and potentially greater robustness to hyperparameter choices, though this requires further investigation. We provide a theoretical analysis of EXAdam's components and their interactions, highlighting the algorithm's potential advantages in navigating complex optimization landscapes. Empirical evaluations demonstrate EXAdam's superiority over Adam, achieving 38.46% faster convergence and yielding improvements of 1.96%, 2.17%, and 1.17% in training, validation, and testing accuracies, respectively, when applied to a CNN trained on the CIFAR-10 dataset. While these results are promising, further empirical validation across diverse tasks is essential to fully gauge EXAdam's efficacy. Nevertheless, EXAdam represents a significant advancement in adaptive optimization techniques, with promising implications for a wide range of machine learning applications. This work aims to contribute to the ongoing development of more efficient, adaptive, and universally applicable optimization methods in the field of machine learning and artificial intelligence.
Authors: Ari Blondal, Shan Gao, Hamed Hatami, Pooya Hatami
Abstract: Two seminal papers--Alon, Livni, Malliaris, Moran (STOC 2019) and Bun, Livni, and Moran (FOCS 2020)--established the equivalence between online learnability and globally stable PAC learnability in binary classification. However, Chase, Chornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this equivalence does not hold in the agnostic setting. Specifically, they proved that in the agnostic setting, only finite hypothesis classes are globally stable learnable. Therefore, agnostic global stability is too restrictive to capture interesting hypothesis classes. To address this limitation, Chase et al. introduced two relaxations of agnostic global stability. In this paper, we characterize the classes that are learnable under their proposed relaxed conditions, resolving the two open problems raised in their work. First, we prove that in the setting where the stability parameter can depend on the excess error (the gap between the learner's error and the best achievable error by the hypothesis class), agnostic stability is fully characterized by the Littlestone dimension. Consequently, as in the realizable case, this form of learnability is equivalent to online learnability. As part of the proof of this theorem, we strengthen the celebrated result of Bun et al. by showing that classes with infinite Littlestone dimension are not stably PAC learnable, even if we allow the stability parameter to depend on the excess error. For the second relaxation proposed by Chase et al., we prove that only finite hypothesis classes are globally stable learnable, even if we restrict the agnostic setting to distributions with small population loss.
Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen
Abstract: We address the problem of best policy identification in preference-based reinforcement learning (PbRL), where learning occurs from noisy binary preferences over trajectory pairs rather than explicit numerical rewards. This approach is useful for post-training optimization of generative AI models during multi-turn user interactions, where preference feedback is more robust than handcrafted reward models. In this setting, learning is driven by both an offline preference dataset -- collected from a rater of unknown `competence' -- and online data collected with pure exploration. Since offline datasets may exhibit out-of-distribution (OOD) biases, principled online data collection is necessary. To address this, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling, that maintains independent posteriors over the true reward model and transition dynamics. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret of $\mathsf{PSPL}$. Since the exact algorithm can be computationally impractical, we also provide an approximate version that outperforms existing baselines.
Authors: Natalie Maus, Kyurae Kim, Yimeng Zeng, Haydn Thomas Jones, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Jacob R. Gardner
Abstract: In multi-objective black-box optimization, the goal is typically to find solutions that optimize a set of $T$ black-box objective functions, $f_1$, ..., $f_T$, simultaneously. Traditional approaches often seek a single Pareto-optimal set that balances trade-offs among all objectives. In this work, we consider a problem setting that departs from this paradigm: finding a small set of K < T solutions, that collectively "covers" the T objectives. A set of solutions is defined as "covering" if, for each objective $f_1$, ..., $f_T$, there is at least one good solution. A motivating example for this problem setting occurs in drug design. For example, we may have T pathogens and aim to identify a set of K < T antibiotics such that at least one antibiotic can be used to treat each pathogen. To address this problem, we propose Multi-Objective Coverage Bayesian Optimization (MOCOBO), a principled algorithm designed to efficiently find a covering set. We validate our approach through experiments on challenging high-dimensional tasks, including applications in peptide and molecular design, where MOCOBO is shown to find high-performing covering sets of solutions. The results show that the coverage of the K < T solutions found by MOCOBO matches or nearly matches the coverage of T solutions obtained by optimizing each objective individually. Furthermore, in in vitro experiments, the peptides found by MOCOBO exhibited high potency against drug-resistant pathogens, further demonstrating the potential of MOCOBO for drug discovery.
Authors: Akiyoshi Tomihari, Issei Sato
Abstract: Transformers are challenging to optimize with SGD and typically require adaptive optimizers such as Adam. However, the reasons behind the superior performance of Adam over SGD remain unclear. In this study, we investigate the optimization of transformers by focusing on gradient heterogeneity, defined as the disparity in gradient norms among parameters. Our analysis shows that gradient heterogeneity hinders gradient-based optimization, including SGD, while sign-based optimization, a simplified variant of Adam, is less affected. We further examine gradient heterogeneity in transformers and show that it is influenced by the placement of layer normalization. Experimental results from fine-tuning transformers in both NLP and vision domains validate our theoretical analyses. This study provides insights into the optimization challenges of transformers and offers guidance for designing future optimization algorithms. Code is available at https://github.com/tom4649/gradient-heterogeneity.
Authors: Dibyajyoti Chakraborty, Arvind T. Mohan, Romit Maulik
Abstract: Forecasting multiscale chaotic dynamical systems with deep learning remains a formidable challenge due to the spectral bias of neural networks, which hinders the accurate representation of fine-scale structures in long-term predictions. This issue is exacerbated when models are deployed autoregressively, leading to compounding errors and instability. In this work, we introduce a novel approach to mitigate the spectral bias which we call the Binned Spectral Power (BSP) Loss. The BSP loss is a frequency-domain loss function that adaptively weighs errors in predicting both larger and smaller scales of the dataset. Unlike traditional losses that focus on pointwise misfits, our BSP loss explicitly penalizes deviations in the energy distribution across different scales, promoting stable and physically consistent predictions. We demonstrate that the BSP loss mitigates the well-known problem of spectral bias in deep learning. We further validate our approach for the data-driven high-dimensional time-series forecasting of a range of benchmark chaotic systems which are typically intractable due to spectral bias. Our results demonstrate that the BSP loss significantly improves the stability and spectral accuracy of neural forecasting models without requiring architectural modifications. By directly targeting spectral consistency, our approach paves the way for more robust deep learning models for long-term forecasting of chaotic dynamical systems.
Authors: Jack Geary, Henry Gouk
Abstract: We consider the problem of strategic classification, where a learner must build a model to classify agents based on features that have been strategically modified. Previous work in this area has concentrated on the case when the learner is restricted to deterministic classifiers. In contrast, we perform a theoretical analysis of an extension to this setting that allows the learner to produce a randomised classifier. We show that, under certain conditions, the optimal randomised classifier can achieve better accuracy than the optimal deterministic classifier, but under no conditions can it be worse. When a finite set of training data is available, we show that the excess risk of Strategic Empirical Risk Minimisation over the class of randomised classifiers is bounded in a similar manner as the deterministic case. In both the deterministic and randomised cases, the risk of the classifier produced by the learner converges to that of the corresponding optimal classifier as the volume of available training data grows. Moreover, this convergence happens at the same rate as in the i.i.d. case. Our findings are compared with previous theoretical work analysing the problem of strategic classification. We conclude that randomisation has the potential to alleviate some issues that could be faced in practice without introducing any substantial downsides.
Authors: Peiyan Hu, Xiaowei Qian, Wenhao Deng, Rui Wang, Haodong Feng, Ruiqi Feng, Tao Zhang, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu
Abstract: The application of deep learning for partial differential equation (PDE)-constrained control is gaining increasing attention. However, existing methods rarely consider safety requirements crucial in real-world applications. To address this limitation, we propose Safe Diffusion Models for PDE Control (SafeDiffCon), which introduce the uncertainty quantile as model uncertainty quantification to achieve optimal control under safety constraints through both post-training and inference phases. Firstly, our approach post-trains a pre-trained diffusion model to generate control sequences that better satisfy safety constraints while achieving improved control objectives via a reweighted diffusion loss, which incorporates the uncertainty quantile estimated using conformal prediction. Secondly, during inference, the diffusion model dynamically adjusts both its generation process and parameters through iterative guidance and fine-tuning, conditioned on control targets while simultaneously integrating the estimated uncertainty quantile. We evaluate SafeDiffCon on three control tasks: 1D Burgers' equation, 2D incompressible fluid, and controlled nuclear fusion problem. Results demonstrate that SafeDiffCon is the only method that satisfies all safety constraints, whereas other classical and deep learning baselines fail. Furthermore, while adhering to safety constraints, SafeDiffCon achieves the best control performance. The code can be found at https://github.com/AI4Science-WestlakeU/safediffcon.
Authors: Wangtao Sun, Haotian Xu, Huanxuan Liao, Xuanqing Yu, Zhongtao Jiang, Shizhu He, Jun Zhao, Kang Liu
Abstract: The interaction with Large Language Models (LLMs) through instructions has been extensively investigated in the research community. While instructions have been widely used as the guidelines for task solving, this paper further notices that both instructions and parameters are the compression of task data. Therefore, they could be strongly correlated and can be learned to predict one from the other. This paper proposes a novel neural network framework, SHIP (\textbf{Sh}uttle between the \textbf{I}nstructions and the \textbf{P}arameters), to model and learn the mutual mappings between the instructions and the parameters of LLMs. We verify that SHIP can effectively map one of the instructions/parameters to the other by evaluating it on the tasks of instruction deduction and induction. The results show that SHIP performs better than existing baseline methods in terms of deductive capabilities while significantly surpassing them in inductive capabilities. Moreover, SHIP can effectively combine the two mapping processes to perform excellent inductive reasoning. The code and data for this paper are released at https://anonymous.4open.science/r/Shuttle-Between-Instructions-Parameters/.
URLs: https://anonymous.4open.science/r/Shuttle-Between-Instructions-Parameters/.
Authors: Giacomo Lastrucci, Artur M. Schweidtmann
Abstract: Ensuring neural networks adhere to domain-specific constraints is crucial for addressing safety and ethical concerns while also enhancing inference accuracy. Despite the nonlinear nature of most real-world tasks, existing methods are predominantly limited to affine or convex constraints. We introduce ENFORCE, a neural network architecture that uses an adaptive projection module (AdaNP) to enforce nonlinear equality constraints in the predictions. We prove that our projection mapping is 1-Lipschitz, making it well-suited for stable training. We evaluate ENFORCE on an illustrative regression task and for learning solutions to high-dimensional optimization problems in an unsupervised setting. The predictions of our new architecture satisfy $N_C$ equality constraints that are nonlinear in both the inputs and outputs of the neural network, while maintaining scalability with a tractable computational complexity of $\mathcal{O}(N_C^3)$ at training and inference time.
Authors: Chengyang Ying, Huayu Chen, Xinning Zhou, Zhongkai Hao, Hang Su, Jun Zhu
Abstract: Unsupervised reinforcement learning (URL) aims to pre-train agents by exploring diverse states or skills in reward-free environments, facilitating efficient adaptation to downstream tasks. As the agent cannot access extrinsic rewards during unsupervised exploration, existing methods design intrinsic rewards to model the explored data and encourage further exploration. However, the explored data are always heterogeneous, posing the requirements of powerful representation abilities for both intrinsic reward models and pre-trained policies. In this work, we propose the Exploratory Diffusion Model (ExDM), which leverages the strong expressive ability of diffusion models to fit the explored data, simultaneously boosting exploration and providing an efficient initialization for downstream tasks. Specifically, ExDM can accurately estimate the distribution of collected data in the replay buffer with the diffusion model and introduces the score-based intrinsic reward, encouraging the agent to explore less-visited states. After obtaining the pre-trained policies, ExDM enables rapid adaptation to downstream tasks. In detail, we provide theoretical analyses and practical algorithms for fine-tuning diffusion policies, addressing key challenges such as training instability and computational complexity caused by multi-step sampling. Extensive experiments demonstrate that ExDM outperforms existing SOTA baselines in efficient unsupervised exploration and fast fine-tuning downstream tasks, especially in structurally complicated environments.
Authors: Mingkai Xu, Yongpeng Wu, Yuxuan Shi, Xiang-Gen Xia, Wenjun Zhang, Ping Zhang
Abstract: A latent denoising semantic communication (SemCom) framework is proposed for robust image transmission over noisy channels. By incorporating a learnable latent denoiser into the receiver, the received signals are preprocessed to effectively remove the channel noise and recover the semantic information, thereby enhancing the quality of the decoded images. Specifically, a latent denoising mapping is established by an iterative residual learning approach to improve the denoising efficiency while ensuring stable performance. Moreover, channel signal-to-noise ratio (SNR) is utilized to estimate and predict the latent similarity score (SS) for conditional denoising, where the number of denoising steps is adapted based on the predicted SS sequence, further reducing the communication latency. Finally, simulations demonstrate that the proposed framework can effectively and efficiently remove the channel noise at various levels and reconstruct visual-appealing images.
Authors: Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang
Abstract: Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). In this work we focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.
Authors: Zhikai Wu, Sifan Wang, Shiyang Zhang, Sizhuang He, Min Zhu, Anran Jiao, Lu Lu, David van Dijk
Abstract: Operator learning for time-dependent partial differential equations (PDEs) has seen rapid progress in recent years, enabling efficient approximation of complex spatiotemporal dynamics. However, most existing methods rely on fixed time step sizes during rollout, which limits their ability to adapt to varying temporal complexity and often leads to error accumulation. To address this gap, we propose the Time-Adaptive Transformer with Neural Taylor Expansion (TANTE), a novel operator-learning framework that produces continuous-time predictions with adaptive step sizes. TANTE predicts future states by performing a Taylor expansion at the current state, where neural networks learn both the higher-order temporal derivatives and the local radius of convergence. This allows the model to dynamically adjust its rollout based on the local behavior of the solution, thereby reducing cumulative error and improving computational efficiency. We demonstrate the effectiveness of TANTE across a wide range of PDE benchmarks, achieving superior accuracy and adaptability compared to fixed-step baselines, delivering accuracy gains of 10-50 % and speed-ups of 30-80 % at inference.
Authors: Cong Wang, Jiahong Chen, Renjie Yang, Dong Li, Zhibin Wang, Zongyu Yang, Zhijun Wang, Yixiong Wei, Zhaoyang Liu, Chenshu Hu, Jing Li
Abstract: An integrated data analysis model based on Gaussian Process Regression is proposed for plasma electron density profile tomography in the HL-3 tokamak. The model combines line-integral measurements from the far-infrared laser interferometer with point measurements obtained via the frequency-modulated continuous wave reflectometry. By employing Gaussian Process Regression, the model effectively incorporates point measurements into 2D profile reconstructions, while coordinate mapping integrates magnetic equilibrium information. The average relative error of the reconstructed profile obtained by the integrated data analysis model with normalized magnetic flux is as low as 3.60*10^(-4). Additionally, sensitivity tests were conducted on the grid resolution, the standard deviation of diagnostic data, and noise levels, providing a robust foundation for the real application to experimental data.
Authors: Dongki Kim, Wonbin Lee, Sung Ju Hwang
Abstract: Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
Authors: Akshay Kumar, Jarvis Haupt
Abstract: Recent works exploring the training dynamics of homogeneous neural network weights under gradient flow with small initialization have established that in the early stages of training, the weights remain small and near the origin, but converge in direction. Building on this, the current paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin. Insights gained from this analysis are used to characterize the first saddle point encountered by gradient flow after escaping the origin. Also, it is shown that for homogeneous feed-forward neural networks, under certain conditions, the sparsity structure emerging among the weights before the escape is preserved after escaping the origin and until reaching the next saddle point.
Authors: Pranav Vaidhyanathan, Aristotelis Papatheodorou, Mark T. Mitchison, Natalia Ares, Ioannis Havoutis
Abstract: Scalable and generalizable physics-aware deep learning has long been considered a significant challenge with various applications across diverse domains ranging from robotics to molecular dynamics. Central to almost all physical systems are symplectic forms, the geometric backbone that underpins fundamental invariants like energy and momentum. In this work, we introduce a novel deep learning framework, MetaSym. In particular, MetaSym combines a strong symplectic inductive bias obtained from a symplectic encoder, and an autoregressive decoder with meta-attention. This principled design ensures that core physical invariants remain intact, while allowing flexible, data-efficient adaptation to system heterogeneities. We benchmark MetaSym with highly varied and realistic datasets, such as a high-dimensional spring-mesh system (Otness et al., 2021), an open quantum system with dissipation and measurement backaction, and robotics-inspired quadrotor dynamics. Our results demonstrate superior performance in modeling dynamics under few-shot adaptation, outperforming state-of-the-art baselines that use larger models.
Authors: Shlok Mehendale, Aditya Challa, Rahul Yedida, Sravan Danda, Santonu Sarkar, Snehanshu Saha
Abstract: Which principle underpins the design of an effective anomaly detection loss function? The answer lies in the concept of Radon-Nikod\'ym theorem, a fundamental concept in measure theory. The key insight from this article is: Multiplying the vanilla loss function with the Radon-Nikod\'ym derivative improves the performance across the board. We refer to this as RN-Loss. We prove this using the setting of PAC (Probably Approximately Correct) learnability. Depending on the context a Radon-Nikod\'ym derivative takes different forms. In the simplest case of supervised anomaly detection, Radon-Nikod\'ym derivative takes the form of a simple weighted loss. In the case of unsupervised anomaly detection (with distributional assumptions), Radon-Nikod\'ym derivative takes the form of the popular cluster based local outlier factor. We evaluate our algorithm on 96 datasets, including univariate and multivariate data from diverse domains, including healthcare, cybersecurity, and finance. We show that RN-Derivative algorithms outperform state-of-the-art methods on 68% of Multivariate datasets (based on F1 scores) and also achieves peak F1-scores on 72% of time series (Univariate) datasets.
Authors: Weilin Chen, Ruichu Cai, Yuguang Yan, Zhifeng Hao, Jos\'e Miguel Hern\'andez-Lobato
Abstract: Long-term causal inference is an important but challenging problem across various scientific domains. To solve the latent confounding problem in long-term observational studies, existing methods leverage short-term experimental data. Ghassami et al. propose an approach based on the Conditional Additive Equi-Confounding Bias (CAECB) assumption, which asserts that the confounding bias in the short-term outcome is equal to that in the long-term outcome, so that the long-term confounding bias and the causal effects can be identified. While effective in certain cases, this assumption is limited to scenarios where there is only one short-term outcome with the same scale as the long-term outcome. In this paper, we introduce a novel assumption that extends the CAECB assumption to accommodate temporal short-term outcomes. Our proposed assumption states a functional relationship between sequential confounding biases across temporal short-term outcomes, under which we theoretically establish the identification of long-term causal effects. Based on the identification result, we develop an estimator and conduct a theoretical analysis of its asymptotic properties. Extensive experiments validate our theoretical results and demonstrate the effectiveness of the proposed method.
Authors: Zhangdie Yuan, Zifeng Ding, Andreas Vlachos
Abstract: Forecasting is an important task in many domains, such as technology and economics. However existing forecasting benchmarks largely lack comprehensive confidence assessment, focus on limited question types, and often consist of artificial questions that do not align with real-world human forecasting needs. To address these gaps, we introduce FOReCAst (Future Outcome Reasoning and Confidence Assessment), a benchmark that evaluates models' ability to make predictions and their confidence in them. FOReCAst spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation, enabling a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.
Authors: Yongwoo Kim, Sungmin Cha, Donghyun Kim
Abstract: Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics (i.e., accuracy) under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a new comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches genuinely eliminate the targeted forget data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier (i.e., the last layer), thereby achieving superior logit-based evaluation metrics while maintaining significant representational similarity to the original model. Furthermore, we introduce a rigorous unlearning evaluation setup, in which the forgetting classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model, thus enabling a more rigorous evaluation from a representation perspective. We hope our benchmark serves as a standardized protocol for evaluating unlearning algorithms under realistic conditions.
Authors: Guillaume Qu\'etant, Pavlo Molchanov, Slava Voloshynovskiy
Abstract: We present a semi-supervised fine-tuning framework for foundation models that utilises mutual information decomposition to address the challenges of training for a limited amount of labelled data. Our approach derives two distinct lower bounds: i) for the downstream task space, such as classification, optimised using conditional and marginal cross-entropy alongside Kullback-Leibler divergence, and ii) for the latent space representation, regularised and aligned using a contrastive-like decomposition. This fine-tuning strategy retains the pre-trained structure of the foundation model, modifying only a specialised projector module comprising a small transformer and a token aggregation technique. Experiments on several datasets demonstrate significant improvements in classification tasks under extremely low-labelled conditions by effectively leveraging unlabelled data.
Authors: Rohan Bhatnagar, Ling Liang, Krish Patel, Haizhao Yang
Abstract: Motivated by the remarkable success of artificial intelligence (AI) across diverse fields, the application of AI to solve scientific problems, often formulated as partial differential equations (PDEs), has garnered increasing attention. While most existing research concentrates on theoretical properties (such as well-posedness, regularity, and continuity) of the solutions, alongside direct AI-driven methods for solving PDEs, the challenge of uncovering symbolic relationships within these equations remains largely unexplored. In this paper, we propose leveraging large language models (LLMs) to learn such symbolic relationships. Our results demonstrate that LLMs can effectively predict the operators involved in PDE solutions by utilizing the symbolic information in the PDEs both theoretically and numerically. Furthermore, we show that discovering these symbolic relationships can substantially improve both the efficiency and accuracy of symbolic machine learning for finding analytical approximation of PDE solutions, delivering a fully interpretable solution pipeline. This work opens new avenues for understanding the symbolic structure of scientific problems and advancing their solution processes.
Authors: Daniel Racz, Mihaly Petreczky, Balint Daroczy
Abstract: We derive a finite-sample probabilistic bound on the parameter estimation error of a system identification algorithm for Linear Switched Systems. The algorithm estimates Markov parameters from a single trajectory and applies a variant of the Ho-Kalman algorithm to recover the system matrices. Our bound guarantees statistical consistency under the assumption that the true system exhibits quadratic stability. The proof leverages the theory of weakly dependent processes. To the best of our knowledge, this is the first finite-sample bound for this algorithm in the single-trajectory setting.
Authors: Da Ma, Gonghu Shang, Zhi Chen, Libo Qin, Yijie Luo, Lei Pan, Shuai Fan, Lu Chen, Kai Yu
Abstract: Instruction tuning improves the ability of large language models (LLMs) to follow diverse human instructions, but achieving strong performance on specific target tasks remains challenging. A critical bottleneck is selecting the most relevant data to maximize task-specific performance. Existing data selection approaches include unstable influence-based methods and more stable distribution alignment methods, the latter of which critically rely on the underlying sample representation. In practice, most distribution alignment methods, from shallow features (e.g., BM25) to neural embeddings (e.g., BGE, LLM2Vec), may fail to capture how the model internally processes samples. To bridge this gap, we adopt a model-centric strategy in which each sample is represented by its neuronal activation pattern in the model, directly reflecting internal computation. However, directly using raw neuron activations leads to spurious similarity between unrelated samples due to neuron polysemanticity, where a single neuron may respond to multiple, unrelated concepts. To address this, we employ sparse autoencoders to disentangle polysemantic activations into sparse, monosemantic representations, and introduce a dedicated similarity metric for this space to better identify task-relevant data. Comprehensive experiments across multiple instruction datasets, models, tasks, and selection ratios show that our approach consistently outperforms existing data selection baselines in both stability and task-specific performance.
Authors: Jianping Ye, Michel Wedel, Kunpeng Zhang
Abstract: Text-to-Image Diffusion models excel at generating images from text prompts but often exhibit suboptimal alignment with content semantics, aesthetics, and human preferences. To address these limitations, this study proposes a novel parameter-efficient framework, Indirect Prompt Gradient Optimization (IPGO), for prompt-level diffusion model fine-tuning. IPGO enhances prompt embeddings by injecting continuously differentiable embeddings at the beginning and end of the prompt embeddings, leveraging low-rank structures with the flexibility and nonlinearity from rotations. This approach enables gradient-based optimization of injected embeddings under range, orthonormality, and conformity constraints, effectively narrowing the search space, promoting a stable solution, and ensuring alignment between the embeddings of the injected embeddings and the original prompt. Its extension IPGO+ adds a parameter-free cross-attention mechanism on the prompt embedding to enforce dependencies between the original prompt and the inserted embeddings. We conduct extensive evaluations through prompt-wise (IPGO) and prompt-batch (IPGO+) training using three reward models of image aesthetics, image-text alignment, and human preferences across three datasets of varying complexity. The results show that IPGO consistently outperforms SOTA benchmarks, including stable diffusion v1.5 with raw prompts, text-embedding-based methods (TextCraftor), training-based methods (DRaFT and DDPO), and training-free methods (DPO-Diffusion, Promptist, and ChatGPT-4o). Specifically, IPGO achieves a win-rate exceeding 99% in prompt-wise learning, and IPGO+ achieves a comparable, but often better performance against current SOTAs (a 75% win rate) in prompt-batch learning. Moreover, we illustrate IPGO's generalizability and its capability to significantly enhance image quality while requiring minimal data and resources.
Authors: Qian Zuo, Fengxiang He
Abstract: This paper studies constrained Markov decision processes (CMDPs) with constraints against stochastic thresholds, aiming at the safety of reinforcement learning in unknown and uncertain environments. We leverage a Growing-Window estimator sampling from interactions with the uncertain and dynamic environment to estimate the thresholds, based on which we design Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds. SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings. We prove that our algorithm achieves sublinear regret and constraint violation; i.e., a reward regret of $\tilde{\mathcal{O}}(\sqrt{T})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{T})$ constraint violation over $T$ episodes. The theoretical guarantees show that our algorithm achieves performance comparable to that of an approach relying on fixed and clear thresholds. To the best of our knowledge, SPOT is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.
Authors: Christopher M. Ackerman, Nina Panickssery
Abstract: Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a "fake" assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the "fake" assistant. In this work, we probe the effectiveness of different fine-tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.
Authors: Yamato Arai, Yuma Ichikawa
Abstract: Layer-wise PTQ is a promising technique for compressing large language models (LLMs), due to its simplicity and effectiveness without requiring retraining. However, recent progress in this area is saturating, underscoring the need to revisit its core limitations and explore further improvements. We address this challenge by identifying a key limitation of existing layer-wise PTQ methods: the growth of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this fundamental issue, we propose Quantization Error Propagation (QEP), a general, lightweight, and scalable framework that enhances layer-wise PTQ by explicitly propagating quantization errors and compensating for accumulated errors. QEP also offers a tunable propagation mechanism that prevents overfitting and controls computational overhead, enabling the framework to adapt to various architectures and resource budgets. Extensive experiments on several LLMs demonstrate that QEP-enhanced layer-wise PTQ achieves substantially higher accuracy than existing methods. Notably, the gains are most pronounced in the extremely low-bit quantization regime.
Authors: Yihang Lu, Yangyang Xu, Qitao Qing, Xianwei Meng
Abstract: Recent deep learning models for Long-term Time Series Forecasting (LTSF) often emphasize complex, handcrafted designs, while simpler architectures like linear models or MLPs have often outperformed these intricate solutions. In this paper, we revisit and organize the core ideas behind several key techniques, such as redundancy reduction and multi-scale modeling, which are frequently employed in advanced LTSF models. Our goal is to streamline these ideas for more efficient deep learning utilization. To this end, we introduce TimeCapsule, a model built around the principle of high-dimensional information compression that unifies these techniques in a generalized yet simplified framework. Specifically, we model time series as a 3D tensor, incorporating temporal, variate, and level dimensions, and leverage mode production to capture multi-mode dependencies while achieving dimensionality compression. We propose an internal forecast within the compressed representation domain, supported by the Joint-Embedding Predictive Architecture (JEPA), to monitor the learning of predictive representations. Extensive experiments on challenging benchmarks demonstrate the versatility of our method, showing that TimeCapsule can achieve state-of-the-art performance.
Authors: Yunhui Liu, Jiashun Cheng, Yiqing Lin, Qizhuo Xie, Jia Li, Fugee Tsung, Hongzhi Yin, Tao Zheng, Jianhua Zhao, Tieke He
Abstract: Graph anomaly detection (GAD) has garnered increasing attention in recent years, yet remains challenging due to two key factors: (1) label scarcity stemming from the high cost of annotations and (2) homophily disparity at node and class levels. In this paper, we introduce Anomaly-Aware Pre-Training and Fine-Tuning (APF), a targeted and effective framework to mitigate the above challenges in GAD. In the pre-training stage, APF incorporates node-specific subgraphs selected via the Rayleigh Quotient, a label-free anomaly metric, into the learning objective to enhance anomaly awareness. It further introduces two learnable spectral polynomial filters to jointly learn dual representations that capture both general semantics and subtle anomaly cues. During fine-tuning, a gated fusion mechanism adaptively integrates pre-trained representations across nodes and dimensions, while an anomaly-aware regularization loss encourages abnormal nodes to preserve more anomaly-relevant information. Furthermore, we theoretically show that APF tends to achieve linear separability under mild conditions. Comprehensive experiments on 10 benchmark datasets validate the superior performance of APF in comparison to state-of-the-art baselines.
Authors: Christopher B\"ulte, Yusuf Sale, Timo L\"ohr, Paul Hofman, Gitta Kutyniok, Eyke H\"ullermeier
Abstract: Uncertainty quantification (UQ) is crucial in machine learning, yet most (axiomatic) studies of uncertainty measures focus on classification, leaving a gap in regression settings with limited formal justification and evaluations. In this work, we introduce a set of axioms to rigorously assess measures of aleatoric, epistemic, and total uncertainty in supervised regression. By utilizing a predictive exponential family, we can generalize commonly used approaches for uncertainty representation and corresponding uncertainty measures. More specifically, we analyze the widely used entropy- and variance-based measures regarding limitations and challenges. Our findings provide a principled foundation for uncertainty quantification in regression, offering theoretical insights and practical guidelines for reliable uncertainty assessment.
Authors: Filipp Nikitin, Ian Dunn, David Ryan Koes, Olexandr Isayev
Abstract: Deep generative models have shown significant promise in generating valid 3D molecular structures, with the GEOM-Drugs dataset serving as a key benchmark. However, current evaluation protocols suffer from critical flaws, including incorrect valency definitions, bugs in bond order calculations, and reliance on force fields inconsistent with the reference data. In this work, we revisit GEOM-Drugs and propose a corrected evaluation framework: we identify and fix issues in data preprocessing, construct chemically accurate valency tables, and introduce a GFN2-xTB-based geometry and energy benchmark. We retrain and re-evaluate several leading models under this framework, providing updated performance metrics and practical recommendations for future benchmarking. Our results underscore the need for chemically rigorous evaluation practices in 3D molecular generation. Our recommended evaluation methods and GEOM-Drugs processing scripts are available at https://github.com/isayevlab/geom-drugs-3dgen-evaluation.
URLs: https://github.com/isayevlab/geom-drugs-3dgen-evaluation.
Authors: Pinaki Mohanty, Riddhiman Bhattacharya, Ruqi Zhang
Abstract: Sampling from flat modes in discrete spaces is a crucial yet underexplored problem. Flat modes represent robust solutions and have broad applications in combinatorial optimization and discrete generative modeling. However, existing sampling algorithms often overlook the mode volume and struggle to capture flat modes effectively. To address this limitation, we propose \emph{Entropic Discrete Langevin Proposal} (EDLP), which incorporates local entropy into the sampling process through a continuous auxiliary variable under a joint distribution. The local entropy term guides the discrete sampler toward flat modes with a small overhead. We provide non-asymptotic convergence guarantees for EDLP in locally log-concave discrete distributions. Empirically, our method consistently outperforms traditional approaches across tasks that require sampling from flat basins, including Bernoulli distribution, restricted Boltzmann machines, combinatorial optimization, and binary neural networks.
Authors: Soumen Kumar Mondal, Akshit Varmora, Prateek Chanda, Ganesh Ramakrishnan
Abstract: We propose FairPO, a novel framework designed to promote fairness in multi-label classification by directly optimizing preference signals with a group robustness perspective. In our framework, the set of labels is partitioned into privileged and non-privileged groups, and a preference-based loss inspired by Direct Preference Optimization (DPO) is employed to more effectively differentiate true positive labels from confusing negatives within the privileged group, while preserving baseline classification performance for non-privileged labels. By framing the learning problem as a robust optimization over groups, our approach dynamically adjusts the training emphasis toward groups with poorer performance, thereby mitigating bias and ensuring a fairer treatment across diverse label categories. In addition, we outline plans to extend this approach by investigating alternative loss formulations such as Simple Preference Optimisation (SimPO) and Contrastive Preference Optimization (CPO) to exploit reference-free reward formulations and contrastive training signals. Furthermore, we plan to extend FairPO with multilabel generation capabilities, enabling the model to dynamically generate diverse and coherent label sets for ambiguous inputs.
Authors: Yoel Zimmermann, Adib Bazgir, Alexander Al-Feghali, Mehrad Ansari, Joshua Bocarsly, L. Catherine Brinson, Yuan Chiang, Defne Circi, Min-Hsueh Chiu, Nathan Daelman, Matthew L. Evans, Abhijeet S. Gangan, Janine George, Hassan Harb, Ghazal Khalighinejad, Sartaaj Takrim Khan, Sascha Klawohn, Magdalena Lederbauer, Soroush Mahjoubi, Bernadette Mohr, Seyed Mohamad Moosavi, Aakash Naik, Aleyna Beste Ozhan, Dieter Plessers, Aritra Roy, Fabian Sch\"oppach, Philippe Schwaller, Carla Terboven, Katharina Ueltzen, Yue Wu, Shang Zhu, Jan Janssen, Calvin Li, Ian Foster, Ben Blaiszik
Abstract: Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline research workflows. To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowledge extraction and reasoning from the scientific literature. Collectively, these applications illustrate how LLMs serve as versatile predictive models, platforms for rapid prototyping of domain-specific tools, and much more. In particular, improvements in both open source and proprietary LLM performance through the addition of reasoning, additional training data, and new techniques have expanded effectiveness, particularly in low-data environments and interdisciplinary research. As LLMs continue to improve, their integration into scientific workflows presents both new opportunities and new challenges, requiring ongoing exploration, continued refinement, and further research to address reliability, interpretability, and reproducibility.
Authors: Johannes Schneider
Abstract: We ask: Does focusing on classes predicted as likely improve model predictions? We aim for an affirmative answer by proposing two novel test-time fine-tuning methods to improve uncertain model predictions. Instead of greedily selecting the most likely class, we introduce an additional step, \emph{focus on the likely classes}, to refine predictions. By applying a theoretically motivated single gradient descent step with a large learning rate, we refine predictions when an initial forward pass indicates high uncertainty. This aligns predictions more closely with the ideal of assigning zero probability to less plausible outcomes. The experimental evaluation demonstrates accuracy gains for one of our methods, which emphasizes shared features among likely classes, across diverse text and image domain models. %Our theoretical discussion provides a deeper understanding, highlighting the varying impact of shared and non-shared features among (focus) classes. %Our discussion also suggests an interesting view on standard, offline training vs. test-time training: Opposing optimization rationales regarding breadth of feature dependence are preferable during each training phase.
Authors: Milad Khademi Nori, Il-Min Kim, Guanghui Wang
Abstract: In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of tasks $t$ increases. Current exemplar replay strategies impose $\mathcal{O}(t)$ memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achieving $\mathcal{O}(0.1 t)$ at the worst case with the computing complexity of $\mathcal{O}(t)$ while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets. The source code is included in the supplementary material and will be open-sourced upon publication.
Authors: Neil De La Fuente, Maria Pilligua, Daniel Vidal, Albin Soutiff, Cecilia Curreli, Daniel Cremers, Andrey Barsky
Abstract: Continual learning (CL) aims to learn a sequence of tasks without forgetting prior knowledge, but gradient updates for a new task often overwrite the weights learned earlier, causing catastrophic forgetting (CF). We propose Prototype-Augmented Hypernetworks (PAH), a framework where a single hypernetwork, conditioned on learnable task prototypes, dynamically generates task-specific classifier heads on demand. To mitigate forgetting, PAH combines cross-entropy with dual distillation losses, one to align logits and another to align prototypes, ensuring stable feature representations across tasks. Evaluations on Split-CIFAR100 and TinyImageNet demonstrate that PAH achieves state-of-the-art performance, reaching 74.5 % and 63.7 % accuracy with only 1.7 % and 4.4 % forgetting, respectively, surpassing prior methods without storing samples or heads.
Authors: Yanxin Liu, Yunqi Zhang
Abstract: Currently, the scaling law of Large Language Models (LLMs) faces challenges and bottlenecks. This paper posits that noise effects, stemming from changes in the signal-to-noise ratio under diminishing marginal returns, are the root cause of these issues. To control this noise, we investigated the differences between models with performance advantages and disadvantages, introducing the concept of "relative overfitting." Based on their complementary strengths, we have proposed an application framework, Accept-Reject (AR). In Natural Language Processing (NLP), we use LLMs and Small Language Models (SLMs) as the medium for discussion. This framework enables SLMs to exert a universal positive influence on LLM decision outputs, rather than the intuitively expected negative influence. We validated our approach using self-built models based on mainstream architectures and pre-trained mainstream models across multiple datasets, including basic language modeling, long-context tasks, subject examination, and question-answering (QA) benchmarks. The results demonstrate that through our structure, compared to increasing the LLM's parameters, we can achieve better performance improvements with significantly lower parameter and computational costs in many scenarios. These improvements are universal, stable, and effective. Furthermore, we explore the potential of "relative overfitting" and the AR framework in other machine learning domains, such as computer vision (CV) and AI for science. We hope the proposed approach can help scale laws overcome existing bottlenecks.
Authors: Boshi Gao, Qingjian Ni, Fanbo Ju, Yu Chen, Ziqi Zhao
Abstract: Long-term time series forecasting (LTSF) offers broad utility in practical settings like energy consumption and weather prediction. Accurately predicting long-term changes, however, is demanding due to the intricate temporal patterns and inherent multi-scale variations within time series. This work confronts key issues in LTSF, including the suboptimal use of multi-granularity information, the neglect of channel-specific attributes, and the unique nature of trend and seasonal components, by introducing a proficient MLP-based forecasting framework. Our method adeptly disentangles complex temporal dynamics using clear, concurrent predictions across various scales. These multi-scale forecasts are then skillfully integrated through a system that dynamically assigns importance to information from different granularities, sensitive to individual channel characteristics. To manage the specific features of temporal patterns, a two-pronged structure is utilized to model trend and seasonal elements independently. Experimental results on eight LTSF benchmarks demonstrate that MDMixer improves average MAE performance by 4.64% compared to the recent state-of-the-art MLP-based method (TimeMixer), while achieving an effective balance between training efficiency and model interpretability.
Authors: Hang Gao, Wenxuan Huang, Fengge Wu, Junsuo Zhao, Changwen Zheng, Huaping Liu
Abstract: The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module.
Authors: Takashi Nicholas Maeda, Shohei Shimizu, Hidetoshi Matsui
Abstract: This paper proposes a causal discovery method for mixed bivariate data consisting of one continuous and one discrete variable. Existing constraint-based approaches are ineffective in the bivariate setting, as they rely on conditional independence tests that are not suited to bivariate data. Score-based methods either impose strong distributional assumptions or face challenges in fairly comparing causal directions between variables of different types, due to differences in their information content. We introduce a novel approach that determines causal direction by analyzing the monotonicity of the conditional density ratio of the continuous variable, conditioned on different values of the discrete variable. Our theoretical analysis shows that the conditional density ratio exhibits monotonicity when the continuous variable causes the discrete variable, but not in the reverse direction. This property provides a principled basis for comparing causal directions between variables of different types, free from strong distributional assumptions and bias arising from differences in their information content. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets, showing superior accuracy compared to existing methods.
Authors: Julian B\"uchel, Iason Chalas, Giovanni Acampa, An Chen, Omobayode Fagbohungbe, Sidney Tsai, Kaoutar El Maghraoui, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian
Abstract: Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models $\unicode{x2013}$ including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct $\unicode{x2013}$ to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at https://github.com/IBM/analog-foundation-models.
Authors: George Dimitriadis, Spyridon Samothrakis
Abstract: Out-of-distribution (OOD) generalisation is considered a hallmark of human and animal intelligence. To achieve OOD through composition, a system must discover the environment-invariant properties of experienced input-output mappings and transfer them to novel inputs. This can be realised if an intelligent system can identify appropriate, task-invariant, and composable input features, as well as the composition methods, thus allowing it to act based not on the interpolation between learnt data points but on the task-invariant composition of those features. We propose that in order to confirm that an algorithm does indeed learn compositional structures from data, it is not enough to just test on an OOD setup, but one also needs to confirm that the features identified are indeed compositional. We showcase this by exploring two tasks with clearly defined OOD metrics that are not OOD solvable by three commonly used neural networks: a Multi-Layer Perceptron (MLP), a Convolutional Neural Network (CNN), and a Transformer. In addition, we develop two novel network architectures imbued with biases that allow them to be successful in OOD scenarios. We show that even with correct biases and almost perfect OOD performance, an algorithm can still fail to learn the correct features for compositional generalisation.
Authors: Aditya Panangat
Abstract: Over the past decade, the use of machine learning has increased exponentially. Models are far more complex than ever before, growing to gargantuan sizes and housing millions of weights. Unfortunately, the fact that large models have become the state of the art means that it often costs millions of dollars to train and operate them. These expenses not only hurt companies but also bar non-wealthy individuals from contributing to new developments and force consumers to pay greater prices for AI. Current methods used to prune models, such as iterative magnitude pruning, have shown great accuracy but require an iterative training sequence that is incredibly computationally and environmentally taxing. To solve this problem, BINGO is introduced. BINGO, during the training pass, studies specific subsets of a neural network one at a time to gauge how significant of a role each weight plays in contributing to a network's accuracy. By the time training is done, BINGO generates a significance score for each weight, allowing for insignificant weights to be pruned in one shot. BINGO provides an accuracy-preserving pruning technique that is less computationally intensive than current methods, allowing for a world where AI growth does not have to mean model growth, as well.
Authors: Saikat Barua, Mostafizur Rahman, Shehenaz Khaled, Md Jafor Sadek, Rafiul Islam, Shahnewaz Siddique
Abstract: The emergence of hybrid quantum-classical machine learning (HQML) models opens new horizons of computational intelligence but their fundamental complexity frequently leads to black box behavior that undermines transparency and reliability in their application. Although XAI for quantum systems still in its infancy, a major research gap is evident in robust global and local explainability approaches that are designed for HQML architectures that employ quantized feature encoding followed by classical learning. The gap is the focus of this work, which introduces QuXAI, an framework based upon Q-MEDLEY, an explainer for explaining feature importance in these hybrid systems. Our model entails the creation of HQML models incorporating quantum feature maps, the use of Q-MEDLEY, which combines feature based inferences, preserving the quantum transformation stage and visualizing the resulting attributions. Our result shows that Q-MEDLEY delineates influential classical aspects in HQML models, as well as separates their noise, and competes well against established XAI techniques in classical validation settings. Ablation studies more significantly expose the virtues of the composite structure used in Q-MEDLEY. The implications of this work are critically important, as it provides a route to improve the interpretability and reliability of HQML models, thus promoting greater confidence and being able to engage in safer and more responsible use of quantum-enhanced AI technology. Our code and experiments are open-sourced at: https://github.com/GitsSaikat/QuXAI
Authors: Victor Chernozhukov, Whitney Newey, Vira Semenova
Abstract: This paper introduces metrics for welfare analysis in dynamic models. We develop estimation and inference for these parameters even in the presence of a high-dimensional state space. Examples of welfare metrics include average welfare, average marginal welfare effects, and welfare decompositions into direct and indirect effects similar to Oaxaca (1973) and Blinder (1973). We derive dual and doubly robust representations of welfare metrics that facilitate debiased inference. For average welfare, the value function does not have to be estimated. In general, debiasing can be applied to any estimator of the value function, including neural nets, random forests, Lasso, boosting, and other high-dimensional methods. In particular, we derive Lasso and Neural Network estimators of the value function and associated dynamic dual representation and establish associated mean square convergence rates for these functions. Debiasing is automatic in the sense that it only requires knowledge of the welfare metric of interest, not the form of bias correction. The proposed methods are applied to estimate a dynamic behavioral model of teacher absenteeism in \cite{DHR} and associated average teacher welfare.
Authors: Omar Fawzi, Aadil Oufkir, Daniel Stilck Fran\c{c}a
Abstract: Understanding the noise affecting a quantum device is of fundamental importance for scaling quantum technologies. A particularly important class of noise models is that of Pauli channels, as randomized compiling techniques can effectively bring any quantum channel to this form and are significantly more structured than general quantum channels. In this paper, we show fundamental lower bounds on the sample complexity for learning Pauli channels in diamond norm. We consider strategies that may not use auxiliary systems entangled with the input to the unknown channel and have to perform a measurement before reusing the channel. For non-adaptive algorithms, we show a lower bound of $\Omega(2^{3n}\varepsilon^{-2})$ to learn an $n$-qubit Pauli channel. In particular, this shows that the recently introduced learning procedure by Flammia and Wallman is essentially optimal. In the adaptive setting, we show a lower bound of $\Omega(2^{2.5n}\varepsilon^{-2})$ for $\varepsilon=\mathcal{O}(2^{-n})$, and a lower bound of $\Omega(2^{2n}\varepsilon^{-2} )$ for any $\varepsilon> 0$. This last lower bound holds even in a stronger model where in each step, before performing the measurement, the unknown channel may be used arbitrarily many times sequentially interspersed with unital operations.
Authors: Ben Chugg, Santiago Cortes-Gomez, Bryan Wilder, Aaditya Ramdas
Abstract: We provide practical, efficient, and nonparametric methods for auditing the fairness of deployed classification and regression models. Whereas previous work relies on a fixed-sample size, our methods are sequential and allow for the continuous monitoring of incoming data, making them highly amenable to tracking the fairness of real-world systems. We also allow the data to be collected by a probabilistic policy as opposed to sampled uniformly from the population. This enables auditing to be conducted on data gathered for another purpose. Moreover, this policy may change over time and different policies may be used on different subpopulations. Finally, our methods can handle distribution shift resulting from either changes to the model or changes in the underlying population. Our approach is based on recent progress in anytime-valid inference and game-theoretic statistics-the "testing by betting" framework in particular. These connections ensure that our methods are interpretable, fast, and easy to implement. We demonstrate the efficacy of our approach on three benchmark fairness datasets.
Authors: Yahya Saleh, \'Alvaro Fern\'andez Corral, Emil Vogt, Armin Iske, Jochen K\"upper, Andrey Yachmenev
Abstract: Calculations of highly excited and delocalized molecular vibrational states are computationally challenging tasks, which strongly depends on the choice of coordinates for describing vibrational motions. We introduce a new method that leverages normalizing flows -- parametrized invertible functions -- to learn optimal vibrational coordinates that satisfy the variational principle. This approach produces coordinates tailored to the vibrational problem at hand, significantly increasing the accuracy and enhancing basis-set convergence of the calculated energy spectrum. The efficiency of the method is demonstrated in calculations of the 100 lowest excited vibrational states of H$_2$S, H$_2$CO, and HCN/HNC. The method effectively captures the essential vibrational behavior of molecules by enhancing the separability of the Hamiltonian and hence allows for an effective assignment of approximate quantum numbers. We demonstrate that the optimized coordinates are transferable across different levels of basis-set truncation, enabling a cost-efficient protocol for computing vibrational spectra of high-dimensional systems.
Authors: Oskar Allerbo
Abstract: We investigate changing the bandwidth of a translational-invariant kernel during training when solving kernel regression with gradient descent. We present a theoretical bound on the out-of-sample generalization error that advocates for decreasing the bandwidth (and thus increasing the model complexity) during training. We further use the bound to show that kernel regression exhibits a double descent behavior when the model complexity is expressed as the minimum allowed bandwidth during training. Decreasing the bandwidth all the way to zero results in benign overfitting, and also circumvents the need for model selection. We demonstrate the double descent behavior on real and synthetic data and also demonstrate that kernel regression with a decreasing bandwidth outperforms that of a constant bandwidth, selected by cross-validation or marginal likelihood maximization. We finally apply our findings to neural networks, demonstrating that by modifying the neural tangent kernel (NTK) during training, making the NTK behave as if its bandwidth were decreasing to zero, we can make the network overfit more benignly, and converge in fewer iterations.
Authors: Cristina Aggazzotti, Nicholas Andrews, Elizabeth Allyn Smith
Abstract: Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., 'um', 'uh-huh'), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic is increasingly controlled. We present analyses of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.
Authors: Ainesh Bakshi, Vincent Cohen-Addad, Samuel B. Hopkins, Rajesh Jayaram, Silvio Lattanzi
Abstract: Metric embeddings are a widely used method in algorithm design, where generally a ``complex'' metric is embedded into a simpler, lower-dimensional one. Historically, the theoretical computer science community has focused on bi-Lipschitz embeddings, which guarantee that every pairwise distance is approximately preserved. In contrast, alternative embedding objectives that are commonly used in practice avoid bi-Lipschitz distortion; yet these approaches have received comparatively less study in theory. In this paper, we focus on Multi-dimensional Scaling (MDS), where we are given a set of non-negative dissimilarities $\{d_{i,j}\}_{i,j\in [n]}$ over $n$ points, and the goal is to find an embedding $\{x_1,\dots,x_n\} \subset R^k$ that minimizes $$\textrm{OPT}=\min_{x}\mathbb{E}_{i,j\in [n]}\left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2.$$ Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine et. al. (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for this objective, which achieves an embedding in constant dimensional Euclidean space with cost $\textrm{OPT} +\epsilon$ in $n^2\cdot 2^{\textrm{poly}(\Delta/\epsilon)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. For metrics that admit low-cost embeddings, $\Delta$ scales polynomially in $n$. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for constant dimensional Euclidean space, we achieve a solution with cost $O(\log \Delta)\cdot \textrm{OPT}^{\Omega(1)}+\epsilon$ in time $n^{O(1)} \cdot 2^{\text{poly}((\log(\Delta)/\epsilon))}$. Our algorithms are based on a novel geometry-aware analysis of a conditional rounding of the Sherali-Adams LP Hierarchy, allowing us to avoid exponential dependency on the aspect ratio, which would typically result from this rounding.
Authors: Ali Rabiee, Sima Ghafoori, Anna Cetera, Reza Abiri
Abstract: This research aims to decode hand grasps from Electroencephalograms (EEGs) for dexterous neuroprosthetic development and Brain-Computer Interface (BCI) applications, especially for patients with motor disorders. Particularly, it focuses on distinguishing two complex natural power and precision grasps in addition to a neutral condition as a no-movement condition using a new EEG-based BCI platform and wavelet signal processing. Wavelet analysis involved generating time-frequency and topographic maps from wavelet power coefficients. Then, by using machine learning techniques with novel wavelet features, we achieved high average accuracies: 85.16% for multiclass, 95.37% for No-Movement vs Power, 95.40% for No-Movement vs Precision, and 88.07% for Power vs Precision, demonstrating the effectiveness of these features in EEG-based grasp differentiation. In contrast to previous studies, a critical part of our study was permutation feature importance analysis, which highlighted key features for grasp classification. It revealed that the most crucial brain activities during grasping occur in the motor cortex, within the alpha and beta frequency bands. These insights demonstrate the potential of wavelet features in real-time neuroprosthetic technology and BCI applications.
Authors: Lucas de Lara (UT3, IMT), Mathis Deronzier (UT3, IMT), Alberto Gonz\'alez-Sanz (UT3, IMT), Virgile Foy (UT3, IMT)
Abstract: The push-forward operation enables one to redistribute a probability measure through a deterministic map. It plays a key role in statistics and optimization: many learning problems (notably from optimal transport, generative modeling, and algorithmic fairness) include constraints or penalties framed as push-forward conditions on the model. However, the literature lacks general theoretical insights on the (non)convexity of such constraints and its consequences on the associated learning problems. This paper aims at filling this gap. In the first part, we provide a range of sufficient and necessary conditions for the (non)convexity of two sets of functions: the maps transporting one probability measure to another and the maps inducing equal output distributions across distinct probability measures. This highlights that for most probability measures, these push-forward constraints are not convex. In the second part, we show how this result implies critical limitations on the design of convex optimization problems for learning generative models or groupwise fair predictors. This work will hopefully help researchers and practitioners have a better understanding of the critical impact of push-forward conditions onto convexity.
Authors: Hanyin Wang, Chufan Gao, Bolun Liu, Qiping Xu, Guleid Hussein, Mohamad El Labban, Kingsley Iheasirim, Hariprasad Korsapati, Chuck Outcalt, Jimeng Sun
Abstract: Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (90.4%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic scored higher (4.2/5) in real-world readiness than physician-authored notes (4.1/5). We highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice.
Authors: Yuang Zhao, Zhaocheng Du, Qinglin Jia, Linxuan Zhang, Zhenhua Dong, Ruiming Tang
Abstract: With the increase in the business scale and number of domains in online advertising, multi-domain ad recommendation has become a mainstream solution in the industry. The core of multi-domain recommendation is effectively modeling the commonalities and distinctions among domains. Existing works are dedicated to designing model architectures for implicit multi-domain modeling while overlooking an in-depth investigation from a more fundamental perspective of feature distributions. This paper focuses on features with significant differences across various domains in both distributions and effects on model predictions. We refer to these features as domain-sensitive features, which serve as carriers of domain distinctions and are crucial for multi-domain modeling. Experiments demonstrate that existing multi-domain modeling methods may neglect domain-sensitive features, indicating insufficient learning of domain distinctions. To avoid this neglect, we propose a domain-sensitive feature attribution method to identify features that best reflect domain distinctions from the feature set. Further, we design a memory architecture that extracts domain-specific information from domain-sensitive features for the model to retrieve and integrate, thereby enhancing the awareness of domain distinctions. Extensive offline and online experiments demonstrate the superiority of our method in capturing domain distinctions and improving multi-domain recommendation performance.
Authors: Miko{\l}aj Ma{\l}ki\'nski, Jacek Ma\'ndziuk
Abstract: We study generalization and knowledge reuse capabilities of deep neural networks in the domain of abstract visual reasoning (AVR), employing Raven's Progressive Matrices (RPMs), a recognized benchmark task for assessing AVR abilities. Two knowledge transfer scenarios referring to the I-RAVEN dataset are investigated. Firstly, inspired by generalization assessment capabilities of the PGM dataset and popularity of I-RAVEN, we introduce Attributeless-I-RAVEN (A-I-RAVEN), a benchmark with 10 generalization regimes that allow to systematically test generalization of abstract rules applied to held-out attributes at various levels of complexity (primary and extended regimes). In contrast to PGM, A-I-RAVEN features compositionality, a variety of figure configurations, and does not require substantial computational resources. Secondly, we construct I-RAVEN-Mesh, a dataset that enriches RPMs with a novel component structure comprising line-based patterns, facilitating assessment of progressive knowledge acquisition in transfer learning setting. We evaluate 13 strong models from the AVR literature on the introduced datasets, revealing their specific shortcomings in generalization and knowledge transfer.
Authors: Paul Fuchs, Stephan Thaler, Sebastien R\"ocken, Julija Zavadlav
Abstract: Neural Networks (NNs) are effective models for refining the accuracy of molecular dynamics, opening up new fields of application. Typically trained bottom-up, atomistic NN potential models can reach first-principle accuracy, while coarse-grained implicit solvent NN potentials surpass classical continuum solvent models. However, overcoming the limitations of costly generation of accurate reference data and data inefficiency of common bottom-up training demands efficient incorporation of data from many sources. This paper introduces the framework chemtrain to learn sophisticated NN potential models through customizable training routines and advanced training algorithms. These routines can combine multiple top-down and bottom-up algorithms, e.g., to incorporate both experimental and simulation data or pre-train potentials with less costly algorithms. chemtrain provides an object-oriented high-level interface to simplify the creation of custom routines. On the lower level, chemtrain relies on JAX to compute gradients and scale the computations to use available resources. We demonstrate the simplicity and importance of combining multiple algorithms in the examples of parametrizing an all-atomistic model of titanium and a coarse-grained implicit solvent model of alanine dipeptide.
Authors: William E. R. de Amorim, Scott A. Sisson, T. Rodrigues, David J. Nott, Guilherme S. Rodrigues
Abstract: Positional Encoder Graph Neural Networks (PE-GNNs) are among the most effective models for learning from continuous spatial data. However, their predictive distributions are often poorly calibrated, limiting their utility in applications that require reliable uncertainty quantification. We propose the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel framework that combines PE-GNNs with Quantile Neural Networks, partially monotonic neural blocks, and post-hoc recalibration techniques. The PE-GQNN enables flexible and robust conditional density estimation with minimal assumptions about the target distribution, and it extends naturally to tasks beyond spatial data. Empirical results on benchmark datasets show that the PE-GQNN outperforms existing methods in both predictive accuracy and uncertainty quantification, without incurring additional computational cost. We also provide theoretical insights and identify important special cases arising from our formulation, including the PE-GNN.
Authors: Shojiro Yamabe, Futa Waseda, Tsubasa Takahashi, Koki Wataoka
Abstract: Protecting the intellectual property of Large Language Models (LLMs) has become increasingly critical due to the high cost of training. Model merging, which integrates multiple expert models into a single multi-task model, introduces a novel risk of unauthorized use of LLMs due to its efficient merging process. While fingerprinting techniques have been proposed for verifying model ownership, their resistance to model merging remains unexplored. To address this gap, we propose a novel fingerprinting method, MergePrint, which embeds robust fingerprints capable of surviving model merging. MergePrint enables black-box ownership verification, where owners only need to check if a model produces target outputs for specific fingerprint inputs, without accessing model weights or intermediate outputs. By optimizing against a pseudo-merged model that simulates merged behavior, MergePrint ensures fingerprints that remain detectable after merging. Additionally, to minimize performance degradation, we pre-optimize the fingerprint inputs. MergePrint pioneers a practical solution for black-box ownership verification, protecting LLMs from misappropriation via merging, while also excelling in resistance to broader model theft threats.
Authors: Peifan Jiang, Xuben Wang, Shuang Wang, Fei Deng, Kunpeng Wang, Bin Wang, Yuhan Yang
Abstract: Magnetotelluric deep learning (DL) inversion methods based on joint data-driven and physics-driven have become a hot topic in recent years. When mapping observation data (or forward modeling data) to the resistivity model using neural networks (NNs), incorporating the error (loss) term of the inversion resistivity's forward modeling response--which introduces physical information about electromagnetic field propagation--can significantly enhance the inversion accuracy. To efficiently achieve data-physical dual-driven MT deep learning inversion for large-scale 3-D MT data, we propose using DL forward modeling networks to compute this portion of the loss. This approach introduces pseudo-physical information through the forward modeling of NN simulation, further guiding the inversion network fitting. Specifically, we first pre-train the forward modeling networks as fixed forward modeling operators, then transfer and integrate them into the inversion network training, and finally optimize the inversion network by minimizing the multinomial loss. Theoretical experimental results indicate that despite some simulation errors in DL forward modeling, the introduced pseudo-physical information still enhances inversion accuracy and significantly mitigates the overfitting problem during training. Additionally, we propose a new input mode that involves masking and adding noise to the data, simulating the field data environment of 3-D MT inversion, thereby making the method more flexible and effective for practical applications.
Authors: Jenelle Feather, David Lipshutz, Sarah E. Harvey, Alex H. Williams, Eero P. Simoncelli
Abstract: Image representations (artificial or biological) are often compared in terms of their global geometric structure; however, representations with similar global structure can have strikingly different local geometries. Here, we propose a framework for comparing a set of image representations in terms of their local geometries. We quantify the local geometry of a representation using the Fisher information matrix, a standard statistical tool for characterizing the sensitivity to local stimulus distortions, and use this as a substrate for a metric on the local geometry in the vicinity of a base image. This metric may then be used to optimally differentiate a set of models, by finding a pair of "principal distortions" that maximize the variance of the models under this metric. As an example, we use this framework to compare a set of simple models of the early visual system, identifying a novel set of image distortions that allow immediate comparison of the models by visual inspection. In a second example, we apply our method to a set of deep neural network models and reveal differences in the local geometry that arise due to architecture and training types. These examples demonstrate how our framework can be used to probe for informative differences in local sensitivities between complex models, and suggest how it could be used to compare model representations with human perception.
Authors: Matthieu Lin, Jenny Sheng, Andrew Zhao, Shenzhi Wang, Yang Yue, Victor Shea Jay Huang, Huan Liu, Jun Liu, Gao Huang, Yong-Jin Liu
Abstract: This survey organizes the intricate literature on the design and optimization of emerging structures around post-trained LMs. We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools. We view scaffolded LMs as semi-parametric models wherein we train non-parametric variables, including the prompt, tools, and scaffold's code. In particular, they interpret instructions, use tools, and receive feedback all in language. Recent works use an LM as an optimizer to interpret language supervision and update non-parametric variables according to intricate objectives. In this survey, we refer to this paradigm as training of scaffolded LMs with language supervision. A key feature of non-parametric training is the ability to learn from language. Parametric training excels in learning from demonstration (supervised learning), exploration (reinforcement learning), or observations (unsupervised learning), using well-defined loss functions. Language-based optimization enables rich, interpretable, and expressive objectives, while mitigating issues like catastrophic forgetting and supporting compatibility with closed-source models. Furthermore, agents are increasingly deployed as co-workers in real-world applications such as Copilot in Office tools or software development. In these mixed-autonomy settings, where control and decision-making are shared between human and AI, users point out errors or suggest corrections. Accordingly, we discuss agents that continuously improve by learning from this real-time, language-based feedback and refer to this setting as streaming learning from language supervision.
Authors: Hadi Vafaii, Dekel Galor, Jacob L. Yates
Abstract: Inference in both brains and machines can be formalized by optimizing a shared objective: maximizing the evidence lower bound (ELBO) in machine learning, or minimizing variational free energy (F) in neuroscience (ELBO = -F). While this equivalence suggests a unifying framework, it leaves open how inference is implemented in neural systems. Here, we show that online natural gradient descent on F, under Poisson assumptions, leads to a recurrent spiking neural network that performs variational inference via membrane potential dynamics. The resulting model -- the iterative Poisson variational autoencoder (iP-VAE) -- replaces the encoder network with local updates derived from natural gradient descent on F. Theoretically, iP-VAE yields a number of desirable features such as emergent normalization via lateral competition, and hardware-efficient integer spike count representations. Empirically, iP-VAE outperforms both standard VAEs and Gaussian-based predictive coding models in sparsity, reconstruction, and biological plausibility. iP-VAE also exhibits strong generalization to out-of-distribution inputs, exceeding hybrid iterative-amortized VAEs. These results demonstrate how deriving inference algorithms from first principles can yield concrete architectures that are simultaneously biologically plausible and empirically effective.
Authors: Philip A. LeMaitre, T. Rick Perche, Marius Krumm, Hans J. Briegel
Abstract: We present an explicit construction of a relativistic quantum computing architecture using a variational quantum circuit approach that is shown to allow for universal quantum computing. The variational quantum circuit consists of tunable single-qubit rotations and entangling gates that are implemented successively. The single qubit rotations are parameterized by the proper time intervals of the qubits' trajectories and can be tuned by varying their relativistic motion in spacetime. The entangling layer is mediated by a relativistic quantum field instead of through direct coupling between the qubits. Within this setting, we give a prescription for how to use quantum field-mediated entanglement and manipulation of the relativistic motion of qubits to obtain a universal gate set, for which compact non-perturbative expressions that are valid for general spacetimes are also obtained. We also derive a lower bound on the channel fidelity that shows the existence of parameter regimes in which all entangling operations are effectively unitary, despite the noise generated from the presence of a mediating quantum field. Finally, we consider an explicit implementation of the quantum Fourier transform with relativistic qubits.
Authors: Samuel J. Bell, Mariano Coria Meglioli, Megan Richards, Eduardo S\'anchez, Christophe Ropers, Skyler Wang, Adina Williams, Levent Sagun, Marta R. Costa-juss\`a
Abstract: Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-based biases are mitigated by speech-based systems, we produce a set of high-quality group annotations for the multilingual MuTox dataset, and then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions, particularly for ambiguous and disagreement-inducing samples. Our results also suggest that improving classifiers, rather than transcription pipelines, is more helpful for reducing group bias. We publicly release our annotations and provide recommendations for future toxicity dataset construction.
Authors: Haoran Ding, No\'emie Jaquier, Jan Peters, Leonel Rozo
Abstract: Diffusion-based visuomotor policies excel at learning complex robotic tasks by effectively combining visual data with high-dimensional, multi-modal action distributions. However, diffusion models often suffer from slow inference due to costly denoising processes or require complex sequential training arising from recent distilling approaches. This paper introduces Riemannian Flow Matching Policy (RFMP), a model that inherits the easy training and fast inference capabilities of flow matching (FM). Moreover, RFMP inherently incorporates geometric constraints commonly found in realistic robotic applications, as the robot state resides on a Riemannian manifold. To enhance the robustness of RFMP, we propose Stable RFMP (SRFMP), which leverages LaSalle's invariance principle to equip the dynamics of FM with stability to the support of a target Riemannian distribution. Rigorous evaluation on eight simulated and real-world tasks show that RFMP successfully learns and synthesizes complex sensorimotor policies on Euclidean and Riemannian spaces with efficient training and inference phases, outperforming Diffusion Policies and Consistency Policies.
Authors: Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
Abstract: Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2$\times$ improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152$\times$1152), FastVLM achieves better performance on key benchmarks like SeedBench, MMMU and DocVQA, using the same 0.5B LLM, but with 85$\times$ faster TTFT and a vision encoder that is 3.4$\times$ smaller. Code and models are available at https://github.com/apple/ml-fastvlm.
Authors: Tara Radvand, Mojtaba Abdolmaleki, Mohamed Mostagir, Ambuj Tewari
Abstract: Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not? We model LLM-generated text as a sequential stochastic process with complete dependence on history. We then design zero-shot statistical tests to (i) distinguish between text generated by two different known sets of LLMs $A$ (non-sanctioned) and $B$ (in-house), and (ii) identify whether text was generated by a known LLM or generated by any unknown model, e.g., a human or some other language generation process. We prove that the type I and type II errors of our test decrease exponentially with the length of the text. For that, we show that if $B$ generates the text, then except with an exponentially small probability in string length, the log-perplexity of the string under $A$ converges to the average cross-entropy of $B$ and $A$. We then present experiments using LLMs with white-box access to support our theoretical results and empirically examine the robustness of our results to black-box settings and adversarial attacks. In the black-box setting, our method achieves an average TPR of 82.5\% at a fixed FPR of 5\%. Under adversarial perturbations, our minimum TPR is 48.6\% at the same FPR threshold. Both results outperform all non-commercial baselines. See https://github.com/TaraRadvand74/llm-text-detection for code, data, and an online demo of the project.
Authors: Riqiang Gao, Mamadou Diallo, Han Liu, Anthony Magliari, Jonathan Sackett, Wilko Verbakel, Sandra Meyers, Rafe Mcbeth, Masoud Zarepisheh, Simon Arberet, Martin Kraus, Florin C. Ghesu, Ali Kamen
Abstract: Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances with artificial intelligence (AI) promise to improve its precision and efficiency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Varian Eclipse. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations is proposed. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. To our best knowledge, this dataset features more than 10 times number of plans compared to the largest existing well-curated public dataset. Repo: https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge.
Authors: Aayush Mishra, Daniel Habermann, Marvin Schmitt, Stefan T. Radev, Paul-Christian B\"urkner
Abstract: Amortized Bayesian inference (ABI) with neural networks can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, ABI is not yet sufficiently robust for widespread and safe application. When performing inference on observations outside the scope of the simulated training data, posterior approximations are likely to become highly biased, which cannot be corrected by additional simulations due to the bad pre-asymptotic behavior of current neural posterior estimators. In this paper, we propose a semi-supervised approach that enables training not only on labeled simulated data generated from the model, but also on \textit{unlabeled} data originating from any source, including real data. To achieve this, we leverage Bayesian self-consistency properties that can be transformed into strictly proper losses that do not require knowledge of ground-truth parameters. We test our approach on several real-world case studies, including applications to high-dimensional time-series and image data. Our results show that semi-supervised learning with unlabeled data drastically improves the robustness of ABI in the out-of-simulation regime. Notably, inference remains accurate even when evaluated on observations far away from the labeled and unlabeled data seen during training.
Authors: Oussama Zekri, Nicolas Boull\'e
Abstract: Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.
Authors: Zhiqiang Que, Jose G. F. Coutinho, Ce Guo, Hongxiang Fan, Wayne Luk
Abstract: This paper presents a unified framework for codifying and automating optimization strategies to efficiently deploy deep neural networks (DNNs) on resource-constrained hardware, such as FPGAs, while maintaining high performance, accuracy, and resource efficiency. Deploying DNNs on such platforms involves addressing the significant challenge of balancing performance, resource usage (e.g., DSPs and LUTs), and inference accuracy, which often requires extensive manual effort and domain expertise. Our novel approach addresses two core key issues: (i)~encoding custom optimization strategies and (ii)~enabling cross-stage optimization search. In particular, our proposed framework seamlessly integrates programmatic DNN optimization techniques with high-level synthesis (HLS)-based metaprogramming, leveraging advanced design space exploration (DSE) strategies like Bayesian optimization to automate both top-down and bottom-up design flows. Hence, we reduce the need for manual intervention and domain expertise. In addition, the framework introduces customizable optimization, transformation, and control blocks to enhance DNN accelerator performance and resource efficiency. Experimental results demonstrate up to a 92\% DSP and 89\% LUT usage reduction for select networks, while preserving accuracy, along with a 15.6-fold reduction in optimization time compared to grid search. These results highlight the potential for automating the generation of resource-efficient DNN accelerator designs with minimum effort.
Authors: Runyao Yu, Yuchen Tao, Fabian Leimgruber, Tara Esterl, Jochen L. Cremer
Abstract: Accurate and reliable probabilistic prediction of intraday electricity prices is essential to manage market uncertainties and support robust trading strategies. However, current methods rely heavily on domain feature extraction and fail to capture the dynamics between buy and sell orders, limiting the ability to form rich representations of the orderbook. Furthermore, these methods often require training separate models for different quantiles and introduce additional procedures-such as post-hoc quantile sorting or loss-based penalties-to address the quantile crossing issue, where predicted upper quantiles fall below lower ones. These steps are either decoupled from model training or introduce extra tuning complexity. To address these challenges, we propose an encoding method called OrderFusion and design a hierarchical multi-quantile head. OrderFusion encodes the orderbook into a 2.5D representation and employs a tailored jump cross-attention to model buy-sell dynamics without the need for domain feature extraction. The multi-quantile head anchors on the median quantile and hierarchically estimates other quantiles through constrained residuals, ensuring monotonicity without post-processing or additional tuning. We conduct extensive experiments and ablation studies on three key price indices (ID1, ID2, and ID3) using three years of orderbook data from the German and Austrian markets. The results demonstrate that our approach provides an accurate, reliable, and unified end-to-end framework for probabilistic intraday price prediction.
Authors: Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, Kun Kuang
Abstract: Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI. Existing methods like data mixture strategies face limitations, including heavy reliance on expert knowledge and conflicting optimization signals. While model merging offers parameter-level conflict-resolution strategies through integrating specialized models' parameters, its potential for 3H optimization remains underexplored. This paper systematically compares the effectiveness of model merging and data mixture methods in constructing 3H-aligned LLMs for the first time, revealing previously overlooked collaborative and conflict relationships among the 3H dimensions and discussing the advantages and drawbacks of data mixture (\textit{data-level}) and model merging (\textit{parameter-level}) methods in mitigating the conflict for balanced 3H optimization. Specially, we propose a novel \textbf{R}eweighting \textbf{E}nhanced task \textbf{S}ingular \textbf{M}erging method, \textbf{RESM}, through outlier weighting and sparsity-aware rank selection strategies to address the challenges of preference noise accumulation and layer sparsity adaptation inherent in 3H-aligned LLM merging. Extensive evaluations can verify the effectiveness and robustness of RESM compared to previous data mixture (2\%-5\% gain) and model merging (1\%-3\% gain) methods in achieving balanced LLM alignment. We release our models through \href{https://huggingface.co/Jinluan}{3H\_Merging} for further investigations.
Authors: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu
Abstract: Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.
Authors: Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen
Abstract: The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.
Authors: Kaikai Zhao, Zhaoxiang Liu, Xuejiao Lei, Jiaojiao Zhao, Zhenhong Long, Zipeng Wang, Ning Wang, Meijuan An, Qingliang Meng, Peijun Yang, Minjie Hua, Chaoyang Ma, Wen Liu, Kai Wang, Shiguo Lian
Abstract: DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations for DeepSeek Series models from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we presents the first comprehensive evaluation of the DeepSeek and its related models (including DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, DeepSeek-R1-Distill-Llama series, their corresponding 4-bit quantized models, and the reasoning model QwQ-32B) using our enhanced A-Eval benchmark, A-Eval-2.0. Our systematic analysis reveals several key insights: (1) Given identical model architectures and training data, larger parameter models demonstrate superior performance, aligning with the scaling law. However, smaller models may achieve enhanced capabilities when employing optimized training strategies and higher-quality data; (2) Reasoning-enhanced model show significant performance gains in logical reasoning tasks but may underperform in text understanding and generation tasks; (3) As the data difficulty increases, distillation or reasoning enhancements yield higher performance gains for the models. Interestingly, reasoning enhancements can even have a negative impact on simpler problems; (4) Quantization impacts different capabilities unevenly, with significant drop on logical reasoning and minimal impact on text generation. Based on these results and findings, we design an model selection handbook enabling users to select the most cost-effective models without efforts.
Authors: Joan Serr\`a, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji
Abstract: Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.
Authors: Chengpiao Huang, Yuhang Wu, Kaizheng Wang
Abstract: We investigate the use of large language models (LLMs) to simulate human responses to survey questions, and perform uncertainty quantification to gain reliable insights. Our approach converts imperfect LLM-simulated responses into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid average-case coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.
Authors: Albert Wilcox, Mohamed Ghanem, Masoud Moghani, Pierre Barroso, Benjamin Joffe, Animesh Garg
Abstract: Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.
Authors: Abtin Pourhadi, Paul Swoboda
Abstract: We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combining a visual backbone coupled with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm. Our method is trained using a contrastive and a hyperspherical loss for better feature representations. We additionally use data augmentation during training. This comparatively simple architecture combining extensive normalization and advanced losses outperforms current state of the art approaches on PascalVOC and SPair-71k datasets by $5.1\%$ and $2.2\%$ respectively compared to BBGM, ASAR, COMMON and GMTR while training for at least $1.7x$ fewer epochs.
Authors: Hedong Yan
Abstract: In order to reduce the cost of experimental evaluation for agents, we introduce a computational theory of evaluation for mini agents: build evaluation model to accelerate the evaluation procedures. We prove upper bounds of generalized error and generalized causal effect error of given evaluation models for infinite agents. We also prove efficiency, and consistency to estimated causal effect from deployed agents to evaluation metric by prediction. To learn evaluation models, we propose a meta-learner to handle heterogeneous agents space problem. Comparing with existed evaluation approaches, our (conditional) evaluation model reduced 24.1\% to 99.0\% evaluation errors across 12 scenes, including individual medicine, scientific simulation, social experiment, business activity, and quantum trade. The evaluation time is reduced 3 to 7 order of magnitude per subject comparing with experiments or simulations.
Authors: Yuichi Kamata, Quoc Hoan Tran, Yasuhiro Endo, Hirotaka Oshima
Abstract: The Transformer model, renowned for its powerful attention mechanism, has achieved state-of-the-art performance in various artificial intelligence tasks but faces challenges such as high computational cost and memory usage. Researchers are exploring quantum computing to enhance the Transformer's design, though it still shows limited success with classical data. With a growing focus on leveraging quantum machine learning for quantum data, particularly in quantum chemistry, we propose the Molecular Quantum Transformer (MQT) for modeling interactions in molecular quantum systems. By utilizing quantum circuits to implement the attention mechanism on the molecular configurations, MQT can efficiently calculate ground-state energies for all configurations. Numerical demonstrations show that in calculating ground-state energies for H2, LiH, BeH2, and H4, MQT outperforms the classical Transformer, highlighting the promise of quantum effects in Transformer structures. Furthermore, its pretraining capability on diverse molecular data facilitates the efficient learning of new molecules, extending its applicability to complex molecular systems with minimal additional effort. Our method offers an alternative to existing quantum algorithms for estimating ground-state energies, opening new avenues in quantum chemistry and materials science.
Authors: Hyeongju Kim, Jinhyeok Yang, Yechan Yu, Seunghun Ji, Jacob Morton, Frederik Bous, Joon Byun, Juheon Lee
Abstract: We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS comprises three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: https://supertonictts.github.io/.
Authors: Valentin Boussot, C\'edric H\'emon, Jean-Claude Nunes, Jason Dowling, Simon Rouz\'e, Caroline Lafond, Ana\"is Barateau, Jean-Louis Dillenseger
Abstract: Image registration is fundamental in medical imaging, enabling precise alignment of anatomical structures for diagnosis, treatment planning, image-guided interventions, and longitudinal monitoring. This work introduces IMPACT (Image Metric with Pretrained model-Agnostic Comparison for Transmodality registration), a novel similarity metric designed for robust multimodal image registration. Rather than relying on raw intensities, handcrafted descriptors, or task-specific training, IMPACT defines a semantic similarity measure based on the comparison of deep features extracted from large-scale pretrained segmentation models. By leveraging representations from models such as TotalSegmentator, Segment Anything (SAM), and other foundation networks, IMPACT provides a task-agnostic, training-free solution that generalizes across imaging modalities. These features, originally trained for segmentation, offer strong spatial correspondence and semantic alignment capabilities, making them naturally suited for registration. The method integrates seamlessly into both algorithmic (Elastix) and learning-based (VoxelMorph) frameworks, leveraging the strengths of each. IMPACT was evaluated on five challenging 3D registration tasks involving thoracic CT/CBCT and pelvic MR/CT datasets. Quantitative metrics, including Target Registration Error and Dice Similarity Coefficient, demonstrated consistent improvements in anatomical alignment over baseline methods. Qualitative analyses further highlighted the robustness of the proposed metric in the presence of noise, artifacts, and modality variations. With its versatility, efficiency, and strong performance across diverse tasks, IMPACT offers a powerful solution for advancing multimodal image registration in both clinical and research settings.
Authors: Satvik Golechha, Adri\`a Garriga-Alonso
Abstract: Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce $\textit{Among Us}$, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, $\textit{Among Us}$ can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate $18$ proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: $\dots$'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.
Authors: Qiyu Wei, Haowei Wang, Zirui Cao, Songhao Wang, Richard Allmendinger, Mauricio A \'Alvarez
Abstract: Bayesian optimization (BO) is an effective technique for black-box optimization. However, its applicability is typically limited to moderate-budget problems due to the cubic complexity in computing the Gaussian process (GP) surrogate model. In large-budget scenarios, directly employing the standard GP model faces significant challenges in computational time and resource requirements. In this paper, we propose a novel approach, gradient-based sample selection Bayesian Optimization (GSSBO), to enhance the computational efficiency of BO. The GP model is constructed on a selected set of samples instead of the whole dataset. These samples are selected by leveraging gradient information to maintain diversity and representation. We provide a theoretical analysis of the gradient-based sample selection strategy and obtain explicit sublinear regret bounds for our proposed framework. Extensive experiments on synthetic and real-world tasks demonstrate that our approach significantly reduces the computational cost of GP fitting in BO while maintaining optimization performance comparable to baseline methods.
Authors: Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah
Abstract: The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4% male, 44.4% female, 6.2% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6%, bullying 10.2%) and symptoms (nightmares 23.4%, substance abuse 20.8%). Clinical experts validated the dataset's therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.
Authors: Yanhao 'Max' Wei, Zhenling Jiang
Abstract: We explore pretraining estimators for structural econometric models. The estimator is "pretrained" in the sense that the bulk of the computational cost and researcher effort occur during the construction of the estimator. Subsequent applications of the estimator to different datasets require little computational cost or researcher effort. The estimation leverages a neural net to recognize the structural model's parameter from data patterns. As an initial trial, this paper builds a pretrained estimator for a sequential search model that is known to be difficult to estimate. We evaluate the pretrained estimator on 12 real datasets. The estimation takes seconds to run and shows high accuracy. We provide the estimator at pnnehome.github.io. More generally, pretrained, off-the-shelf estimators can make structural models more accessible to researchers and practitioners.
Authors: Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber
Abstract: Synthetic video generation has gained significant attention for its realism and broad applications, but remains prone to violations of common sense and physical laws. This highlights the need for reliable abnormality detectors that understand such principles and are robust to hallucinations. To address this, we introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic videos generated by models like Veo2, Sora, and Kling, paired with expert-crafted counterintuitive QA to evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors. VideoHallu evaluates MLLMs' abnormality detection abilities with examples across alignment, consistency, commonsense, and physics. We benchmark SOTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen2.5-VL, Video-R1, and VideoChat-R1. We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and commonsense reasoning in synthetic videos. We further show that post-training with Group Relative Policy Optimization (GRPO), using curriculum learning on datasets combining video QA with counterintuitive commonsense and physics reasoning over real and synthetic videos, improves MLLMs' abnormality detection and critical thinking, demonstrating the value of targeted training for improving their understanding of commonsense and physical laws.
Authors: Dhruvesh Patel, Aishwarya Sahoo, Avinash Amballa, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum
Abstract: Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,'' have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence -- that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences where token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, we show that ILMs outperform MDMs and perform on par with ARMs in an unconditional text generation task while offering greater flexibility than MDMs in arbitrary-length text infilling.
Authors: Moritz Vandenhirtz, Julia E. Vogt
Abstract: Understanding the decision-making process of machine learning models provides valuable insights into the task, the data, and the reasons behind a model's failures. In this work, we propose a method that performs inherently interpretable predictions through the instance-wise sparsification of input images. To align the sparsification with human perception, we learn the masking in the space of semantically meaningful pixel regions rather than on pixel-level. Additionally, we introduce an explicit way to dynamically determine the required level of sparsity for each instance. We show empirically on semi-synthetic and natural image datasets that our inherently interpretable classifier produces more meaningful, human-understandable predictions than state-of-the-art benchmarks.
Authors: Haichen Hu, David Simchi-Levi, Navid Azizan
Abstract: Contextual online decision-making problems with constraints appear in various real-world applications, such as personalized recommendation with resource limits and dynamic pricing with fairness constraints. In this paper, we investigate a general formulation of sequential decision-making with stage-wise feasibility constraints, where at each round, the learner must select an action based on observed context while ensuring a problem-specific feasibility criterion. We propose a unified algorithmic framework that captures many existing constrained learning problems, including constrained bandits, stream active learning, online hypothesis testing, and model calibration. Central to our approach is the concept of upper counterfactual confidence bound, which enables the design of practically efficient online algorithms using any offline conditional density estimation oracle. Technically, to handle feasibility constraints, we introduce a generalized notion of the eluder dimension, extending it from the classical setting based on square loss to a broader class of metric-like probability divergences, which could capture the complexity of various density function classes and characterize the loss incurred due to feasibility constraint uncertainty. Our result offers a principled foundation for constrained sequential decision-making in both theory and practice.
Authors: Michael Benigni, Maurizio Ferrari Dacrema, Dietmar Jannach
Abstract: Countless new machine learning models are published every year and are reported to significantly advance the state-of-the-art in \emph{top-n} recommendation. However, earlier reproducibility studies indicate that progress in this area may be quite limited. Specifically, various widespread methodological issues, e.g., comparisons with untuned baseline models, have led to an \emph{illusion of progress}. In this work, our goal is to examine whether these problems persist in today's research. To this end, we aim to reproduce the latest advancements reported from applying modern Denoising Diffusion Probabilistic Models to recommender systems, focusing on four models published at the top-ranked SIGIR conference in 2023 and 2024. Our findings are concerning, revealing persistent methodological problems. Alarmingly, through experiments, we find that the latest recommendation techniques based on diffusion models, despite their computational complexity and substantial carbon footprint, are consistently outperformed by simpler existing models. Furthermore, we identify key mismatches between the characteristics of diffusion models and those of the traditional \emph{top-n} recommendation task, raising doubts about their suitability for recommendation. We also note that, in the papers we analyze, the generative capabilities of these models are constrained to a minimum. Overall, our results and continued methodological issues call for greater scientific rigor and a disruptive change in the research and publication culture in this area.
Authors: Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Abstract: Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.
Authors: Dmitry Rybin, Yushun Zhang, Zhi-Quan Luo
Abstract: We present RXTX, a new algorithm for computing the product of matrix by its transpose $XX^{t}$ for $X\in \mathbb{R}^{n\times m}$. RXTX uses $5\%$ fewer multiplications and $5\%$ fewer operations (additions and multiplications) than State-of-the-Art algorithms. Note that the accelerations not only holds asymptotically for large matrices with $n \rightarrow \infty$, but also for small matrices including $n = 4$. The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.
Authors: Ruichen Yang, Gy\"orgy M. L\'evay, Christopher L. Hunt, D\'aniel Czeiner, Megan C. Hodgson, Damini Agarwal, Rahul R. Kaliki, Nitish V. Thakor
Abstract: State-of-the-art upper limb myoelectric prostheses often use pattern recognition (PR) control systems that translate electromyography (EMG) signals into desired movements. As prosthesis movement complexity increases, users often struggle to produce sufficiently distinct EMG patterns for reliable classification. Existing training typically involves heuristic, trial-and-error user adjustments to static decoder boundaries. Goal: We introduce the Reviewer, a 3D visual interface projecting EMG signals directly into the decoder's classification space, providing intuitive, real-time insight into PR algorithm behavior. This structured feedback reduces cognitive load and fosters mutual, data-driven adaptation between user-generated EMG patterns and decoder boundaries. Methods: A 10-session study with 12 able-bodied participants compared PR performance after motor-based training and updating using the Reviewer versus conventional virtual arm visualization. Performance was assessed using a Fitts law task that involved the aperture of the cursor and the control of orientation. Results: Participants trained with the Reviewer achieved higher completion rates, reduced overshoot, and improved path efficiency and throughput compared to the standard visualization group. Significance: The Reviewer introduces decoder-informed motor training, facilitating immediate and consistent PR-based myoelectric control improvements. By iteratively refining control through real-time feedback, this approach reduces reliance on trial-and-error recalibration, enabling a more adaptive, self-correcting training framework. Conclusion: The 3D visual feedback significantly improves PR control in novice operators through structured training, enabling feedback-driven adaptation and reducing reliance on extensive heuristic adjustments.