Authors: Zhaoyang Chen, Cody Fleming
Abstract: Classifier free guidance has shown strong potential in diffusion-based reinforcement learning. However, existing methods rely on joint training of the guidance module and the diffusion model, which can be suboptimal during the early stages when the guidance is inaccurate and provides noisy learning signals. In offline RL, guidance depends solely on offline data: observations, actions, and rewards, and is independent of the policy module's behavior, suggesting that joint training is not required. This paper proposes modular training methods that decouple the guidance module from the diffusion model, based on three key findings: Guidance Necessity: We explore how the effectiveness of guidance varies with the training stage and algorithm choice, uncovering the roles of guidance and diffusion. A lack of good guidance in the early stage presents an opportunity for optimization. Guidance-First Diffusion Training: We introduce a method where the guidance module is first trained independently as a value estimator, then frozen to guide the diffusion model using classifier-free reward guidance. This modularization reduces memory usage, improves computational efficiency, and enhances both sample efficiency and final performance. Cross-Module Transferability: Applying two independently trained guidance models, one during training and the other during inference, can significantly reduce normalized score variance (e.g., reducing IQR by 86%). We show that guidance modules trained with one algorithm (e.g., IDQL) can be directly reused with another (e.g., DQL), with no additional training required, demonstrating baseline-level performance as well as strong modularity and transferability. We provide theoretical justification and empirical validation on bullet D4RL benchmarks. Our findings suggest a new paradigm for offline RL: modular, reusable, and composable training pipelines.
Authors: Yu Zheng
Abstract: The proliferation of artificial intelligence has enabled a diversity of applications that bridge the gap between digital and physical worlds. As physical environments are too complex to model through a single information acquisition approach, it is crucial to fuse multimodal data generated by different sources, such as sensors, devices, systems, and people, to solve a problem in the real world. Unfortunately, it is neither applicable nor sustainable to deploy new resources to collect original data from scratch for every problem. Thus, when data is inadequate in the domain of problem, it is vital to fuse knowledge from multimodal data that is already available in other domains. We call this cross-domain knowledge fusion. Existing research focus on fusing multimodal data in a single domain, supposing the knowledge from different datasets is intrinsically aligned; however, this assumption may not hold in the scenarios of cross-domain knowledge fusion. In this paper, we formally define the cross-domain multimodal data fusion problem, discussing its unique challenges, differences and advantages beyond data fusion in a single domain. We propose a four-layer framework, consisting of Domains, Links, Models and Data layers, answering three key questions: "what to fuse", "why can be fused", and "how to fuse". The Domains Layer selects relevant data from different domains for a given problem. The Links Layer reveals the philosophy of knowledge alignment beyond specific model structures. The Models Layer provides two knowledge fusion paradigms based on the fundamental mechanisms for processing data. The Data Layer turns data of different structures, resolutions, scales and distributions into a consistent representation that can be fed into an AI model. With this framework, we can design end-to-end solutions that fuse cross-domain multimodal data effectively for solving real-world problems.
Authors: Jiahao Qin, Bei Peng, Feng Liu, Guangliang Cheng, Lu Zong
Abstract: Deep learning models frequently encounter feature uncertainty in diverse learning scenarios, significantly impacting their performance and reliability. This challenge is particularly complex in multi-modal scenarios, where models must integrate information from different sources with inherent uncertainties. We propose Dynamic Uncertainty-Aware Learning (DUAL), a unified framework that effectively handles feature uncertainty in both single-modal and multi-modal scenarios. DUAL introduces three key innovations: Dynamic Feature Uncertainty Modeling, which continuously refines uncertainty estimates through joint consideration of feature characteristics and learning dynamics; Adaptive Distribution-Aware Modulation, which maintains balanced feature distributions through dynamic sample influence adjustment; and Uncertainty-aware Cross-Modal Relationship Learning, which explicitly models uncertainties in cross-modal interactions. Through extensive experiments, we demonstrate DUAL's effectiveness across multiple domains: in computer vision tasks, it achieves substantial improvements of 7.1% accuracy on CIFAR-10, 6.5% accuracy on CIFAR-100, and 2.3% accuracy on Tiny-ImageNet; in multi-modal learning, it demonstrates consistent gains of 4.1% accuracy on CMU-MOSEI and 2.8% accuracy on CMU-MOSI for sentiment analysis, while achieving 1.4% accuracy improvements on MISR. The code will be available on GitHub soon.
Authors: Lesley Wheat, Martin v. Mohrenschildt, Saeid Habibi
Abstract: The Bayes Error Rate (BER) is the fundamental limit on the achievable generalizable classification accuracy of any machine learning model due to inherent uncertainty within the data. BER estimators offer insight into the difficulty of any classification problem and set expectations for optimal classification performance. In order to be useful, the estimators must also be accurate with a limited number of samples on multivariate problems with unknown class distributions. To determine which estimators meet the minimum requirements for "usefulness", an in-depth examination of their accuracy is conducted using Monte Carlo simulations with synthetic data in order to obtain their confidence bounds for binary classification. To examine the usability of the estimators on real-world applications, new test scenarios are introduced upon which 2500 Monte Carlo simulations per scenario are run over a wide range of BER values. In a comparison of k-Nearest Neighbor (kNN), Generalized Henze-Penrose (GHP) divergence and Kernel Density Estimation (KDE) techniques, results show that kNN is overwhelmingly the more accurate non-parametric estimator. In order to reach the target of an under 5 percent range for the 95 percent confidence bounds, the minimum number of required samples per class is 1000. As more features are added, more samples are needed, so that 2500 samples per class are required at only 4 features. Other estimators do become more accurate than kNN as more features are added, but continuously fail to meet the target range.
Authors: Shriyank Somvanshi, Anannya Ghosh Tusti, Mahmuda Sultana Mimi, Md Monzurul Islam, Sazzad Bin Bashar Polock, Anandi Dutta, Subasish Das
Abstract: The increasing presence of automated vehicles (AVs) presents new challenges for crash classification and safety analysis. Accurately identifying the SAE automation level involved in each crash is essential to understanding crash dynamics and system accountability. However, existing approaches often overlook automation-specific factors and lack model sophistication to capture distinctions between different SAE levels. To address this gap, this study evaluates the performance of three advanced tabular deep learning models MambaAttention, TabPFN, and TabTransformer for classifying SAE automation levels using structured crash data from Texas (2024), covering 4,649 cases categorized as Assisted Driving (SAE Level 1), Partial Automation (SAE Level 2), and Advanced Automation (SAE Levels 3-5 combined). Following class balancing using SMOTEENN, the models were trained and evaluated on a unified dataset of 7,300 records. MambaAttention demonstrated the highest overall performance (F1-scores: 88% for SAE 1, 97% for SAE 2, and 99% for SAE 3-5), while TabPFN excelled in zero-shot inference with high robustness for rare crash categories. In contrast, TabTransformer underperformed, particularly in detecting Partial Automation crashes (F1-score: 55%), suggesting challenges in modeling shared human-system control dynamics. These results highlight the capability of deep learning models tailored for tabular data to enhance the accuracy and efficiency of automation-level classification. Integrating such models into crash analysis frameworks can support policy development, AV safety evaluation, and regulatory decisions, especially in distinguishing high-risk conditions for mid- and high-level automation technologies.
Authors: Mira Nuthakki
Abstract: Traffic congestion and collisions represent significant economic, environmental, and social challenges worldwide. Traditional traffic management approaches have shown limited success in addressing these complex, dynamic problems. To address the current research gaps, three potential tools are developed: a comprehensive 3D city-wide simulation environment that integrates both macroscopic and microscopic traffic dynamics; a collision model; and a reinforcement learning framework with custom reward functions prioritizing safety over efficiency. Unity game engine-based simulation is used for direct collision modeling. A custom reward enabled reinforcement learning method, proximal policy optimization (PPO) model, yields substantial improvements over baseline results, reducing the number of serious collisions, number of vehicle-vehicle collisions, and total distance travelled by over 3 times the baseline values. The model also improves fuel efficiency by 39% and reduces carbon emissions by 88%. Results establish feasibility for city-wide 3D traffic simulation applications incorporating the vision-zero safety principles of the Department of Transportation, including physics-informed, adaptable, realistic collision modeling, as well as appropriate reward modeling for real-world traffic signal light control towards reducing collisions, optimizing traffic flow and reducing greenhouse emissions.
Authors: Oluwaseyi Giwa
Abstract: Dynamic causal discovery in wireless networks is essential due to evolving interference, fading, and mobility, which complicate traditional static causal models. This paper addresses causal inference challenges in dynamic fading wireless environments by proposing a sequential regression-based algorithm with a novel application of the NOTEARS acyclicity constraint, enabling efficient online updates. We derive theoretical lower and upper bounds on the detection delay required to identify structural changes, explicitly quantifying their dependence on network size, noise variance, and fading severity. Monte Carlo simulations validate these theoretical results, demonstrating linear increases in detection delay with network size, quadratic growth with noise variance, and inverse-square dependence on the magnitude of structural changes. Our findings provide rigorous theoretical insights and practical guidelines for designing robust online causal inference mechanisms to maintain network reliability under nonstationary wireless conditions.
Authors: Vignav Ramesh, Morteza Mardani
Abstract: The iterative and stochastic nature of diffusion models enables test-time scaling, whereby spending additional compute during denoising generates higher-fidelity samples. Increasing the number of denoising steps is the primary scaling axis, but this yields quickly diminishing returns. Instead optimizing the noise trajectory--the sequence of injected noise vectors--is promising, as the specific noise realizations critically affect sample quality; but this is challenging due to a high-dimensional search space, complex noise-outcome interactions, and costly trajectory evaluations. We address this by first casting diffusion as a Markov Decision Process (MDP) with a terminal reward, showing tree-search methods such as Monte Carlo tree search (MCTS) to be meaningful but impractical. To balance performance and efficiency, we then resort to a relaxation of MDP, where we view denoising as a sequence of independent contextual bandits. This allows us to introduce an $\epsilon$-greedy search algorithm that globally explores at extreme timesteps and locally exploits during the intermediate steps where de-mixing occurs. Experiments on EDM and Stable Diffusion reveal state-of-the-art scores for class-conditioned/text-to-image generation, exceeding baselines by up to $164\%$ and matching/exceeding MCTS performance. To our knowledge, this is the first practical method for test-time noise trajectory optimization of arbitrary (non-differentiable) rewards.
Authors: Bin Wang, Yongqi Han, Minbo Ma, Tianrui Li, Junbo Zhang, Feng Hong, Yanwei Yu
Abstract: Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model's learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1)
Authors: Nadav Timor, Jonathan Mamou, Oren Pereg, Hongyang Zhang, David Harel
Abstract: Speculative decoding relies on fast and accurate drafters. Recent state-of-the-art language models employ larger and larger vocabularies, which significantly slows down drafters. One promising approach to boost the efficiency of speculative decoding is to use drafters with smaller vocabularies. However, existing sampling methods cannot draw out-of-vocabulary tokens, creating a tradeoff between drafters' vocabulary size and acceptance rates. This paper introduces Redistributing Drafter Kernels (RDK), the first out-of-vocabulary sampler that effectively recovers acceptance rates by virtually restoring pruned target tokens. RDK leverages token-affinity priors to reallocate drafter mass towards high-overlap regions. We prove mathematically that RDK can achieve higher acceptance rates than vanilla and state-of-the-art samplers. We provide an efficient first-order approximation of RDK and prove that it reduces redistribution times from $O(N^2)$ to $O(N)$, enabling lightweight implementations for large vocabularies. Our experiments demonstrate that this linear-time RDK significantly boosts acceptance rates even after extreme pruning (removing more than 75% of the drafter's vocabulary), where existing samplers fail. RDK opens the door to extremely pruned drafters, which were previously impractical.
Authors: Md Nahid Hasan Shuvo, Moinul Hossain
Abstract: Federated Learning (FL) is increasingly adopted as a decentralized machine learning paradigm due to its capability to preserve data privacy by training models without centralizing user data. However, FL is susceptible to indirect privacy breaches via network traffic analysis-an area not explored in existing research. The primary objective of this research is to study the feasibility of fingerprinting deep learning models deployed within FL environments by analyzing their network-layer traffic information. In this paper, we conduct an experimental evaluation using various deep learning architectures (i.e., CNN, RNN) within a federated learning testbed. We utilize machine learning algorithms, including Support Vector Machines (SVM), Random Forest, and Gradient-Boosting, to fingerprint unique patterns within the traffic data. Our experiments show high fingerprinting accuracy, achieving 100% accuracy using Random Forest and around 95.7% accuracy using SVM and Gradient Boosting classifiers. This analysis suggests that we can identify specific architectures running within the subsection of the network traffic. Hence, if an adversary knows about the underlying DL architecture, they can exploit that information and conduct targeted attacks. These findings suggest a notable security vulnerability in FL systems and the necessity of strengthening it at the network level.
Authors: Qiusheng Huang, Yuan Niu, Xiaohui Zhong, Anboyu Guo, Lei Chen, Dianjun Zhang, Xuefeng Zhang, Hao Li
Abstract: Accurate, high-resolution ocean forecasting is crucial for maritime operations and environmental monitoring. While traditional numerical models are capable of producing sub-daily, eddy-resolving forecasts, they are computationally intensive and face challenges in maintaining accuracy at fine spatial and temporal scales. In contrast, recent data-driven approaches offer improved computational efficiency and emerging potential, yet typically operate at daily resolution and struggle with sub-daily predictions due to error accumulation over time. We introduce FuXi-Ocean, the first data-driven global ocean forecasting model achieving six-hourly predictions at eddy-resolving 1/12{\deg} spatial resolution, reaching depths of up to 1500 meters. The model architecture integrates a context-aware feature extraction module with a predictive network employing stacked attention blocks. The core innovation is the Mixture-of-Time (MoT) module, which adaptively integrates predictions from multiple temporal contexts by learning variable-specific reliability , mitigating cumulative errors in sequential forecasting. Through comprehensive experimental evaluation, FuXi-Ocean demonstrates superior skill in predicting key variables, including temperature, salinity, and currents, across multiple depths.
Authors: Wa\"el Doulazmi, Auguste Lehuger, Marin Toromanoff, Valentin Charraut, Thibault Buhet, Fabien Moutarde
Abstract: Reinforcement Learning's high sensitivity to hyperparameters is a source of instability and inefficiency, creating significant challenges for practitioners. Hyperparameter Optimization (HPO) algorithms have been developed to address this issue, among them Population-Based Training (PBT) stands out for its ability to generate hyperparameters schedules instead of fixed configurations. PBT trains a population of agents, each with its own hyperparameters, frequently ranking them and replacing the worst performers with mutations of the best agents. These intermediate selection steps can cause PBT to focus on short-term improvements, leading it to get stuck in local optima and eventually fall behind vanilla Random Search over longer timescales. This paper studies how this greediness issue is connected to the choice of evolution frequency, the rate at which the selection is done. We propose Multiple-Frequencies Population-Based Training (MF-PBT), a novel HPO algorithm that addresses greediness by employing sub-populations, each evolving at distinct frequencies. MF-PBT introduces a migration process to transfer information between sub-populations, with an asymmetric design to balance short and long-term optimization. Extensive experiments on the Brax suite demonstrate that MF-PBT improves sample efficiency and long-term performance, even without actually tuning hyperparameters.
Authors: Abdelrahman Sayed Sayed, Pierre-Jean Meyer, Mohamed Ghazel
Abstract: A neural ordinary differential equation (neural ODE) is a machine learning model that is commonly described as a continuous depth generalization of a residual network (ResNet) with a single residual block, or conversely, the ResNet can be seen as the Euler discretization of the neural ODE. These two models are therefore strongly related in a way that the behaviors of either model are considered to be an approximation of the behaviors of the other. In this work, we establish a more formal relationship between these two models by bounding the approximation error between two such related models. The obtained error bound then allows us to use one of the models as a verification proxy for the other, without running the verification tools twice: if the reachable output set expanded by the error bound satisfies a safety property on one of the models, this safety property is then guaranteed to be also satisfied on the other model. This feature is fully reversible, and the initial safety verification can be run indifferently on either of the two models. This novel approach is illustrated on a numerical example of a fixed-point attractor system modeled as a neural ODE.
Authors: Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li, Naigang Wang, Penghang Yin, Zi Yang
Abstract: Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.
Authors: Kaiwen Duan, Hongwei Yao, Yufei Chen, Ziyun Li, Tong Qiao, Zhan Qin, Cong Wang
Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning text-to-image (T2I) models with human preferences. However, RLHF's feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of preference data with natural-appearing examples. Specifically, we propose BadReward, a stealthy clean-label poisoning attack targeting the reward model in multi-modal RLHF. BadReward operates by inducing feature collisions between visually contradicted preference data instances, thereby corrupting the reward model and indirectly compromising the T2I model's integrity. Unlike existing alignment poisoning techniques focused on single (text) modality, BadReward is independent of the preference annotation process, enhancing its stealth and practical threat. Extensive experiments on popular T2I models show that BadReward can consistently guide the generation towards improper outputs, such as biased or violent imagery, for targeted concepts. Our findings underscore the amplified threat landscape for RLHF in multi-modal systems, highlighting the urgent need for robust defenses. Disclaimer. This paper contains uncensored toxic content that might be offensive or disturbing to the readers.
Authors: Shahbaz Rezaei, Avishai Halev, Xin Liu
Abstract: A prevailing approach to explain time series models is to generate attribution in time domain. A recent development in time series XAI is the concept of explanation spaces, where any model trained in the time domain can be interpreted with any existing XAI method in alternative domains, such as frequency. The prevailing approach is to present XAI attributions either in the time domain or in the domain where the attribution is most sparse. In this paper, we demonstrate that in certain cases, XAI methods can generate attributions that highlight fundamentally different features in the time and frequency domains that are not direct counterparts of one another. This suggests that both domains' attributions should be presented to achieve a more comprehensive interpretation. Thus it shows the necessity of multi-domain explanation. To quantify when such cases arise, we introduce the uncertainty principle (UP), originally developed in quantum mechanics and later studied in harmonic analysis and signal processing, to the XAI literature. This principle establishes a lower bound on how much a signal can be simultaneously localized in both the time and frequency domains. By leveraging this concept, we assess whether attributions in the time and frequency domains violate this bound, indicating that they emphasize distinct features. In other words, UP provides a sufficient condition that the time and frequency domain explanations do not match and, hence, should be both presented to the end user. We validate the effectiveness of this approach across various deep learning models, XAI methods, and a wide range of classification and forecasting datasets. The frequent occurrence of UP violations across various datasets and XAI methods highlights the limitations of existing approaches that focus solely on time-domain explanations. This underscores the need for multi-domain explanations as a new paradigm.
Authors: James Bagrow, Josh Bongard
Abstract: Kolmogorov-Arnold Networks (KANs) uniquely combine high accuracy with interpretability, making them valuable for scientific modeling. However, it is unclear a priori how deep a network needs to be for any given task, and deeper KANs can be difficult to optimize. Here we introduce multi-exit KANs, where each layer includes its own prediction branch, enabling the network to make accurate predictions at multiple depths simultaneously. This architecture provides deep supervision that improves training while discovering the right level of model complexity for each task. Multi-exit KANs consistently outperform standard, single-exit versions on synthetic functions, dynamical systems, and real-world datasets. Remarkably, the best predictions often come from earlier, simpler exits, revealing that these networks naturally identify smaller, more parsimonious and interpretable models without sacrificing accuracy. To automate this discovery, we develop a differentiable "learning to exit" algorithm that balances contributions from exits during training. Our approach offers scientists a practical way to achieve both high performance and interpretability, addressing a fundamental challenge in machine learning for scientific discovery.
Authors: Kristen Goebel, William Solow, Paola Pesantez-Cabrera, Markus Keller, Alan Fern
Abstract: This paper introduces a novel approach to budgeted online active learning from finite-horizon data streams with extremely limited labeling budgets. In agricultural applications, such streams might include daily weather data over a growing season, and labels require costly measurements of weather-dependent plant characteristics. Our method integrates two key sources of prior information: a collection of preexisting expert predictors and episodic behavioral knowledge of the experts based on unlabeled data streams. Unlike previous research on online active learning with experts, our work simultaneously considers query budgets, finite horizons, and episodic knowledge, enabling effective learning in applications with severely limited labeling capacity. We demonstrate the utility of our approach through experiments on various prediction problems derived from both a realistic agricultural crop simulator and real-world data from multiple grape cultivars. The results show that our method significantly outperforms baseline expert predictions, uniform query selection, and existing approaches that consider budgets and limited horizons but neglect episodic knowledge, even under highly constrained labeling budgets.
Authors: Jack Bell, Luigi Quarantiello, Eric Nuertey Coleman, Lanpei Li, Malio Li, Mauro Madeddu, Elia Piccoli, Vincenzo Lomonaco
Abstract: Continual learning--the ability to acquire, retain, and refine knowledge over time--has always been fundamental to intelligence, both human and artificial. Historically, different AI paradigms have acknowledged this need, albeit with varying priorities: early expert and production systems focused on incremental knowledge consolidation, while reinforcement learning emphasised dynamic adaptation. With the rise of deep learning, deep continual learning has primarily focused on learning robust and reusable representations over time to solve sequences of increasingly complex tasks. However, the emergence of Large Language Models (LLMs) and foundation models has raised the question: Do we still need continual learning when centralised, monolithic models can tackle diverse tasks with access to internet-scale knowledge? We argue that continual learning remains essential for three key reasons: (i) continual pre-training is still necessary to ensure foundation models remain up to date, mitigating knowledge staleness and distribution shifts while integrating new information; (ii) continual fine-tuning enables models to specialise and personalise, adapting to domain-specific tasks, user preferences, and real-world constraints without full retraining, avoiding the need for computationally expensive long context-windows; (iii) continual compositionality offers a scalable and modular approach to intelligence, enabling the orchestration of foundation models and agents to be dynamically composed, recombined, and adapted. While continual pre-training and fine-tuning are explored as niche research directions, we argue it is continual compositionality that will mark the rebirth of continual learning. The future of AI will not be defined by a single static model but by an ecosystem of continually evolving and interacting models, making continual learning more relevant than ever.
Authors: Ethan Che, Hakan Ceylan, James McInerney, Nathan Kallus
Abstract: Modern recommendation systems rely on exploration to learn user preferences for new items, typically implementing uniform exploration policies (e.g., epsilon-greedy) due to their simplicity and compatibility with machine learning (ML) personalization models. Within these systems, a crucial consideration is the rate of exploration - what fraction of user traffic should receive random item recommendations and how this should evolve over time. While various heuristics exist for navigating the resulting exploration-exploitation tradeoff, selecting optimal exploration rates is complicated by practical constraints including batched updates, time-varying user traffic, short time horizons, and minimum exploration requirements. In this work, we propose a principled framework for determining the exploration schedule based on directly minimizing Bayesian regret through stochastic gradient descent (SGD), allowing for dynamic exploration rate adjustment via Model-Predictive Control (MPC). Through extensive experiments with recommendation datasets, we demonstrate that variations in the batch size across periods significantly influence the optimal exploration strategy. Our optimization methods automatically calibrate exploration to the specific problem setting, consistently matching or outperforming the best heuristic for each setting.
Authors: Juan Sebastian Rojas, Chi-Guhn Lee
Abstract: To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a potentially-discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time-step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms consistently yield competitive performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run reward and return distributions.
Authors: Yide Ran, Wentao Guo, Jingwei Sun, Yanzhou Pan, Xiaodong Yu, Hao Wang, Jianwen Xie, Yiran Chen, Denghui Zhang, Zhaozhuo Xu
Abstract: Federated Learning enables collaborative fine-tuning of Large Language Models (LLMs) across decentralized Non-Independent and Identically Distributed (Non-IID) clients, but such models' massive parameter sizes lead to significant memory and communication challenges. This work introduces Meerkat, a sparse zeroth-order optimization (ZO) method designed for federated LLM fine-tuning. By limiting fine-tuning to a transferable, static, extremely sparse subset of parameters, Meerkat achieves remarkable communication efficiency, enabling cost-effective high-frequency synchronization. With theoretical analysis and experiments, we show that this high-frequency communication effectively mitigates Non-IID data challenges and leads to superior performance compared to full-parameter ZO. Furthermore, experiment results show that Meerkat outperforms existing sparsity baselines with better performance at the same communication frequency. To further handle Non-IID drift, Meerkat leverages traceable local updates and forms a virtual path for each client. This virtual path mechanism reveals the GradIP phenomenon: the inner products between LLM pre-training gradients maintained by server and client gradients estimated via ZO converges for extreme Non-IID clients but oscillates for IID ones. This distinct behavior provides a signal for identifying clients with extreme data heterogeneity. Using this signal, Meerkat-vp is proposed to analyze GradIP trajectories to identify extreme Non-IID clients and applies early stopping to enhance aggregated model quality. Experiments confirm that Meerkat and Meerkat-vp significantly improve the efficiency and effectiveness of ZO federated LLM fine-tuning.
Authors: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher
Abstract: Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.
Authors: Divya Shyamal, Jiaqi Zhang, Caroline Uhler
Abstract: A combinatorial intervention, consisting of multiple treatments applied to a single unit with potentially interactive effects, has substantial applications in fields such as biomedicine, engineering, and beyond. Given $p$ possible treatments, conducting all possible $2^p$ combinatorial interventions can be laborious and quickly becomes infeasible as $p$ increases. Here we introduce probabilistic factorial experimental design, formalized from how scientists perform lab experiments. In this framework, the experimenter selects a dosage for each possible treatment and applies it to a group of units. Each unit independently receives a random combination of treatments, sampled from a product Bernoulli distribution determined by the dosages. Additionally, the experimenter can carry out such experiments over multiple rounds, adapting the design in an active manner. We address the optimal experimental design problem within an intervention model that imposes bounded-degree interactions between treatments. In the passive setting, we provide a closed-form solution for the near-optimal design. Our results prove that a dosage of $\tfrac{1}{2}$ for each treatment is optimal up to a factor of $1+O(\tfrac{\ln(n)}{n})$ for estimating any $k$-way interaction model, regardless of $k$, and imply that $O\big(kp^{3k}\ln(p)\big)$ observations are required to accurately estimate this model. For the multi-round setting, we provide a near-optimal acquisition function that can be numerically optimized. We also explore several extensions of the design problem and finally validate our findings through simulations.
Authors: Leonid Ryvkin
Abstract: This note is a survey of various results on the capabilities of unique hard attention transformers encoders (UHATs) to recognize formal languages. We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions. We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.
Authors: Haley Dozier, Althea Henslee, Ashley Abraham, Andrew Strelzoff, Mark Chappell
Abstract: The use of machine learning (ML) techniques has allowed rapid advancements in many scientific and engineering fields. One of these problems is that of surface soil taxonomy, a research area previously hindered by the reliance on human-derived classifications, which are mostly dependent on dividing a dataset based on historical understandings of that data rather than data-driven, statistically observable similarities. Using a ML-based taxonomy allows soil researchers to move beyond the limitations of human visualization and create classifications of high-dimension datasets with a much higher level of specificity than possible with hand-drawn taxonomies. Furthermore, this pipeline allows for the possibility of producing both highly accurate and flexible soil taxonomies with classes built to fit a specific application. The machine learning pipeline outlined in this work combines product quantization with the systematic evaluation of parameters and output to get the best available results, rather than accepting sub-optimal results by using either default settings or best guess settings.
Authors: Aref Ghoreishee, Abhishek Mishra, John Walsh, Anup Das, Nagarajan Kandasamy
Abstract: We propose a new ternary spiking neuron model to improve the representation capacity of binary spiking neurons in deep Q-learning. Although a ternary neuron model has recently been introduced to overcome the limited representation capacity offered by the binary spiking neurons, we show that its performance is worse than that of binary models in deep Q-learning tasks. We hypothesize gradient estimation bias during the training process as the underlying potential cause through mathematical and empirical analysis. We propose a novel ternary spiking neuron model to mitigate this issue by reducing the estimation bias. We use the proposed ternary spiking neuron as the fundamental computing unit in a deep spiking Q-learning network (DSQN) and evaluate the network's performance in seven Atari games from the Gym environment. Results show that the proposed ternary spiking neuron mitigates the drastic performance degradation of ternary neurons in Q-learning tasks and improves the network performance compared to the existing binary neurons, making DSQN a more practical solution for on-board autonomous decision-making tasks.
Authors: Walter Mayor, Johan Obando-Ceron, Aaron Courville, Pablo Samuel Castro
Abstract: The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.
Authors: Melissa Dutz, Han Shao, Avrim Blum, Aloni Cohen
Abstract: Strategic litigation involves bringing a legal case to court with the goal of having a broader impact beyond resolving the case itself: for example, creating precedent which will influence future rulings. In this paper, we explore strategic litigation from the perspective of machine learning theory. We consider an abstract model of a common-law legal system where a lower court decides new cases by applying a decision rule learned from a higher court's past rulings. In this model, we explore the power of a strategic litigator, who strategically brings cases to the higher court to influence the learned decision rule, thereby affecting future cases. We explore questions including: What impact can a strategic litigator have? Which cases should a strategic litigator bring to court? Does it ever make sense for a strategic litigator to bring a case when they are sure the court will rule against them?
Authors: Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, Kyungwoo Song
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks without parameter updates by conditioning on a few demonstrations provided in the prompt. Despite its success, ICL suffers from several limitations, including sensitivity to demonstration order, context length constraints, and computational inefficiency. To address these challenges, task vector-based approaches compress task information into a single vector. However, these methods typically construct task vectors from fixed sets of demonstrations and reuse them across input queries, without conditioning on the specific input. This limitation can lead models to struggle with effective adaptation when the input query is not well aligned with the underlying demonstrations, consequently degrading their generalization performance on unseen tasks. To overcome this limitation, we propose Adaptive Task Vectors (ATV), a simple and effective framework that dynamically generates task vectors conditioned on each input query. ATV employs a small language model to generate task vectors, which are then transformed to match the target LLM's architecture and applied to guide its output generation. In contrast to ICL and previous vector-based approaches, which rely on fixed demonstration sets and their corresponding vectors, ATV dynamically generates task vectors tailored to each specific input query and task. Consequently, ATV demonstrates strong performance and generalization capabilities, even for unseen tasks. Furthermore, we provide a theoretical analysis indicating that ATV is expressively equivalent to LoRA under equal rank budgets and more expressive than Prefix-Tuning, thereby offering formal support for its representational advantage.
Authors: Yue Gong, Raul Castro Fernandez
Abstract: As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask: given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs' weights to derive a prior distribution over the correlation value of a variable pair. If an LLM's prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model's raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization.
Authors: Mahesh Godavarti
Abstract: We present an empirical validation of the directional non-commutative monoidal embedding framework recently introduced in prior work~\cite{Godavarti2025monoidal}. This framework defines learnable compositional embeddings using distinct non-commutative operators per dimension (axis) that satisfy an interchange law, generalizing classical one-dimensional transforms. Our primary goal is to verify that this framework can effectively model real data by applying it to a controlled, well-understood task: image classification on the MNIST dataset~\cite{lecun1998gradient}. A central hypothesis for why the proposed monoidal embedding works well is that it generalizes the Discrete Fourier Transform (DFT)~\cite{oppenheim1999discrete} by learning task-specific frequency components instead of using fixed basis frequencies. We test this hypothesis by comparing learned monoidal embeddings against fixed DFT-based embeddings on MNIST. The results show that as the embedding dimensionality decreases (e.g., from 32 to 8 to 2), the performance gap between the learned monoidal embeddings and fixed DFT-based embeddings on MNIST grows increasingly large. This comparison is used as an analytic tool to explain why the framework performs well: the learnable embeddings can capture the most discriminative spectral components for the task. Overall, our experiments confirm that directional non-commutative monoidal embeddings are highly effective for representing image data, offering a compact learned representation that retains high task performance. The code used in this work is available at https://github.com/mahesh-godavarti/directional_composition_mnist.
URLs: https://github.com/mahesh-godavarti/directional_composition_mnist.
Authors: Yifeng Xiao, Yurong Xu, Ning Yan, Masood Mortazavi, Pierluigi Nuzzo
Abstract: Simulation-based design space exploration (DSE) aims to efficiently optimize high-dimensional structured designs under complex constraints and expensive evaluation costs. Existing approaches, including heuristic and multi-step reinforcement learning (RL) methods, struggle to balance sampling efficiency and constraint satisfaction due to sparse, delayed feedback, and large hybrid action spaces. In this paper, we introduce CORE, a constraint-aware, one-step RL method for simulationguided DSE. In CORE, the policy agent learns to sample design configurations by defining a structured distribution over them, incorporating dependencies via a scaling-graph-based decoder, and by reward shaping to penalize invalid designs based on the feedback obtained from simulation. CORE updates the policy using a surrogate objective that compares the rewards of designs within a sampled batch, without learning a value function. This critic-free formulation enables efficient learning by encouraging the selection of higher-reward designs. We instantiate CORE for hardware-mapping co-design of neural network accelerators, demonstrating that it significantly improves sample efficiency and achieves better accelerator configurations compared to state-of-the-art baselines. Our approach is general and applicable to a broad class of discrete-continuous constrained design problems.
Authors: Daniel Campa, Mehdi Saeedi, Ian Colbert, Srinjoy Das
Abstract: Navigation path traces play a crucial role in video game design, serving as a vital resource for both enhancing player engagement and fine-tuning non-playable character behavior. Generating such paths with human-like realism can enrich the overall gaming experience, and evaluating path traces can provide game designers insights into player interactions. Despite the impressive recent advancements in deep learning-based generative modeling, the video game industry hesitates to adopt such models for path generation, often citing their complex training requirements and interpretability challenges. To address these problems, we propose a novel path generation and evaluation approach that is grounded in principled nonparametric statistics and provides precise control while offering interpretable insights. Our path generation method fuses two statistical techniques: (1) nonparametric model-free transformations that capture statistical characteristics of path traces through time; and (2) copula models that capture statistical dependencies in space. For path evaluation, we adapt a nonparametric three-sample hypothesis test designed to determine if the generated paths are overfit (mimicking the original data too closely) or underfit (diverging too far from it). We demonstrate the precision and reliability of our proposed methods with empirical analysis on two existing gaming benchmarks to showcase controlled generation of diverse navigation paths. Notably, our novel path generator can be fine-tuned with user controllable parameters to create navigation paths that exhibit varying levels of human-likeness in contrast to those produced by neural network-based agents. The code is available at https://github.com/daniel-campa/mf-copula.
Authors: Daniel Ovalle, Lorenz T. Biegler, Ignacio E. Grossmann, Carl D. Laird, Mateo Dulce Rubio
Abstract: We propose Conformal Mixed-Integer Constraint Learning (C-MICL), a novel framework that provides probabilistic feasibility guarantees for data-driven constraints in optimization problems. While standard Mixed-Integer Constraint Learning methods often violate the true constraints due to model error or data limitations, our C-MICL approach leverages conformal prediction to ensure feasible solutions are ground-truth feasible. This guarantee holds with probability at least $1{-}\alpha$, under a conditional independence assumption. The proposed framework supports both regression and classification tasks without requiring access to the true constraint function, while avoiding the scalability issues associated with ensemble-based heuristics. Experiments on real-world applications demonstrate that C-MICL consistently achieves target feasibility rates, maintains competitive objective performance, and significantly reduces computational cost compared to existing methods. Our work bridges mathematical optimization and machine learning, offering a principled approach to incorporate uncertainty-aware constraints into decision-making with rigorous statistical guarantees.
Authors: Yongxiang Tang, Yanhua Cheng, Xiaocheng Liu, Chenchen Jiao, Yanxiang Zeng, Ning Luo, Pengjia Yuan, Xialong Liu, Peng Jiang
Abstract: In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques, whereas this paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, termed the Generative Cost Model (GCM), which inherently addresses the strict monotonic problem, and propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found at https://github.com/tyxaaron/GCM.
Authors: Wang WeiQuan, Riaz-ul-Haque Mian
Abstract: In semiconductor manufacturing, testing costs remain significantly high, especially during wafer and FPGA testing. To reduce the number of required tests while maintaining predictive accuracy, this study investigates three baseline sampling strategies: Random Sampling, Stratified Sampling, and k-means Clustering Sampling. To further enhance these methods, this study proposes a novel algorithm that improves the sampling quality of each approach. This research is conducted using real industrial production data from wafer-level tests and silicon measurements from various FPGAs. This study introduces two hybrid strategies: Stratified with Short Distance Elimination (S-SDE) and k-means with Short Distance Elimination (K-SDE). Their performance is evaluated within the framework of Gaussian Process Regression (GPR) for predicting wafer and FPGA test data. At the core of our proposed approach is the Short Distance Elimination (SDE) algorithm, which excludes spatially proximate candidate points during sampling, thereby ensuring a more uniform distribution of training data across the physical domain. A parameter sweep was conducted over the (alpha, beta) thresholds, where alpha and beta are in the range {0, 1, 2, 3, 4} and not both zero, to identify the optimal combination that minimizes RMSD. Experimental results on a randomly selected wafer file reveal that (alpha, beta) equal (2, 2) yields the lowest RMSD. Accordingly, all subsequent experiments adopt this parameter configuration. The results demonstrate that the proposed SDE-based strategies enhance predictive accuracy: K-SDE improves upon k-means sampling by 16.26 percent (wafer) and 13.07 percent (FPGA), while S-SDE improves upon stratified sampling by 16.49 percent (wafer) and 8.84 percent (FPGA).
Authors: Hiroki Shiraishi, Hisao Ishibuchi, Masaya Nakata
Abstract: The decision-making process significantly influences the predictions of machine learning models. This is especially important in rule-based systems such as Learning Fuzzy-Classifier Systems (LFCSs) where the selection and application of rules directly determine prediction accuracy and reliability. LFCSs combine evolutionary algorithms with supervised learning to optimize fuzzy classification rules, offering enhanced interpretability and robustness. Despite these advantages, research on improving decision-making mechanisms (i.e., class inference schemes) in LFCSs remains limited. Most LFCSs use voting-based or single-winner-based inference schemes. These schemes rely on classification performance on training data and may not perform well on unseen data, risking overfitting. To address these limitations, this article introduces a novel class inference scheme for LFCSs based on the Dempster-Shafer Theory of Evidence (DS theory). The proposed scheme handles uncertainty well. By using the DS theory, the scheme calculates belief masses (i.e., measures of belief) for each specific class and the ``I don't know'' state from each fuzzy rule and infers a class from these belief masses. Unlike the conventional schemes, the proposed scheme also considers the ``I don't know'' state that reflects uncertainty, thereby improving the transparency and reliability of LFCSs. Applied to a variant of LFCS (i.e., Fuzzy-UCS), the proposed scheme demonstrates statistically significant improvements in terms of test macro F1 scores across 30 real-world datasets compared to conventional voting-based and single-winner-based fuzzy inference schemes. It forms smoother decision boundaries, provides reliable confidence measures, and enhances the robustness and generalizability of LFCSs in real-world applications. Our implementation is available at https://github.com/YNU-NakataLab/jUCS.
Authors: Minh Luu, Surya Jasper, Khoi Le, Evan Pan, Michael Quinn, Aakash Tyagi, Jiang Hu
Abstract: Failure triage in design functional verification is critical but time-intensive, relying on manual specification reviews, log inspections, and waveform analyses. While machine learning (ML) has improved areas like stimulus generation and coverage closure, its application to RTL-level simulation failure triage, particularly for large designs, remains limited. VCDiag offers an efficient, adaptable approach using VCD data to classify failing waveforms and pinpoint likely failure locations. In the largest experiment, VCDiag achieves over 94% accuracy in identifying the top three most likely modules. The framework introduces a novel signal selection and statistical compression approach, achieving over 120x reduction in raw data size while preserving features essential for classification. It can also be integrated into diverse Verilog/SystemVerilog designs and testbenches.
Authors: Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi
Abstract: The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly can eliminate the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.
Authors: Hiroki Shiraishi, Yohei Hayamizu, Tomonori Hashiyama, Keiki Takadama, Hisao Ishibuchi, Masaya Nakata
Abstract: Rule representations significantly influence the search capabilities and decision boundaries within the search space of Learning Classifier Systems (LCSs), a family of rule-based machine learning systems that evolve interpretable models through evolutionary processes. However, it is very difficult to choose an appropriate rule representation for each problem. Additionally, some problems benefit from using different representations for different subspaces within the input space. Thus, an adaptive mechanism is needed to choose an appropriate rule representation for each rule in LCSs. This article introduces a flexible rule representation using a four-parameter beta distribution and integrates it into a fuzzy-style LCS. The four-parameter beta distribution can form various function shapes, and this flexibility enables our LCS to automatically select appropriate representations for different subspaces. Our rule representation can represent crisp/fuzzy decision boundaries in various boundary shapes, such as rectangles and bells, by controlling four parameters, compared to the standard representations such as trapezoidal ones. Leveraging this flexibility, our LCS is designed to adapt the appropriate rule representation for each subspace. Moreover, our LCS incorporates a generalization bias favoring crisp rules where feasible, enhancing model interpretability without compromising accuracy. Experimental results on real-world classification tasks show that our LCS achieves significantly superior test accuracy and produces more compact rule sets. Our implementation is available at https://github.com/YNU-NakataLab/Beta4-UCS. An extended abstract related to this work is available at https://doi.org/10.36227/techrxiv.174900805.59801248/v1.
URLs: https://github.com/YNU-NakataLab/Beta4-UCS., https://doi.org/10.36227/techrxiv.174900805.59801248/v1.
Authors: Jiayi Wan, Xiang Zhu, Fanzhen Liu, Wei Fan, Xiaolong Xu
Abstract: Federated learning, as a distributed architecture, shows great promise for applications in Cyber-Physical-Social Systems (CPSS). In order to mitigate the privacy risks inherent in CPSS, the integration of differential privacy with federated learning has attracted considerable attention. Existing research mainly focuses on dynamically adjusting the noise added or discarding certain gradients to mitigate the noise introduced by differential privacy. However, these approaches fail to remove the noise that hinders convergence and correct the gradients affected by the noise, which significantly reduces the accuracy of model classification. To overcome these challenges, this paper proposes a novel framework for differentially private federated learning that balances rigorous privacy guarantees with accuracy by introducing a server-side gradient correction mechanism. Specifically, after clients perform gradient clipping and noise perturbation, our framework detects deviations in the noisy local gradients and employs a projection mechanism to correct them, mitigating the negative impact of noise. Simultaneously, gradient projection promotes the alignment of gradients from different clients and guides the model towards convergence to a global optimum. We evaluate our framework on several benchmark datasets, and the experimental results demonstrate that it achieves state-of-the-art performance under the same privacy budget.
Authors: Yidi Wang, Jiawei Gu, pei Xiaobing, Xubin Zheng, Xiao Luo, Pengyang Wang, Ziyue Qiao
Abstract: This paper studies a novel problem of out-of-distribution graph models merging, which aims to construct a generalized model from multiple graph models pre-trained on different domains with distribution discrepancy. This problem is challenging because of the difficulty in learning domain-invariant knowledge implicitly in model parameters and consolidating expertise from potentially heterogeneous GNN backbones. In this work, we propose a graph generation strategy that instantiates the mixture distribution of multiple domains. Then, we merge and fine-tune the pre-trained graph models via a MoE module and a masking mechanism for generalized adaptation. Our framework is architecture-agnostic and can operate without any source/target domain data. Both theoretical analysis and experimental results demonstrate the effectiveness of our approach in addressing the model generalization problem.
Authors: Fang Wang, Paolo Ceravolo, Ernesto Damiani
Abstract: Predictive Business Process Monitoring (PBPM) aims to forecast future outcomes of ongoing business processes. However, existing methods often lack flexibility to handle real-world challenges such as simultaneous events, class imbalance, and multi-level attributes. While prior work has explored static encoding schemes and fixed LSTM architectures, they struggle to support adaptive representations and generalize across heterogeneous datasets. To address these limitations, we propose a suite of dynamic LSTM HyperModels that integrate two-level hierarchical encoding for event and sequence attributes, character-based decomposition of event labels, and novel pseudo-embedding techniques for durations and attribute correlations. We further introduce specialized LSTM variants for simultaneous event modeling, leveraging multidimensional embeddings and time-difference flag augmentation. Experimental validation on four public and real-world datasets demonstrates up to 100% accuracy on balanced datasets and F1 scores exceeding 86\% on imbalanced ones. Our approach advances PBPM by offering modular and interpretable models better suited for deployment in complex settings. Beyond PBPM, it contributes to the broader AI community by improving temporal outcome prediction, supporting data heterogeneity, and promoting explainable process intelligence frameworks.
Authors: Xiansheng Cai, Sihan Hu, Tao Wang, Yuan Huang, Pan Zhang, Youjin Deng, Kun Chen
Abstract: Fundamental physics often confronts complex symbolic problems with few guiding exemplars or established principles. While artificial intelligence (AI) offers promise, its typical need for vast datasets to learn from hinders its use in these information-scarce frontiers. We introduce learning at criticality (LaC), a reinforcement learning (RL) scheme that tunes Large Language Models (LLMs) to a sharp learning transition, addressing this information scarcity. At this transition, LLMs achieve peak generalization from minimal data, exemplified by 7-digit base-7 addition -- a test of nontrivial arithmetic reasoning. To elucidate this peak, we analyze a minimal concept-network model (CoNet) designed to capture the essence of how LLMs might link tokens. Trained on a single exemplar, this model also undergoes a sharp learning transition. This transition exhibits hallmarks of a second-order phase transition, notably power-law distributed solution path lengths. At this critical point, the system maximizes a ``critical thinking pattern" crucial for generalization, enabled by the underlying scale-free exploration. This suggests LLMs reach peak performance by operating at criticality, where such explorative dynamics enable the extraction of underlying operational rules. We demonstrate LaC in quantum field theory: an 8B-parameter LLM, tuned to its critical point by LaC using a few exemplars of symbolic Matsubara sums, solves unseen, higher-order problems, significantly outperforming far larger models. LaC thus leverages critical phenomena, a physical principle, to empower AI for complex, data-sparse challenges in fundamental physics.
Authors: Quentin Bertrand, Anne Gagneux, Mathurin Massias, R\'emi Emonet
Abstract: Modern deep generative models can now produce high-quality synthetic samples that are often indistinguishable from real training data. A growing body of research aims to understand why recent methods -- such as diffusion and flow matching techniques -- generalize so effectively. Among the proposed explanations are the inductive biases of deep learning architectures and the stochastic nature of the conditional flow matching loss. In this work, we rule out the latter -- the noisy nature of the loss -- as a primary contributor to generalization in flow matching. First, we empirically show that in high-dimensional settings, the stochastic and closed-form versions of the flow matching loss yield nearly equivalent losses. Then, using state-of-the-art flow matching models on standard image datasets, we demonstrate that both variants achieve comparable statistical performance, with the surprising observation that using the closed-form can even improve performance.
Authors: Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov
Abstract: Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.
Authors: Razvan-Andrei Lascu, David \v{S}i\v{s}ka, {\L}ukasz Szpruch
Abstract: Proximal Policy Optimization (PPO) has become a widely adopted algorithm for reinforcement learning, offering a practical policy gradient method with strong empirical performance. Despite its popularity, PPO lacks formal theoretical guarantees for policy improvement and convergence. PPO is motivated by Trust Region Policy Optimization (TRPO) that utilizes a surrogate loss with a KL divergence penalty, which arises from linearizing the value function within a flat geometric space. In this paper, we derive a tighter surrogate in the Fisher-Rao (FR) geometry, yielding a novel variant, Fisher-Rao PPO (FR-PPO). Our proposed scheme provides strong theoretical guarantees, including monotonic policy improvement. Furthermore, in the tabular setting, we demonstrate that FR-PPO achieves sub-linear convergence without any dependence on the dimensionality of the action or state spaces, marking a significant step toward establishing formal convergence results for PPO-based algorithms.
Authors: Daniel Palenicek, Florian Vogt, Jan Peters
Abstract: Reinforcement learning has achieved significant milestones, but sample efficiency remains a bottleneck for real-world applications. Recently, CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1. In this work, we explore CrossQ's scaling behavior with higher UTD ratios. We identify challenges in the training dynamics which are emphasized by higher UTDs, particularly Q-bias explosion and the growing magnitude of critic network weights. To address this, we integrate weight normalization into the CrossQ framework, a solution that stabilizes training, prevents potential loss of plasticity and keeps the effective learning rate constant. Our proposed approach reliably scales with increasing UTD ratios, achieving competitive or superior performance across a range of challenging tasks on the DeepMind control benchmark, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a robust pathway for improving sample efficiency and scalability in model-free reinforcement learning.
Authors: Li Zhang, Zhongxuan Han, Chaochao chen, Xiaohua Feng, Jiaming Zhang, Yuyuan Li
Abstract: With emerging application of Federated Learning (FL) in decision-making scenarios, it is imperative to regulate model fairness to prevent disparities across sensitive groups (e.g., female, male). Current research predominantly focuses on two concepts of group fairness within FL: Global Fairness (overall model disparity across all clients) and Local Fairness (the disparity within each client). However, the non-decomposable, non-differentiable nature of fairness criteria pose two fundamental, unresolved challenges for fair FL: (i) Harmonizing global and local fairness in multi-class classification; (ii) Enabling a controllable, optimal accuracy-fairness trade-off. To tackle the aforementioned challenges, we propose a novel controllable federated group-fairness calibration framework, named FedFACT. FedFACT identifies the Bayes-optimal classifiers under both global and local fairness constraints in multi-class case, yielding models with minimal performance decline while guaranteeing fairness. To effectively realize an adjustable, optimal accuracy-fairness balance, we derive specific characterizations of the Bayes-optimal fair classifiers for reformulating fair FL as personalized cost-sensitive learning problem for in-processing, and bi-level optimization for post-processing. Theoretically, we provide convergence and generalization guarantees for FedFACT to approach the near-optimal accuracy under given fairness levels. Extensive experiments on multiple datasets across various data heterogeneity demonstrate that FedFACT consistently outperforms baselines in balancing accuracy and global-local fairness.
Authors: Beatrix M. G. Nielsen, Emanuele Marconato, Andrea Dittadi, Luigi Gresele
Abstract: When and why representations learned by different deep neural networks are similar is an active research topic. We choose to address these questions from the perspective of identifiability theory, which suggests that a measure of representational similarity should be invariant to transformations that leave the model distribution unchanged. Focusing on a model family which includes several popular pre-training approaches, e.g., autoregressive language models, we explore when models which generate distributions that are close have similar representations. We prove that a small Kullback-Leibler divergence between the model distributions does not guarantee that the corresponding representations are similar. This has the important corollary that models arbitrarily close to maximizing the likelihood can still learn dissimilar representations, a phenomenon mirrored in our empirical observations on models trained on CIFAR-10. We then define a distributional distance for which closeness implies representational similarity, and in synthetic experiments, we find that wider networks learn distributions which are closer with respect to our distance and have more similar representations. Our results establish a link between closeness in distribution and representational similarity.
Authors: Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma
Abstract: Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.
Authors: Andreas Athanasopoulos, Christos Dimitrakakis
Abstract: We investigate the problem of learning an equilibrium in a generalized two-sided matching market, where agents can adaptively choose their actions based on their assigned matches. Specifically, we consider a setting in which matched agents engage in a zero-sum game with initially unknown payoff matrices, and we explore whether a centralized procedure can learn an equilibrium from bandit feedback. We adopt the solution concept of matching equilibrium, where a pair consisting of a matching $\mathfrak{m}$ and a set of agent strategies $X$ forms an equilibrium if no agent has the incentive to deviate from $(\mathfrak{m}, X)$. To measure the deviation of a given pair $(\mathfrak{m}, X)$ from the equilibrium pair $(\mathfrak{m}^\star, X^\star)$, we introduce matching instability that can serve as a regret measure for the corresponding learning problem. We then propose a UCB algorithm in which agents form preferences and select actions based on optimistic estimates of the game payoffs, and prove that it achieves sublinear, instance-independent regret over a time horizon $T$.
Authors: Lili Chen, Changyang She, Jingge Zhu, Jamie Evans
Abstract: As the number of mobile devices continues to grow, interference has become a major bottleneck in improving data rates in wireless networks. Efficient joint channel and power allocation (JCPA) is crucial for managing interference. In this paper, we first propose an enhanced WMMSE (eWMMSE) algorithm to solve the JCPA problem in multi-channel wireless networks. To reduce the computational complexity of iterative optimization, we further introduce JCPGNN-M, a graph neural network-based solution that enables simultaneous multi-channel allocation for each user. We reformulate the problem as a Lagrangian function, which allows us to enforce the total power constraints systematically. Our solution involves combining this Lagrangian framework with GNNs and iteratively updating the Lagrange multipliers and resource allocation scheme. Unlike existing GNN-based methods that limit each user to a single channel, JCPGNN-M supports efficient spectrum reuse and scales well in dense network scenarios. Simulation results show that JCPGNN-M achieves better data rate compared to eWMMSE. Meanwhile, the inference time of JCPGNN-M is much lower than eWMMS, and it can generalize well to larger networks.
Authors: Julius Gonsior, Tim Rie{\ss}, Anja Reusch, Claudio Hartmann, Maik Thiele, Wolfgang Lehner
Abstract: Annotating data is a time-consuming and costly task, but it is inherently required for supervised machine learning. Active Learning (AL) is an established method that minimizes human labeling effort by iteratively selecting the most informative unlabeled samples for expert annotation, thereby improving the overall classification performance. Even though AL has been known for decades, AL is still rarely used in real-world applications. As indicated in the two community web surveys among the NLP community about AL, two main reasons continue to hold practitioners back from using AL: first, the complexity of setting AL up, and second, a lack of trust in its effectiveness. We hypothesize that both reasons share the same culprit: the large hyperparameter space of AL. This mostly unexplored hyperparameter space often leads to misleading and irreproducible AL experiment results. In this study, we first compiled a large hyperparameter grid of over 4.6 million hyperparameter combinations, second, recorded the performance of all combinations in the so-far biggest conducted AL study, and third, analyzed the impact of each hyperparameter in the experiment results. In the end, we give recommendations about the influence of each hyperparameter, demonstrate the surprising influence of the concrete AL strategy implementation, and outline an experimental study design for reproducible AL experiments with minimal computational effort, thus contributing to more reproducible and trustworthy AL research in the future.
Authors: Jianyuan Yin, Qianxiao Li
Abstract: We consider learning a predictive model to be subsequently used for a given downstream task (described by an algorithm) that requires access to the model evaluation. This task need not be prediction, and this situation is frequently encountered in machine-learning-augmented scientific computing. We show that this setting differs from classical supervised learning, and in general it cannot be solved by minimizing the mean square error of the model predictions as is frequently performed in the literature. Instead, we find that the maximum prediction error on the support of the downstream task algorithm can serve as an effective estimate for the subsequent task performance. With this insight, we formulate a task-specific supervised learning problem based on the given sampling measure, whose solution serves as a reliable surrogate model for the downstream task. Then, we discretize the empirical risk based on training data, and develop an iterative algorithm to solve the task-specific supervised learning problem. Three illustrative numerical examples on trajectory prediction, optimal control and minimum energy path computation demonstrate the effectiveness of the approach.
Authors: Tobias Pielok, Bernd Bischl, David R\"ugamer
Abstract: Recent years have witnessed growing interest in semi-implicit variational inference (SIVI) methods due to their ability to rapidly generate samples from complex distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. Although unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and show that UIVI's MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback-Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.
Authors: Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong
Abstract: Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which estimates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.
Authors: Mingxuan Xia, Haobo Wang, Yixuan Li, Zewei Yu, Jindong Wang, Junbo Zhao, Runze Wu
Abstract: Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.
Authors: Mohd. Farhan Israk Soumik, Syed Mhamudul Hasan, Abdur R. Shahid
Abstract: The misuse of Large Language Models (LLMs) to infer emotions from text for malicious purposes, known as emotion inference attacks, poses a significant threat to user privacy. In this paper, we investigate the potential of Apple Intelligence's writing tools, integrated across iPhone, iPad, and MacBook, to mitigate these risks through text modifications such as rewriting and tone adjustment. By developing early novel datasets specifically for this purpose, we empirically assess how different text modifications influence LLM-based detection. This capability suggests strong potential for Apple Intelligence's writing tools as privacy-preserving mechanisms. Our findings lay the groundwork for future adaptive rewriting systems capable of dynamically neutralizing sensitive emotional content to enhance user privacy. To the best of our knowledge, this research provides the first empirical analysis of Apple Intelligence's text-modification tools within a privacy-preservation context with the broader goal of developing on-device, user-centric privacy-preserving mechanisms to protect against LLMs-based advanced inference attacks on deployed systems.
Authors: Pau Vilimelis Aceituno, Jack William Miller, Noah Marti, Youssef Farag, Victor Boussange
Abstract: When training autoregressive models for dynamical systems, a critical question arises: how far into the future should the model be trained to predict? Too short a horizon may miss long-term trends, while too long a horizon can impede convergence due to accumulating prediction errors. In this work, we formalize this trade-off by analyzing how the geometry of the loss landscape depends on the training horizon. We prove that for chaotic systems, the loss landscape's roughness grows exponentially with the training horizon, while for limit cycles, it grows linearly, making long-horizon training inherently challenging. However, we also show that models trained on long horizons generalize well to short-term forecasts, whereas those trained on short horizons suffer exponentially (resp. linearly) worse long-term predictions in chaotic (resp. periodic) systems. We validate our theory through numerical experiments and discuss practical implications for selecting training horizons. Our results provide a principled foundation for hyperparameter optimization in autoregressive forecasting models.
Authors: Pierre-Fran\c{c}ois Massiani, Christian Fiedler, Lukas Haverbeck, Friedrich Solowjow, Sebastian Trimpe
Abstract: We propose a framework for hypothesis testing on conditional probability distributions, which we then use to construct conditional two-sample statistical tests. These tests identify the inputs -- called covariates in this context -- where two conditional expectations differ with high probability. Our key idea is to transform confidence bounds of a learning method into a conditional two-sample test, and we instantiate this principle for kernel ridge regression (KRR) and conditional kernel mean embeddings. We generalize existing pointwise-in-time or time-uniform confidence bounds for KRR to previously-inaccessible yet essential cases such as infinite-dimensional outputs with non-trace-class kernels. These bounds enable circumventing the need for independent data in our statistical tests, since they allow online sampling. We also introduce bootstrapping schemes leveraging the parametric form of testing thresholds identified in theory to avoid tuning inaccessible parameters, making our method readily applicable in practice. Such conditional two-sample tests are especially relevant in applications where data arrive sequentially or non-independently, or when output distributions vary with operational parameters. We demonstrate their utility through examples in process monitoring and comparison of dynamical systems. Overall, our results establish a comprehensive foundation for conditional two-sample testing, from theoretical guarantees to practical implementation, and advance the state-of-the-art on the concentration of vector-valued least squares estimation.
Authors: Shyam Prabhu, P Akshay Kumar, Antov Selwinston, Pavan Taduvai, Shreya Bairi, Rohit Batra
Abstract: Materials design problems often require optimizing multiple variables, rendering full factorial exploration impractical. Design of experiment (DOE) methods, such as Taguchi technique, are commonly used to efficiently sample the design space but they inherently lack the ability to capture non-linear dependency of process variables. In this work, we demonstrate how machine learning (ML) methods can be used to overcome these limitations. We compare the performance of Taguchi method against an active learning based Gaussian process regression (GPR) model in a wire arc additive manufacturing (WAAM) process to accurately predict aspects of bead geometry, including penetration depth, bead width, and height. While Taguchi method utilized a three-factor, five-level L25 orthogonal array to suggest weld parameters, the GPR model used an uncertainty-based exploration acquisition function coupled with latin hypercube sampling for initial training data. Accuracy and efficiency of both models was evaluated on 15 test cases, with GPR outperforming Taguchi in both metrics. This work applies to broader materials processing domain requiring efficient exploration of complex parameters.
Authors: Chamsi Hssaine, Yichun Hu, Ciara Pike-Burke
Abstract: Points-based rewards programs are a prevalent way to incentivize customer loyalty; in these programs, customers who make repeated purchases from a seller accumulate points, working toward eventual redemption of a free reward. These programs have recently come under scrutiny due to accusations of unfair practices in their implementation. Motivated by these concerns, we study the problem of fairly designing points-based rewards programs, with a focus on two obstacles that put fairness at odds with their effectiveness. First, due to customer heterogeneity, the seller should set different redemption thresholds for different customers to generate high revenue. Second, the relationship between customer behavior and the number of accumulated points is typically unknown; this requires experimentation which may unfairly devalue customers' previously earned points. We first show that an individually fair rewards program that uses the same redemption threshold for all customers suffers a loss in revenue of at most a factor of $1+\ln 2$, compared to the optimal personalized strategy that differentiates between customers. We then tackle the problem of designing temporally fair learning algorithms in the presence of demand uncertainty. Toward this goal, we design a learning algorithm that limits the risk of point devaluation due to experimentation by only changing the redemption threshold $O(\log T)$ times, over a horizon of length $T$. This algorithm achieves the optimal (up to polylogarithmic factors) $\widetilde{O}(\sqrt{T})$ regret in expectation. We then modify this algorithm to only ever decrease redemption thresholds, leading to improved fairness at a cost of only a constant factor in regret. Extensive numerical experiments show the limited value of personalization in average-case settings, in addition to demonstrating the strong practical performance of our proposed learning algorithms.
Authors: Eduardo Santos Escriche, Stefanie Jegelka
Abstract: Recently, a trend has emerged that favors learning relevant symmetries from data in geometric domains instead of designing constrained architectures. To do so, two popular options are (1) to modify the training protocol, e.g., with a specific loss and data augmentations (soft equivariance), or (2) to ignore equivariance and infer it only implicitly. However, both options have limitations: soft equivariance requires a priori knowledge about relevant symmetries, while inferring symmetries merely via the task and larger data lacks interpretability. To address both limitations, we propose SEMoLA, an end-to-end approach that jointly (1) discovers a priori unknown symmetries in the data via learnable data augmentations, and (2) softly encodes the respective approximate equivariance into an arbitrary unconstrained model. Hence, it does not need prior knowledge about symmetries, it offers interpretability, and it maintains robustness to distribution shifts. Empirically, we demonstrate the ability of SEMoLA to robustly discover relevant symmetries while achieving high prediction accuracy across various datasets, encompassing multiple data modalities and underlying symmetry groups.
Authors: Lorenz Kummer, Samir Moustafa, Anatol Ehrlich, Franka Bause, Nikolaus Suess, Wilfried N. Gansterer, Nils M. Kriege
Abstract: The lottery ticket hypothesis (LTH) is well-studied for convolutional neural networks but has been validated only empirically for graph neural networks (GNNs), for which theoretical findings are largely lacking. In this paper, we identify the expressivity of sparse subnetworks, i.e. their ability to distinguish non-isomorphic graphs, as crucial for finding winning tickets that preserve the predictive performance. We establish conditions under which the expressivity of a sparsely initialized GNN matches that of the full network, particularly when compared to the Weisfeiler-Leman test, and in that context put forward and prove a Strong Expressive Lottery Ticket Hypothesis. We subsequently show that an increased expressivity in the initialization potentially accelerates model convergence and improves generalization. Our findings establish novel theoretical foundations for both LTH and GNN research, highlighting the importance of maintaining expressivity in sparsely initialized GNNs. We illustrate our results using examples from drug discovery.
Authors: Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, Nadav Cohen
Abstract: Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)--a common testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.
Authors: C\'edric L\'eonard (Technical University of Munich, Munich, Germany, Remote Sensing Technology Institute), Dirk Stober (Technical University of Munich, Munich, Germany), Martin Schulz (Technical University of Munich, Munich, Germany)
Abstract: New UAV technologies and the NewSpace era are transforming Earth Observation missions and data acquisition. Numerous small platforms generate large data volume, straining bandwidth and requiring onboard decision-making to transmit high-quality information in time. While Machine Learning allows real-time autonomous processing, FPGAs balance performance with adaptability to mission-specific requirements, enabling onboard deployment. This review systematically analyzes 66 experiments deploying ML models on FPGAs for Remote Sensing applications. We introduce two distinct taxonomies to capture both efficient model architectures and FPGA implementation strategies. For transparency and reproducibility, we follow PRISMA 2020 guidelines and share all data and code at https://github.com/CedricLeon/Survey_RS-ML-FPGA.
Authors: Shiyi Yang, Can Chen, Didong Li
Abstract: Networks with higher-order interactions, prevalent in biological, social, and information systems, are naturally represented as hypergraphs, yet their structural complexity poses fundamental challenges for geometric characterization. While curvature-based methods offer powerful insights in graph analysis, existing extensions to hypergraphs suffer from critical trade-offs: combinatorial approaches such as Forman-Ricci curvature capture only coarse features, whereas geometric methods like Ollivier-Ricci curvature offer richer expressivity but demand costly optimal transport computations. To address these challenges, we introduce hypergraph lower Ricci curvature (HLRC), a novel curvature metric defined in closed form that achieves a principled balance between interpretability and efficiency. Evaluated across diverse synthetic and real-world hypergraph datasets, HLRC consistently reveals meaningful higher-order organization, distinguishing intra- from inter-community hyperedges, uncovering latent semantic labels, tracking temporal dynamics, and supporting robust clustering of hypergraphs based on global structure. By unifying geometric sensitivity with algorithmic simplicity, HLRC provides a versatile foundation for hypergraph analytics, with broad implications for tasks including node classification, anomaly detection, and generative modeling in complex systems.
Authors: Aojun Lu, Hangjie Yuan, Tao Feng, Yanan Sun
Abstract: The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. While numerous CL methods aim to achieve this trade-off, they often overlook the impact of network architecture on stability and plasticity, restricting the trade-off to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Arch, which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments demonstrate that Dual-Arch enhances the performance of existing CL methods while being up to 87% more compact in terms of parameters. Code: https://github.com/byyx666/Dual-Arch.
Authors: Jianqing Zhang, Xinghao Wu, Yanbing Zhou, Xiaoting Sun, Qiqi Cai, Yang Liu, Yang Hua, Zhenzhe Zheng, Jian Cao, Qiang Yang
Abstract: As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.
Authors: Aojun Lu, Tao Feng, Hangjie Yuan, Chunhui Ding, Yanan Sun
Abstract: Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). While pre-trained models (PTMs) have become pivotal in CL, prevailing approaches freeze the PTM backbone to preserve stability, limiting their plasticity, particularly when encountering significant domain gaps in incremental tasks. Conversely, sequentially finetuning the entire PTM risks catastrophic forgetting of generalizable knowledge, exposing a critical stability-plasticity trade-off. To address this challenge, we propose Adapting PTMs before the core CL process (ACL), a novel framework that refines the PTM backbone through a plug-and-play adaptation phase before learning each new task with existing CL approaches (e.g., prompt tuning). ACL enhances plasticity by aligning embeddings with their original class prototypes while distancing them from others, theoretically and empirically shown to balance stability and plasticity. Extensive experiments demonstrate that ACL significantly improves CL performance across benchmarks and integrated methods, offering a versatile solution for PTM-based CL. Code is available at https://github.com/byyx666/ACL_code.
Authors: HyunGi Kim, Jisoo Mok, Dongjun Lee, Jaihyun Lew, Sungjae Kim, Sungroh Yoon
Abstract: Utilizing the complex inter-variable causal relationships within multivariate time-series provides a promising avenue toward more robust and reliable multivariate time-series anomaly detection (MTSAD) but remains an underexplored area of research. This paper proposes Causality-Aware contrastive learning for RObust multivariate Time-Series (CAROTS), a novel MTSAD pipeline that incorporates the notion of causality into contrastive learning. CAROTS employs two data augmentors to obtain causality-preserving and -disturbing samples that serve as a wide range of normal variations and synthetic anomalies, respectively. With causality-preserving and -disturbing samples as positives and negatives, CAROTS performs contrastive learning to train an encoder whose latent space separates normal and abnormal samples based on causality. Moreover, CAROTS introduces a similarity-filtered one-class contrastive loss that encourages the contrastive learning process to gradually incorporate more semantically diverse samples with common causal relationships. Extensive experiments on five real-world and two synthetic datasets validate that the integration of causal relationships endows CAROTS with improved MTSAD capabilities. The code is available at https://github.com/kimanki/CAROTS.
Authors: Haoxuan Chen, Yinuo Ren, Martin Renqiang Min, Lexing Ying, Zachary Izzo
Abstract: Diffusion models (DMs) have proven to be effective in modeling high-dimensional distributions, leading to their widespread adoption for representing complex priors in Bayesian inverse problems (BIPs). However, current DM-based posterior sampling methods proposed for solving common BIPs rely on heuristic approximations to the generative process. To exploit the generative capability of DMs and avoid the usage of such approximations, we propose an ensemble-based algorithm that performs posterior sampling without the use of heuristic approximations. Our algorithm is motivated by existing works that combine DM-based methods with the sequential Monte Carlo (SMC) method. By examining how the prior evolves through the diffusion process encoded by the pre-trained score function, we derive a modified partial differential equation (PDE) governing the evolution of the corresponding posterior distribution. This PDE includes a modified diffusion term and a reweighting term, which can be simulated via stochastic weighted particle methods. Theoretically, we prove that the error between the true posterior distribution can be bounded in terms of the training error of the pre-trained score function and the number of particles in the ensemble. Empirically, we validate our algorithm on several inverse problems in imaging to show that our method gives more accurate reconstructions compared to existing DM-based methods.
Authors: Lianfeng Shi, Ao Li, Benjamin Ward-Cherrier
Abstract: Spiking Neural Networks (SNNs) have emerged as a new generation of energy-efficient neural networks suitable for implementation on neuromorphic hardware. As neuromorphic hardware has limited memory and computing resources, weight pruning and quantization have recently been explored to improve SNNs' efficiency. State-of-the-art SNN pruning/quantization methods employ multiple compression and training iterations, increasing the cost for pre-trained or very large SNNs. In this paper, we propose a new one-shot post-training pruning/quantization framework, Optimal Spiking Brain Compression (OSBC), that adapts the Optimal Brain Compression (OBC) method of [Frantar, Singh, and Alistarh, 2023] for SNNs. Rather than minimizing the loss on neuron input current as OBC does, OSBC achieves more efficient and accurate SNN compression in one pass by minimizing the loss on spiking neuron membrane potential with a small sample dataset. Our experiments on neuromorphic datasets (N-MNIST, CIFAR10-DVS, DVS128-Gesture) demonstrate that OSBC can achieve 97% sparsity through pruning with 1.41%, 10.20%, and 1.74% accuracy loss, or 4-bit symmetric quantization with 0.17%, 1.54%, and 7.71% accuracy loss, respectively. Code will be available on GitHub.
Authors: Han Ji, Yuqi Feng, Jiahao Fan, Yanan Sun
Abstract: Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training samples and diverse test samples. Hence, they tend to learn spurious correlations as shortcuts to predictions, leading to poor generalization. To address this, we propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction. Specifically, we employ a substructure extractor to split the input architecture into critical and redundant substructures in the latent space. Then, we generate multiple interventional samples by pairing critical representations with diverse redundant representations to prioritize critical features. Extensive experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL. For instance, CARL achieves 97.67% top-1 accuracy on CIFAR-10 using DARTS.
Authors: Cl\'ement B\'enesse, Patrick Mesana, Ath\'ena\"is Gautier, S\'ebastien Gambs
Abstract: In machine learning, knowing the impact of a given datum on model training is a fundamental task referred to as Data Valuation. Building on previous works from the literature, we have designed a novel canonical decomposition allowing practitioners to analyze any data valuation method as the combination of two parts: a utility function that captures characteristics from a given model and an aggregation procedure that merges such information. We also propose to use Gaussian Processes as a means to easily access the utility function on ``sub-models'', which are models trained on a subset of the training set. The strength of our approach stems from both its theoretical grounding in Bayesian theory, and its practical reach, by enabling fast estimation of valuations thanks to efficient update formulae.
Authors: Alexander Semenenko, Ivan Butakov, Alexey Frolov, Ivan Oseledets
Abstract: Sliced Mutual Information (SMI) is widely used as a scalable alternative to mutual information for measuring non-linear statistical dependence. Despite its advantages, such as faster convergence, robustness to high dimensionality, and nullification only under statistical independence, we demonstrate that SMI is highly susceptible to data manipulation and exhibits counterintuitive behavior. Through extensive benchmarking and theoretical analysis, we show that SMI saturates easily, fails to detect increases in statistical dependence (even under linear transformations designed to enhance the extraction of information), prioritizes redundancy over informative content, and in some cases, performs worse than simpler dependence measures like the correlation coefficient.
Authors: Luiz Manella Pereira, M. Hadi Amini
Abstract: Federated learning (FL) is a subfield of machine learning that avoids sharing local data with a central server, which can enhance privacy and scalability. The inability to consolidate data leads to a unique problem called dataset imbalance, where agents in a network do not have equal representation of the labels one is trying to learn to predict. In FL, fusing locally-trained models with unbalanced datasets may deteriorate the performance of global model aggregation, and reduce the quality of updated local models and the accuracy of the distributed agents' decisions. In this work, we introduce an Optimal Transport-based preprocessing algorithm that aligns the datasets by minimizing the distributional discrepancy of data along the edge devices. We accomplish this by leveraging Wasserstein barycenters when computing channel-wise averages. These barycenters are collected in a trusted central server where they collectively generate a target RGB space. By projecting our dataset towards this target space, we minimize the distributional discrepancy on a global level, which facilitates the learning process due to a minimization of variance across the samples. We demonstrate the capabilities of the proposed approach over the CIFAR-10 dataset, where we show its capability of reaching higher degrees of generalization in fewer communication rounds.
Authors: Jun-Peng Jiang, Yu Xia, Hai-Long Sun, Shiyin Lu, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye
Abstract: Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation ({\sc Turbo}), a new framework for multimodal tabular reasoning with privileged structured tables. {\sc Turbo} benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, {\sc Turbo} repeatedly generates and selects the advantageous reasoning paths, further enhancing the model's tabular reasoning ability. Experimental results demonstrate that, with limited ($9$k) data, {\sc Turbo} achieves state-of-the-art performance ($+7.2\%$ vs. previous SOTA) across multiple datasets.
Authors: Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov
Abstract: As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.
Authors: Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis
Abstract: We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $\pi_S(y\mid x)$. We provably approximate the optimal tilted policy $\pi_{\beta,B}(y\mid x) \propto \pi_B(y\mid x)\exp(\beta\,r(x,y))$ of soft best-of-$n$ under the primary model $\pi_B$. We derive a theoretical bound on the KL divergence between our induced distribution and the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math), our method achieves higher accuracy than standard soft best-of-$n$ with $\pi_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $\pi_B$. The code is available at https://github.com/j-geuter/GSI .
Authors: Yujun Kim, Jaeyoung Cha, Chulhee Yun
Abstract: Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $\kappa$. In contrast, little is known when $K$ is smaller than $\kappa$, and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime.
Authors: Yashas Samaga, Varun Yerram, Spandana Raj Babbula, Prateek Jain, Praneeth Netrapalli
Abstract: We consider the Top-$K$ selection problem, which aims to identify the largest-$K$ elements from an array. Top-$K$ selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, \citet{chern2022tpuknnknearestneighbor} proposed a fast two-stage \textit{approximate} Top-$K$ algorithm: (i) partition the input array and select the top-$1$ element from each partition, (ii) sort this \textit{smaller subset} and return the top $K$ elements. In this paper, we consider a generalized version of this algorithm, where the first stage selects top-$K'$ elements, for some $1 \leq K' \leq K$, from each partition. Our contributions are as follows: (i) we derive an expression for the expected recall of this generalized algorithm and show that choosing $K' > 1$ with fewer partitions in the first stage reduces the input size to the second stage more effectively while maintaining the same expected recall as the original algorithm, (ii) we derive a bound on the expected recall for the original algorithm in \citet{chern2022tpuknnknearestneighbor} that is provably tighter by a factor of $2$ than the one in that paper, and (iii) we implement our algorithm on Cloud TPUv5e and achieve around an order of magnitude speedups over the original algorithm without sacrificing recall on real-world tasks.
Authors: Caleb Chin, Aashish Khubchandani, Harshvardhan Maskara, Kyuseong Choi, Jacob Feitelberg, Albert Gong, Manit Paul, Tathagata Sadhukhan, Anish Agarwal, Raaz Dwivedi
Abstract: Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.
Authors: Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, Sergey Levine
Abstract: In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction
Authors: Utkarsh Utkarsh, Pengfei Cai, Alan Edelman, Rafael Gomez-Bombarelli, Christopher Vincent Rackauckas
Abstract: Deep generative models have recently been applied to physical systems governed by partial differential equations (PDEs), offering scalable simulation and uncertainty-aware inference. However, enforcing physical constraints, such as conservation laws (linear and nonlinear) and physical consistencies, remains challenging. Existing methods often rely on soft penalties or architectural biases that fail to guarantee hard constraints. In this work, we propose Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear constraints in pretrained flow-based generative models. PCFM continuously guides the sampling process through physics-based corrections applied to intermediate solution states, while remaining aligned with the learned flow and satisfying physical constraints. Empirically, PCFM outperforms both unconstrained and constrained baselines on a range of PDEs, including those with shocks, discontinuities, and sharp features, while ensuring exact constraint satisfaction at the final solution. Our method provides a general framework for enforcing hard constraints in both scientific and general-purpose generative models, especially in applications where constraint satisfaction is essential.
Authors: Shreenidhi Srinivasan, Lydia Manikonda
Abstract: Generating realistic synthetic tabular data presents a critical challenge in machine learning. It adds another layer of complexity when this data contain class imbalance problems. This paper presents a novel token-aware data imputation method that leverages the in-context learning capabilities of large language models. This is achieved through the combination of a structured group-wise CSV-style prompting technique and the elimination of irrelevant contextual information in the input prompt. We test this approach with two class-imbalanced binary classification datasets and evaluate the effectiveness of imputation using classification-based evaluation metrics. The experimental results demonstrate that our approach significantly reduces the input prompt size while maintaining or improving imputation quality compared to our baseline prompt, especially for datasets that are of relatively smaller in size. The contributions of this presented work is two-fold -- 1) it sheds light on the importance of prompt design when leveraging LLMs for synthetic data generation and 2) it addresses a critical gap in LLM-based data imputation for class-imbalanced datasets with missing data by providing a practical solution within computational constraints. We hope that our work will foster further research and discussions about leveraging the incredible potential of LLMs and prompt engineering techniques for synthetic data generation.
Authors: Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt
Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.
URLs: https://openthoughts.ai.
Authors: Yuxuan Cao, Jiarong Xu, Chen Zhao, Jiaan Wang, Carl Yang, Chunping Wang, Yang Yang
Abstract: In recent years, graph anomaly detection has found extensive applications in various domains such as social, financial, and communication networks. However, anomalies in graph-structured data present unique challenges, including label scarcity, ill-defined anomalies, and varying anomaly types, making supervised or semi-supervised methods unreliable. Researchers often adopt unsupervised approaches to address these challenges, assuming that anomalies deviate significantly from the normal data distribution. Yet, when the available data is insufficient, capturing the normal distribution accurately and comprehensively becomes difficult. To overcome this limitation, we propose to utilize external graph data (i.e., graph data in the wild) to help anomaly detection tasks. This naturally raises the question: How can we use external data to help graph anomaly detection tasks? To answer this question, we propose a framework called Wild-GAD. It is built upon a unified database, UniWildGraph, which comprises a large and diverse collection of graph data with broad domain coverage, ample data volume, and a unified feature space. Further, we develop selection criteria based on representativity and diversity to identify the most suitable external data for anomaly detection task. Extensive experiments on six real-world datasets demonstrate the effectiveness of Wild-GAD. Compared to the baseline methods, our framework has an average 18% AUCROC and 32% AUCPR improvement over the best-competing methods.
Authors: Elena Zamaraeva, Christopher M. Collins, George R. Darling, Matthew S. Dyer, Bei Peng, Rahul Savani, Dmytro Antypov, Vladimir V. Gusev, Judith Clymo, Paul G. Spirakis, Matthew J. Rosseinsky
Abstract: Geometry optimization of atomic structures is a common and crucial task in computational chemistry and materials design. Following the learning to optimize paradigm, we propose a new multi-agent reinforcement learning method called Multi-Agent Crystal Structure optimization (MACS) to address periodic crystal structure optimization. MACS treats geometry optimization as a partially observable Markov game in which atoms are agents that adjust their positions to collectively discover a stable configuration. We train MACS across various compositions of reported crystalline materials to obtain a policy that successfully optimizes structures from the training compositions as well as structures of larger sizes and unseen compositions, confirming its excellent scalability and zero-shot transferability. We benchmark our approach against a broad range of state-of-the-art optimization methods and demonstrate that MACS optimizes periodic crystal structures significantly faster, with fewer energy calculations, and the lowest failure rate.
Authors: Jinghan Jia, Hadi Reisizadeh, Chongyu Fan, Nathalie Baracaldo, Mingyi Hong, Sijia Liu
Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities when trained with chain-of-thought (CoT) supervision. However, the long and verbose CoT traces, especially those distilled from large reasoning models (LRMs) such as DeepSeek-R1, significantly increase training costs during the distillation process, where a non-reasoning base model is taught to replicate the reasoning behavior of an LRM. In this work, we study the problem of CoT condensation for resource-efficient reasoning training, aimed at pruning intermediate reasoning steps (i.e., thoughts) in CoT traces, enabling supervised model training on length-reduced CoT data while preserving both answer accuracy and the model's ability to generate coherent reasoning. Our rationale is that CoT traces typically follow a three-stage structure: problem understanding, exploration, and solution convergence. Through empirical analysis, we find that retaining the structure of the reasoning trace, especially the early stage of problem understanding (rich in reflective cues) and the final stage of solution convergence, is sufficient to achieve lossless reasoning supervision. To this end, we propose an Edge-Preserving Condensation method, EPiC, which selectively retains only the initial and final segments of each CoT trace while discarding the middle portion. This design draws an analogy to preserving the "edge" of a reasoning trajectory, capturing both the initial problem framing and the final answer synthesis, to maintain logical continuity. Experiments across multiple model families (Qwen and LLaMA) and benchmarks show that EPiC reduces training time by over 34% while achieving lossless reasoning accuracy on MATH500, comparable to full CoT supervision. To the best of our knowledge, this is the first study to explore thought-level CoT condensation for efficient reasoning model distillation.
Authors: Reza Ramezanpour, Victor M. Tenorio, Antonio G. Marques, Ashutosh Sabharwal, Santiago Segarra
Abstract: Graphons, as limit objects of dense graph sequences, play a central role in the statistical analysis of network data. However, existing graphon estimation methods often struggle with scalability to large networks and resolution-independent approximation, due to their reliance on estimating latent variables or costly metrics such as the Gromov-Wasserstein distance. In this work, we propose a novel, scalable graphon estimator that directly recovers the graphon via moment matching, leveraging implicit neural representations (INRs). Our approach avoids latent variable modeling by training an INR--mapping coordinates to graphon values--to match empirical subgraph counts (i.e., moments) from observed graphs. This direct estimation mechanism yields a polynomial-time solution and crucially sidesteps the combinatorial complexity of Gromov-Wasserstein optimization. Building on foundational results, we establish a theoretical guarantee: when the observed subgraph motifs sufficiently represent those of the true graphon (a condition met with sufficiently large or numerous graph samples), the estimated graphon achieves a provable upper bound in cut distance from the ground truth. Additionally, we introduce MomentMixup, a data augmentation technique that performs mixup in the moment space to enhance graphon-based learning. Our graphon estimation method achieves strong empirical performance--demonstrating high accuracy on small graphs and superior computational efficiency on large graphs--outperforming state-of-the-art scalable estimators in 75\% of benchmark settings and matching them in the remaining cases. Furthermore, MomentMixup demonstrated improved graph classification accuracy on the majority of our benchmarks.
Authors: Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng
Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
Authors: Robert Bayer, Julian Priest, Daniel Kjellberg, Jeppe Lindhard, Nikolaj S{\o}renesen, Nicolaj Valsted, \'Ivar \'Oli, P{\i}nar T\"oz\"un
Abstract: CubeSats offer a low-cost platform for space research, particularly for Earth observation. However, their resource-constrained nature and being in space, challenge the flexibility and complexity of the deployed image processing pipelines and their orchestration. This paper introduces two novel systems, DIPP and DISH, to address these challenges. DIPP is a modular and configurable image processing pipeline framework that allows for adaptability to changing mission goals even after deployment, while preserving robustness. DISH is a domain-specific language (DSL) and runtime system designed to schedule complex imaging workloads on low-power and memory-constrained processors. Our experiments demonstrate that DIPP's decomposition of the processing pipelines adds negligible overhead, while significantly reducing the network requirements of updating pipelines and being robust against erroneous module uploads. Furthermore, we compare DISH to Lua, a general purpose scripting language, and demonstrate its comparable expressiveness and lower memory requirement.
Authors: Junzhe Jiang, Chang Yang, Xinrun Wang, Bo Li
Abstract: Stock market indices serve as fundamental market measurement that quantify systematic market dynamics. However, accurate index price prediction remains challenging, primarily because existing approaches treat indices as isolated time series and frame the prediction as a simple regression task. These methods fail to capture indices' inherent nature as aggregations of constituent stocks with complex, time-varying interdependencies. To address these limitations, we propose Cubic, a novel end-to-end framework that explicitly models the adaptive fusion of constituent stocks for index price prediction. Our main contributions are threefold. i) Fusion in the latent space: we introduce the fusion mechanism over the latent embedding of the stocks to extract the information from the vast number of stocks. ii) Binary encoding classification: since regression tasks are challenging due to continuous value estimation, we reformulate the regression into the classification task, where the target value is converted to binary and we optimize the prediction of the value of each digit with cross-entropy loss. iii) Confidence-guided prediction and trading: we introduce the regularization loss to address market prediction uncertainty for the index prediction and design the rule-based trading policies based on the confidence. Extensive experiments across multiple stock markets and indices demonstrate that Cubic consistently outperforms state-of-the-art baselines in stock index prediction tasks, achieving superior performance on both forecasting accuracy metrics and downstream trading profitability.
Authors: Ziyang Yu, Wenbing Huang, Yang Liu
Abstract: Molecular Dynamics (MD) simulations are essential for understanding the atomic-level behavior of molecular systems, giving insights into their transitions and interactions. However, classical MD techniques are limited by the trade-off between accuracy and efficiency, while recent deep learning-based improvements have mostly focused on single-domain molecules, lacking transferability to unfamiliar molecular systems. Therefore, we propose \textbf{Uni}fied \textbf{Sim}ulator (UniSim), which leverages cross-domain knowledge to enhance the understanding of atomic interactions. First, we employ a multi-head pretraining approach to learn a unified atomic representation model from a large and diverse set of molecular data. Then, based on the stochastic interpolant framework, we learn the state transition patterns over long timesteps from MD trajectories, and introduce a force guidance module for rapidly adapting to different chemical environments. Our experiments demonstrate that UniSim achieves highly competitive performance across small molecules, peptides, and proteins.
Authors: Damith Chamalke Senadeera, Xiaoyun Yang, Dimitrios Kollias, Gregory Slabaugh
Abstract: The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.
Authors: Long Tan Le, Senura Hansaja Wanasekara, Zerun Niu, Yansong Shi, Nguyen H. Tran, Phuong Vo, Walid Saad, Dusit Niyato, Zhu Han, Choong Seon Hong, H. Vincent Poor
Abstract: 6G wireless systems are expected to support massive volumes of data with ultra-low latency. However, conventional bit-level transmission strategies cannot support the efficiency and adaptability required by modern, data-intensive applications. The concept of semantic communication (SemCom) addresses this limitation by focusing on transmitting task-relevant semantic information instead of raw data. While recent efforts incorporating deep learning and large-scale AI models have improved SemCom's performance, existing systems remain vulnerable to both semantic-level and transmission-level noise because they often rely on domain-specific architectures that hinder generalizability. In this paper, a novel and generalized semantic communication framework called WaSeCom is proposed to systematically address uncertainty and enhance robustness. In particular, Wasserstein distributionally robust optimization is employed to provide resilience against semantic misinterpretation and channel perturbations. A rigorous theoretical analysis is performed to establish the robust generalization guarantees of the proposed framework. Experimental results on image and text transmission demonstrate that WaSeCom achieves improved robustness under noise and adversarial perturbations. These results highlight its effectiveness in preserving semantic fidelity across varying wireless conditions.
Authors: Dawen Jiang, Zhishu Shen, Qiushi Zheng, Tiehua Zhang, Wei Xiang, Jiong Jin
Abstract: Amid the challenges posed by global population growth and climate change, traditional agricultural Internet of Things (IoT) systems is currently undergoing a significant digital transformation to facilitate efficient big data processing. While smart agriculture utilizes artificial intelligence (AI) technologies to enable precise control, it still encounters significant challenges, including excessive reliance on agricultural expert knowledge, difficulties in fusing multimodal data, poor adaptability to dynamic environments, and bottlenecks in real-time decision-making at the edge. Large language models (LLMs), with their exceptional capabilities in knowledge acquisition and semantic understanding, provide a promising solution to address these challenges. To this end, we propose Farm-LightSeek, an edge-centric multimodal agricultural IoT data analytics framework that integrates LLMs with edge computing. This framework collects real-time farmland multi-source data (images, weather, geographic information) via sensors, performs cross-modal reasoning and disease detection at edge nodes, conducts low-latency management decisions, and enables cloud collaboration for model updates. The main innovations of Farm-LightSeek include: (1) an agricultural "perception-decision-action" closed-loop architecture; (2) cross-modal adaptive monitoring; and (3)a lightweight LLM deployment strategy balancing performance and efficiency. Experiments conducted on two real-world datasets demonstrate that Farm-LightSeek consistently achieves reliable performance in mission-critical tasks, even under the limitations of edge computing resources. This work advances intelligent real-time agricultural solutions and highlights the potential for deeper integration of agricultural IoT with LLMs.
Authors: Murthy L, Subarna Tripathi
Abstract: The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved $100\%$ attribution accuracy. However, any model with less than \emph{perfect} accuracy is practically non-deployable. In this work, we propose an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error correcting codes from the literature of coding theory.
Authors: Koki Matsuishi, Kosuke Ukita, Tsuyoshi Okita
Abstract: In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person perspectives alone fail to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model's overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961.
Authors: Shivani Chiranjeevi, Hossein Zaremehrjerdi, Zi K. Deng, Talukder Z. Jubery, Ari Grele, Arti Singh, Asheesh K Singh, Soumik Sarkar, Nirav Merchant, Harold F. Greeney, Baskar Ganapathysubramanian, Chinmay Hegde
Abstract: The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identifying unknown, potentially undescribed insect species from image data. Our benchmark dataset combines a mix of expertly annotated images of insect species likely known to frontier AI models, and images of rare and poorly known species, for which few/no publicly available images exist. These images were collected from underexplored biodiversity hotspots, realistically mimicking open-world discovery scenarios faced by ecologists. The benchmark assesses models' proficiency in hierarchical taxonomic classification, their capability to detect and abstain from out-of-distribution (OOD) samples representing novel species, and their ability to generate explanations aligned with expert taxonomic knowledge. Notably, top-performing models achieve over 90\% F1 at the Order level on known species, but drop below 2\% at the Species level, highlighting the sharp difficulty gradient from coarse to fine taxonomic prediction (Order $\rightarrow$ Family $\rightarrow$ Genus $\rightarrow$ Species). TerraIncognita will be updated regularly, and by committing to quarterly dataset expansions (of both known and novel species), will provide an evolving platform for longitudinal benchmarking of frontier AI methods. All TerraIncognita data, results, and future updates are available \href{https://baskargroup.github.io/TerraIncognita/}{here}.
Authors: Ya\c{s}ar Utku Al\c{c}alar, Yu Cao, Mehmet Ak\c{c}akaya
Abstract: Physics-driven artificial intelligence (PD-AI) reconstruction methods have emerged as the state-of-the-art for accelerating MRI scans, enabling higher spatial and temporal resolutions. However, the high resolution of these scans generates massive data volumes, leading to challenges in transmission, storage, and real-time processing. This is particularly pronounced in functional MRI, where hundreds of volumetric acquisitions further exacerbate these demands. Edge computing with FPGAs presents a promising solution for enabling PD-AI reconstruction near the MRI sensors, reducing data transfer and storage bottlenecks. However, this requires optimization of PD-AI models for hardware efficiency through quantization and bypassing traditional FFT-based approaches, which can be a limitation due to their computational demands. In this work, we propose a novel PD-AI computational MRI approach optimized for FPGA-based edge computing devices, leveraging 8-bit complex data quantization and eliminating redundant FFT/IFFT operations. Our results show that this strategy improves computational efficiency while maintaining reconstruction quality comparable to conventional PD-AI methods, and outperforms standard clinical methods. Our approach presents an opportunity for high-resolution MRI reconstruction on resource-constrained devices, highlighting its potential for real-world deployment.
Authors: Mahe Zabin, Ho-Jin Choi, Md. Monirul Islam, Jia Uddin
Abstract: The performance of a classifier depends on the tuning of its parame ters. In this paper, we have experimented the impact of various tuning parameters on the performance of a deep convolutional neural network (DCNN). In the ex perimental evaluation, we have considered a DCNN classifier that consists of 2 convolutional layers (CL), 2 pooling layers (PL), 1 dropout, and a dense layer. To observe the impact of pooling, activation function, and optimizer tuning pa rameters, we utilized a crack image dataset having two classes: negative and pos itive. The experimental results demonstrate that with the maxpooling, the DCNN demonstrates its better performance for adam optimizer and tanh activation func tion.
Authors: Duaa Kareem Qasim, Sabah Abdulazeez Jebur, Lafta Raheem Ali, Abdul Jalil M. Khalaf, Abir Jaafar Hussain
Abstract: Retinal diseases such as Diabetic Retinopathy (DR) and Macular Hole (MH) significantly impact vision and affect millions worldwide. Early detection is crucial, as DR, a complication of diabetes, damages retinal blood vessels, potentially leading to blindness, while MH disrupts central vision, affecting tasks like reading and facial recognition. This paper employed two lightweight and efficient Convolution Neural Network architectures, MobileNet and NASNetMobile, for the classification of Normal, DR, and MH retinal images. The models were trained on the RFMiD dataset, consisting of 3,200 fundus images, after undergoing preprocessing steps such as resizing, normalization, and augmentation. To address data scarcity, this study leveraged transfer learning and data augmentation techniques, enhancing model generalization and performance. The experimental results demonstrate that MobileNetV2 achieved the highest accuracy of 90.8%, outperforming NASNetMobile, which achieved 89.5% accuracy. These findings highlight the effectiveness of CNNs in retinal disease classification, providing a foundation for AI-assisted ophthalmic diagnosis and early intervention.
Authors: Ekram Alam, Abu Sufian, Paramartha Dutta, Marco Leo
Abstract: Unintentional or accidental falls are one of the significant health issues in senior persons. The population of senior persons is increasing steadily. So, there is a need for an automated fall detection monitoring system. This paper introduces a vision-based fall detection system using a pre-trained 3D CNN. Unlike 2D CNN, 3D CNN extracts not only spatial but also temporal features. The proposed model leverages the original learned weights of a 3D CNN model pre-trained on the Sports1M dataset to extract the spatio-temporal features. Only the SVM classifier was trained, which saves the time required to train the 3D CNN. Stratified shuffle five split cross-validation has been used to split the dataset into training and testing data. Extracted features from the proposed 3D CNN model were fed to an SVM classifier to classify the activity as fall or ADL. Two datasets, GMDCSA and CAUCAFall, were utilized to conduct the experiment. The source code for this work can be accessed via the following link: https://github.com/ekramalam/HFD_3DCNN.
Authors: Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande
Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.
Authors: Dania Herzalla, Willian T. Lunardi, Martin Andreoni
Abstract: Graph-based learning has emerged as a transformative approach for modeling complex relationships across diverse domains, yet its potential in wireless security remains largely unexplored. In this work, we introduce the first application of graph-based learning for jamming source localization, addressing the imminent threat of jamming attacks in wireless networks. Unlike geometric optimization techniques that struggle under environmental uncertainties and dense interference, we reformulate localization as an inductive graph regression task. Our approach integrates structured node representations that encode local and global signal aggregation, ensuring spatial coherence and adaptive signal fusion. To enhance robustness, we incorporate an attention-based graph neural network that adaptively refines neighborhood influence and introduces a confidence-guided estimation mechanism that dynamically balances learned predictions with domain-informed priors. We evaluate our approach under complex radio frequency environments with varying sampling densities and signal propagation conditions, conducting comprehensive ablation studies on graph construction, feature selection, and pooling strategies. Results demonstrate that our novel graph-based learning framework significantly outperforms established localization baselines, particularly in challenging scenarios with sparse and obfuscated signal information. Code is available at [https://github.com/daniaherzalla/gnn-jamming-source-localization](https://github.com/daniaherzalla/gnn-jamming-source-localization).
Authors: Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Yanjie Liang, Zuming Huang, Haozhe Wang, Jun Huang, Ling Chen, Wei Chu, Yuan Qi
Abstract: Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
Authors: Giuseppe Di Caro, Vahagn Kirakosyan, Alexander G. Abanov, Luca Candelori, Nadine Hartmann, Ernest T. Lam, Kharen Musaelian, Ryan Samson, Dario Villani, Martin T. Wells, Richard J. Wenstrup, Mengjia Xu
Abstract: The accurate prediction of chromosomal instability from the morphology of circulating tumor cells (CTCs) enables real-time detection of CTCs with high metastatic potential in the context of liquid biopsy diagnostics. However, it presents a significant challenge due to the high dimensionality and complexity of single-cell digital pathology data. Here, we introduce the application of Quantum Cognition Machine Learning (QCML), a quantum-inspired computational framework, to estimate morphology-predicted chromosomal instability in CTCs from patients with metastatic breast cancer. QCML leverages quantum mechanical principles to represent data as state vectors in a Hilbert space, enabling context-aware feature modeling, dimensionality reduction, and enhanced generalization without requiring curated feature selection. QCML outperforms conventional machine learning methods when tested on out of sample verification CTCs, achieving higher accuracy in identifying predicted large-scale state transitions (pLST) status from CTC-derived morphology features. These preliminary findings support the application of QCML as a novel machine learning tool with superior performance in high-dimensional, low-sample-size biomedical contexts. QCML enables the simulation of cognition-like learning for the identification of biologically meaningful prediction of chromosomal instability from CTC morphology, offering a novel tool for CTC classification in liquid biopsy.
Authors: Itxasne Ant\'unez S\'aenz, Ane Alberdi Aramendi, David Dunaway, Juling Ong, Lara Deli\`ege, Amparo S\'aenz, Anita Ahmadi Birjandi, Noor UI Owase Jeelani, Silvia Schievano, Alessandro Borghi
Abstract: Craniosynostosis is a medical condition that affects the growth of babies' heads, caused by an early fusion of cranial sutures. In recent decades, surgical treatments for craniosynostosis have significantly improved, leading to reduced invasiveness, faster recovery, and less blood loss. At Great Ormond Street Hospital (GOSH), the main surgical treatment for patients diagnosed with sagittal craniosynostosis (SC) is spring assisted cranioplasty (SAC). This procedure involves a 15x15 mm2 osteotomy, where two springs are inserted to induce distraction. Despite the numerous advantages of this surgical technique for patients, the outcome remains unpredictable due to the lack of efficient preoperative planning tools. The surgeon's experience and the baby's age are currently relied upon to determine the osteotomy location and spring selection. Previous tools for predicting the surgical outcome of SC relied on finite element modeling (FEM), which involved computed tomography (CT) imaging and required engineering expertise and lengthy calculations. The main goal of this research is to develop a real-time prediction tool for the surgical outcome of patients, eliminating the need for CT scans to minimise radiation exposure during preoperative planning. The proposed methodology involves creating personalised synthetic skulls based on three-dimensional (3D) photographs, incorporating population average values of suture location, skull thickness, and soft tissue properties. A machine learning (ML) surrogate model is employed to achieve the desired surgical outcome. The resulting multi-output support vector regressor model achieves a R2 metric of 0.95 and MSE and MAE below 0.13. Furthermore, in the future, this model could not only simulate various surgical scenarios but also provide optimal parameters for achieving a maximum cranial index (CI).
Authors: Tinghuan Li, Shuheng Chen, Junyi Fan, Elham Pishgar, Kamiar Alaei, Greg Placencia, Maryam Pishgar
Abstract: Postoperative stroke remains a critical complication in elderly surgical intensive care unit (SICU) patients, contributing to prolonged hospitalization, elevated healthcare costs, and increased mortality. Accurate early risk stratification is essential to enable timely intervention and improve clinical outcomes. We constructed a combined cohort of 19,085 elderly SICU admissions from the MIMIC-III and MIMIC-IV databases and developed an interpretable machine learning (ML) framework to predict in-hospital stroke using clinical data from the first 24 hours of Intensive Care Unit (ICU) stay. The preprocessing pipeline included removal of high-missingness features, iterative Singular Value Decomposition (SVD) imputation, z-score normalization, one-hot encoding, and class imbalance correction via the Adaptive Synthetic Sampling (ADASYN) algorithm. A two-stage feature selection process-combining Recursive Feature Elimination with Cross-Validation (RFECV) and SHapley Additive exPlanations (SHAP)-reduced the initial 80 variables to 20 clinically informative predictors. Among eight ML models evaluated, CatBoost achieved the best performance with an AUROC of 0.8868 (95% CI: 0.8802--0.8937). SHAP analysis and ablation studies identified prior cerebrovascular disease, serum creatinine, and systolic blood pressure as the most influential risk factors. Our results highlight the potential of interpretable ML approaches to support early detection of postoperative stroke and inform decision-making in perioperative critical care.
Authors: Arbind Agrahari Baniya, Tsz-Kwan Lee, Peter Eklund, Sunil Aryal
Abstract: Video super-resolution (VSR) is a prominent research topic in low-level computer vision, where deep learning technologies have played a significant role. The rapid progress in deep learning and its applications in VSR has led to a proliferation of tools and techniques in the literature. However, the usage of these methods is often not adequately explained, and decisions are primarily driven by quantitative improvements. Given the significance of VSR's potential influence across multiple domains, it is imperative to conduct a comprehensive analysis of the elements and deep learning methodologies employed in VSR research. This methodical analysis will facilitate the informed development of models tailored to specific application needs. In this paper, we present an overarching overview of deep learning-based video super-resolution models, investigating each component and discussing its implications. Furthermore, we provide a synopsis of key components and technologies employed by state-of-the-art and earlier VSR models. By elucidating the underlying methodologies and categorising them systematically, we identified trends, requirements, and challenges in the domain. As a first-of-its-kind survey of deep learning-based VSR models, this work also establishes a multi-level taxonomy to guide current and future VSR research, enhancing the maturation and interpretation of VSR practices for various practical applications.
Authors: Alina Wernick, Kristof Meding
Abstract: The EU has become one of the vanguards in regulating the digital age. A particularly important regulation in the Artificial Intelligence (AI) domain is the EU AI Act, which entered into force in 2024. The AI Act specifies -- due to a risk-based approach -- various obligations for providers of AI systems. These obligations, for example, include a cascade of documentation and compliance measures, which represent a potential obstacle to science. But do these obligations also apply to AI researchers? This position paper argues that, indeed, the AI Act's obligations could apply in many more cases than the AI community is aware of. In our analysis of the AI Act and its applicability, we contribute the following: 1.) We give a high-level introduction to the AI Act aimed at non-legal AI research scientists. 2.) We explain with everyday research examples why the AI Act applies to research. 3.) We analyse the exceptions of the AI Act's applicability and state that especially scientific research exceptions fail to account for current AI research practices. 4.) We propose changes to the AI Act to provide more legal certainty for AI researchers and give two recommendations for AI researchers to reduce the risk of not complying with the AI Act. We see our paper as a starting point for a discussion between policymakers, legal scholars, and AI researchers to avoid unintended side effects of the AI Act on research.
Authors: Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, Zaoxing Liu
Abstract: Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.
Authors: Andrea Ferrario
Abstract: Modern AI systems are man-made objects that leverage machine learning to support our lives across a myriad of contexts and applications. Despite extensive epistemological and ethical debates, their metaphysical foundations remain relatively under explored. The orthodox view simply suggests that AI systems, as artifacts, lack well-posed identity and persistence conditions -- their metaphysical kinds are no real kinds. In this work, we challenge this perspective by introducing a theory of metaphysical identity of AI systems. We do so by characterizing their kinds and introducing identity criteria -- formal rules that answer the questions "When are two AI systems the same?" and "When does an AI system persist, despite change?" Building on Carrara and Vermaas' account of fine-grained artifact kinds, we argue that AI trustworthiness provides a lens to understand AI system kinds and formalize the identity of these artifacts by relating their functional requirements to their physical make-ups. The identity criteria of AI systems are determined by their trustworthiness profiles -- the collection of capabilities that the systems must uphold over time throughout their artifact histories, and their effectiveness in maintaining these capabilities. Our approach suggests that the identity and persistence of AI systems is sensitive to the socio-technical context of their design and utilization via their trustworthiness, providing a solid metaphysical foundation to the epistemological, ethical, and legal discussions about these artifacts.
Authors: Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang
Abstract: The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
Authors: My Youssef El Hafidi, Achraf Toufah, Mohamed Achraf Kadim
Abstract: In recent years, quantum machine learning has emerged as a promising intersection between quantum physics and artificial intelligence, particularly in domains requiring advanced pattern recognition such as healthcare. This study investigates the effectiveness of Quantum Support Vector Machines (QSVM), which leverage quantum mechanical phenomena like superposition and entanglement to construct high-dimensional Hilbert spaces for data classification. Focusing on lung cancer diagnosis, a concrete and critical healthcare application, we analyze how different quantum feature maps influence classification performance. Using a real-world dataset of 309 patient records with significant class imbalance (39 non-cancer vs. 270 cancer cases), we constructed six balanced subsets for robust evaluation. QSVM models were implemented using Qiskit and executed on the qasm simulator, employing three distinct quantum feature maps: ZFeatureMap, ZZFeatureMap, and PauliFeatureMap. Performance was assessed using accuracy, precision, recall, specificity, and F1-score. Results show that the PauliFeatureMap consistently outperformed the others, achieving perfect classification in three subsets and strong performance overall. These findings demonstrate how quantum computational principles can be harnessed to enhance diagnostic capabilities, reinforcing the importance of physics-based modeling in emerging AI applications within healthcare.
Authors: Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, Atticus Geiger
Abstract: Steering language models (LMs) by modifying internal activations is a popular approach for controlling text generation. Unsupervised dictionary learning methods, e.g., sparse autoencoders, can be scaled to produce many steering vectors, but lack guarantees on the individual efficacy of each vector and control over the coverage of relevant steering tasks. In contrast, supervised methods for constructing steering vectors are targeted and effective, but require more data collection and training for each additional steering vector produced. In this work, we introduce HyperSteer, a family of hypernetwork-based architectures which are trained end-to-end to generate steering vectors conditioned on the natural language steering prompts and the internals of the steered LM. In our evaluations, we show that scaling HyperSteer with thousands of steering prompts exceeds the performance of state-of-the-art activation steering methods, even on steering prompts never seen during training. Moreover, HyperSteer performs on par with steering-via-prompting.
Authors: Yubo Wang, Ping Nie, Kai Zou, Lijun Wu, Wenhu Chen
Abstract: We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.
Authors: Mustafa Eyceoz, Nikhil Shivakumar Nayak, Hao Wang, Ligong Han, Akash Srivastava
Abstract: Modern causal language models stack many attention blocks to improve performance, but not all blocks are necessary for every task. We propose Hopscotch, a simple yet effective method that identifies and skips attention blocks with least contributions to a task and adapts to preserve output quality. Hopscotch jointly optimizes which blocks to skip and how to scale the outputs of the remaining layers. By introducing lightweight, trainable scaling parameters to attention and MLP blocks, it mitigates distribution shifts in hidden states caused by removing attention blocks. Hopscotch does not modify model weights or require access to pretraining or instruction-tuning data, and is compatible with existing model compression techniques. When applied to $\texttt{Llama-3.1-8B}$ and $\texttt{Qwen2.5-7B}$, Hopscotch achieves less than a 2% drop in performance even after skipping four attention blocks.
Authors: Celia Chen, Scotty Beland, Ingo Burghardt, Jill Byczek, William J. Conway, Eric Cotugno, Sadaf Davre, Megan Fletcher, Rajesh Kumar Gnanasekaran, Kristin Hamilton, Marilyn Harbert, Jordan Heustis, Tanaya Jha, Emily Klein, Hayden Kramer, Alex Leitch, Jessica Perkins, Casi Sherman, Celia Sterrn, Logan Stevens, Rebecca Zarrella, Jennifer Golbeck
Abstract: Violent threats remain a significant problem across social media platforms. Useful, high-quality data facilitates research into the understanding and detection of malicious content, including violence. In this paper, we introduce a cross-platform dataset of 30,000 posts hand-coded for violent threats and sub-types of violence, including political and sexual violence. To evaluate the signal present in this dataset, we perform a machine learning analysis with an existing dataset of violent comments from YouTube. We find that, despite originating from different platforms and using different coding criteria, we achieve high classification accuracy both by training on one dataset and testing on the other, and in a merged dataset condition. These results have implications for content-classification strategies and for understanding violent content across social media.
Authors: Yuntian Wang, Zafer Yilmaz, Yuhang Li, Edward Liu, Eric Ahlberg, Farid Ghahari, Ertugrul Taciroglu, Aydogan Ozcan
Abstract: Structural Health Monitoring (SHM) is vital for maintaining the safety and longevity of civil infrastructure, yet current solutions remain constrained by cost, power consumption, scalability, and the complexity of data processing. Here, we present a diffractive vibration monitoring system, integrating a jointly optimized diffractive layer with a shallow neural network-based backend to remotely extract 3D structural vibration spectra, offering a low-power, cost-effective and scalable solution. This architecture eliminates the need for dense sensor arrays or extensive data acquisition; instead, it uses a spatially-optimized passive diffractive layer that encodes 3D structural displacements into modulated light, captured by a minimal number of detectors and decoded in real-time by shallow and low-power neural networks to reconstruct the 3D displacement spectra of structures. The diffractive system's efficacy was demonstrated both numerically and experimentally using millimeter-wave illumination on a laboratory-scale building model with a programmable shake table. Our system achieves more than an order-of-magnitude improvement in accuracy over conventional optics or separately trained modules, establishing a foundation for high-throughput 3D monitoring of structures. Beyond SHM, the 3D vibration monitoring capabilities of this cost-effective and data-efficient framework establish a new computational sensing modality with potential applications in disaster resilience, aerospace diagnostics, and autonomous navigation, where energy efficiency, low latency, and high-throughput are critical.
Authors: Victor H. Cid, James Mork
Abstract: We investigated the feasibility of predicting Medical Subject Headings (MeSH) Publication Types (PTs) from MEDLINE citation metadata using pre-trained Transformer-based models BERT and DistilBERT. This study addresses limitations in the current automated indexing process, which relies on legacy NLP algorithms. We evaluated monolithic multi-label classifiers and binary classifier ensembles to enhance the retrieval of biomedical literature. Results demonstrate the potential of Transformer models to significantly improve PT tagging accuracy, paving the way for scalable, efficient biomedical indexing.
Authors: Artur Grigorev, Khaled Saleh, Jiwon Kim, Adriana-Simona Mihaita
Abstract: Traffic incidents remain a critical public safety concern worldwide, with Australia recording 1,300 road fatalities in 2024, which is the highest toll in 12 years. Similarly, the United States reports approximately 6 million crashes annually, raising significant challenges in terms of a fast reponse time and operational management. Traditional response protocols rely on human decision-making, which introduces potential inconsistencies and delays during critical moments when every minute impacts both safety outcomes and network performance. To address this issue, we propose a novel Incident Response Benchmark that uses generative artificial intelligence to automatically generate response plans for incoming traffic incidents. Our approach aims to significantly reduce incident resolution times by suggesting context-appropriate actions such as variable message sign deployment, lane closures, and emergency resource allocation adapted to specific incident characteristics. First, the proposed methodology uses real-world incident reports from the Performance Measurement System (PeMS) as training and evaluation data. We extract historically implemented actions from these reports and compare them against AI-generated response plans that suggest specific actions, such as lane closures, variable message sign announcements, and/or dispatching appropriate emergency resources. Second, model evaluations reveal that advanced generative AI models like GPT-4o and Grok 2 achieve superior alignment with expert solutions, demonstrated by minimized Hamming distances (averaging 2.96-2.98) and low weighted differences (approximately 0.27-0.28). Conversely, while Gemini 1.5 Pro records the lowest count of missed actions, its extremely high number of unnecessary actions (1547 compared to 225 for GPT-4o) indicates an over-triggering strategy that reduces the overall plan efficiency.
Authors: Tri Kurniawan Wijaya, Xinyang Shao, Gonzalo Fiz Pontiveros, Edoardo D'Amico
Abstract: Recommender systems are pivotal in delivering personalized experiences across industries, yet their adoption and scalability remain hindered by the need for extensive dataset- and task-specific configurations. Existing systems often require significant manual intervention, domain expertise, and engineering effort to adapt to new datasets or tasks, creating barriers to entry and limiting reusability. In contrast, recent advancements in large language models (LLMs) have demonstrated the transformative potential of reusable systems, where a single model can handle diverse tasks without significant reconfiguration. Inspired by this paradigm, we propose the Dataset- and Task-Independent Recommender System (DTIRS), a framework aimed at maximizing the reusability of recommender systems while minimizing barriers to entry. Unlike LLMs, which achieve task generalization directly, DTIRS focuses on eliminating the need to rebuild or reconfigure recommendation pipelines for every new dataset or task, even though models may still need retraining on new data. By leveraging the novel Dataset Description Language (DsDL), DTIRS enables standardized dataset descriptions and explicit task definitions, allowing autonomous feature engineering, model selection, and optimization. This paper introduces the concept of DTIRS and establishes a roadmap for transitioning from Level-1 automation (dataset-agnostic but task-specific systems) to Level-2 automation (fully dataset- and task-independent systems). Achieving this paradigm would maximize code reusability and lower barriers to adoption. We discuss key challenges, including the trade-offs between generalization and specialization, computational overhead, and scalability, while presenting DsDL as a foundational tool for this vision.
Authors: Lukas Meyer, Josef Gr\"un, Maximilian Weiherer, Bernhard Egger, Marc Stamminger, Linus Franke
Abstract: We present MS-Splatting -- a multi-spectral 3D Gaussian Splatting (3DGS) framework that is able to generate multi-view consistent novel views from images of multiple, independent cameras with different spectral domains. In contrast to previous approaches, our method does not require cross-modal camera calibration and is versatile enough to model a variety of different spectra, including thermal and near-infra red, without any algorithmic changes. Unlike existing 3DGS-based frameworks that treat each modality separately (by optimizing per-channel spherical harmonics) and therefore fail to exploit the underlying spectral and spatial correlations, our method leverages a novel neural color representation that encodes multi-spectral information into a learned, compact, per-splat feature embedding. A shallow multi-layer perceptron (MLP) then decodes this embedding to obtain spectral color values, enabling joint learning of all bands within a unified representation. Our experiments show that this simple yet effective strategy is able to improve multi-spectral rendering quality, while also leading to improved per-spectra rendering quality over state-of-the-art methods. We demonstrate the effectiveness of this new technique in agricultural applications to render vegetation indices, such as normalized difference vegetation index (NDVI).
Authors: Muhammad Zubair Hasan, Fahmida Yasmin Rifat
Abstract: Skin cancer is among the most prevalent and life-threatening diseases worldwide, with early detection being critical to patient outcomes. This work presents a hybrid machine and deep learning-based approach for classifying malignant and benign skin lesions using the SLICE-3D dataset from ISIC 2024, which comprises 401,059 cropped lesion images extracted from 3D Total Body Photography (TBP), emulating non-dermoscopic, smartphone-like conditions. Our method combines vision transformers (EVA02) and our designed convolutional ViT hybrid (EdgeNeXtSAC) to extract robust features, employing a segmentation-assisted classification pipeline to enhance lesion localization. Predictions from these models are fused with a gradient-boosted decision tree (GBDT) ensemble enriched by engineered features and patient-specific relational metrics. To address class imbalance and improve generalization, we augment malignant cases with Stable Diffusion-generated synthetic lesions and apply a diagnosis-informed relabeling strategy to harmonize external datasets into a 3-class format. Using partial AUC (pAUC) above 80 percent true positive rate (TPR) as the evaluation metric, our approach achieves a pAUC of 0.1755 -- the highest among all configurations. These results underscore the potential of hybrid, interpretable AI systems for skin cancer triage in telemedicine and resource-constrained settings.
Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj
Abstract: Evaluating explainability techniques, such as SHAP and LRP, in the context of audio deepfake detection is challenging due to lack of clear ground truth annotations. In the cases when we are able to obtain the ground truth, we find that these methods struggle to provide accurate explanations. In this work, we propose a novel data-driven approach to identify artifact regions in deepfake audio. We consider paired real and vocoded audio, and use the difference in time-frequency representation as the ground-truth explanation. The difference signal then serves as a supervision to train a diffusion model to expose the deepfake artifacts in a given vocoded audio. Experimental results on the VocV4 and LibriSeVoc datasets demonstrate that our method outperforms traditional explainability techniques, both qualitatively and quantitatively.
Authors: Yang Cai, Haipeng Luo, Chen-Yu Wei, Weiqiang Zheng
Abstract: The convergence of online learning algorithms in games under self-play is a fundamental question in game theory and machine learning. Among various notions of convergence, last-iterate convergence is particularly desirable, as it reflects the actual decisions made by the learners and captures the day-to-day behavior of the learning dynamics. While many algorithms are known to converge in the average-iterate, achieving last-iterate convergence typically requires considerably more effort in both the design and the analysis of the algorithm. Somewhat surprisingly, we show in this paper that for a large family of games, there exists a simple black-box reduction that transforms the average iterates of an uncoupled learning dynamics into the last iterates of a new uncoupled learning dynamics, thus also providing a reduction from last-iterate convergence to average-iterate convergence. Our reduction applies to games where each player's utility is linear in both their own strategy and the joint strategy of all opponents. This family includes two-player bimatrix games and generalizations such as multi-player polymatrix games. By applying our reduction to the Optimistic Multiplicative Weights Update algorithm, we obtain new state-of-the-art last-iterate convergence rates for uncoupled learning dynamics in two-player zero-sum normal-form games: (1) an $O(\frac{\log d}{T})$ last-iterate convergence rate under gradient feedback, representing an exponential improvement in the dependence on the dimension $d$ (i.e., the maximum number of actions available to either player); and (2) an $\widetilde{O}(d^{\frac{1}{5}} T^{-\frac{1}{5}})$ last-iterate convergence rate under bandit feedback, improving upon the previous best rates of $\widetilde{O}(\sqrt{d} T^{-\frac{1}{8}})$ and $\widetilde{O}(\sqrt{d} T^{-\frac{1}{6}})$.
Authors: Hang Liu, Anna Scaglione, Sean Peisert
Abstract: Gaussian Mixture Models (GMMs) are widely used statistical models for representing multi-modal data distributions, with numerous applications in data mining, pattern recognition, data simulation, and machine learning. However, recent research has shown that releasing GMM parameters poses significant privacy risks, potentially exposing sensitive information about the underlying data. In this paper, we address the challenge of releasing GMM parameters while ensuring differential privacy (DP) guarantees. Specifically, we focus on the privacy protection of mixture weights, component means, and covariance matrices. We propose to use Kullback-Leibler (KL) divergence as a utility metric to assess the accuracy of the released GMM, as it captures the joint impact of noise perturbation on all the model parameters. To achieve privacy, we introduce a DP mechanism that adds carefully calibrated random perturbations to the GMM parameters. Through theoretical analysis, we quantify the effects of privacy budget allocation and perturbation statistics on the DP guarantee, and derive a tractable expression for evaluating KL divergence. We formulate and solve an optimization problem to minimize the KL divergence between the released and original models, subject to a given $(\epsilon, \delta)$-DP constraint. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach achieves strong privacy guarantees while maintaining high utility.
Authors: Tuan Le, Risal Shefin, Debashis Gupta, Thai Le, Sarra Alqahtani
Abstract: Ensuring the safety of reinforcement learning (RL) policies in high-stakes environments requires not only formal verification but also interpretability and targeted falsification. While model checking provides formal guarantees, its effectiveness is limited by abstraction quality and the completeness of the underlying trajectory dataset. We propose a hybrid framework that integrates (1) explainability, (2) model checking, and (3) risk-guided falsification to achieve both rigor and coverage. Our approach begins by constructing a human-interpretable abstraction of the RL policy using Comprehensible Abstract Policy Summarization (CAPS). This abstract graph, derived from offline trajectories, is both verifier-friendly, semantically meaningful, and can be used as input to Storm probabilistic model checker to verify satisfaction of temporal safety specifications. If the model checker identifies a violation, it will return an interpretable counterexample trace by which the policy fails the safety requirement. However, if no violation is detected, we cannot conclude satisfaction due to potential limitation in the abstraction and coverage of the offline dataset. In such cases, we estimate associated risk during model checking to guide a falsification strategy that prioritizes searching in high-risk states and regions underrepresented in the trajectory dataset. We further provide PAC-style guarantees on the likelihood of uncovering undetected violations. Finally, we incorporate a lightweight safety shield that switches to a fallback policy at runtime when such a risk exceeds a threshold, facilitating failure mitigation without retraining.
Authors: Liam Hodgkinson, Zhichao Wang, Michael W. Mahoney
Abstract: Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept of heavy-tailed mechanistic universality (HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models -- the high-temperature Marchenko-Pastur (HTMP) ensemble -- to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an "eigenvalue repulsion" parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed.
Authors: Masaya Kawamura, Takuya Hasumi, Yuma Shirahata, Ryuichi Yamamoto
Abstract: This paper proposes a highly compact, lightweight text-to-speech (TTS) model for on-device applications. To reduce the model size, the proposed model introduces two techniques. First, we introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit. In this case, most of 32-bit model parameters are quantized to ternary values {-1, 0, 1}. Second, we propose a method named weight indexing. In this method, we save a group of 1.58-bit weights as a single int8 index. This allows for efficient storage of model parameters, even on hardware that treats values in units of 8-bit. Experimental results demonstrate that the proposed method achieved 83 % reduction in model size, while outperforming the baseline of similar model size without quantization in synthesis quality.
Authors: Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang
Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.
Authors: Rui Yann, Xianglei Xing
Abstract: We present ViTSGMM, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, while their generalization ability when dealing with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification decision mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on STL-10 and CIFAR-10/100 datasets when using negligible labeled samples. Notably, this paper also reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning tasks and removes duplicates to ensure the reliability of experimental results. Code available at https://github.com/Shu1L0n9/ViTSGMM.
Authors: Shengjie Lin, Jiading Fang, Muhammad Zubair Irshad, Vitor Campagnolo Guizilini, Rares Andrei Ambrus, Greg Shakhnarovich, Matthew R. Walter
Abstract: Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.
Authors: Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Abstract: Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
Authors: Leonardo Martins Bianco (LMO), Christine Keribin (LMO), Zacharie Naulet (INRAE, MaIAGE)
Abstract: Community detection is a fundamental task in graph analysis, with methods often relying on fitting models like the Stochastic Block Model (SBM) to observed networks. While many algorithms can accurately estimate SBM parameters when the input graph is a perfect sample from the model, real-world graphs rarely conform to such idealized assumptions. Therefore, robust algorithms are crucial-ones that can recover model parameters even when the data deviates from the assumed distribution. In this work, we propose SubSearch, an algorithm for robustly estimating SBM parameters by exploring the space of subgraphs in search of one that closely aligns with the model's assumptions. Our approach also functions as an outlier detection method, properly identifying nodes responsible for the graph's deviation from the model and going beyond simple techniques like pruning high-degree nodes. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method.
Authors: Valerie Krug, Sebastian Stober
Abstract: Deep Learning models have achieved remarkable success. Training them is often accelerated by building on top of pre-trained models which poses the risk of perpetuating encoded biases. Here, we investigate biases in the representations of commonly used ImageNet classifiers for facial images while considering intersections of sensitive variables age, race and gender. To assess the biases, we use linear classifier probes and visualize activations as topographic maps. We find that representations in ImageNet classifiers particularly allow differentiation between ages. Less strongly pronounced, the models appear to associate certain ethnicities and distinguish genders in middle-aged groups.
Authors: Ivan Melev, Goeran Kauermann
Abstract: Due to their intuitive appeal, Bayesian methods of modeling and uncertainty quantification have become popular in modern machine and deep learning. When providing a prior distribution over the parameter space, it is straightforward to obtain a distribution over the parameters that is conventionally interpreted as uncertainty quantification of the model. We challenge the validity of such Bayesian uncertainty quantification by discussing the equivalent optimization-based representation of Bayesian updating, provide an alternative interpretation that is coherent with the optimization-based perspective, propose measures of the quality of the Bayesian inferential stage, and suggest directions for future work.
Authors: Sobihan Surendran (LPSM), Adeline Fermanian (LPSM), Sylvain Le Corff (LPSM)
Abstract: Combinatorial Optimization problems are widespread in domains such as logistics, manufacturing, and drug discovery, yet their NP-hard nature makes them computationally challenging. Recent Neural Combinatorial Optimization methods leverage deep learning to learn solution strategies, trained via Supervised or Reinforcement Learning (RL). While promising, these approaches often rely on task-specific augmentations, perform poorly on out-of-distribution instances, and lack robust inference mechanisms. Moreover, existing latent space models either require labeled data or rely on pre-trained policies. In this work, we propose LGS-Net, a novel latent space model that conditions on problem instances, and introduce an efficient inference method, Latent Guided Sampling (LGS), based on Markov Chain Monte Carlo and Stochastic Approximation. We show that the iterations of our method form a time-inhomogeneous Markov Chain and provide rigorous theoretical convergence guarantees. Empirical results on benchmark routing tasks show that our method achieves state-of-the-art performance among RL-based approaches.
Authors: Melika Ayoughi, Samira Abnar, Chen Huang, Chris Sandino, Sayeri Lala, Eeshan Gunesh Dhekane, Dan Busbridge, Shuangfei Zhai, Vimal Thilak, Josh Susskind, Pascal Mettes, Paul Groth, Hanlin Goh
Abstract: The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid structure, where the goal of the pretext task involves predicting the absolute position index of patches within a fixed grid. However, grid-based approaches fall short of capturing the fluid and continuous nature of real-world object compositions. We introduce PART, a self-supervised learning approach that leverages continuous relative transformations between off-grid patches to overcome these limitations. By modeling how parts relate to each other in a continuous space, PART learns the relative composition of images-an off-grid structural relative positioning process that generalizes beyond occlusions and deformations. In tasks requiring precise spatial understanding such as object detection and time series prediction, PART outperforms strong grid-based methods like MAE and DropPos, while also maintaining competitive performance on global classification tasks with minimal hyperparameter tuning. By breaking free from grid constraints, PART opens up an exciting new trajectory for universal self-supervised pretraining across diverse datatypes-from natural images to EEG signals-with promising potential in video, medical imaging, and audio.
Authors: Swagat Kumar, Jan-Nico Zaech, Colin Michael Wilmott, Luc Van Gool
Abstract: Variational Quantum Algorithms (VQAs) are a promising approach for leveraging powerful Noisy Intermediate-Scale Quantum (NISQ) computers. When applied to machine learning tasks, VQAs give rise to NISQ-compatible Quantum Neural Networks (QNNs), which have been shown to outperform classical neural networks with a similar number of trainable parameters. While the quantum circuit structures of VQAs for physics simulations are determined by the physical properties of the systems, identifying effective QNN architectures for general machine learning tasks is a difficult challenge due to the lack of domain-specific priors. Indeed, existing Quantum Architecture Search (QAS) algorithms, adaptations of classical neural architecture search techniques, often overlook the inherent quantum nature of the circuits they produce. By approaching QAS from the ground-up and from a quantum perspective, we resolve this limitation by proposing $\rho$DARTS, a differentiable QAS algorithm that models the search process as the evolution of a quantum mixed state, emerging from the search space of quantum architectures. We validate our method by finding circuits for state initialization, Hamiltonian optimization, and image classification. Further, we demonstrate better convergence against existing QAS techniques and show improved robustness levels to noise.
Authors: C\'esar Sabater, Sonia Ben Mokhtar, Jan Ramon
Abstract: Achieving differentially private computations in decentralized settings poses significant challenges, particularly regarding accuracy, communication cost, and robustness against information leakage. While cryptographic solutions offer promise, they often suffer from high communication overhead or require centralization in the presence of network failures. Conversely, existing fully decentralized approaches typically rely on relaxed adversarial models or pairwise noise cancellation, the latter suffering from substantial accuracy degradation if parties unexpectedly disconnect. In this work, we propose IncA, a new protocol for fully decentralized mean estimation, a widely used primitive in data-intensive processing. Our protocol, which enforces differential privacy, requires no central orchestration and employs low-variance correlated noise, achieved by incrementally injecting sensitive information into the computation. First, we theoretically demonstrate that, when no parties permanently disconnect, our protocol achieves accuracy comparable to that of a centralized setting-already an improvement over most existing decentralized differentially private techniques. Second, we empirically show that our use of low-variance correlated noise significantly mitigates the accuracy loss experienced by existing techniques in the presence of dropouts.
Authors: R\'ois\'in Luo
Abstract: We present a theoretical framework for deriving the general $n$-th order Fr\'echet derivatives of singular values in real rectangular matrices, by leveraging reduced resolvent operators from Kato's analytic perturbation theory for self-adjoint operators. Deriving closed-form expressions for higher-order derivatives of singular values is notoriously challenging through standard matrix-analysis techniques. To overcome this, we treat a real rectangular matrix as a compact operator on a finite-dimensional Hilbert space, and embed the rectangular matrix into a block self-adjoint operator so that non-symmetric perturbations are captured. Applying Kato's asymptotic eigenvalue expansion to this construction, we obtain a general, closed-form expression for the infinitesimal $n$-th order spectral variations. Specializing to $n=2$ and deploying on a Kronecker-product representation with matrix convention yield the Hessian of a singular value, not found in literature. By bridging abstract operator-theoretic perturbation theory with matrices, our framework equips researchers with a practical toolkit for higher-order spectral sensitivity studies in random matrix applications (e.g., adversarial perturbation in deep learning).
Authors: Hachem Kadri, Joachim Tomasi, Yuka Hashimoto, Sandrine Anthoine
Abstract: Quantum kernels are reproducing kernel functions built using quantum-mechanical principles and are studied with the aim of outperforming their classical counterparts. The enthusiasm for quantum kernel machines has been tempered by recent studies that have suggested that quantum kernels could not offer speed-ups when learning on classical data. However, most of the research in this area has been devoted to scalar-valued kernels in standard classification or regression settings for which classical kernel methods are efficient and effective, leaving very little room for improvement with quantum kernels. This position paper argues that quantum kernel research should focus on more expressive kernel classes. We build upon recent advances in operator-valued kernels, and propose guidelines for investigating quantum kernels. This should help to design a new generation of quantum kernel machines and fully explore their potentials.
Authors: Hasan Fallahgoul
Abstract: Recent advances in machine learning have shown promising results for financial prediction using large, over-parameterized models. This paper provides theoretical foundations and empirical validation for understanding when and how these methods achieve predictive success. I examine three key aspects of high-dimensional learning in finance. First, I prove that within-sample standardization in Random Fourier Features implementations fundamentally alters the underlying Gaussian kernel approximation, replacing shift-invariant kernels with training-set dependent alternatives. Second, I derive sample complexity bounds showing when reliable learning becomes information-theoretically impossible under weak signal-to-noise ratios typical in finance. Third, VC-dimension analysis reveals that ridgeless regression's effective complexity is bounded by sample size rather than nominal feature dimension. Comprehensive numerical validation confirms these theoretical predictions, revealing systematic breakdown of claimed theoretical properties across realistic parameter ranges. These results show that when sample size is small and features are high-dimensional, observed predictive success is necessarily driven by low-complexity artifacts, not genuine high-dimensional learning.
Authors: Penelope Madysa, Sabrina Appel, Verena Kain, Michael Schenk
Abstract: Geoff is a collection of Python packages that form a framework for automation of particle accelerator controls. With particle accelerator laboratories around the world researching machine learning techniques to improve accelerator performance and uptime, a multitude of approaches and algorithms have emerged. The purpose of Geoff is to harmonize these approaches and to minimize friction when comparing or migrating between them. It provides standardized interfaces for optimization problems, utility functions to speed up development, and a reference GUI application that ties everything together. Geoff is an open-source library developed at CERN and maintained and updated in collaboration between CERN and GSI as part of the EURO-LABS project. This paper gives an overview over Geoff's design, features, and current usage.
Authors: Peter Pfeiffer, Alexander Rombach, Maxim Majlatow, Nijat Mehdiyev
Abstract: Traditional Business Process Management (BPM) struggles with rigidity, opacity, and scalability in dynamic environments while emerging Large Language Models (LLMs) present transformative opportunities alongside risks. This paper explores four real-world use cases that demonstrate how LLMs, augmented with trustworthy process intelligence, redefine process modeling, prediction, and automation. Grounded in early-stage research projects with industrial partners, the work spans manufacturing, modeling, life-science, and design processes, addressing domain-specific challenges through human-AI collaboration. In manufacturing, an LLM-driven framework integrates uncertainty-aware explainable Machine Learning (ML) with interactive dialogues, transforming opaque predictions into auditable workflows. For process modeling, conversational interfaces democratize BPMN design. Pharmacovigilance agents automate drug safety monitoring via knowledge-graph-augmented LLMs. Finally, sustainable textile design employs multi-agent systems to navigate regulatory and environmental trade-offs. We intend to examine tensions between transparency and efficiency, generalization and specialization, and human agency versus automation. By mapping these trade-offs, we advocate for context-sensitive integration prioritizing domain needs, stakeholder values, and iterative human-in-the-loop workflows over universal solutions. This work provides actionable insights for researchers and practitioners aiming to operationalize LLMs in critical BPM environments.
Authors: Marc Aurel Vischer, Noelia Otero, Jackie Ma
Abstract: We present a dataset for rainfall streamflow modeling that is fully spatially resolved with the aim of taking neural network-driven hydrological modeling beyond lumped catchments. To this end, we compiled data covering five river basins in central Europe: upper Danube, Elbe, Oder, Rhine, and Weser. The dataset contains meteorological forcings, as well as ancillary information on soil, rock, land cover, and orography. The data is harmonized to a regular 9km times 9km grid and contains daily values that span from October 1981 to September 2011. We also provide code to further combine our dataset with publicly available river discharge data for end-to-end rainfall streamflow modeling.
Authors: Xiao-Qi Han, Ze-Feng Gao, Xin-De Wang, Zhenfeng Ouyang, Peng-Jie Guo, Zhong-Yi Lu
Abstract: The discovery of high-temperature superconducting materials holds great significance for human industry and daily life. In recent years, research on predicting superconducting transition temperatures using artificial intelligence~(AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different AI algorithms and impeded further advancement of these methods. In this work, we present the HTSC-2025, an ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on BCS superconductivity theory, including the renowned X$_2$YH$_6$ system, perovskite MXH$_3$ system, M$_3$XH$_8$ system, cage-like BCN-doped metal atomic systems derived from LaH$_{10}$ structural evolution, and two-dimensional honeycomb-structured systems evolving from MgB$_2$. The HTSC-2025 benchmark has been open-sourced at https://github.com/xqh19970407/HTSC-2025 and will be continuously updated. This benchmark holds significant importance for accelerating the discovery of superconducting materials using AI-based methods.
Authors: Benjamin Dupuis, Dario Shariatian, Maxime Haddouche, Alain Durmus, Umut Simsekli
Abstract: Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical performance. In the latter case, bounds have been derived, under various metrics, between the true data distribution and the distribution induced by the SGM, often demonstrating polynomial convergence rates with respect to the number of training samples. However, these approaches adopt a largely approximation theory viewpoint, which tends to be overly pessimistic and relatively coarse. In particular, they fail to fully explain the empirical success of SGMs or capture the role of the optimization algorithm used in practice to train the score network. To support this observation, we first present simple experiments illustrating the concrete impact of optimization hyperparameters on the generalization ability of the generated distribution. Then, this paper aims to bridge this theoretical gap by providing the first algorithmic- and data-dependent generalization analysis for SGMs. In particular, we establish bounds that explicitly account for the optimization dynamics of the learning algorithm, offering new insights into the generalization behavior of SGMs. Our theoretical findings are supported by empirical results on several datasets.
Authors: Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, Liqiang Nie
Abstract: Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation. Existing approaches mainly leverage latent variable models, e.g., VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we present \textbf{S}kill \textbf{T}raining with \textbf{A}ugmented \textbf{R}otation (\textbf{STAR}), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ). It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions. Further, to capture the causal relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation. Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12\% improvement over the baselines.
Authors: Claire Barale, Michael Rovatsos, Nehal Bhuta
Abstract: Legal decisions are increasingly evaluated for fairness, consistency, and bias using machine learning (ML) techniques. In high-stakes domains like refugee adjudication, such methods are often applied to detect disparities in outcomes. Yet it remains unclear whether statistical methods can meaningfully assess fairness in legal contexts shaped by discretion, normative complexity, and limited ground truth. In this paper, we empirically evaluate three common ML approaches (feature-based analysis, semantic clustering, and predictive modeling) on a large, real-world dataset of 59,000+ Canadian refugee decisions (AsyLex). Our experiments show that these methods produce divergent and sometimes contradictory signals, that predictive modeling often depends on contextual and procedural features rather than legal features, and that semantic clustering fails to capture substantive legal reasoning. We show limitations of statistical fairness evaluation, challenge the assumption that statistical regularity equates to fairness, and argue that current computational approaches fall short of evaluating fairness in legally discretionary domains. We argue that evaluating fairness in law requires methods grounded not only in data, but in legal reasoning and institutional context.
Authors: Cl\'ement Elvira, Th\'eo Guyard, C\'edric Herzet
Abstract: We present a generic Branch-and-Bound procedure designed to solve L0-penalized optimization problems. Existing approaches primarily focus on quadratic losses and construct relaxations using "Big-M" constraints and/or L2-norm penalties. In contrast, our method accommodates a broader class of loss functions and allows greater flexibility in relaxation design through a general penalty term, encompassing existing techniques as special cases. We establish theoretical results ensuring that all key quantities required for the Branch-and-Bound implementation admit closed-form expressions under the general blanket assumptions considered in our work. Leveraging this framework, we introduce El0ps, an open-source Python solver with a plug-and-play workflow that enables user-defined losses and penalties in L0-penalized problems. Through extensive numerical experiments, we demonstrate that El0ps achieves state-of-the-art performance on classical instances and extends computational feasibility to previously intractable ones.
Authors: Hieu Trung Nguyen, Bao Nguyen, Viet Anh Nguyen
Abstract: Model pruning in transformer-based language models, traditionally viewed as a means of achieving computational savings, can enhance the model's reasoning capabilities. In this work, we uncover a surprising phenomenon: the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, SPRINT identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments demonstrate that our method significantly outperforms traditional best-of-$N$ and random head selection strategies on the MATH500 and GSM8K datasets.
Authors: Hicham Eddoubi, Jonas Ricker, Federico Cocchi, Lorenzo Baraldi, Angelo Sotgiu, Maura Pintor, Marcella Cornia, Lorenzo Baraldi, Asja Fischer, Rita Cucchiara, Battista Biggio
Abstract: AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealized conditions. In particular, the adversarial robustness is often neglected, potentially due to a lack of awareness or the substantial effort required to conduct a comprehensive robustness analysis. In this work, we tackle this problem by providing a simpler means to assess the robustness of AI-generated image detectors. We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples. The dataset is created by running attacks against an ensemble of seven state-of-the-art detectors and images generated by four different text-to-image models. Extensive experiments show that our methodology generates adversarial images that transfer with a high success rate to unseen detectors, which can be used to quickly provide an approximate yet still reliable estimate of a detector's adversarial robustness. Our findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples, highlighting the critical need for the development of more robust methods. We release our dataset at https://huggingface.co/datasets/aimagelab/RAID and evaluation code at https://github.com/pralab/RAID.
URLs: https://huggingface.co/datasets/aimagelab/RAID, https://github.com/pralab/RAID.
Authors: Fernando de Meer Pardo, Branka Hadji Misheva, Martin Braschler, Kurt Stockinger
Abstract: We present TransClean, a method for detecting false positive predictions of entity matching algorithms under real-world conditions characterized by large-scale, noisy, and unlabeled multi-source datasets that undergo distributional shifts. TransClean is explicitly designed to operate with multiple data sources in an efficient, robust and fast manner while accounting for edge cases and requiring limited manual labeling. TransClean leverages the Transitive Consistency of a matching, a measure of the consistency of a pairwise matching model f_theta on the matching it produces G_f_theta, based both on its predictions on directly evaluated record pairs and its predictions on implied record pairs. TransClean iteratively modifies a matching through gradually removing false positive matches while removing as few true positive matches as possible. In each of these steps, the estimation of the Transitive Consistency is exclusively done through model evaluations and produces quantities that can be used as proxies of the amounts of true and false positives in the matching while not requiring any manual labeling, producing an estimate of the quality of the matching and indicating which record groups are likely to contain false positives. In our experiments, we compare combining TransClean with a naively trained pairwise matching model (DistilBERT) and with a state-of-the-art end-to-end matching method (CLER) and illustrate the flexibility of TransClean in being able to detect most of the false positives of either setup across a variety of datasets. Our experiments show that TransClean induces an average +24.42 F1 score improvement for entity matching in a multi-source setting when compared to traditional pair-wise matching algorithms.
Authors: Adam Ran\c{c}on, Ulysse Ran\c{c}on, Tomislav Ivek, Ivan Balog
Abstract: We explore how minimal neural networks can invert the renormalization group (RG) coarse-graining procedure in the two-dimensional Ising model, effectively "dreaming up" microscopic configurations from coarse-grained states. This task-formally impossible at the level of configurations-can be approached probabilistically, allowing machine learning models to reconstruct scale-invariant distributions without relying on microscopic input. We demonstrate that even neural networks with as few as three trainable parameters can learn to generate critical configurations, reproducing the scaling behavior of observables such as magnetic susceptibility, heat capacity, and Binder ratios. A real-space renormalization group analysis of the generated configurations confirms that the models capture not only scale invariance but also reproduce nontrivial eigenvalues of the RG transformation. Surprisingly, we find that increasing network complexity by introducing multiple layers offers no significant benefit. These findings suggest that simple local rules, akin to those generating fractal structures, are sufficient to encode the universality of critical phenomena, opening the door to efficient generative models of statistical ensembles in physics.
Authors: Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Goun\'e, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young
Abstract: As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. Prior work has examined agents' ability to enact misaligned behaviour (misalignment capability) and their compliance with harmful instructions (misuse propensity). However, the likelihood of agents attempting misaligned behaviours in real-world settings (misalignment propensity) remains poorly understood. We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios in which LLM agents have the opportunity to display misaligned behaviour. We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking. We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models. Finally, we systematically vary agent personalities through different system prompts. We find that persona characteristics can dramatically and unpredictably influence misalignment tendencies -- occasionally far more than the choice of model itself -- highlighting the importance of careful system prompt engineering for deployed AI agents. Our work highlights the failure of current alignment methods to generalise to LLM agents, and underscores the need for further propensity evaluations as autonomous systems become more prevalent.
Authors: Neeva Oza, Ishaan Govil, Parul Gupta, Dinesh Khandelwal, Dinesh Garg, Parag Singla
Abstract: LLMs have been extensively used for the task of automated code generation. In this work, we examine the applicability of LLMs for the related but relatively unexplored task of code-equivalence checking, i.e., given two programs, whether they are functionally equivalent or not. This is an important problem since benchmarking code equivalence can play a critical role in evaluating LLM capabilities for tasks such as code re-writing and code translation. Towards this end, we present CETBench - Code Equivalence with Transformations Benchmark, constructed via a repository of programs, where two programs in the repository may be solving the same or different tasks. Each instance in our dataset is obtained by taking a pair of programs in the repository and applying a random series of pre-defined code transformations, resulting in (non-)equivalent pairs. Our analysis on this dataset reveals a surprising finding that very simple code transformations in the underlying pair of programs can result in a significant drop in performance of SOTA LLMs for the task of code-equivalence checking. To remedy this, we present a simple fine-tuning-based approach to boost LLM performance on the transformed pairs of programs. Our approach for dataset generation is generic, and can be used with repositories with varying program difficulty levels and allows for applying varying numbers as well as kinds of transformations. In our experiments, we perform ablations over the difficulty level of original programs, as well as the kind of transformations used in generating pairs for equivalence checking. Our analysis presents deep insights into the working of LLMs for the task of code-equivalence, and points to the fact that they may still be far from what could be termed as a semantic understanding of the underlying code.
Authors: Qiyue Xia, J. Michael Herrmann
Abstract: Multi-objective reinforcement learning (MORL) aims at optimising several, often conflicting goals in order to improve flexibility and reliability of RL in practical tasks. This can be achieved by finding diverse policies that are optimal for some objective preferences and non-dominated by optimal policies for other preferences so that they form a Pareto front in the multi-objective performance space. The relation between the multi-objective performance space and the parameter space that represents the policies is generally non-unique. Using a training scheme that is based on a locally linear map between the parameter space and the performance space, we show that an approximate Pareto front can provide an interpretation of the current parameter vectors in terms of the objectives which enables an effective search within contiguous solution domains. Experiments are conducted with and without retraining across different domains, and the comparison with previous methods demonstrates the efficiency of our approach.
Authors: Chengdong Wu, Sven Kirchner, Nils Purschke, Alois C. Knoll
Abstract: The controller is one of the most important modules in the autonomous driving pipeline, ensuring the vehicle reaches its desired position. In this work, a reinforcement learning based lateral control approach, despite the imperfections in the vehicle models due to measurement errors and simplifications, is presented. Our approach ensures comfortable, efficient, and robust control performance considering the interface between controlling and other modules. The controller consists of the conventional Model Predictive Control (MPC)-PID part as the basis and the demonstrator, and the Deep Reinforcement Learning (DRL) part which leverages the online information from the MPC-PID part. The controller's performance is evaluated in CARLA using the ground truth of the waypoints as inputs. Experimental results demonstrate the effectiveness of the controller when vehicle information is incomplete, and the training of DRL can be stabilized with the demonstration part. These findings highlight the potential to reduce development and integration efforts for autonomous driving pipelines in the future.
Authors: Mikel K. Ngueajio, Flor Miriam Plaza-del-Arco, Yi-Ling Chung, Danda B. Rawat, Amanda Cercas Curry
Abstract: Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere's CommandR-7B, and Meta's LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.
Authors: Vu Thi Huong, Ida Litzel, Thorsten Koch
Abstract: Fuzzy clustering, which allows an article to belong to multiple clusters with soft membership degrees, plays a vital role in analyzing publication data. This problem can be formulated as a constrained optimization model, where the goal is to minimize the discrepancy between the similarity observed from data and the similarity derived from a predicted distribution. While this approach benefits from leveraging state-of-the-art optimization algorithms, tailoring them to work with real, massive databases like OpenAlex or Web of Science - containing about 70 million articles and a billion citations - poses significant challenges. We analyze potentials and challenges of the approach from both mathematical and computational perspectives. Among other things, second-order optimality conditions are established, providing new theoretical insights, and practical solution methods are proposed by exploiting the structure of the problem. Specifically, we accelerate the gradient projection method using GPU-based parallel computing to efficiently handle large-scale data.
Authors: Paul Fuchs, Weilong Chen, Stephan Thaler, Julija Zavadlav
Abstract: Machine learning potentials (MLPs) have advanced rapidly and show great promise to transform molecular dynamics (MD) simulations. However, most existing software tools are tied to specific MLP architectures, lack integration with standard MD packages, or are not parallelizable across GPUs. To address these challenges, we present chemtrain-deploy, a framework that enables model-agnostic deployment of MLPs in LAMMPS. chemtrain-deploy supports any JAX-defined semi-local potential, allowing users to exploit the functionality of LAMMPS and perform large-scale MLP-based MD simulations on multiple GPUs. It achieves state-of-the-art efficiency and scales to systems containing millions of atoms. We validate its performance and scalability using graph neural network architectures, including MACE, Allegro, and PaiNN, applied to a variety of systems, such as liquid-vapor interfaces, crystalline materials, and solvated peptides. Our results highlight the practical utility of chemtrain-deploy for real-world, high-performance simulations and provide guidance for MLP architecture selection and future design.
Authors: Alex Sotiropoulos, Sulyab Thottungal Valapu, Linus Lei, Jared Coleman, Bhaskar Krishnamachari
Abstract: Large Language Models (LLMs) increasingly rely on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to align model responses with human preferences. While RLHF employs a reinforcement learning approach with a separate reward model, SFT uses human-curated datasets for supervised learning. Both approaches traditionally depend on small, vetted groups of annotators, making them costly, prone to bias, and limited in scalability. We propose an open, crowd-sourced fine-tuning framework that addresses these limitations by enabling broader feedback collection for SFT without extensive annotator training. Our framework promotes incentive fairness via a point-based reward system correlated with Shapley values and guides model convergence through iterative model updates. Our multi-model selection framework demonstrates up to a 55% reduction in target distance over single-model selection, enabling subsequent experiments that validate our point-based reward mechanism's close alignment with Shapley values (a well-established method for attributing individual contributions) thereby supporting fair and scalable participation.
Authors: Pedro Henrique Martins, Jo\~ao Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, Jos\'e Pombal, Manuel Faysse, Pierre Colombo, Fran\c{c}ois Yvon, Barry Haddow, Jos\'e G. C. de Souza, Alexandra Birch, Andr\'e F. T. Martins
Abstract: This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.
Authors: Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin
Abstract: We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning.
Authors: Disha Sheshanarayana, Tanishka Magar, Ayushi Mittal, Neelam Chaplot
Abstract: Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at https://github.com/Disha1001/CLAIM.
Authors: Jiaheng Hu, Peter Stone, Roberto Mart\'in-Mart\'in
Abstract: Building capable household and industrial robots requires mastering the control of versatile, high-degree-of-freedom (DoF) systems such as mobile manipulators. While reinforcement learning (RL) holds promise for autonomously acquiring robot control policies, scaling it to high-DoF embodiments remains challenging. Direct RL in the real world demands both safe exploration and high sample efficiency, which are difficult to achieve in practice. Sim-to-real RL, on the other hand, is often brittle due to the reality gap. This paper introduces SLAC, a method that renders real-world RL feasible for complex embodiments by leveraging a low-fidelity simulator to pretrain a task-agnostic latent action space. SLAC trains this latent action space via a customized unsupervised skill discovery method designed to promote temporal abstraction, disentanglement, and safety, thereby facilitating efficient downstream learning. Once a latent action space is learned, SLAC uses it as the action interface for a novel off-policy RL algorithm to autonomously learn downstream tasks through real-world interactions. We evaluate SLAC against existing methods on a suite of bimanual mobile manipulation tasks, where it achieves state-of-the-art performance. Notably, SLAC learns contact-rich whole-body tasks in under an hour of real-world interactions, without relying on any demonstrations or hand-crafted behavior priors. More information, code, and videos at robo-rl.github.io
Authors: Piotr Bia{\l}as, Piotr Korcyl, Tomasz Stebel, Dawid Zapolski
Abstract: We present an application of autoregressive neural networks to Monte Carlo simulations of quantum spin chains using the correspondence with classical two-dimensional spin systems. We use a hierarchy of neural networks capable of estimating conditional probabilities of consecutive spins to evaluate elements of reduced density matrices directly. Using the Ising chain as an example, we calculate the continuum limit of the ground state's von Neumann and R\'enyi bipartite entanglement entropies of an interval built of up to 5 spins. We demonstrate that our architecture is able to estimate all the needed matrix elements with just a single training for a fixed time discretization and lattice volume. Our method can be applied to other types of spin chains, possibly with defects, as well as to estimating entanglement entropies of thermal states at non-zero temperature.
Authors: Stephen R. Pfohl, Natalie Harris, Chirag Nagpal, David Madras, Vishwali Mhasawade, Olawale Salaudeen, Awa Dieng, Shannon Sequeira, Santiago Arciniegas, Lillian Sung, Nnamdi Ezeanochie, Heather Cole-Lewis, Katherine Heller, Sanmi Koyejo, Alexander D'Amour
Abstract: Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to predict metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation.
Authors: Yang Cai, Alkis Kalavasis, Katerina Mamali, Anay Mehrotra, Manolis Zampetakis
Abstract: Most of the widely used estimators of the average treatment effect (ATE) in causal inference rely on the assumptions of unconfoundedness and overlap. Unconfoundedness requires that the observed covariates account for all correlations between the outcome and treatment. Overlap requires the existence of randomness in treatment decisions for all individuals. Nevertheless, many types of studies frequently violate unconfoundedness or overlap, for instance, observational studies with deterministic treatment decisions -- popularly known as Regression Discontinuity designs -- violate overlap. In this paper, we initiate the study of general conditions that enable the identification of the average treatment effect, extending beyond unconfoundedness and overlap. In particular, following the paradigm of statistical learning theory, we provide an interpretable condition that is sufficient and nearly necessary for the identification of ATE. Moreover, this condition characterizes the identification of the average treatment effect on the treated (ATT) and can be used to characterize other treatment effects as well. To illustrate the utility of our condition, we present several well-studied scenarios where our condition is satisfied and, hence, we prove that ATE can be identified in regimes that prior works could not capture. For example, under mild assumptions on the data distributions, this holds for the models proposed by Tan (2006) and Rosenbaum (2002), and the Regression Discontinuity design model introduced by Thistlethwaite and Campbell (1960). For each of these scenarios, we also show that, under natural additional assumptions, ATE can be estimated from finite samples. We believe these findings open new avenues for bridging learning-theoretic insights and causal inference methodologies, particularly in observational studies with complex treatment mechanisms.
Authors: Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia
Abstract: Long context large language models (LLMs) are deployed in many real-world applications such as RAG, agent, and broad LLM-integrated applications. Given an instruction and a long context (e.g., documents, PDF files, webpages), a long context LLM can generate an output grounded in the provided context, aiming to provide more accurate, up-to-date, and verifiable outputs while reducing hallucinations and unsupported claims. This raises a research question: how to pinpoint the texts (e.g., sentences, passages, or paragraphs) in the context that contribute most to or are responsible for the generated output by an LLM? This process, which we call context traceback, has various real-world applications, such as 1) debugging LLM-based systems, 2) conducting post-attack forensic analysis for attacks (e.g., prompt injection attack, knowledge corruption attacks) to an LLM, and 3) highlighting knowledge sources to enhance the trust of users towards outputs generated by LLMs. When applied to context traceback for long context LLMs, existing feature attribution methods such as Shapley have sub-optimal performance and/or incur a large computational cost. In this work, we develop TracLLM, the first generic context traceback framework tailored to long context LLMs. Our framework can improve the effectiveness and efficiency of existing feature attribution methods. To improve the efficiency, we develop an informed search based algorithm in TracLLM. We also develop contribution score ensemble/denoising techniques to improve the accuracy of TracLLM. Our evaluation results show TracLLM can effectively identify texts in a long context that lead to the output of an LLM. Our code and data are at: https://github.com/Wang-Yanting/TracLLM.
Authors: Martin Beseda, Vittorio Cortellessa, Daniele Di Pompeo, Luca Traini, Michele Tucci
Abstract: This paper addresses the challenge of accurately detecting the transition from the warmup phase to the steady state in performance metric time series, which is a critical step for effective benchmarking. The goal is to introduce a method that avoids premature or delayed detection, which can lead to inaccurate or inefficient performance analysis. The proposed approach adapts techniques from the chemical reactors domain, detecting steady states online through the combination of kernel-based step detection and statistical methods. By using a window-based approach, it provides detailed information and improves the accuracy of identifying phase transitions, even in noisy or irregular time series. Results show that the new approach reduces total error by 14.5% compared to the state-of-the-art method. It offers more reliable detection of the steady-state onset, delivering greater precision for benchmarking tasks. For users, the new approach enhances the accuracy and stability of performance benchmarking, efficiently handling diverse time series data. Its robustness and adaptability make it a valuable tool for real-world performance evaluation, ensuring consistent and reproducible results.
Authors: Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang
Abstract: Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/
Authors: Alex DeWeese, Guannan Qu
Abstract: Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) are known to be NEXP-Complete and intractable to solve. However, for problems such as cooperative navigation, obstacle avoidance, and formation control, basic assumptions can be made about local visibility and local dependencies. The work DeWeese and Qu 2024 formalized these assumptions in the construction of the Locally Interdependent Multi-Agent MDP. In this setting, it establishes three closed-form policies that are tractable to compute in various situations and are exponentially close to optimal with respect to visibility. However, it is also shown that these solutions can have poor performance when the visibility is small and fixed, often getting stuck during simulations due to the so called "Penalty Jittering" phenomenon. In this work, we establish the Extended Cutoff Policy Class which is, to the best of our knowledge, the first non-trivial class of near optimal closed-form partially observable policies that are exponentially close to optimal with respect to the visibility for any Locally Interdependent Multi-Agent MDP. These policies are able to remember agents beyond their visibilities which allows them to perform significantly better in many small and fixed visibility settings, resolve Penalty Jittering occurrences, and under certain circumstances guarantee fully observable joint optimal behavior despite the partial observability. We also propose a generalized form of the Locally Interdependent Multi-Agent MDP that allows for transition dependence and extended reward dependence, then replicate our theoretical results in this setting.
Authors: Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta
Abstract: Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations (R^2=0.8) than the best existing open-loop approach (R^2=0.7). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.
Authors: Zhao-Heng Yin, Sherry Yang, Pieter Abbeel
Abstract: Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components in its implementation. First, a novel training pipeline for training a ''denoising'' 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction architecture that favors both cross-embodiment transfer and policy generalization to background. We evaluate the system in real world setups. Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method, achieve 55% average success rate in diverse tasks where prior approaches fail~($\lesssim 10$\%), and can even acquire fine-grained manipulation skills like insertion.
Authors: Liyan Chen, Philippos Mordohai, Sergul Aydore
Abstract: Dropout as a common regularizer to prevent overfitting in deep neural networks has been less effective in convolutional layers than in fully connected layers. This is because Dropout drops features randomly, without considering local structure. When features are spatially correlated, as in the case of convolutional layers, information from the dropped features can still propagate to subsequent layers via neighboring features. To address this problem, structured forms of Dropout have been proposed. A drawback of these methods is that they do not adapt to the data. In this work, we leverage the structure in the outputs of convolutional layers and introduce a novel structured regularization method named DropCluster. Our approach clusters features in convolutional layers, and drops the resulting clusters randomly during training iterations. Experiments on CIFAR-10/100, SVHN, and APPA-REAL datasets demonstrate that our approach is effective and controls overfitting better than other approaches.
Authors: Xavier Aramayo Carrasco, Maksim Nekrashevich, Petr Mokrov, Evgeny Burnaev, Alexander Korotin
Abstract: Recently, the Gromov-Wasserstein Optimal Transport (GWOT) problem has attracted the special attention of the ML community. In this problem, given two distributions supported on two (possibly different) spaces, one has to find the most isometric map between them. In the discrete variant of GWOT, the task is to learn an assignment between given discrete sets of points. In the more advanced continuous formulation, one aims at recovering a parametric mapping between unknown continuous distributions based on i.i.d. samples derived from them. The clear geometrical intuition behind the GWOT makes it a natural choice for several practical use cases, giving rise to a number of proposed solvers. Some of them claim to solve the continuous version of the problem. At the same time, GWOT is notoriously hard, both theoretically and numerically. Moreover, all existing continuous GWOT solvers still heavily rely on discrete techniques. Natural questions arise: to what extent do existing methods unravel the GWOT problem, what difficulties do they encounter, and under which conditions they are successful? Our benchmark paper is an attempt to answer these questions. We specifically focus on the continuous GWOT as the most interesting and debatable setup. We crash-test existing continuous GWOT approaches on different scenarios, carefully record and analyze the obtained results, and identify issues. Our findings experimentally testify that the scientific community is still missing a reliable continuous GWOT solver, which necessitates further research efforts. As the first step in this direction, we propose a new continuous GWOT method which does not rely on discrete techniques and partially solves some of the problems of the competitors.
Authors: Tianyu Du, Ayush Kanodia, Susan Athey
Abstract: The $\texttt{torch-choice}$ is an open-source library for flexible, fast choice modeling with Python and PyTorch. $\texttt{torch-choice}$ provides a $\texttt{ChoiceDataset}$ data structure to manage databases flexibly and memory-efficiently. The paper demonstrates constructing a $\texttt{ChoiceDataset}$ from databases of various formats and functionalities of $\texttt{ChoiceDataset}$. The package implements two widely used models, namely the multinomial logit and nested logit models, and supports regularization during model estimation. The package incorporates the option to take advantage of GPUs for estimation, allowing it to scale to massive datasets while being computationally efficient. Models can be initialized using either R-style formula strings or Python dictionaries. We conclude with a comparison of the computational efficiencies of $\texttt{torch-choice}$ and $\texttt{mlogit}$ in R as (1) the number of observations increases, (2) the number of covariates increases, and (3) the expansion of item sets. Finally, we demonstrate the scalability of $\texttt{torch-choice}$ on large-scale datasets.
Authors: Jaeseung Heo, Seungbeom Lee, Sungsoo Ahn, Dongwoo Kim
Abstract: Data augmentation plays a critical role in improving model performance across various domains, but it becomes challenging with graph data due to their complex and irregular structure. To address this issue, we propose EPIC (Edit Path Interpolation via learnable Cost), a novel interpolation-based method for augmenting graph datasets. To interpolate between two graphs lying in an irregular domain, EPIC leverages the concept of graph edit distance, constructing an edit path that represents the transformation process between two graphs via edit operations. Moreover, our method introduces a context-sensitive cost model that accounts for the importance of specific edit operations formulated through a learning framework. This allows for a more nuanced transformation process, where the edit distance is not merely count-based but reflects meaningful graph attributes. With randomly sampled graphs from the edit path, we enrich the training set to enhance the generalization capability of classification models. Experimental evaluations across several benchmark datasets demonstrate that our approach outperforms existing augmentation techniques in many tasks.
Authors: Xinshun Liu, Yizhi Fang, Yichao Jiang
Abstract: This paper offers a new perspective on Artificial Neural Networks (ANNs) architecture. Traditional ANNs commonly use tree-like or DAG structures for simplicity, which can be preset or determined by Neural Architecture Search (NAS). Yet, these structures restrict network collaboration and capability due to the absence of horizontal and backward communication. Biological neural systems, however, feature billions of neural units with highly complex connections, allowing each biological neuron to connect with others based on specific situations. Inspired by biological systems, we propose a novel framework that learns to construct arbitrary graph structures during training and introduce the concept of Neural Modules for organizing neural units, which facilitates communication between any nodes and collaboration among modules. Unlike traditional NAS methods that rely on DAG search spaces, our framework learns from complete graphs, enabling free communication between neurons akin to biological neural networks. Furthermore, we present a method to compute these structures and a regularization technique that organizes them into multiple independent, balanced neural modules. This approach reduces overfitting and improves efficiency through parallel computing. Overall, our method allows ANNs to learn effective arbitrary structures similar to biological ones. It is adaptable to various tasks and compatible across different scenarios, with experimental results demonstrating its potential.
Authors: Marcial Sanchis-Agudo, Yuning Wang, Roger Arnau, Luca Guastoni, Jasmin Lim, Karthik Duraisamy, Ricardo Vinuesa
Abstract: To improve the robustness of transformer neural networks used for temporal-dynamics prediction of chaotic systems, we propose a novel attention mechanism called easy attention which we demonstrate in time-series reconstruction and prediction. While the standard self attention only makes use of the inner product of queries and keys, it is demonstrated that the keys, queries and softmax are not necessary for obtaining the attention score required to capture long-term dependencies in temporal sequences. Through the singular-value decomposition (SVD) on the softmax attention score, we further observe that self attention compresses the contributions from both queries and keys in the space spanned by the attention score. Therefore, our proposed easy-attention method directly treats the attention scores as learnable parameters. This approach produces excellent results when reconstructing and predicting the temporal dynamics of chaotic systems exhibiting more robustness and less complexity than self attention or the widely-used long short-term memory (LSTM) network. We show the improved performance of the easy-attention method in the Lorenz system, a turbulence shear flow and a model of a nuclear reactor.
Authors: Zheng Lin, Guanqiao Qu, Qiyuan Chen, Xianhao Chen, Zhe Chen, Kaibin Huang
Abstract: Large language models (LLMs), which have shown remarkable capabilities, are revolutionizing AI development and potentially shaping our future. However, given their multimodality, the status quo cloud-based deployment faces some critical challenges: 1) long response time; 2) high bandwidth costs; and 3) the violation of data privacy. 6G mobile edge computing (MEC) systems may resolve these pressing issues. In this article, we explore the potential of deploying LLMs at the 6G edge. We start by introducing killer applications powered by multimodal LLMs, including robotics and healthcare, to highlight the need for deploying LLMs in the vicinity of end users. Then, we identify the critical challenges for LLM deployment at the edge and envision the 6G MEC architecture for LLMs. Furthermore, we delve into two design aspects, i.e., edge training and edge inference for LLMs. In both aspects, considering the inherent resource limitations at the edge, we discuss various cutting-edge techniques, including split learning/inference, parameter-efficient fine-tuning, quantization, and parameter-sharing inference, to facilitate the efficient deployment of LLMs. This article serves as a position paper for thoroughly identifying the motivation, challenges, and pathway for empowering LLMs at the 6G edge.
Authors: Rudrajit Das, Xi Chen, Bertram Ieong, Parikshit Bansal, Sujay Sanghavi
Abstract: It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large \textit{approximate losses} instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer's representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12 layer BERT base model, and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes ~ 43 hours compared to ~ 57 hours of vanilla training.
Authors: Zheng Lin, Guanqiao Qu, Wei Wei, Xianhao Chen, Kin K. Leung
Abstract: The increasing complexity of deep neural networks poses significant barriers to democratizing them to resource-limited edge devices. To address this challenge, split federated learning (SFL) has emerged as a promising solution by of floading the primary training workload to a server via model partitioning while enabling parallel training among edge devices. However, although system optimization substantially influences the performance of SFL under resource-constrained systems, the problem remains largely uncharted. In this paper, we provide a convergence analysis of SFL which quantifies the impact of model splitting (MS) and client-side model aggregation (MA) on the learning performance, serving as a theoretical foundation. Then, we propose AdaptSFL, a novel resource-adaptive SFL framework, to expedite SFL under resource-constrained edge computing systems. Specifically, AdaptSFL adaptively controls client-side MA and MS to balance communication-computing latency and training convergence. Extensive simulations across various datasets validate that our proposed AdaptSFL framework takes considerably less time to achieve a target accuracy than benchmarks, demonstrating the effectiveness of the proposed strategies.
Authors: Jinho Chang, Jong Chul Ye
Abstract: With the emergence of diffusion models as a frontline generative model, many researchers have proposed molecule generation techniques with conditional diffusion models. However, the unavoidable discreteness of a molecule makes it difficult for a diffusion model to connect raw data with highly complex conditions like natural language. To address this, here we present a novel latent diffusion model dubbed LDMol for text-conditioned molecule generation. By recognizing that the suitable latent space design is the key to the diffusion model performance, we employ a contrastive learning strategy to extract novel feature space from text data that embeds the unique characteristics of the molecule structure. Experiments show that LDMol outperforms the existing autoregressive baselines on the text-to-molecule generation benchmark, being one of the first diffusion models that outperforms autoregressive models in textual data generation with a better choice of the latent domain. Furthermore, we show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-guided molecule editing, demonstrating its versatility as a diffusion model.
Authors: Benjamin Th\'erien, Charles-\'Etienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky
Abstract: Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($\mu$P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $\mu$-parameterized LOs ($\mu$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP), as they are trained in existing work. We also empirically observe that $\mu$LOs trained with our recipe exhibit unexpectedly improved meta-generalization to deeper networks ($5\times$ meta-training) and surprising generalization to much longer training horizons ($25\times$ meta-training) when compared to SP LOs.
Authors: Faisal Hamman, Pasan Dissanayake, Saumitra Mishra, Freddy Lecue, Sanghamitra Dutta
Abstract: Fine-tuning LLMs on tabular classification tasks can lead to the phenomenon of fine-tuning multiplicity where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the consistency of individual predictions without expensive model retraining. Our measure quantifies a prediction's consistency by analyzing (sampling) the model's local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction consistency under a broad class of fine-tuned models, i.e., inputs with sufficiently high local stability (as defined by our measure) also remain consistent across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures.
Authors: Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, Jing Tang
Abstract: Large language models (LLMs) have exhibited their problem-solving abilities in mathematical reasoning. Solving realistic optimization (OPT) problems in application scenarios requires advanced and applied mathematics ability. However, current OPT benchmarks that merely solve linear programming are far from complex realistic situations. In this work, we propose OptiBench, a benchmark for End-to-end optimization problem-solving with human-readable inputs and outputs. OptiBench contains rich optimization problems, including linear and nonlinear programming with or without tabular data, which can comprehensively evaluate LLMs' solving ability. In our benchmark, LLMs are required to call a code solver to provide precise numerical answers. Furthermore, to alleviate the data scarcity for optimization problems, and to bridge the gap between open-source LLMs on a small scale (e.g., Llama-3-8b) and closed-source LLMs (e.g., GPT-4), we further propose a data synthesis method namely ReSocratic. Unlike general data synthesis methods that proceed from questions to answers, \ReSocratic first incrementally synthesizes formatted optimization demonstration with mathematical formulations step by step and then back-translates the generated demonstrations into questions. Based on this, we synthesize the ReSocratic-29k dataset. We further conduct supervised fine-tuning with ReSocratic-29k on multiple open-source models. Experimental results show that ReSocratic-29k significantly improves the performance of open-source models.
Authors: Ioannis Athanasiadis, Fredrik Lindsten, Michael Felsberg
Abstract: Variational Autoencoders (VAEs) are a popular framework for unsupervised learning and data generation. A plethora of methods have been proposed focusing on improving VAEs, with the incorporation of adversarial objectives and the integration of prior learning mechanisms being prominent directions. When it comes to the former, an indicative instance is the recently introduced family of Introspective VAEs aiming at ensuring that a low likelihood is assigned to unrealistic samples. In this study, we focus on the Soft-IntroVAE (S-IntroVAE), one of only two members of the Introspective VAE family, the other being the original IntroVAE. We select S-IntroVAE for its state-of-the-art status and its training stability. In particular, we investigate the implication of incorporating a multimodal and trainable prior into this S-IntroVAE. Namely, we formulate the prior as a third player and show that when trained in cooperation with the decoder constitutes an effective way for prior learning, which shares the Nash Equilibrium with the vanilla S-IntroVAE. Furthermore, based on a modified formulation of the optimal ELBO in S-IntroVAE, we develop theoretically motivated regularizations, namely (i) adaptive variance clipping to stabilize training when learning the prior and (ii) responsibility regularization to discourage the formation of inactive prior modes. Finally, we perform a series of targeted experiments on a 2D density estimation benchmark and in an image generation setting comprised of the (F)-MNIST and CIFAR-10 datasets demonstrating the effect of prior learning in S-IntroVAE in generation and representation learning.
Authors: Micha{\l} Wili\'nski, Mononito Goswami, Willa Potosnak, Nina \.Zukowska, Artur Dubrawski
Abstract: Time series foundation models (TSFMs) promise to be powerful tools for a wide range of applications. However, their internal representations and learned concepts are still not well understood. In this study, we investigate the structure and redundancy of representations across various TSFMs, examining the self-similarity of model layers within and across different model sizes. This analysis reveals block-like redundancy in the representations, which can be utilized for informed pruning to improve inference speed and efficiency. Additionally, we explore the concepts learned by these models - such as periodicity and trends - and how these can be manipulated through latent space steering to influence model behavior. Our experiments show that steering interventions can introduce new features, e.g., adding periodicity or trends to signals that initially lacked them. These findings underscore the value of representational analysis for optimizing models and demonstrate how conceptual steering offers new possibilities for more controlled and efficient time series analysis with TSFMs.
Authors: Vladim\'ir Bo\v{z}a, Vladim\'ir Macko
Abstract: Neural networks are often challenging to work with due to their large size and complexity. To address this, various methods aim to reduce model size by sparsifying or decomposing weight matrices, such as magnitude pruning and low-rank or block-diagonal factorization. In this work, we present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices. Although solving this problem exactly is computationally infeasible, we propose an efficient heuristic based on alternating minimization via ADMM that achieves state-of-the-art results, enabling unprecedented sparsification of neural networks. For instance, in a one-shot pruning setting, our method can reduce the size of the LLaMA2-13B model by 50% while maintaining better performance than the dense LLaMA2-7B model. We also compare favorably with Optimal Brain Compression, the state-of-the-art layer-wise pruning approach for convolutional neural networks. Furthermore, accuracy improvements of our method persist even after further model fine-tuning. Code available at: https://github.com/usamec/double_sparse.
Authors: Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, Nicolas Le Roux
Abstract: Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.
Authors: Navdeep Kumar, Priyank Agrawal, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
Abstract: We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an $\epsilon$-close \textbf{globally optimal} policy with a sample complexity of \( O(\epsilon^{-3}) \). This significantly improves upon the existing complexity of $O(\epsilon^{-2})$ to achieve $\epsilon$-close \textbf{stationary policy}, which is equivalent to the complexity of $O(\epsilon^{-4})$ to achieve $\epsilon$-close \textbf{globally optimal} policy using gradient domination lemma. Furthermore, we demonstrate that to achieve this improvement, the step sizes for both the actor and critic must decay as \( O(k^{-\frac{2}{3}}) \) with iteration $k$, diverging from the conventional \( O(k^{-\frac{1}{2}}) \) rates commonly used in (non)convex optimization.
Authors: Zachary Chase, Idan Mehalel
Abstract: In binary ($0/1$) online classification with apple tasting feedback, the learner receives feedback only when predicting $1$. Besides some degenerate learning tasks, all previously known learning algorithms for this model are randomized. Consequently, prior to this work it was unknown whether deterministic apple tasting is generally feasible. In this work, we provide the first widely-applicable deterministic apple tasting learner, and show that in the realizable case, a hypothesis class is learnable if and only if it is deterministically learnable, confirming a conjecture of [Raman, Subedi, Raman, Tewari-24]. Quantitatively, we show that every class $\mathcal{H}$ is learnable with mistake bound $O \left(\sqrt{\mathtt{L}(\mathcal{H}) T \log T} \right)$ (where $\mathtt{L}(\mathcal{H})$ is the Littlestone dimension of $\mathcal{H}$), and that this is tight for some classes. We further study the agnostic case, in which the best hypothesis makes at most $k$ many mistakes, and prove a trichotomy stating that every class $\mathcal{H}$ must be either easy, hard, or unlearnable. Easy classes have (both randomized and deterministic) mistake bound $\Theta_{\mathcal{H}}(k)$. Hard classes have randomized mistake bound $\tilde{\Theta}_{\mathcal{H}} \left(k + \sqrt{T} \right)$, and deterministic mistake bound $\tilde{\Theta}_{\mathcal{H}} \left(\sqrt{k \cdot T} \right)$, where $T$ is the time horizon. Unlearnable classes have (both randomized and deterministic) mistake bound $\Theta(T)$. Our upper bound is based on a deterministic algorithm for learning from expert advice with apple tasting feedback, a problem interesting in its own right. For this problem, we show that the optimal deterministic mistake bound is $\Theta \left(\sqrt{T (k + \log n)} \right)$ for all $k$ and $T \leq n \leq 2^T$, where $n$ is the number of experts.
Authors: Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, Kun Yuan
Abstract: Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduce GoLore (Gradient random Low-rank projection), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization algorithms. Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https://github.com/pkumelon/Golore.
Authors: Wei Chen, Yuxuan Liang
Abstract: The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.
Authors: Jinyang Liu, Tessa Steensgaard, Marvin N. Wright, Niklas Pfister, Munir Hiabu
Abstract: Many existing interpretation methods are based on Partial Dependence (PD) functions that, for a pre-trained machine learning model, capture how a subset of the features affects the predictions by averaging over the remaining features. Notable methods include Shapley additive explanations (SHAP) which computes feature contributions based on a game theoretical interpretation and PD plots (i.e., 1-dim PD functions) that capture average marginal main effects. Recent work has connected these approaches using a functional decomposition and argues that SHAP values can be misleading since they merge main and interaction effects into a single local effect. However, a major advantage of SHAP compared to other PD-based interpretations has been the availability of fast estimation techniques, such as \texttt{TreeSHAP}. In this paper, we propose a new tree-based estimator, \texttt{FastPD}, which efficiently estimates arbitrary PD functions. We show that \texttt{FastPD} consistently estimates the desired population quantity -- in contrast to path-dependent \texttt{TreeSHAP} which is inconsistent when features are correlated. For moderately deep trees, \texttt{FastPD} improves the complexity of existing methods from quadratic to linear in the number of observations. By estimating PD functions for arbitrary feature subsets, \texttt{FastPD} can be used to extract PD-based interpretations such as SHAP, PD plots and higher-order interaction effects.
Authors: Xingran Chen, Zhenke Wu, Xu Shi, Hyunghoon Cho, Bhramar Mukherjee
Abstract: We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. A Python package, "SynthEHRella", is provided to integrate various choices of approaches and evaluation metrics, enabling more streamlined exploration and evaluation of multiple methods. We found that method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. Based on the decision tree, GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.
Authors: Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmonsky, Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swaminathan Sundararaman, Danny Harnik
Abstract: With the growth of model sizes and the scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast model compression literature deleting parts of the model weights for faster inference, we investigate a more traditional type of compression - one that represents the model in a compact form and is coupled with a decompression algorithm that returns it to its original form and size - namely lossless compression. We present ZipNN a lossless compression tailored to neural networks. Somewhat surprisingly, we show that specific lossless compression can gain significant network and storage reduction on popular models, often saving 33% and at times reducing over 50% of the model size. We investigate the source of model compressibility and introduce specialized compression variants tailored for models that further increase the effectiveness of compression. On popular models (e.g. Llama 3) ZipNN shows space savings that are over 17% better than vanilla compression while also improving compression and decompression speeds by 62%. We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like Hugging Face.
Authors: Erica Coppolillo, Federico Cinus, Marco Minici, Francesco Bonchi, Giuseppe Manco
Abstract: Large Language Models (LLMs) demonstrate significant persuasive capabilities in one-on-one interactions, but their influence within social networks, where interconnected users and complex opinion dynamics pose unique challenges, remains underexplored. This paper addresses the research question: \emph{Can LLMs generate meaningful content that maximizes user engagement on social networks?} To answer this, we propose a pipeline using reinforcement learning with simulated feedback, where the network's response to LLM-generated content (i.e., the reward) is simulated through a formal engagement model. This approach bypasses the temporal cost and complexity of live experiments, enabling an efficient feedback loop between the LLM and the network under study. It also allows to control over endogenous factors such as the LLM's position within the social network and the distribution of opinions on a given topic. Our approach is adaptive to the opinion distribution of the underlying network and agnostic to the specifics of the engagement model, which is embedded as a plug-and-play component. Such flexibility makes it suitable for more complex engagement tasks and interventions in computational social science. Using our framework, we analyze the performance of LLMs in generating social engagement under different conditions, showcasing their full potential in this task. The experimental code is publicly available at https://github.com/mminici/Engagement-Driven-Content-Generation.
URLs: https://github.com/mminici/Engagement-Driven-Content-Generation.
Authors: Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram
Abstract: Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure, but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. Fully overlapping PCIe communication latency gets challenging as the size of the KV cache grows and/or the GPU compute capabilities increase. In this paper, we introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations, from which the GPU can start recomputing the KV cache values. While the GPU recomputes the partial KV cache, the remaining portion of the KV cache is transferred concurrently from the CPU. This approach overlaps GPU recomputation with KV cache transfer to minimize idle GPU time and maximize inference performance. KVPR is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches. The code is available at https://github.com/chaoyij/KVPR.
Authors: Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Di Wang, Jie Jiang, Jian Li
Abstract: Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum ($m_t$) and adaptive learning rate ($v_t$). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noise, poses significant challenges to Adam's standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam's performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistency between the momentum and the gradient for each parameter dimension before deciding on updates. If momentum and gradient are in sync, CAdam proceeds with parameter updates according to Adam's original formulation; if not, it temporarily withholds updates and monitors potential shifts in data distribution in subsequent iterations. This method allows CAdam to distinguish between the true distributional shifts and mere noise, and to adapt more quickly to new data distributions. In various settings with distribution shift or noise, our experiments demonstrate that CAdam surpasses other well-known optimizers, including the original Adam. Furthermore, in large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV).
Authors: Tal Zeevi, Ravid Shwartz-Ziv, Yann LeCun, Lawrence H. Staib, John A. Onofrey
Abstract: Accurate uncertainty estimation is crucial for deploying neural networks in risk-sensitive applications such as medical diagnosis. Monte Carlo Dropout is a widely used technique for approximating predictive uncertainty by performing stochastic forward passes with dropout during inference. However, using static dropout rates across all layers and inputs can lead to suboptimal uncertainty estimates, as it fails to adapt to the varying characteristics of individual inputs and network layers. Existing approaches optimize dropout rates during training using labeled data, resulting in fixed inference-time parameters that cannot adjust to new data distributions, compromising uncertainty estimates in Monte Carlo simulations. In this paper, we propose Rate-In, an algorithm that dynamically adjusts dropout rates during inference by quantifying the information loss induced by dropout in each layer's feature maps. By treating dropout as controlled noise injection and leveraging information-theoretic principles, Rate-In adapts dropout rates per layer and per input instance without requiring ground truth labels. By quantifying the functional information loss in feature maps, we adaptively tune dropout rates to maintain perceptual quality across diverse medical imaging tasks and architectural configurations. Our extensive empirical study on synthetic data and real-world medical imaging tasks demonstrates that Rate-In improves calibration and sharpens uncertainty estimates compared to fixed or heuristic dropout rates without compromising predictive performance. Rate-In offers a practical, unsupervised, inference-time approach to optimizing dropout for more reliable predictive uncertainty estimation in critical applications.
Authors: Yael Itzhakev, Amit Giloni, Yuval Elovici, Asaf Shabtai
Abstract: Machine learning models trained on tabular data are vulnerable to adversarial attacks, even in realistic scenarios where attackers only have access to the model's outputs. Since tabular data contains complex interdependencies among features, it presents a unique challenge for adversarial samples which must maintain coherence and respect these interdependencies to remain indistinguishable from benign data. Moreover, existing attack evaluation metrics-such as the success rate, perturbation magnitude, and query count-fail to account for this challenge. To address those gaps, we propose a technique for perturbing dependent features while preserving sample coherence. In addition, we introduce Class-Specific Anomaly Detection (CSAD), an effective novel anomaly detection approach, along with concrete metrics for assessing the quality of tabular adversarial attacks. CSAD evaluates adversarial samples relative to their predicted class distribution, rather than a broad benign distribution. It ensures that subtle adversarial perturbations, which may appear coherent in other classes, are correctly identified as anomalies. We integrate SHAP explainability techniques to detect inconsistencies in model decision-making, extending CSAD for SHAP-based anomaly detection. Our evaluation incorporates both anomaly detection rates with SHAP-based assessments to provide a more comprehensive measure of adversarial sample quality. We evaluate various attack strategies, examining black-box query-based and transferability-based gradient attacks across four target models. Experiments on benchmark tabular datasets reveal key differences in the attacker's risk and effort and attack quality, offering insights into the strengths, limitations, and trade-offs faced by attackers and defenders. Our findings lay the groundwork for future research on adversarial attacks and defense development in the tabular domain.
Authors: Aditya Desai, Shuo Yang, Alejandro Cuadron, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
Abstract: Leveraging long contexts is crucial for advanced AI systems, but attention computation poses a scalability challenge. While scaled dot-product attention (SDPA) exhibits token sparsity, i.e. only a few pivotal tokens significantly contribute to output, exploiting this sparsity remains challenging. Existing methods either suffer from quality degradation or require substantial additional resources. We show that identifying pivotal tokens is a Maximum Inner Product Search (MIPS) problem. However, existing MIPS solutions are not well-suited for SDPA, as they are not GPU-friendly and often underperform due to the separated query and key distributions. This paper introduces HashAttention, framing pivotal token identification as a recommendation problem. Given a query, HashAttention encodes keys and queries in Hamming space, capturing the required semantic similarity, using learned mapping functions. HashAttention efficiently identifies pivotal tokens for a given query using bitwise operations and computes attention using only these tokens, improving the overall attention efficiency. Trained on generic data, HashAttention reduces tokens used by up to $16\times$ with minimal quality loss, requiring only 32 bits of auxiliary memory per token. Sparsity can be further improved to $32\times$ through task-specific fine-tuning. On A100 GPU, at $32\times$ sparsity, incorporating HashAttention reduces attention latency by up to $4.3\times$ in GPT-FAST and $2.54\times$ in FlashDecode, and achieves up to $3.12\times$ higher throughput for GPT-FAST.
Authors: Xingwu Sun, Shuaipeng Li, Ruobing Xie, Weidong Han, Kan Wu, Zhen Yang, Yixing Li, An Wang, Shuai Li, Jinbao Xue, Yu Cheng, Yangyu Tao, Zhanhui Kang, Chengzhong Xu, Di Wang, Jie Jiang
Abstract: Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point (FP) quantization, and thus cannot well fit the LLM losses in this scenario. In contrast, while FP quantization training is more commonly implemented in production, it's research has been relatively superficial. In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models. In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal FP quantization precision is directly proportional to the computational power, but within a wide computational power range. We estimate that the best cost-performance precision should lie between 4-8 bits.
Authors: Satchel Grant
Abstract: When can we say that two neural systems are the same? The answer to this question is goal-dependent, and it is often addressed through correlative methods such as Representational Similarity Analysis (RSA) and Centered Kernel Alignment (CKA). What nuances do we miss, however, when we fail to causally probe the representations? Do the dangers of cause vs. correlation exist in comparative representational analyses? In this work, we introduce a method for connecting neural representational similarity to behavior through causal interventions. The method learns orthogonal transformations that find an aligned subspace in which behavioral information from multiple distributed networks' representations can be isolated and interchanged. We first show that the method can be used to transfer the behavior from one frozen Neural Network (NN) to another in a manner similar to model stitching, and we show how the method can complement correlative similarity measures like RSA. We then introduce an efficient subspace orthogonalization technique using the Gram-Schmidt process -- that can also be used for Distributed Alignment Search (DAS) -- allowing us to perform analyses on larger models. Next, we empirically and theoretically show how our method can be equivalent to model stitching when desired, or it can take a form that is more restrictive to causal information, and in both cases, it reduces the number of required matrices for a comparison of n models from quadratic to linear in n. We then show how we can augment the loss objective with an auxiliary loss to train causally relevant alignments even when we can only read the representations from one of the two networks during training (like with biological networks). Lastly, we use number representations as a case study to explore how our method can be used to compare specific types of representational information across tasks and models.
Authors: Tehila Dahan, Kfir Y. Levy
Abstract: We address the challenges of Byzantine-robust training in asynchronous distributed machine learning systems, aiming to enhance efficiency amid massive parallelization and heterogeneous computing resources. Asynchronous systems, marked by independently operating workers and intermittent updates, uniquely struggle with maintaining integrity against Byzantine failures, which encompass malicious or erroneous actions that disrupt learning. The inherent delays in such settings not only introduce additional bias to the system but also obscure the disruptions caused by Byzantine faults. To tackle these issues, we adapt the Byzantine framework to asynchronous dynamics by introducing a novel weighted robust aggregation framework. This allows for the extension of robust aggregators and a recent meta-aggregator to their weighted versions, mitigating the effects of delayed updates. By further incorporating a recent variance-reduction technique, we achieve an optimal convergence rate for the first time in an asynchronous Byzantine environment. Our methodology is rigorously validated through empirical and theoretical analysis, demonstrating its effectiveness in enhancing fault tolerance and optimizing performance in asynchronous ML systems.
Authors: Yangming Li, Chaoyu Liu, Carola-Bibiane Sch\"onlieb
Abstract: Flow-based generative models have demonstrated promising performance across a broad spectrum of data modalities (e.g., image and text). However, there are few works exploring their extension to unordered data (e.g., spatial point set), which is not trivial because previous models are mostly designed for vector data that are naturally ordered. In this paper, we present unordered flow, a type of flow-based generative model for set-structured data generation. Specifically, we convert unordered data into an appropriate function representation, and learn the probability measure of such representations through function-valued flow matching. For the inverse map from a function representation to unordered data, we propose a method similar to particle filtering, with Langevin dynamics to first warm-up the initial particles and gradient-based search to update them until convergence. We have conducted extensive experiments on multiple real-world datasets, showing that our unordered flow model is very effective in generating set-structured data and significantly outperforms previous baselines.
Authors: Yunzhen Yao, Lie He, Michael Gastpar
Abstract: This paper considers the sample-efficiency of preference learning, which models and predicts human choices based on comparative judgments. The minimax optimal estimation error rate $\Theta(d/n)$ in classical estimation theory requires that the number of samples $n$ scales linearly with the dimensionality of the feature space $d$. However, the high dimensionality of the feature space and the high cost of collecting human-annotated data challenge the efficiency of traditional estimation methods. To remedy this, we leverage sparsity in the preference model and establish sharp error rates. We show that under the sparse random utility model, where the parameter of the reward function is $k$-sparse, the minimax optimal rate can be reduced to $\Theta(k/n \log(d/k))$. Furthermore, we analyze the $\ell_{1}$-regularized estimator and show that it achieves near-optimal rate under mild assumptions on the Gram matrix. Experiments on synthetic data and LLM alignment data validate our theoretical findings, showing that sparsity-aware methods significantly reduce sample complexity and improve prediction accuracy.
Authors: Konrad Mundinger, Max Zimmer, Aldo Kiem, Christoph Spiegel, Sebastian Pokutta
Abstract: We demonstrate how neural networks can drive mathematical discovery through a case study of the Hadwiger-Nelson problem, a long-standing open problem at the intersection of discrete geometry and extremal combinatorics that is concerned with coloring the plane while avoiding monochromatic unit-distance pairs. Using neural networks as approximators, we reformulate this mixed discrete-continuous geometric coloring problem with hard constraints as an optimization task with a probabilistic, differentiable loss function. This enables gradient-based exploration of admissible configurations that most significantly led to the discovery of two novel six-colorings, providing the first improvement in thirty years to the off-diagonal variant of the original problem. Here, we establish the underlying machine learning approach used to obtain these results and demonstrate its broader applicability through additional numerical insights.
Authors: Giovanni Catania, Aur\'elien Decelle, Cyril Furtlehner, Beatriz Seoane
Abstract: We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models. Finally, we propose a generalization to arbitrary energy-based models by deriving the neural tangent kernel dynamics of the score function under the score-matching algorithm.
Authors: Yuan Gao, Hao Wu, Ruiqi Shu, Huanshuo Dong, Fan Xu, Rui Ray Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, Jiahao Wu, Qing Li, Hui Xiong, Xiaomeng Huang
Abstract: Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link https://github.com/YuanGao-YG/OneForecast.
Authors: Gholamali Aminian, Amir R. Asadi, Idan Shenfeld, Youssef Mroueh
Abstract: Recent methods for aligning large language models (LLMs) with human feedback predominantly rely on a single reference model, which limits diversity, model overfitting, and underutilizes the wide range of available pre-trained models. Incorporating multiple reference models has the potential to address these limitations by broadening perspectives, reducing bias, and leveraging the strengths of diverse open-source LLMs. However, integrating multiple reference models into reinforcement learning with human feedback (RLHF) frameworks poses significant theoretical challenges, where achieving exact solutions has remained an open problem. This paper presents the first \emph{exact solution} to the multiple reference model problem in reverse KL-regularized RLHF. We introduce a comprehensive theoretical framework that includes rigorous statistical analysis and provides sample complexity guarantees. Additionally, we extend our analysis to forward KL-regularized RLHF, offering new insights into sample complexity requirements in multiple reference scenarios. Our contributions lay the foundation for more advanced and adaptable LLM alignment techniques, enabling the effective use of multiple reference models. This work paves the way for developing alignment frameworks that are both theoretically sound and better suited to the challenges of modern AI ecosystems.
Authors: Wei Yao, Wenkai Yang, Gengze Xu, Ziqiao Wang, Yankai Lin, Yong Liu
Abstract: Weak-to-strong generalization, where weakly supervised strong models outperform their weaker teachers, offers a promising approach to aligning superhuman models with human values. To deepen the understanding of this approach, we provide theoretical insights into its capabilities and limitations. First, in the classification setting, we establish upper and lower generalization error bounds for the strong model, identifying the primary limitations as stemming from the weak model's generalization error and the optimization objective itself. Additionally, we derive lower and upper bounds on the calibration error of the strong model. These theoretical bounds reveal two critical insights: (1) the weak model should demonstrate strong generalization performance and maintain well-calibrated predictions, and (2) the strong model's training process must strike a careful balance, as excessive optimization could undermine its generalization capability by over-relying on the weak supervision signals. Finally, in the regression setting, we extend the work of Charikar et al. (2024) to a loss function based on Kullback-Leibler (KL) divergence, offering guarantees that the strong student can outperform its weak teacher by at least the magnitude of their disagreement. We conduct sufficient experiments to validate our theory.
Authors: Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni
Abstract: Modern foundation models are trained on diverse datasets to enhance generalization across tasks and domains A central challenge in this process is determining how to effectively mix and sample data from multiple sources This naturally leads to a multitask learning (MTL) perspective While prior work in MTL has emphasized mitigating gradient conflicts we observe that largescale pretraining scenariossuch as multilingual or multidomain trainingoften exhibit little to no gradient conflict Motivated by this observation we propose PiKE (Positive gradient interaction-based K-task weights Estimator) an adaptive data mixing algorithm that dynamically adjusts sampling weights during training PiKE exploits nonconflicting gradient interactions to minimize a neartight upper bound on the average loss decrease at each step while incurring negligible computational overhead We provide theoretical convergence guarantees and show that PiKE outperforms static and nonadaptive mixing baselines Furthermore we extend PiKE to promote balanced learning across tasks Extensive experiments on largescale language model pretraining confirm that PiKE achieves faster convergence and improved downstream performance compared to existing approaches
Authors: Nikita P. Kalinin, Jalaj Upadhyay, Christoph H. Lampert
Abstract: We propose Joint Moment Estimation (JME), a method for continually and privately estimating both the first and second moments of data with reduced noise compared to naive approaches. JME uses the matrix mechanism and a joint sensitivity analysis to allow the second moment estimation with no additional privacy cost, thereby improving accuracy while maintaining privacy. We demonstrate JME's effectiveness in two applications: estimating the running mean and covariance matrix for Gaussian density estimation, and model training with DP-Adam on CIFAR-10.
Authors: Massimiliano Ciranni, Vito Paolo Pastore, Roberto Di Via, Enzo Tartaglione, Francesca Odone, Vittorio Murino
Abstract: Deep learning model effectiveness in classification tasks is often challenged by the quality and quantity of training data whenever they are affected by strong spurious correlations between specific attributes and target labels. This results in a form of bias affecting training data, which typically leads to unrecoverable weak generalization in prediction. This paper aims at facing this problem by leveraging bias amplification with generated synthetic data: we introduce Diffusing DeBias (DDB), a novel approach acting as a plug-in for common methods of unsupervised model debiasing exploiting the inherent bias-learning tendency of diffusion models in data generation. Specifically, our approach adopts conditional diffusion models to generate synthetic bias-aligned images, which replace the original training set for learning an effective bias amplifier model that we subsequently incorporate into an end-to-end and a two-step unsupervised debiasing approach. By tackling the fundamental issue of bias-conflicting training samples memorization in learning auxiliary models, typical of this type of techniques, our proposed method beats current state-of-the-art in multiple benchmark datasets, demonstrating its potential as a versatile and effective tool for tackling bias in deep learning models.
Authors: Chuanhui Liu, Xiao Wang
Abstract: This paper provides a comprehensive analysis of variational inference in latent variable models for survival analysis, emphasizing the distinctive challenges associated with applying variational methods to survival data. We identify a critical weakness in the existing methodology, demonstrating how a poorly designed variational distribution may hinder the objective of survival analysis tasks - modeling time-to-event distributions. We prove that the optimal variational distribution, which perfectly bounds the log-likelihood, may depend on the censoring mechanism. To address this issue, we propose censor-dependent variational inference (CDVI), tailored for latent variable models in survival analysis. More practically, we introduce CD-CVAE, a V-structure Variational Autoencoder (VAE) designed for the scalable implementation of CDVI. Further discussion extends some existing theories and training techniques to survival analysis. Extensive experiments validate our analysis and demonstrate significant improvements in the estimation of individual survival distributions.
Authors: Adrian M\"uller, Jon Schneider, Stratis Skoulakis, Luca Viano, Volkan Cevher
Abstract: In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $\tilde{O}(\sqrt{T})$ regret compared to any fixed strategy, where $T$ is the number of rounds. We provide the first affirmative answer to this question whenever the comparator strategy supports every action. In the context of zero-sum games with min-max value zero, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $\Omega(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.
Authors: Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matari\'c
Abstract: We present an end-to-end framework for generating synthetic users for evaluating interactive agents designed to encourage positive behavior changes, such as in health and lifestyle coaching. The synthetic users are grounded in health and lifestyle conditions, specifically sleep and diabetes management in this study, to ensure realistic interactions with the health coaching agent. Synthetic users are created in two stages: first, structured data are generated grounded in real-world health and lifestyle factors in addition to basic demographics and behavioral attributes; second, full profiles of the synthetic users are developed conditioned on the structured data. Interactions between synthetic users and the coaching agent are simulated using generative agent-based models such as Concordia, or directly by prompting a language model. Using two independently-developed agents for sleep and diabetes coaching as case studies, the validity of this framework is demonstrated by analyzing the coaching agent's understanding of the synthetic users' needs and challenges. Finally, through multiple blinded evaluations of user-coach interactions by human experts, we demonstrate that our synthetic users with health and behavioral attributes more accurately portray real human users with the same attributes, compared to generic synthetic users not grounded in such attributes. The proposed framework lays the foundation for efficient development of conversational agents through extensive, realistic, and grounded simulated interactions.
Authors: Joongwon Lee, Seonghwan Kim, Seokhyun Moon, Hyunwoo Kim, Woo Youn Kim
Abstract: We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.
Authors: Akshay Mehra, Trisha Mittal, Subhadra Gopalakrishnan, Joshua Kimball
Abstract: Coreset Selection (CS) aims to identify a subset of the training dataset that achieves model performance comparable to using the entire dataset. Many state-of-the-art CS methods select coresets using scores whose computation requires training the downstream model on the entire dataset first and recording changes in the model's behavior on samples as it trains (training dynamics). These scores are inefficient to compute and hard to interpret, as they do not indicate whether a sample is difficult to learn in general or only for a specific downstream model. Our work addresses these challenges by proposing a score that computes a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model. Specifically, we measure the alignment between a sample's visual features and concept bottlenecks, derived via large language models, by training a linear concept bottleneck layer and computing the sample's difficulty score using it.We then use stratified sampling based on this score to generate a coreset of the dataset.Crucially, our score is efficiently computable without training the downstream model on the full dataset even once, leads to high-performing coresets for various downstream models, and is computable even for an unlabeled dataset. Through experiments on CIFAR-10/100, and ImageNet-1K, we show that our coresets outperform random subsets, even at high pruning rates, and achieve model performance comparable to or better than coresets found by training dynamics-based methods.
Authors: Xiangwen Wang, Yibo Jacky Zhang, Zhoujie Ding, Katherine Tsai, Haolun Wu, Sanmi Koyejo
Abstract: Compound AI systems, comprising multiple interacting components such as LLMs, foundation models, and external tools, have demonstrated remarkable improvements compared to single models in various tasks. To ensure their effective deployment in real-world applications, aligning these systems with human preferences is crucial. However, aligning the compound system via policy optimization, unlike the alignment of a single model, is challenging for two main reasons: (i) non-differentiable interactions between components make end-to-end gradient-based optimization method inapplicable, and (ii) system-level preferences cannot be directly transformed into component-level preferences. To address these challenges, we first formulate compound AI systems as Directed Acyclic Graphs (DAGs), explicitly modeling both component interactions and the associated data flows. Building on this formulation, we introduce $\textbf{SysDPO}$, a framework that extends Direct Preference Optimization (DPO) to enable joint system-level alignment. We propose two variants, SysDPO-Direct and SysDPO-Sampling, tailored for scenarios depending on whether we construct a system-specific preference dataset. We empirically demonstrate the effectiveness of our approach across two applications: the joint alignment of a language model and a diffusion model, and the joint alignment of an LLM collaboration system.
Authors: Samuele Sabella, Chiara Boldrini, Lorenzo Valerio, Andrea Passarella, Marco Conti
Abstract: Decentralized federated learning (DFL) enables devices to collaboratively train models over complex network topologies without relying on a central controller. In this setting, local data remains private, but its quality and quantity can vary significantly across nodes. The extent to which a fully decentralized system is vulnerable to poor-quality or corrupted data remains unclear, but several factors could contribute to potential risks. Without a central authority, there can be no unified mechanism to detect or correct errors, and each node operates with a localized view of the data distribution, making it difficult for the node to assess whether its perspective aligns with the true distribution. Moreover, models trained on low-quality data can propagate through the network, amplifying errors. To explore the impact of low-quality data on DFL, we simulate two scenarios with degraded data quality -- one where the corrupted data is evenly distributed in a subset of nodes and one where it is concentrated on a single node -- using a decentralized implementation of FedAvg. Our results reveal that averaging-based decentralized learning is remarkably robust to localized bad data, even when the corrupted data resides in the most influential nodes of the network. Counterintuitively, this robustness is further enhanced when the corrupted data is concentrated on a single node, regardless of its centrality in the communication network topology. This phenomenon is explained by the averaging process, which ensures that no single node -- however central -- can disproportionately influence the overall learning process.
Authors: Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji
Abstract: Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow matching. Its key innovation is a learned noise-data coupling scheme inspired by Variational Autoencoders, where a data-dependent encoder models noise emission. This enables VCT to adaptively learn noise-todata pairings, reducing training variance relative to the fixed, unsorted pairings in classical CT. Experiments on multiple image datasets demonstrate significant improvements: our method surpasses baselines, achieves state-of-the-art FID among non-distillation CT approaches on CIFAR-10, and matches SoTA performance on ImageNet 64 x 64 with only two sampling steps. Code is available at https://github.com/sony/vct.
Authors: Michael Somma, Thomas Gallien, Branka Stojanovic
Abstract: Anomaly detection in complex dynamical systems is essential for ensuring reliability, safety, and efficiency in industrial and cyber-physical infrastructures. Predictive maintenance helps prevent costly failures, while cybersecurity monitoring has become critical as digitized systems face growing threats. Many of these systems exhibit oscillatory behaviors and bounded motion, requiring anomaly detection methods that capture structured temporal dependencies while adhering to physical consistency principles. In this work, we propose a system-theoretic approach to anomaly detection, grounded in classical embedding theory and physics-inspired consistency principles. We build upon the Fractal Whitney Embedding Prevalence Theorem that extends traditional embedding techniques to complex system dynamics. Additionally, we introduce state-derivative pairs as an embedding strategy to capture system evolution. To enforce temporal coherence, we develop a Temporal Differential Consistency Autoencoder (TDC-AE), incorporating a TDC-Loss that aligns the approximated derivatives of latent variables with their dynamic representations. We evaluate our method on the C-MAPSS dataset, a benchmark for turbofan aeroengine degradation. TDC-AE outperforms LSTMs and Transformers while achieving a nearly 200x reduction in MAC operations, making it particularly suited for lightweight edge computing. Our findings support the hypothesis that anomalies disrupt stable system dynamics, providing a robust signal for anomaly detection.
Authors: Yixiang Sun, Haotian Fu, Michael Littman, George Konidaris
Abstract: We propose DRAGO, a novel approach for continual model-based reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components: Synthetic Experience Rehearsal, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, and Regaining Memories Through Exploration, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.
Authors: Prakash Thakolkaran, Yaqi Guo, Shivam Saini, Mathias Peirlinck, Benjamin Alheit, Siddhant Kumar
Abstract: Traditional constitutive models rely on hand-crafted parametric forms with limited expressivity and generalizability, while neural network-based models can capture complex material behavior but often lack interpretability. To balance these trade-offs, we present monotonic Input-Convex Kolmogorov-Arnold Networks (ICKANs) for learning polyconvex hyperelastic constitutive laws. ICKANs leverage the Kolmogorov-Arnold representation, decomposing the model into compositions of trainable univariate spline-based activation functions for rich expressivity. We introduce trainable monotonic input-convex splines within the KAN architecture, ensuring physically admissible polyconvex models for isotropic compressible hyperelasticity. The resulting models are both compact and interpretable, enabling explicit extraction of analytical constitutive relationships through a monotonic input-convex symbolic regression technique. Through unsupervised training on full-field strain data and limited global force measurements, ICKANs accurately capture nonlinear stress-strain behavior across diverse strain states. Finite element simulations of unseen geometries with trained ICKAN hyperelastic constitutive models confirm the framework's robustness and generalization capability.
Authors: Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han
Abstract: Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. It motivates this paper to explore enhanced unlearning schemes that can mitigate this trade-off. Specifically, we propose Gradient Rectified Unlearning (GRU), an improved framework that regulates the directions of gradient updates during the unlearning procedure such that their side impacts on other, unrelated responses can be minimized. GRU is easy and general to implement, demonstrating practical effectiveness across a variety of well-established unlearning benchmarks.
Authors: Fengze Sun, Yanchuan Chang, Egemen Tanin, Shanika Karunasekera, Jianzhong Qi
Abstract: The increasing availability of urban data offers new opportunities for learning region representations, which can be used as input to machine learning models for downstream tasks such as check-in or crime prediction. While existing solutions have produced promising results, an issue is their fixed formation of regions and fixed input region features, which may not suit the needs of different downstream tasks. To address this limitation, we propose a model named FlexiReg for urban region representation learning that is flexible with both the formation of urban regions and the input region features. FlexiReg is based on a spatial grid partitioning over the spatial area of interest. It learns representations for the grid cells, leveraging publicly accessible data, including POI, land use, satellite imagery, and street view imagery. We propose adaptive aggregation to fuse the cell representations and prompt learning techniques to tailor the representations towards different tasks, addressing the needs of varying formations of urban regions and downstream tasks. Extensive experiments on five real-world datasets demonstrate that FlexiReg outperforms state-of-the-art models by up to 202% in term of the accuracy of four diverse downstream tasks using the produced urban region representations.
Authors: Yaorui Shi, Jiaqi Yang, Changhao Nai, Sihang Li, Junfeng Fang, Xiang Wang, Zhiyuan Liu, Yang Zhang
Abstract: Single-cell RNA sequencing (scRNA-seq) offers detailed insights into cellular heterogeneity. Recent advancements leverage single-cell large language models (scLLMs) for effective representation learning. These models focus exclusively on transcriptomic data, neglecting complementary biological knowledge from textual descriptions. To overcome this limitation, we propose scMMGPT, a novel multimodal framework designed for language-enhanced representation learning in single-cell transcriptomics. Unlike existing methods, scMMGPT employs robust cell representation extraction, preserving quantitative gene expression data, and introduces an innovative two-stage pre-training strategy combining discriminative precision with generative flexibility. Extensive experiments demonstrate that scMMGPT significantly outperforms unimodal and multimodal baselines across key downstream tasks, including cell annotation and clustering, and exhibits superior generalization in out-of-distribution scenarios.
Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
Abstract: Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: www.neuronpedia.org/sae-bench
Authors: Nathan Drenkow, Mitchell Pavlak, Keith Harrigian, Ayah Zirikly, Adarsh Subbaswamy, Mohammad Mehdi Farhangi, Nicholas Petrick, Mathias Unberath
Abstract: Artificial Intelligence (AI) is now firmly at the center of evidence-based medicine. Despite many success stories that edge the path of AI's rise in healthcare, there are comparably many reports of significant shortcomings and unexpected behavior of AI in deployment. A major reason for these limitations is AI's reliance on association-based learning, where non-representative machine learning datasets can amplify latent bias during training and/or hide it during testing. To unlock new tools capable of foreseeing and preventing such AI bias issues, we present G-AUDIT. Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets is a modality-agnostic dataset auditing framework that allows for generating targeted hypotheses about sources of bias in training or testing data. Our method examines the relationship between task-level annotations (commonly referred to as ``labels'') and data properties including patient attributes (e.g., age, sex) and environment/acquisition characteristics (e.g., clinical site, imaging protocols). G-AUDIT quantifies the extent to which the observed data attributes pose a risk for shortcut learning, or in the case of testing data, might hide predictions made based on spurious associations. We demonstrate the broad applicability of our method by analyzing large-scale medical datasets for three distinct modalities and machine learning tasks: skin lesion classification in images, stigmatizing language classification in Electronic Health Records (EHR), and mortality prediction for ICU tabular data. In each setting, G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods, underscoring its practical value in exposing dataset-level risks and supporting the downstream development of reliable AI systems.
Authors: Jacob Comeau, Mathieu Bazinet, Pascal Germain, Cem Subakan
Abstract: Continual learning algorithms aim to learn from a sequence of tasks, making the training distribution non-stationary. The majority of existing continual learning approaches in the literature rely on heuristics and do not provide learning guarantees. In this paper, we present a new method called Continual Pick-to-Learn (CoP2L), which is able to retain the most representative samples for each task in an efficient way. CoP2L combines the Pick-to-Learn algorithm (rooted in the sample compression theory) and the experience replay continual learning scheme. This allows us to provide non-vacuous upper bounds on the generalization loss of the learned predictors, numerically computable after each task. We empirically evaluate our approach on several standard continual learning benchmarks across Class-Incremental, Task-Incremental, and Domain-Incremental settings. Our results show that CoP2L is highly competitive across all setups, often outperforming existing baselines, and significantly mitigating catastrophic forgetting compared to vanilla experience replay in the Class-Incremental setting. It is possible to leverage the bounds provided by CoP2L in practical scenarios to certify the predictor reliability on previously learned tasks, in order to improve the trustworthiness of the continual learning algorithm.
Authors: Chaoyu Liu, Davide Murari, Lihao Liu, Yangming Li, Chris Budd, Carola-Bibiane Sch\"onlieb
Abstract: Partial Differential Equation (PDE) problems often exhibit strong local spatial structures, and effectively capturing these structures is critical for approximating their solutions. Recently, the Fourier Neural Operator (FNO) has emerged as an efficient approach for solving these PDE problems. By using parametrization in the frequency domain, FNOs can efficiently capture global patterns. However, this approach inherently overlooks the critical role of local spatial features, as frequency-domain parameterized convolutions primarily emphasize global interactions without encoding comprehensive localized spatial dependencies. Although several studies have attempted to address this limitation, their extracted Local Spatial Features (LSFs) remain insufficient, and computational efficiency is often compromised. To address this limitation, we introduce a convolutional neural network (CNN)-based feature pre-extractor to capture LSFs directly from input data, resulting in a hybrid architecture termed \textit{Conv-FNO}. Furthermore, we introduce two novel resizing schemes to make our Conv-FNO resolution invariant. In this work, we focus on demonstrating the effectiveness of incorporating LSFs into FNOs by conducting both a theoretical analysis and extensive numerical experiments. Our findings show that this simple yet impactful modification enhances the representational capacity of FNOs and significantly improves performance on challenging PDE benchmarks.
Authors: Bicheng Ying, Zhe Li, Haibo Yang
Abstract: This work tackles the fundamental challenges in Federated Learning (FL) posed by arbitrary client participation and data heterogeneity, prevalent characteristics in practical FL settings. It is well-established that popular FedAvg-style algorithms struggle with exact convergence and can suffer from slow convergence rates since a decaying learning rate is required to mitigate these scenarios. To address these issues, we introduce the concept of stochastic matrix and the corresponding time-varying graphs as a novel modeling tool to accurately capture the dynamics of arbitrary client participation and the local update procedure. Leveraging this approach, we offer a fresh perspective on designing FL algorithms, provide a rigorous quantitative analysis of the limitations inherent in the FedAvg algorithm, and present FOCUS, Federated Optimization with Exact Convergence via Push-pull Strategy, a provably convergent algorithm designed to effectively overcome the previously mentioned two challenges. More specifically, we provide a rigorous proof demonstrating that FOCUS achieves exact convergence with a linear rate regardless of the arbitrary client participation, establishing it as the first work to demonstrate this significant result.
Authors: Matthieu Meunier, Christoph Reisinger, Yufei Zhang
Abstract: Designing efficient learning algorithms with complexity guarantees for Markov decision processes (MDPs) with large or continuous state and action spaces remains a fundamental challenge. We address this challenge for entropy-regularized MDPs with Polish state and action spaces, assuming access to a generative model of the environment. We propose a novel family of multilevel Monte Carlo (MLMC) algorithms that integrate fixed-point iteration with MLMC techniques and a generic stochastic approximation of the Bellman operator. We quantify the precise impact of the chosen approximate Bellman operator on the accuracy of the resulting MLMC estimator. Leveraging this error analysis, we show that using a biased plain MC estimate for the Bellman operator results in quasi-polynomial sample complexity, whereas an unbiased randomized multilevel approximation of the Bellman operator achieves polynomial sample complexity in expectation. Notably, these complexity bounds are independent of the dimensions or cardinalities of the state and action spaces, distinguishing our approach from existing algorithms whose complexities scale with the sizes of these spaces. We validate these theoretical performance guarantees through numerical experiments.
Authors: Tomomasa Yamasaki, Zhehui Wang, Tao Luo, Niangjun Chen, Bo Wang
Abstract: Neural Architecture Search (NAS) is an automated technique to design optimal neural network architectures for a specific workload. Conventionally, evaluating candidate networks in NAS involves extensive training, which requires significant time and computational resources. To address this, training-free NAS has been proposed to expedite network evaluation with minimal search time. However, state-of-the-art training-free NAS algorithms struggle to precisely distinguish well-performing networks from poorly-performing networks, resulting in inaccurate performance predictions and consequently sub-optimal top-1 network accuracy. Moreover, they are less effective in activation function exploration. To tackle the challenges, this paper proposes RBFleX-NAS, a novel training-free NAS framework that accounts for both activation outputs and input features of the last layer with a Radial Basis Function (RBF) kernel. We also present a detection algorithm to identify optimal hyperparameters using the obtained activation outputs and input feature maps. We verify the efficacy of RBFleX-NAS over a variety of NAS benchmarks. RBFleX-NAS significantly outperforms state-of-the-art training-free NAS methods in terms of top-1 accuracy, achieving this with short search time in NAS-Bench-201 and NAS-Bench-SSS. In addition, it demonstrates higher Kendall correlation compared to layer-based training-free NAS algorithms. Furthermore, we propose NAFBee, a new activation design space that extends the activation type to encompass various commonly used functions. In this extended design space, RBFleX-NAS demonstrates its superiority by accurately identifying the best-performing network during activation function search, providing a significant advantage over other NAS algorithms.
Authors: Thomas Mulc, Mike Anderson, Paul Cubre, Huikun Zhang, Ivy Liu, Saket Kumar
Abstract: We present NNN, an experimental Transformer-based neural network approach to marketing measurement. Unlike Marketing Mix Models (MMMs) which rely on scalar inputs and parametric decay functions, NNN uses rich embeddings to capture both quantitative and qualitative aspects of marketing and organic channels (e.g., search queries, ad creatives). This, combined with its attention mechanism, potentially enables NNN to model complex interactions, capture long-term effects, and improve sales attribution accuracy. We show that L1 regularization permits the use of such expressive models in typical data-constrained settings. Evaluating NNN on simulated and real-world data demonstrates its efficacy, particularly through considerable improvement in predictive power. In addition to marketing measurement, the NNN framework can provide valuable, complementary insights through model probing, such as evaluating keyword or creative effectiveness.
Authors: Matthew B. Webster, Dongheon Lee, Joonnyong Lee
Abstract: Biosignals can be viewed as mixtures measuring particular physiological events, and blind source separation (BSS) aims to extract underlying source signals from mixtures. This paper proposes a self-supervised multi-encoder autoencoder (MEAE) to separate heartbeat-related source signals from photoplethysmogram (PPG), enhancing heart rate (HR) detection in noisy PPG data. The MEAE is trained on PPG signals from a large open polysomnography database without any pre-processing or data selection. The trained network is then applied to a noisy PPG dataset collected during the daily activities of nine subjects. The extracted heartbeat-related source signal significantly improves HR detection as compared to the original PPG. The absence of pre-processing and the self-supervised nature of the proposed method, combined with its strong performance, highlight the potential of MEAE for BSS in biosignal analysis.
Authors: Youssef Allouah, Rachid Guerraoui, John Stephan
Abstract: Resilience against malicious participants and data privacy are essential for trustworthy federated learning, yet achieving both with good utility typically requires the strong assumption of a trusted central server. This paper shows that a significantly weaker assumption suffices: each pair of participants shares a randomness seed unknown to others. In a setting where malicious participants may collude with an untrusted server, we propose CafCor, an algorithm that integrates robust gradient aggregation with correlated noise injection, using shared randomness between participants. We prove that CafCor achieves strong privacy-utility trade-offs, significantly outperforming local differential privacy (DP) methods, which do not make any trust assumption, while approaching central DP utility, where the server is fully trusted. Empirical results on standard benchmarks validate CafCor's practicality, showing that privacy and robustness can coexist in distributed systems without sacrificing utility or trusting the server.
Authors: Pablo Flores, Olga Graf, Pavlos Protopapas, Karim Pichara
Abstract: Physics-Informed Neural Networks (PINNs) have been widely used to obtain solutions to various physical phenomena modeled as Differential Equations. As PINNs are not naturally equipped with mechanisms for Uncertainty Quantification, some work has been done to quantify the different uncertainties that arise when dealing with PINNs. In this paper, we use a two-step procedure to train Bayesian Neural Networks that provide uncertainties over the solutions to differential equation systems provided by PINNs. We use available error bounds over PINNs to formulate a heteroscedastic variance that improves the uncertainty estimation. Furthermore, we solve forward problems and utilize the obtained uncertainties when doing parameter estimation in inverse problems in cosmology.
Authors: Victor Petr\'en Bach Hansen, Lasse Krogsb{\o}ll, Jonas Lyngs{\o}, Mathias Baltzersen, Andreas Motzfeldt, Kevin Pelgrims, Lars Maal{\o}e
Abstract: There are now a multitude of AI-scribing solutions for healthcare promising the utilization of large language models for ambient documentation. However, these AI scribes still rely on one-shot, or few-shot prompts for generating notes after the consultation has ended, employing little to no reasoning. This risks long notes with an increase in hallucinations, misrepresentation of the intent of the clinician, and reliance on the proofreading of the clinician to catch errors. A dangerous combination for patient safety if vigilance is compromised by workload and fatigue. In this paper, we introduce a method for extracting salient clinical information in real-time alongside the healthcare consultation, denoted Facts, and use that information recursively to generate the final note. The FactsR method results in more accurate and concise notes by placing the clinician-in-the-loop of note generation, while opening up new use cases within real-time decision support.
Authors: Emmanuel Noutahi, Jason Hartford, Prudencio Tossou, Shawn Whitfield, Alisandra K. Denton, Cas Wognum, Kristina Ulicna, Michael Craig, Jonathan Hsu, Michael Cuccarese, Emmanuel Bengio, Dominique Beaini, Christopher Gibson, Daniel Cohen, Berton Earnshaw
Abstract: Drug discovery is fundamentally a process of inferring the effects of treatments on patients, and would therefore benefit immensely from computational models that can reliably simulate patient responses, enabling researchers to generate and test large numbers of therapeutic hypotheses safely and economically before initiating costly clinical trials. Even a more specific model that predicts the functional response of cells to a wide range of perturbations would be tremendously valuable for discovering safe and effective treatments that successfully translate to the clinic. Creating such virtual cells has long been a goal of the computational research community that unfortunately remains unachieved given the daunting complexity and scale of cellular biology. Nevertheless, recent advances in AI, computing power, lab automation, and high-throughput cellular profiling provide new opportunities for reaching this goal. In this perspective, we present a vision for developing and evaluating virtual cells that builds on our experience at Recursion. We argue that in order to be a useful tool to discover novel biology, virtual cells must accurately predict the functional response of a cell to perturbations and explain how the predicted response is a consequence of modifications to key biomolecular interactions. We then introduce key principles for designing therapeutically-relevant virtual cells, describe a lab-in-the-loop approach for generating novel insights with them, and advocate for biologically-grounded benchmarks to guide virtual cell development. Finally, we make the case that our approach to virtual cells provides a useful framework for building other models at higher levels of organization, including virtual patients. We hope that these directions prove useful to the research community in developing virtual models optimized for positive impact on drug discovery outcomes.
Authors: Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy
Abstract: Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to \(2.2\times\) end-to-end speedups for models like OPT, LLaMA-2 \& 3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.
Authors: Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Abstract: In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.
Authors: Kun Yang, Neena Imam
Abstract: Federated Learning (FL) enables collaborative machine learning across decentralized data sources without sharing raw data. It offers a promising approach to privacy-preserving AI. However, FL remains vulnerable to adversarial threats from malicious participants, referred to as Byzantine clients, who can send misleading updates to corrupt the global model. Traditional aggregation methods, such as simple averaging, are not robust to such attacks. More resilient approaches, like the Krum algorithm, require prior knowledge of the number of malicious clients, which is often unavailable in real-world scenarios. To address these limitations, we propose Average-rKrum (ArKrum), a novel aggregation strategy designed to enhance both the resilience and privacy guarantees of FL systems. Building on our previous work (rKrum), ArKrum introduces two key innovations. First, it includes a median-based filtering mechanism that removes extreme outliers before estimating the number of adversarial clients. Second, it applies a multi-update averaging scheme to improve stability and performance, particularly when client data distributions are not identical. We evaluate ArKrum on benchmark image and text datasets under three widely studied Byzantine attack types. Results show that ArKrum consistently achieves high accuracy and stability. It performs as well as or better than other robust aggregation methods. These findings demonstrate that ArKrum is an effective and practical solution for secure FL systems in adversarial environments.
Authors: Tina Khezresmaeilzadeh, Parsa Razmara, Seyedarmin Azizi, Mohammad Erfan Sadeghi, Erfan Baghaei Potraghloo
Abstract: Stock price prediction remains a complex and high-stakes task in financial analysis, traditionally addressed using statistical models or, more recently, language models. In this work, we introduce VISTA (Vision-Language Inference for Stock Time-series Analysis), a novel, training-free framework that leverages Vision-Language Models (VLMs) for multi-modal stock forecasting. VISTA prompts a VLM with both textual representations of historical stock prices and their corresponding line charts to predict future price values. By combining numerical and visual modalities in a zero-shot setting and using carefully designed chain-of-thought prompts, VISTA captures complementary patterns that unimodal approaches often miss. We benchmark VISTA against standard baselines, including ARIMA and text-only LLM-based prompting methods. Experimental results show that VISTA outperforms these baselines by up to 89.83%, demonstrating the effectiveness of multi-modal inference for stock time-series analysis and highlighting the potential of VLMs in financial forecasting tasks without requiring task-specific training.
Authors: Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto
Abstract: We consider a bandit optimization problem for nonconvex and non-smooth functions, where in each trial the loss function is the sum of a linear function and a small but arbitrary perturbation chosen after observing the player's choice. We give both expected and high probability regret bounds for the problem. Our result also implies an improved high-probability regret bound for the bandit linear optimization, a special case with no perturbation. We also give a lower bound on the expected regret.
Authors: Hung Anh Vu, Galen Reeves, Emily Wenger
Abstract: The internet is full of AI-generated content while also serving as a common source of training data for generative AI (genAI) models. This duality raises the possibility that future genAI models may be trained on other models' generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society's increasing dependence on genAI tools, understanding downstream effects of such data-mediated model interactions is critical. To this end, we provide empirical evidence for how data-mediated interactions might unfold in practice, develop a theoretical model for this interactive training process, and show experimentally possible long-term results of such interactions. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.
Authors: Josiah Couch, Miao Li, Rima Arnaout, Ramy Arnaout
Abstract: In the universal quest to optimize machine-learning classifiers, three factors -- model architecture, dataset size, and class balance -- have been shown to influence test-time performance but do not fully account for it. Previously, evidence was presented for an additional factor that can be referred to as dataset quality, but it was unclear whether this was actually a joint property of the dataset and the model architecture, or an intrinsic property of the dataset itself. If quality is truly dataset-intrinsic and independent of model architecture, dataset size, and class balance, then the same datasets should perform better (or worse) regardless of these other factors. To test this hypothesis, here we create thousands of datasets, each controlled for size and class balance, and use them to train classifiers with a wide range of architectures, from random forests and support-vector machines to deep networks. We find that classifier performance correlates strongly by subset across architectures ($R^2=0.79$), supporting quality as an intrinsic property of datasets independent of dataset size and class balance and of model architecture. Digging deeper, we find that dataset quality appears to be an emergent property of something more fundamental: the quality of datasets' constituent classes. Thus, quality joins size, class balance, and model architecture as an independent correlate of performance and a separate target for optimizing machine-learning-based classification.
Authors: Chenjie Li, Amir Gilad, Boris Glavic, Zhengjie Miao, Sudeepa Roy
Abstract: Programmatic weak supervision (PWS) significantly reduces human effort for labeling data by combining the outputs of user-provided labeling functions (LFs) on unlabeled datapoints. However, the quality of the generated labels depends directly on the accuracy of the LFs. In this work, we study the problem of fixing LFs based on a small set of labeled examples. Towards this goal, we develop novel techniques for repairing a set of LFs by minimally changing their results on the labeled examples such that the fixed LFs ensure that (i) there is sufficient evidence for the correct label of each labeled datapoint and (ii) the accuracy of each repaired LF is sufficiently high. We model LFs as conditional rules which enables us to refine them, i.e., to selectively change their output for some inputs. We demonstrate experimentally that our system improves the quality of LFs based on surprisingly small sets of labeled datapoints.
Authors: Raj Ghugare, Benjamin Eysenbach
Abstract: Modern reinforcement learning (RL) algorithms have found success by using powerful probabilistic models, such as transformers, energy-based models, and diffusion/flow-based models. To this end, RL researchers often choose to pay the price of accommodating these models into their algorithms -- diffusion models are expressive, but are computationally intensive due to their reliance on solving differential equations, while autoregressive transformer models are scalable but typically require learning discrete representations. Normalizing flows (NFs), by contrast, seem to provide an appealing alternative, as they enable likelihoods and sampling without solving differential equations or autoregressive architectures. However, their potential in RL has received limited attention, partly due to the prevailing belief that normalizing flows lack sufficient expressivity. We show that this is not the case. Building on recent work in NFs, we propose a single NF architecture which integrates seamlessly into RL algorithms, serving as a policy, Q-function, and occupancy measure. Our approach leads to much simpler algorithms, and achieves higher performance in imitation learning, offline, goal conditioned RL and unsupervised RL.
Authors: Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei
Abstract: Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is merged into the verl library at https://verl.readthedocs.io/en/latest/algo/opo.html.
Authors: James R. Golden
Abstract: We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.
Authors: Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
Abstract: Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
Authors: Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber
Abstract: Flow matching casts sample generation as learning a continuous-time velocity field that transports noise to data. Existing flow matching networks typically predict each point's velocity independently, considering only its location and time along its flow trajectory, and ignoring neighboring points. However, this pointwise approach may overlook correlations between points along the generation trajectory that could enhance velocity predictions, thereby improving downstream generation quality. To address this, we propose Graph Flow Matching (GFM), a lightweight enhancement that decomposes the learned velocity into a reaction term -- any standard flow matching network -- and a diffusion term that aggregates neighbor information via a graph neural module. This reaction-diffusion formulation retains the scalability of deep flow models while enriching velocity predictions with local context, all at minimal additional computational cost. Operating in the latent space of a pretrained variational autoencoder, GFM consistently improves Fr\'echet Inception Distance (FID) and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ at $256\times256$), demonstrating its effectiveness as a modular enhancement to existing flow matching architectures.
Authors: David Steinmann, Wolfgang Stammer, Antonia W\"ust, Kristian Kersting
Abstract: Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.
Authors: Omri Lev, Vishwak Srinivasan, Moshe Shenfeld, Katrina Ligett, Ayush Sekhari, Ashia C. Wilson
Abstract: Gaussian sketching, which consists of pre-multiplying the data with a random Gaussian matrix, is a widely used technique for multiple problems in data science and machine learning, with applications spanning computationally efficient optimization, coded computing, and federated learning. This operation also provides differential privacy guarantees due to its inherent randomness. In this work, we revisit this operation through the lens of Renyi Differential Privacy (RDP), providing a refined privacy analysis that yields significantly tighter bounds than prior results. We then demonstrate how this improved analysis leads to performance improvement in different linear regression settings, establishing theoretical utility guarantees. Empirically, our methods improve performance across multiple datasets and, in several cases, reduce runtime.
Authors: Jun Wu, Yirong Xiong, Jiangtao Wen, Yuxing Han
Abstract: Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during training, BackSlash can reduce parameters by up to 90\% with minimal performance loss. Building on this foundational insight, we propose a unified, end-to-end framework for LLM optimization based on the GG model. Our contributions are threefold: (1) GG-based initialization scheme that aligns with the statistical structure of trained models, resulting in faster convergence and improved accuracy; (2) DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile, improving compressibility with minimized degradation in performance; and (3) RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-initialized BackSlash training, enabling low-cost inference without compromising accuracy. Experiments across diverse model architectures show that our framework consistently yields smaller and faster models that match or outperform standard training baselines. By grounding LLM development in principled statistical modeling, this work forges a new path toward efficient, scalable, and hardware-aware AI systems. The code is available on our project page: https://huggingface.co/spaces/shifeng3711/gg_prior.
Authors: Junaid Muzaffar, Khubaib Ahmed, Ingo Frommholz, Zeeshan Pervez, Ahsan ul Haq
Abstract: Training reinforcement learning (RL) agents often requires significant computational resources and extended training times. To address this, we build upon the foundation laid by Google Brain's Sensory Neuron, which introduced a novel neural architecture for reinforcement learning tasks that maintained permutation in-variance in the sensory neuron system. While the baseline model demonstrated significant performance improvements over traditional approaches, we identified opportunities to enhance the efficiency of the learning process further. We propose a modified attention mechanism incorporating a non-linear transformation of the key vectors (K) using a mapping function, resulting in a new set of key vectors (K'). This non-linear mapping enhances the representational capacity of the attention mechanism, allowing the model to encode more complex feature interactions and accelerating convergence without compromising performance. Our enhanced model demonstrates significant improvements in learning efficiency, showcasing the potential for non-linear attention mechanisms in advancing reinforcement learning algorithms.
Authors: Olya Mastikhina, Dhruv Sreenivas, Pablo Samuel Castro
Abstract: Actor-critic methods have been central to many of the recent advances in deep reinforcement learning. The most common approach is to use symmetric architectures, whereby both actor and critic have the same network topology and number of parameters. However, recent works have argued for the advantages of asymmetric setups, specifically with the use of smaller actors. We perform broad empirical investigations and analyses to better understand the implications of this and find that, in general, smaller actors result in performance degradation and overfit critics. Our analyses suggest poor data collection, due to value underestimation, as one of the main causes for this behavior, and further highlight the crucial role the critic can play in alleviating this pathology. We explore techniques to mitigate the observed value underestimation, which enables further research in asymmetric actor-critic methods.
Authors: Xiaojun Shan, Qi Cao, Xing Han, Haofei Yu, Paul Pu Liang
Abstract: Recent advances in multimodal foundation models have achieved state-of-the-art performance across a range of tasks. These breakthroughs are largely driven by new pre-training paradigms that leverage large-scale, unlabeled multimodal data, followed by instruction fine-tuning on curated labeled datasets and high-quality prompts. While there is growing interest in scaling instruction fine-tuning to ever-larger datasets in both quantity and scale, our findings reveal that simply increasing the number of instruction-tuning tasks does not consistently yield better performance. Instead, we observe that grouping tasks by the common interactions across modalities, such as discovering redundant shared information, prioritizing modality selection with unique information, or requiring synergistic fusion to discover new information from both modalities, encourages the models to learn transferrable skills within a group while suppressing interference from mismatched tasks. To this end, we introduce MINT, a simple yet surprisingly effective task-grouping strategy based on the type of multimodal interaction. We demonstrate that the proposed method greatly outperforms existing task grouping baselines for multimodal instruction tuning, striking an effective balance between generalization and specialization.
Authors: Ze Yu Zhang, Bolin Ding, Bryan Kian Hsiang Low
Abstract: Mixture-of-Experts (MoE) has been gaining popularity due to its successful adaptation to large language models (LLMs). In this work, we introduce Privacy-preserving Collaborative Mixture-of-Experts (PC-MoE), which leverages the sparsity of the MoE architecture for memory-efficient decentralized collaborative LLM training, enabling multiple parties with limited GPU-memory and data resources to collectively train more capable LLMs than they could achieve individually. At the same time, this approach protects training data privacy of each participant by keeping training data, as well as parts of the forward pass signal and gradients locally within each party. By design, PC-MoE synergistically combines the strengths of distributed computation with strong confidentiality assurances. Unlike most privacy-preserving schemes, which pay for confidentiality with lower task accuracy, our framework breaks that trade-off: across seven popular LLM benchmarks, it almost matches (and sometimes exceeds) the performance and convergence rate of a fully centralized model, enjoys near 70% peak GPU RAM reduction, while being fully robust against reconstruction attacks.
Authors: Zhexiao Lin, Fang Han
Abstract: Establishing the limiting distribution of Chatterjee's rank correlation for a general, possibly non-independent, pair of random variables has been eagerly awaited by many. This paper shows that (a) Chatterjee's rank correlation is asymptotically normal as long as one variable is not a measurable function of the other, (b) the corresponding asymptotic variance is uniformly bounded by 36, and (c) a consistent variance estimator exists. Similar results also hold for Azadkia-Chatterjee's graph-based correlation coefficient, a multivariate analogue of Chatterjee's original proposal. The proof is given by appealing to H\'ajek representation and Chatterjee's nearest-neighbor CLT.
Authors: Nathan Kallus, Miruna Oprescu
Abstract: The conditional average treatment effect (CATE) is the best measure of individual causal effects given baseline covariates. However, the CATE only captures the (conditional) average, and can overlook risks and tail events, which are important to treatment choice. In aggregate analyses, this is usually addressed by measuring the distributional treatment effect (DTE), such as differences in quantiles or tail expectations between treatment groups. Hypothetically, one can similarly fit conditional quantile regressions in each treatment group and take their difference, but this would not be robust to misspecification or provide agnostic best-in-class predictions. We provide a new robust and model-agnostic methodology for learning the conditional DTE (CDTE) for a class of problems that includes conditional quantile treatment effects, conditional super-quantile treatment effects, and conditional treatment effects on coherent risk measures given by $f$-divergences. Our method is based on constructing a special pseudo-outcome and regressing it on covariates using any regression learner. Our method is model-agnostic in that it can provide the best projection of CDTE onto the regression model class. Our method is robust in that even if we learn these nuisances nonparametrically at very slow rates, we can still learn CDTEs at rates that depend on the class complexity and even conduct inferences on linear projections of CDTEs. We investigate the behavior of our proposal in simulations, as well as in a case study of 401(k) eligibility effects on wealth.
Authors: Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan
Abstract: For high-dimensional Gaussian data, we investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to an improvement in the approximation capacity from initialization. First, we compare the influence of batch size to that of multiple steps. For a single step, a batch of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = \mathcal{O}(d^2)$ is essential for neurons to specialize in multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n = \mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. Second, we show that the picture drastically improves over multiple gradient steps: a batch size of $n = \mathcal{O}(d)$ is indeed sufficient to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allow for a drastic improvement in the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence, which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the intertwined role of the structure of the task to learn, the details of the algorithm, and the architecture, shedding new light on how neural networks adapt to the feature and learn complex task from data over time.
Authors: Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, Dongmei Zhang
Abstract: Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM
Authors: Rong Jiang, Cong Ma
Abstract: We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret in the fully online setting.
Authors: Alban Farchi, Marcin Chrust, Marc Bocquet, Massimo Bonavita
Abstract: In recent years, there has been significant progress in the development of fully data-driven global numerical weather prediction models. These machine learning weather prediction models have their strength, notably accuracy and low computational requirements, but also their weakness: they struggle to represent fundamental dynamical balances, and they are far from being suitable for data assimilation experiments. Hybrid modelling emerges as a promising approach to address these limitations. Hybrid models integrate a physics-based core component with a statistical component, typically a neural network, to enhance prediction capabilities. In this article, we propose to develop a model error correction for the operational Integrated Forecasting System (IFS) of the European Centre for Medium-Range Weather Forecasts using a neural network. The neural network is initially pre-trained offline using a large dataset of operational analyses and analysis increments. Subsequently, the trained network is integrated into the IFS within the Object-Oriented Prediction System (OOPS) so as to be used in data assimilation and forecast experiments. It is then further trained online using a recently developed variant of weak-constraint 4D-Var. The results show that the pre-trained neural network already provides a reliable model error correction, which translates into reduced forecast errors in many conditions and that the online training further improves the accuracy of the hybrid model in many conditions.
Authors: Peng Liu, Cong Xu, Ming Zhao, Jiawei Zhu, Bin Wang, Yi Ren
Abstract: As the last critical stage of RSs, Multi-Task Fusion (MTF) is responsible for combining multiple scores outputted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which determines the ultimate recommendation results. Recently, to optimize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is used for MTF in the industry. However, the offline RL algorithms used for MTF so far have the following severe problems: 1) to avoid out-of-distribution (OOD) problem, their constraints are overly strict, which seriously damage their performance; 2) they are unaware of the exploration policy used for producing training data and never interact with real environment, so only suboptimal policy can be learned; 3) the traditional exploration policies are inefficient and hurt user experience. To solve the above problems, we propose a novel method named IntegratedRL-MTF customized for MTF in large-scale RSs. IntegratedRL-MTF integrates offline RL model with our online exploration policy to relax overstrict and complicated constraints, which significantly improves its performance. We also design an extremely efficient exploration policy, which eliminates low-value exploration space and focuses on exploring potential high-value state-action pairs. Moreover, we adopt progressive training mode to further enhance our model's performance with the help of our exploration policy. We conduct extensive offline and online experiments in the short video channel of Tencent News. The results demonstrate that our model outperforms other models remarkably. IntegratedRL-MTF has been fully deployed in our RS and other large-scale RSs in Tencent, which have achieved significant improvements.
Authors: Christian Peukert, Florian Abeillon, J\'er\'emie Haese, Franziska Kaiser, Alexander Staub
Abstract: Artificial intelligence (AI) systems rely heavily on human-generated data, yet the people behind that data are often overlooked. Human behavior can play a major role in AI training datasets, be it in limiting access to existing works or in deciding which types of new works to create or whether to create any at all. We examine creators' behavioral change when their works become training data for commercial AI. Specifically, we focus on contributors on Unsplash, a popular stock image platform with about 6 million high-quality photos and illustrations. In the summer of 2020, Unsplash launched a research program and released a dataset of 25,000 images for commercial AI use. We study contributors' reactions, comparing contributors whose works were included in this dataset to contributors whose works were not. Our results suggest that treated contributors left the platform at a higher-than-usual rate and substantially slowed down the rate of new uploads. Professional photographers and more heavily affected users had a stronger reaction than amateurs and less affected users. We also show that affected users changed the variety and novelty of contributions to the platform, which can potentially lead to lower-quality AI outputs in the long run. Our findings highlight a critical trade-off: the drive to expand AI capabilities versus the incentives of those producing training data. We conclude with policy proposals, including dynamic compensation schemes and structured data markets, to realign incentives at the data frontier.
Authors: Shengpeng Ji, Qian Chen, Wen Wang, Jialong Zuo, Minghui Fang, Ziyue Jiang, Hai Huang, Zehan Wang, Xize Cheng, Siqi Zheng, Zhou Zhao
Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style. Prior zero-shot TTS models only mimic the speaker's voice without further control and adjustment capabilities while prior controllable TTS models cannot perform speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging task: a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture codec representations corresponding to timbre, content, and style in a discrete decoupling codec space. Moreover, we analyze the many-to-many issue in textual style control and propose the Style Mixture Semantic Density (SMSD) module, which is based on Gaussian mixture density networks, to resolve this problem. To facilitate empirical validations, we make available a new style controllable dataset called VccmDataset. Our experimental results demonstrate that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech .
Authors: Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou
Abstract: The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. A comprehensive set of over 100 experiments demonstrated the suite's usability and extensibility, while also uncovering insights into the interplay between data quality, diversity, model behavior, and computational costs. All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure.
Authors: Mohammad Aamir Sohail, Mohsen Heidari, S. Sandeep Pradhan
Abstract: Variational quantum algorithms, optimized using gradient-based methods, often exhibit sub-optimal convergence performance due to their dependence on Euclidean geometry. Quantum natural gradient descent (QNGD) is a more efficient method that incorporates the geometry of the state space via a quantum information metric. However, QNGD is computationally intensive and suffers from high sample complexity. In this work, we formulate a novel quantum information metric and construct an unbiased estimator for this metric using single-shot measurements. We develop a quantum optimization algorithm that leverages the geometry of the state space via this estimator while avoiding full-state tomography, as in conventional techniques. We provide the convergence analysis of the algorithm under mild conditions. Furthermore, we provide experimental results that demonstrate the better sample complexity and faster convergence of our algorithm compared to the state-of-the-art approaches. Our results illustrate the algorithm's ability to avoid saddle points and local minima.
Authors: Daan Caljon, Jente Van Belle, Jeroen Berrevoets, Wouter Verbeke
Abstract: In Influence Maximization (IM), the objective is to -- given a budget -- select the optimal set of entities in a network to target with a treatment so as to maximize the total effect. For instance, in marketing, the objective is to target the set of customers that maximizes the total response rate, resulting from both direct treatment effects on targeted customers and indirect, spillover, effects that follow from targeting these customers. Recently, new methods to estimate treatment effects in the presence of network interference have been proposed. However, the issue of how to leverage these models to make better treatment allocation decisions has been largely overlooked. Traditionally, in Uplift Modeling (UM), entities are ranked according to estimated treatment effect, and the top entities are allocated treatment. Since, in a network context, entities influence each other, the UM ranking approach will be suboptimal. The problem of finding the optimal treatment allocation in a network setting is \textcolor{red}{NP-hard,} and generally has to be solved heuristically. To fill the gap between IM and UM, we propose OTAPI: Optimizing Treatment Allocation in the Presence of Interference to find solutions to the IM problem using treatment effect estimates. OTAPI consists of two steps. First, a causal estimator is trained to predict treatment effects in a network setting. Second, this estimator is leveraged to identify an optimal treatment allocation by integrating it into classic IM algorithms. We demonstrate that this novel method outperforms classic IM and UM approaches on both synthetic and semi-synthetic datasets.
Authors: Jin Hwa Lee, Thomas Jiralerspong, Lei Yu, Yoshua Bengio, Emily Cheng
Abstract: By virtue of linguistic compositionality, few syntactic rules and a finite lexicon can generate an unbounded number of sentences. That is, language, though seemingly high-dimensional, can be explained using relatively few degrees of freedom. An open question is whether contemporary language models (LMs) reflect the intrinsic simplicity of language that is enabled by compositionality. We take a geometric view of this problem by relating the degree of compositionality in a dataset to the intrinsic dimension (ID) of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations' ID, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between nonlinear and linear dimensionality, showing they respectively encode semantic and superficial aspects of linguistic composition.
Authors: Yu Fei, Yasaman Razeghi, Sameer Singh
Abstract: Large language models (LLMs) require alignment to effectively and safely follow user instructions. This process necessitates training an aligned version for every base model, resulting in significant computational overhead. In this work, we propose NUDGING, a simple, training-free algorithm that aligns any base model at inference time using a small aligned model. NUDGING is motivated by recent findings that alignment primarily alters the model's behavior on a small subset of stylistic tokens (e.g., discourse markers). We find that base models are significantly more uncertain when generating these tokens. Building on this insight, NUDGING employs a small aligned model to generate nudging tokens to guide the base model's output during decoding when the base model's uncertainty is high, with only a minor additional inference overhead. We evaluate NUDGING across 3 model families on a diverse range of open-instruction tasks. Without any training, nudging a large base model with a 7x-14x smaller aligned model achieves zero-shot performance comparable to, and sometimes surpassing, that of large aligned models. By operating at the token level, NUDGING enables off-the-shelf collaboration between model families. For instance, nudging Gemma-2-27b with Llama-27b-chat outperforms Llama-2-70b-chat on various tasks. Overall, our work offers a modular and cost-efficient solution to LLM alignment. Our code and demo are available at: https://fywalter.github.io/nudging/ .
Authors: Lin Wang, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Zhitao Zhang
Abstract: Multimodal conversation, a crucial form of human communication, carries rich emotional content, making the exploration of the causes of emotions within it a research endeavor of significant importance. However, existing research on the causes of emotions typically employs an utterance selection method within a single textual modality to locate causal utterances. This approach remains limited to coarse-grained assessments, lacks nuanced explanations of emotional causation, and demonstrates inadequate capability in identifying multimodal emotional triggers. Therefore, we introduce a task-\textbf{Multimodal Emotion Cause Explanation in Conversation (MECEC)}. This task aims to generate a summary based on the multimodal context of conversations, clearly and intuitively describing the reasons that trigger a given emotion. To adapt to this task, we develop a new dataset (ECEM) based on the MELD dataset. ECEM combines video clips with detailed explanations of character emotions, helping to explore the causal factors behind emotional expression in multimodal conversations. A novel approach, FAME-Net, is further proposed, that harnesses the power of Large Language Models (LLMs) to analyze visual data and accurately interpret the emotions conveyed through facial expressions in videos. By exploiting the contagion effect of facial emotions, FAME-Net effectively captures the emotional causes of individuals engaged in conversations. Our experimental results on the newly constructed dataset show that FAME-Net outperforms several excellent baselines. Code and dataset are available at https://github.com/3222345200/FAME-Net.
Authors: Laure Ciernik, Lorenz Linhardt, Marco Morik, Jonas Dippel, Simon Kornblith, Lukas Muttenthaler
Abstract: The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance, irrespective of the objectives and data modalities used to train these models (Huh et al., 2024). Representational similarity is generally measured for individual datasets and is not necessarily consistent across datasets. Thus, one may wonder whether this convergence of model representations is confounded by the datasets commonly used in machine learning. Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations. We find that the objective function is a crucial factor in determining the consistency of representational similarities across datasets. Specifically, self-supervised vision models learn representations whose relative pairwise similarities generalize better from one dataset to another compared to those of image classification or image-text models. Moreover, the correspondence between representational similarities and the models' task behavior is dataset-dependent, being most strongly pronounced for single-domain datasets. Our work provides a framework for analyzing similarities of model representations across datasets and linking those similarities to differences in task behavior.
Authors: Jiaan Han, Junxiao Chen, Yanzhe Fu
Abstract: We introduce CatNet, an algorithm that effectively controls False Discovery Rate (FDR) and selects significant features in LSTM. CatNet employs the derivative of SHAP values to quantify the feature importance, and constructs a vector-formed mirror statistic for FDR control with the Gaussian Mirror algorithm. To avoid instability due to nonlinear or temporal correlations among features, we also propose a new kernel-based independence measure. CatNet performs robustly on different model settings with both simulated and real-world data, which reduces overfitting and improves interpretability of the model. Our framework that introduces SHAP for feature importance in FDR control algorithms and improves Gaussian Mirror can be naturally extended to other time-series or sequential deep learning models.
Authors: Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao
Abstract: The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiming-lab/MMedPO.
Authors: Yiwen Gu, Mahir Patel, Margrit Betke
Abstract: In this paper, we present a contrastive learning based framework, ExeChecker, for the interpretation of rehabilitation exercises. Our work builds upon state-of-the-art advances in the area of human pose estimation, graph-attention neural networks, and transformer interpretablity. The downstream task is to assist rehabilitation by providing informative feedback to users while they are performing prescribed exercises. We utilize a contrastive learning strategy during training. Given a tuple of correctly and incorrectly executed exercises, our model is able to identify and highlight those joints that are involved in an incorrect movement and thus require the user's attention. We collected an in-house dataset, ExeCheck, with paired recordings of both correct and incorrect execution of exercises. In our experiments, we tested our method on this dataset as well as the UI-PRMD dataset and found ExeCheck outperformed the baseline method using pairwise sequence alignment in identifying joints of physical relevance in rehabilitation exercises.
Authors: Taywon Min, Haeone Lee, Yongchan Kwon, Kimin Lee
Abstract: In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. In our experiments, we demonstrate two key applications of influence functions: (1) detecting common forms of labeler bias in human feedback datasets and (2) guiding labelers to refine their strategies to align more closely with expert feedback. By quantifying the impact of human feedback on reward models, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at https://github.com/mintaywon/IF_RLHF
Authors: Akash Kundu
Abstract: The Sachdev-Ye-Kitaev (SYK) model, known for its strong quantum correlations and chaotic behavior, serves as a key platform for quantum gravity studies. However, variationally preparing thermal states on near-term quantum processors for large systems ($N>12$, where $N$ is the number of Majorana fermions) presents a significant challenge due to the rapid growth in the complexity of parameterized quantum circuits. This paper addresses this challenge by integrating reinforcement learning (RL) with convolutional neural networks, employing an iterative approach to optimize the quantum circuit and its parameters. The refinement process is guided by a composite reward signal derived from entropy and the expectation values of the SYK Hamiltonian. This approach reduces the number of CNOT gates by two orders of magnitude for systems $N\geq12$ compared to traditional methods like first-order Trotterization. We demonstrate the effectiveness of the RL framework in both noiseless and noisy quantum hardware environments, maintaining high accuracy in thermal state preparation. This work advances a scalable, RL-based framework with applications for quantum gravity studies and out-of-time-ordered thermal correlators computation in quantum many-body systems on near-term quantum hardware. The code is available at https://github.com/Aqasch/solving_SYK_model_with_RL.
Authors: Oliver Biggar, Christos Papadimitriou
Abstract: Characterizing the limit behavior -- that is, the attractors -- of learning dynamics is one of the most fundamental open questions in game theory. In recent work in this front, it was conjectured that the attractors of the replicator dynamic are in one-to-one correspondence with the sink equilibria of the game -- the sink strongly connected components of a game's preference graph -- , and it was established that they do stand in at least one-to-many correspondence with them. We make threefold progress on the problem of characterizing attractors. First, we show through a topological construction that the one-to-one conjecture is false. The counterexamples derive from objects called local sources -- fixed points which lie within the sink equilibrium yet are locally repelling. Second, we make progress on the attractor characterization problem for two-player games by establishing that the one-to-one conjecture is true when a local property called pseudoconvexity holds. Pseudoconvexity prevents the existence of local sources, and generalizes the existing cases -- such as zero-sum games and potential games -- where the conjecture was known to be true.
Authors: Haonan Shi, Tu Ouyang, An Wang
Abstract: Federated Distillation (FD) has emerged as a popular federated training framework, enabling clients to collaboratively train models without sharing private data. Public Dataset-Assisted Federated Distillation (PDA-FD), which leverages public datasets for knowledge sharing, has become widely adopted. Although PDA-FD enhances privacy compared to traditional Federated Learning, we demonstrate that the use of public datasets still poses significant privacy risks to clients' private training data. This paper presents the first comprehensive privacy analysis of PDA-FD in presence of an honest-but-curious server. We show that the server can exploit clients' inference results on public datasets to extract two critical types of private information: label distributions and membership information of the private training dataset. To quantify these vulnerabilities, we introduce two novel attacks specifically designed for the PDA-FD setting: a label distribution inference attack and innovative membership inference methods based on Likelihood Ratio Attack (LiRA). Through extensive evaluation of three representative PDA-FD frameworks (FedMD, DS-FL, and Cronus), our attacks achieve state-of-the-art performance, with label distribution attacks reaching minimal KL-divergence and membership inference attacks maintaining high True Positive Rates under low False Positive Rate constraints. Our findings reveal significant privacy risks in current PDA-FD frameworks and emphasize the need for more robust privacy protection mechanisms in collaborative learning systems.
Authors: Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy
Abstract: Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlux models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects -- a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlux generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlux enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research. Project page: https://yuhui-zh15.github.io/CellFlux/.
Authors: Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, Matthieu Wyart
Abstract: Natural data is often organized as a hierarchical composition of features. How many samples do generative models need in order to learn the composition rules, so as to produce a combinatorially large number of novel data? What signal in the data is exploited to learn those rules? We investigate these questions in the context of diffusion models both theoretically and empirically. Theoretically, we consider a simple probabilistic context-free grammar - a tree-like graphical model used to represent the hierarchical and compositional structure of data such as language and images. We demonstrate that diffusion models learn the grammar's composition rules with the sample complexity required for clustering features with statistically similar context, a process similar to the word2vec algorithm. However, this clustering emerges hierarchically: higher-level features associated with longer contexts require more data to be identified. This mechanism leads to a sample complexity that scales polynomially with the said context size. As a result, diffusion models trained on an intermediate dataset size generate data coherent up to a certain scale, but lacking global coherence. We test these predictions across different domains and find remarkable agreement: both generated texts and images achieve progressively larger coherence lengths as the training time or dataset size grows. We discuss connections between the hierarchical clustering mechanism we introduce here and the renormalization group in physics.
Authors: Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O\u{g}uzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan
Abstract: Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.
Authors: Leena Mathur, Marian Qian, Paul Pu Liang, Louis-Philippe Morency
Abstract: Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce SOCIAL GENOME, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. SOCIAL GENOME contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferences about these interactions. These traces contain 5,777 reasoning steps that reference evidence from visual cues, verbal cues, vocal cues, and external knowledge (contextual knowledge external to videos). SOCIAL GENOME is also the first modeling challenge to study external knowledge in social reasoning. SOCIAL GENOME computes metrics to holistically evaluate semantic and structural qualities of model-generated social reasoning traces. We demonstrate the utility of SOCIAL GENOME through experiments with state-of-the-art models, identifying performance gaps and opportunities for future research to improve the grounded social reasoning abilities of multimodal models.
Authors: Hong Cai Chen, Yi Pin Xu, Yang Zhang
Abstract: In electronic design, engineers often manually search through extensive documents to retrieve component parameters required for constructing SPICE models, a process that is both labor-intensive and time-consuming. To address this challenge, we present an automated framework called D2S-FLOW that leverages large language models (LLMs) to extract electrical parameters from datasheets and generate SPICE models with high precision and efficiency, significantly reducing the need for manual intervention. Unlike traditional RAG systems, D2S-FLOW employs a workflow to enhance precision in handling unstructured documents and inconsistent naming conventions through three innovative mechanisms: Attention-Guided Document Focusing (AGDF), Hierarchical Document-Enhanced Retrieval (HDER), and Heterogeneous Named Entity Normalization (HNEN). AGDF narrows retrieval to user-selected documents, HDER utilizes document structure for precise parameter localization, and HNEN standardizes terminology via semantic inference. Experimental results demonstrate that the framework achieves an Exact Match (EM) of 0.86, an F1 score of 0.92, and an Exact Correctness (EC) of 0.96, outperforming the strongest baseline by 19.4%, 5.7%, and 13.1%, respectively. Additionally, it reduces API token consumption by 38% and minimizes the irrelevant information ratio to 4%, showcasing substantial improvements in resource efficiency. This research provides an effective automated solution for circuit design.
Authors: Zonghao Chen, Masha Naslidnyk, Fran\c{c}ois-Xavier Briol
Abstract: This paper considers the challenging computational task of estimating nested expectations. Existing algorithms, such as nested Monte Carlo or multilevel Monte Carlo, are known to be consistent but require a large number of samples at both inner and outer levels to converge. Instead, we propose a novel estimator consisting of nested kernel quadrature estimators and we prove that it has a faster convergence rate than all baseline methods when the integrands have sufficient smoothness. We then demonstrate empirically that our proposed method does indeed require fewer samples to estimate nested expectations on real-world applications including Bayesian optimisation, option pricing, and health economics.
Authors: Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, Xiangyu Zhao
Abstract: Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at https://github.com/Fzkuji/swat-attention.
Authors: Ziyan Wang, Zhicheng Zhang, Fei Fang, Yali Du
Abstract: Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality ($\text{M}^3\text{HF}$), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, $\text{M}^3\text{HF}$ leverages both expert and non-expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weights by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that $\text{M}^3\text{HF}$ significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.
Authors: Nikita Kazeev, Wei Nong, Ignat Romanov, Ruiming Zhu, Andrey Ustyuzhanin, Shuya Yamazaki, Kedar Hippalgaonkar
Abstract: Crystal symmetry plays a fundamental role in determining its physical, chemical, and electronic properties such as electrical and thermal conductivity, optical and polarization behavior, and mechanical strength. Almost all known crystalline materials have internal symmetry. However, this is often inadequately addressed by existing generative models, making the consistent generation of stable and symmetrically valid crystal structures a significant challenge. We introduce WyFormer, a generative model that directly tackles this by formally conditioning on space group symmetry. It achieves this by using Wyckoff positions as the basis for an elegant, compressed, and discrete structure representation. To model the distribution, we develop a permutation-invariant autoregressive model based on the Transformer encoder and an absence of positional encoding. Extensive experimentation demonstrates WyFormer's compelling combination of attributes: it achieves best-in-class symmetry-conditioned generation, incorporates a physics-motivated inductive bias, produces structures with competitive stability, predicts material properties with competitive accuracy even without atomic coordinates, and exhibits unparalleled inference speed.
Authors: Ziqiao Wang, Cheng Long, Yongyi Mao
Abstract: Federated learning (FL) is a widely adopted privacy-preserving distributed learning framework, yet its generalization performance remains less explored compared to centralized learning. In FL, the generalization error consists of two components: the out-of-sample gap, which measures the gap between the empirical and true risk for participating clients, and the participation gap, which quantifies the risk difference between participating and non-participating clients. In this work, we apply an information-theoretic analysis via the conditional mutual information (CMI) framework to study FL's two-level generalization. Beyond the traditional supersample-based CMI framework, we introduce a superclient construction to accommodate the two-level generalization setting in FL. We derive multiple CMI-based bounds, including hypothesis-based CMI bounds, illustrating how privacy constraints in FL can imply generalization guarantees. Furthermore, we propose fast-rate evaluated CMI bounds that recover the best-known convergence rate for two-level FL generalization in the small empirical risk regime. For specific FL model aggregation strategies and structured loss functions, we refine our bounds to achieve improved convergence rates with respect to the number of participating clients. Empirical evaluations confirm that our evaluated CMI bounds are non-vacuous and accurately capture the generalization behavior of FL algorithms.
Authors: Ming Zhang, Yuhui Wang, Yujiong Shen, Tingyi Yang, Changhao Jiang, Yilong Wu, Shihan Dou, Qinhao Chen, Zhiheng Xi, Zhihao Zhang, Yi Dong, Zhen Wang, Zhihui Fei, Mingyang Wan, Tao Liang, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
Abstract: Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
Authors: Yunli Wang, Zhen Zhang, Zhiqiang Wang, Zixuan Yang, Yu Li, Jian Yang, Shiyang Wen, Peng Jiang, Kun Gai
Abstract: Cascade Ranking is a prevalent architecture in large-scale top-k selection systems like recommendation and advertising platforms. Traditional training methods focus on single-stage optimization, neglecting interactions between stages. Recent advances have introduced interaction-aware training paradigms, but still struggle to 1) align training objectives with the goal of the entire cascade ranking (i.e., end-to-end recall of ground-truth items) and 2) learn effective collaboration patterns for different stages. To address these challenges, we propose LCRON, which introduces a novel surrogate loss function derived from the lower bound probability that ground truth items are selected by cascade ranking, ensuring alignment with the overall objective of the system. According to the properties of the derived bound, we further design an auxiliary loss for each stage to drive the reduction of this bound, leading to a more robust and effective top-k selection. LCRON enables end-to-end training of the entire cascade ranking system as a unified network. Experimental results demonstrate that LCRON achieves significant improvement over existing methods on public benchmarks and industrial applications, addressing key limitations in cascade ranking training and significantly enhancing system performance.
Authors: Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
Abstract: This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these symbolic analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles, and 2) we smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest accuracy reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.
Authors: Zhongchun Zheng, Kan Wu, Long Cheng, Lu Li, Rodrigo C. O. Rocha, Tianyi Liu, Wei Wei, Jianjiang Zeng, Xianwei Zhang, Yaoqing Gao
Abstract: Auto-vectorization is a fundamental optimization for modern compilers to exploit SIMD parallelism. However, state-of-the-art approaches still struggle to handle intricate code patterns, often requiring manual hints or domain-specific expertise. Large language models (LLMs), with their ability to capture intricate patterns, provide a promising solution, yet their effective application in compiler optimizations remains an open challenge due to issues such as hallucinations and a lack of domain-specific reasoning. In this paper, we present VecTrans, a novel framework that leverages LLMs to enhance compiler-based code vectorization. VecTrans first employs compiler analysis to identify potentially vectorizable code regions. It then utilizes an LLM to refactor these regions into patterns that are more amenable to the compilers auto-vectorization. To ensure semantic correctness, VecTrans further integrates a hybrid validation mechanism at the intermediate representation (IR) level. With the above efforts, VecTrans combines the adaptability of LLMs with the precision of compiler vectorization, thereby effectively opening up the vectorization opportunities. experimental results show that among all TSVC functions unvectorizable by GCC, ICC, Clang, and BiSheng Compiler, VecTrans achieves an geomean speedup of 1.77x and successfully vectorizes 24 of 51 test cases. This marks a significant advancement over state-of-the-art approaches while maintaining a cost efficiency of $0.012 per function optimization for LLM API usage.
Authors: Aayush Gautam, Susav Shrestha, Narasimha Reddy
Abstract: Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textit{GammaTune} and \textit{GammaTune+}, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% ($\pm$5\%) with \textit{GammaTune} and 16\% ($\pm$3\%) with \textit{GammaTune+}, while reducing performance variance. This makes \textit{GammaTune} a robust and efficient solution for real-world deployment.
Authors: Chenxuan He, Yuan Gao, Liping Zhu, Jian Huang
Abstract: Testing conditional independence between two random vectors given a third is a fundamental and challenging problem in statistics, particularly in multivariate nonparametric settings due to the complexity of conditional structures. We propose a innovative framework for testing conditional independence using transport maps. At the population level, we show that two well-defined transport maps can transform the conditional independence test into an unconditional independence test, this substantially simplifies the problem. These transport maps are estimated from data using conditional continuous normalizing flow models. Within this framework, we derive a test statistic and prove its asymptotic validity under both the null and alternative hypotheses. A permutation-based procedure is employed to evaluate the significance of the test. We validate the proposed method through extensive simulations and real-data analysis. Our numerical studies demonstrate the practical effectiveness of the proposed method for conditional independence testing.
Authors: Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, Sinead Williamson
Abstract: Uncertainty Quantification (UQ) in Language Models (LMs) is key to improving their safety and reliability. Evaluations often use metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). We show that mutual biases--when both UQ methods and correctness functions are biased by the same factors--systematically distort evaluation. First, we formally prove that any mutual bias non-randomly skews AUROC rankings, compromising benchmark integrity. Second, we confirm this happens empirically by testing 7 widely used correctness functions, from lexical-based and embedding-based metrics to LM-as-a-judge approaches, across 4 datasets x 4 models x 8 UQ methods. Our analysis shows that length biases in correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LM-as-a-judge methods as the least length-biased, offering a promising path for a fairer UQ evaluation.
Authors: Yinlong Dai, Robert Ramirez Sanchez, Ryan Jeronimus, Shahabedin Sagheb, Cara M. Nunez, Heramb Nemlekar, Dylan P. Losey
Abstract: Today's robots learn new tasks by imitating human examples. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding the features that factor into the human's decisions, robot learners often misinterpret the data and fail to perform the task when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to indicate task-relevant features using markers and language prompts. Our proposed algorithm, CIVIL, leverages this augmented data to filter the robot's visual observations and extract a feature representation that causally informs human actions. CIVIL then applies these causal features to train a transformer-based policy that emulates human behaviors without being confused by visual distractors. Our simulations, real-world experiments, and user study demonstrate that robots trained with CIVIL can learn from fewer human demonstrations and perform better than state-of-the-art baselines, especially in previously unseen scenarios. See videos at our project website: https://civil2025.github.io
Authors: Uday Kiran Reddy Tadipatri, Benjamin D. Haeffele, Joshua Agterberg, Ingvar Ziemann, Ren\'e Vidal
Abstract: Low-order linear System IDentification (SysID) addresses the challenge of estimating the parameters of a linear dynamical system from finite samples of observations and control inputs with minimal state representation. Traditional approaches often utilize Hankel-rank minimization, which relies on convex relaxations that can require numerous, costly singular value decompositions (SVDs) to optimize. In this work, we propose two nonconvex reformulations to tackle low-order SysID (i) Burer-Monterio (BM) factorization of the Hankel matrix for efficient nuclear norm minimization, and (ii) optimizing directly over system parameters for real, diagonalizable systems with an atomic norm style decomposition. These reformulations circumvent the need for repeated heavy SVD computations, significantly improving computational efficiency. Moreover, we prove that optimizing directly over the system parameters yields lower statistical error rates, and lower sample complexities that do not scale linearly with trajectory length like in Hankel-nuclear norm minimization. Additionally, while our proposed formulations are nonconvex, we provide theoretical guarantees of achieving global optimality in polynomial time. Finally, we demonstrate algorithms that solve these nonconvex programs and validate our theoretical claims on synthetic data.
Authors: Alexander Baumann, Leonardo Ayala, Silvia Seidlitz, Jan Sellner, Alexander Studier-Fischer, Berkin \"Ozdemir, Lena Maier-Hein, Slobodan Ilic
Abstract: Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce $\textbf{CARL}$, a model for $\textbf{C}$amera-$\textbf{A}$gnostic $\textbf{R}$epresentation $\textbf{L}$earning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic embedding, we introduce wavelength positional encoding and a self-attention-cross-attention mechanism to compress spectral information into learned query representations. Spectral-spatial pre-training is achieved with a novel spectral self-supervised JEPA-inspired strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model's unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.
Authors: Ze Zhang, Georg Hess, Junjie Hu, Emmanuel Dean, Lennart Svensson, Knut {\AA}kesson
Abstract: This paper proposes an integrated approach for the safe and efficient control of mobile robots in dynamic and uncertain environments. The approach consists of two key steps: one-shot multimodal motion prediction to anticipate motions of dynamic obstacles and model predictive control to incorporate these predictions into the motion planning process. Motion prediction is driven by an energy-based neural network that generates high-resolution, multi-step predictions in a single operation. The prediction outcomes are further utilized to create geometric shapes formulated as mathematical constraints. Instead of treating each dynamic obstacle individually, predicted obstacles are grouped by proximity in an unsupervised way to improve performance and efficiency. The overall collision-free navigation is handled by model predictive control with a specific design for proactive dynamic obstacle avoidance. The proposed approach allows mobile robots to navigate effectively in dynamic environments. Its performance is accessed across various scenarios that represent typical warehouse settings. The results demonstrate that the proposed approach outperforms other existing dynamic obstacle avoidance methods.
Authors: Jean-Michel Tucny, Mihir Durve, Sauro Succi
Abstract: The rise of deep learning challenges the longstanding scientific ideal of insight - the human capacity to understand phenomena by uncovering underlying mechanisms. In many modern applications, accurate predictions no longer require interpretable models, prompting debate about whether explainability is a realistic or even meaningful goal. From our perspective in physics, we examine this tension through a concrete case study: a physics-informed neural network (PINN) trained on a rarefied gas dynamics problem governed by the Boltzmann equation. Despite the system's clear structure and well-understood governing laws, the trained network's weights resemble Gaussian-distributed random matrices, with no evident trace of the physical principles involved. This suggests that deep learning and traditional simulation may follow distinct cognitive paths to the same outcome - one grounded in mechanistic insight, the other in statistical interpolation. Our findings raise critical questions about the limits of explainable AI and whether interpretability can - or should-remain a universal standard in artificial reasoning.
Authors: Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, Tieniu Tan
Abstract: Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.
Authors: Ahmet Bilican, M. Ak{\i}n Y{\i}lmaz, A. Murat Tekalp, R. G\"okberk Cinbi\c{s}
Abstract: Efficiently adapting large foundation models is critical, especially with tight compute and memory budgets. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA offer limited granularity and effectiveness in few-parameter regimes. We propose Wavelet Fine-Tuning (WaveFT), a novel PEFT method that learns highly sparse updates in the wavelet domain of residual matrices. WaveFT allows precise control of trainable parameters, offering fine-grained capacity adjustment and excelling with remarkably low parameter count, potentially far fewer than LoRA's minimum, ideal for extreme parameter-efficient scenarios. Evaluated on personalized text-to-image generation using Stable Diffusion XL as baseline, WaveFT significantly outperforms LoRA and other PEFT methods, especially at low parameter counts; achieving superior subject fidelity, prompt alignment, and image diversity.
Authors: Aayam Bansal
Abstract: As financial institutions increasingly rely on machine learning models to automate lending decisions, concerns about algorithmic fairness have risen. This paper explores the tradeoff between enforcing fairness constraints (such as demographic parity or equal opportunity) and maximizing lender profitability. Through simulations on synthetic data that reflects real-world lending patterns, we quantify how different fairness interventions impact profit margins and default rates. Our results demonstrate that equal opportunity constraints typically impose lower profit costs than demographic parity, but surprisingly, removing protected attributes from the model (fairness through unawareness) outperforms explicit fairness interventions in both fairness and profitability metrics. We further identify the specific economic conditions under which fair lending becomes profitable and analyze the feature-specific drivers of unfairness. These findings offer practical guidance for designing lending algorithms that balance ethical considerations with business objectives.
Authors: Nadav Har-Tuv, Or Tal, Yossi Adi
Abstract: We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation. To foster further research, we release the full implementation. For code, model checkpoints, and samples see: https://pages.cs.huji.ac.il/adiyoss-lab/PAST
Authors: Gowtham Premananth, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Espy-Wilson
Abstract: Studies on schizophrenia assessments using deep learning typically treat it as a classification task to detect the presence or absence of the disorder, oversimplifying the condition and reducing its clinical applicability. This traditional approach overlooks the complexity of schizophrenia, limiting its practical value in healthcare settings. This study shifts the focus to individual symptom severity estimation using a multimodal approach that integrates speech, video, and text inputs. We develop unimodal models for each modality and a multimodal framework to improve accuracy and robustness. By capturing a more detailed symptom profile, this approach can help in enhancing diagnostic precision and support personalized treatment, offering a scalable and objective tool for mental health assessment.
Authors: Anna Van Elst, Igor Colin, Stephan Cl\'emen\c{c}on
Abstract: This paper addresses the problem of robust estimation in gossip algorithms over arbitrary communication graphs. Gossip algorithms are fully decentralized, relying only on local neighbor-to-neighbor communication, making them well-suited for situations where communication is constrained. A fundamental challenge in existing mean-based gossip algorithms is their vulnerability to malicious or corrupted nodes. In this paper, we show that an outlier-robust mean can be computed by globally estimating a robust statistic. More specifically, we propose a novel gossip algorithm for rank estimation, referred to as \textsc{GoRank}, and leverage it to design a gossip procedure dedicated to trimmed mean estimation, coined \textsc{GoTrim}. In addition to a detailed description of the proposed methods, a key contribution of our work is a precise convergence analysis: we establish an $\mathcal{O}(1/t)$ rate for rank estimation and an $\mathcal{O}(\log(t)/\sqrt{t})$ rate for trimmed mean estimation, where by $t$ is meant the number of iterations. Moreover, we provide a breakdown point analysis of \textsc{GoTrim}. We empirically validate our theoretical results through experiments on diverse network topologies, data distributions and contamination schemes.
Authors: Sebastian Antony Joseph, Syed Murtaza Husain, Stella S. R. Offner, St\'ephanie Juneau, Paul Torrey, Adam S. Bolton, Juan P. Farias, Niall Gaffney, Greg Durrett, Junyi Jessy Li
Abstract: Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
Authors: Shahrooz Pouryousef, Ali Montazeralghaem
Abstract: User-item interactions contain rich collaborative signals that form the backbone of many successful recommender systems. While recent work has explored the use of large language models (LLMs) for recommendation, it remains unclear whether LLMs can effectively reason over this type of collaborative information. In this paper, we conduct a systematic comparison between LLMs and classical matrix factorization (MF) models to assess LLMs' ability to leverage user-item interaction data. We further introduce a simple retrieval-augmented generation (RAG) method that enhances LLMs by grounding their predictions in structured interaction data. Our experiments reveal that current LLMs often fall short in capturing collaborative patterns inherent to MF models, but that our RAG-based approach substantially improves recommendation quality-highlighting a promising direction for future LLM-based recommenders.
Authors: Xianling Mu, Joseph Ternasky, Fuat Alican, Yigit Ihlamur
Abstract: Early-stage startup investment is a high-risk endeavor characterized by scarce data and uncertain outcomes. Traditional machine learning approaches often require large, labeled datasets and extensive fine-tuning, yet remain opaque and difficult for domain experts to interpret or improve. In this paper, we propose a transparent and data-efficient investment decision framework powered by memory-augmented large language models (LLMs) using in-context learning (ICL). Central to our method is a natural language policy embedded directly into the LLM prompt, enabling the model to apply explicit reasoning patterns and allowing human experts to easily interpret, audit, and iteratively refine the logic. We introduce a lightweight training process that combines few-shot learning with an in-context learning loop, enabling the LLM to update its decision policy iteratively based on structured feedback. With only minimal supervision and no gradient-based optimization, our system predicts startup success far more accurately than existing benchmarks. It is over 20x more precise than random chance, which succeeds 1.9% of the time. It is also 7.1x more precise than the typical 5.6% success rate of top-tier venture capital (VC) firms.
Authors: Pin-Han Chen, Yu-Sheng Lin, Wei-Cheng Lee, Tin-Yu Leu, Po-Hsiang Hsu, Anjana Dissanayake, Sungjin Oh, Chinq-Shiun Chiu
Abstract: RF/Analog design is essential for bridging digital technologies with real-world signals, ensuring the functionality and reliability of a wide range of electronic systems. However, analog design procedures are often intricate, time-consuming and reliant on expert intuition, and hinder the time and cost efficiency of circuit development. To overcome the limitations of the manual circuit design, we introduce MenTeR - a multiagent workflow integrated into an end-to-end analog design framework. By employing multiple specialized AI agents that collaboratively address different aspects of the design process, such as specification understanding, circuit optimization, and test bench validation, MenTeR reduces the dependency on frequent trial-and-error-style intervention. MenTeR not only accelerates the design cycle time but also facilitates a broader exploration of the design space, demonstrating robust capabilities in handling real-world analog systems. We believe that MenTeR lays the groundwork for future "RF/Analog Copilots" that can collaborate seamlessly with human designers.
Authors: Antonio D'Orazio, Maria Rosaria Briglia, Donato Crisostomi, Dario Loi, Emanuele Rodol\`a, Iacopo Masi
Abstract: CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
Authors: Kazuki Egashira, Robin Staab, Mark Vero, Jingxuan He, Martin Vechev
Abstract: With the increasing size of frontier LLMs, post-training quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and llama$.$cpp frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error -- the difference between the full-precision weights and their (de-)quantized version -- provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation ($\Delta$=$88.7\%$), targeted content injection ($\Delta$=$85.0\%$), and benign instruction refusal ($\Delta$=$30.1\%$). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient as a defense.
Authors: Hadi Askari, Shivanshu Gupta, Fei Wang, Anshuman Chhabra, Muhao Chen
Abstract: Pretrained Large Language Models (LLMs) achieve strong performance across a wide range of tasks, yet exhibit substantial variability in the various layers' training quality with respect to specific downstream applications, limiting their downstream performance. It is therefore critical to estimate layer-wise training quality in a manner that accounts for both model architecture and training data. However, existing approaches predominantly rely on model-centric heuristics (such as spectral statistics, outlier detection, or uniform allocation) while overlooking the influence of data. To address these limitations, we propose LayerIF, a data-driven framework that leverages Influence Functions to quantify the training quality of individual layers in a principled and task-sensitive manner. By isolating each layer's gradients and measuring the sensitivity of the validation loss to training examples by computing layer-wise influences, we derive data-driven estimates of layer importance. Notably, our method produces task-specific layer importance estimates for the same LLM, revealing how layers specialize for different test-time evaluation tasks. We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in LoRA-MoE architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.
Authors: Jesse Thibodeau, Hadi Nekoei, Afaf Ta\"ik, Janarthanan Rajendran, Golnoosh Farnadi
Abstract: Dynamic, risk-based pricing can systematically exclude vulnerable consumer groups from essential resources such as health insurance and consumer credit. We show that a regulator can realign private incentives with social objectives through a learned, interpretable tax schedule. First, we provide a formal proposition that bounding each firm's \emph{local} demographic gap implicitly bounds the \emph{global} opt-out disparity, motivating firm-level penalties. Building on this insight we introduce \texttt{MarketSim} -- an open-source, scalable simulator of heterogeneous consumers and profit-maximizing firms -- and train a reinforcement learning (RL) social planner (SP) that selects a bracketed fairness-tax while remaining close to a simple linear prior via an $\mathcal{L}_1$ regularizer. The learned policy is thus both transparent and easily interpretable. In two empirically calibrated markets, i.e., U.S. health-insurance and consumer-credit, our planner simultaneously raises demand-fairness by up to $16\%$ relative to unregulated Free Market while outperforming a fixed linear schedule in terms of social welfare without explicit coordination. These results illustrate how AI-assisted regulation can convert a competitive social dilemma into a win-win equilibrium, providing a principled and practical framework for fairness-aware market oversight.
Authors: Nicholas H. Barbara, Ruigang Wang, Alexandre Megretski, Ian R. Manchester
Abstract: We study parameterizations of stabilizing nonlinear policies for learning-based control. We propose a structure based on a nonlinear version of the Youla-Kucera parameterization combined with robust neural networks such as the recurrent equilibrium network (REN). The resulting parameterizations are unconstrained, and hence can be searched over with first-order optimization methods, while always ensuring closed-loop stability by construction. We study the combination of (a) nonlinear dynamics, (b) partial observation, and (c) incremental closed-loop stability requirements (contraction and Lipschitzness). We find that with any two of these three difficulties, a contracting and Lipschitz Youla parameter always leads to contracting and Lipschitz closed loops. However, if all three hold, then incremental stability can be lost with exogenous disturbances. Instead, a weaker condition is maintained, which we call d-tube contraction and Lipschitzness. We further obtain converse results showing that the proposed parameterization covers all contracting and Lipschitz closed loops for certain classes of nonlinear systems. Numerical experiments illustrate the utility of our parameterization when learning controllers with built-in stability certificates for: (i) "economic" rewards without stabilizing effects; (ii) short training horizons; and (iii) uncertain systems.
Authors: Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong
Abstract: Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.
Authors: Michael Li, Nishant Subramani
Abstract: Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how both classical architectures (BERT, DeBERTa, GPT-2)and contemporary large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) represent lexical identity and inflectional morphology. We train linear and nonlinear classifiers on layer-wise activations to predict word lemmas and inflectional features. We discover that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout the layers. Further analysis reveals that these models encode inflectional morphology through generalizable abstractions, but rely predominantly on memorization to encode lexical identity. Remarkably, these patterns emerge across all 16 models we test, despite differences in architecture, size, and training regime (including pretrained and instruction-tuned variants). This consistency suggests that, despite substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties could be fundamental for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing
Authors: Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz, Marek Cygan, Bartosz Zieli\'nski, Maciej Wo{\l}czyk
Abstract: The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.
Authors: Junghyun Lee, Kyoungseok Jang, Kwang-Sung Jun, Milan Vojnovi\'c, Se-Young Yun
Abstract: We present `GL-LowPopArt`, a novel Catoni-style estimator for generalized low-rank trace regression. Building on `LowPopArt` (Jang et al., 2024), it employs a two-stage approach: nuclear norm regularization followed by matrix Catoni estimation. We establish state-of-the-art estimation error bounds, surpassing existing guarantees (Fan et al., 2019; Kang et al., 2022), and reveal a novel experimental design objective, $\mathrm{GL}(\pi)$. The key technical challenge is controlling bias from the nonlinear inverse link function, which we address by our two-stage approach. We prove a *local* minimax lower bound, showing that our `GL-LowPopArt` enjoys instance-wise optimality up to the condition number of the ground-truth Hessian. Applications include generalized linear matrix completion, where `GL-LowPopArt` achieves a state-of-the-art Frobenius error guarantee, and **bilinear dueling bandits**, a novel setting inspired by general preference learning (Zhang et al., 2024). Our analysis of a `GL-LowPopArt`-based explore-then-commit algorithm reveals a new, potentially interesting problem-dependent quantity, along with improved Borda regret bound than vectorization (Wu et al., 2024).