new P2W: From Power Traces to Weights Matrix -- An Unconventional Transfer Learning Approach

Authors: Roozbeh Siyadatzadeh, Fatemeh Mehrafrooz, Nele Mentens, Todor Stefanov

Abstract: The rapid growth of deploying machine learning (ML) models within embedded systems on a chip (SoCs) has led to transformative shifts in fields like healthcare and autonomous vehicles. One of the primary challenges for training such embedded ML models is the lack of publicly available high-quality training data. Transfer learning approaches address this challenge by utilizing the knowledge encapsulated in an existing ML model as a starting point for training a new ML model. However, existing transfer learning approaches require direct access to the existing model which is not always feasible, especially for ML models deployed on embedded SoCs. Therefore, in this paper, we introduce a novel unconventional transfer learning approach to train a new ML model by extracting and using weights from an existing ML model running on an embedded SoC without having access to the model within the SoC. Our approach captures power consumption measurements from the SoC while it is executing the ML model and translates them to an approximated weights matrix used to initialize the new ML model. This improves the learning efficiency and predictive performance of the new model, especially in scenarios with limited data available to train the model. Our novel approach can effectively increase the accuracy of the new ML model up to 3 times compared to classical training methods using the same amount of limited training data.

new EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models

Authors: Nastaran Darabi, Devashri Naik, Sina Tayebati, Dinithi Jayasuriya, Ranganath Krishnan, Amit Ranjan Trivedi

Abstract: Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large Language Models (LLMs), which are further exacerbated by their multimodal nature. Existing defenses, including adversarial training, input transformations, and heuristic detection, are computationally expensive, architecture-dependent, and fragile against adaptive attacks. We introduce EigenShield, an inference-time defense leveraging Random Matrix Theory to quantify adversarial disruptions in high-dimensional VLM representations. Unlike prior methods that rely on empirical heuristics, EigenShield employs the spiked covariance model to detect structured spectral deviations. Using a Robustness-based Nonconformity Score (RbNS) and quantile-based thresholding, it separates causal eigenvectors, which encode semantic information, from correlational eigenvectors that are susceptible to adversarial artifacts. By projecting embeddings onto the causal subspace, EigenShield filters adversarial noise without modifying model parameters or requiring adversarial training. This architecture-independent, attack-agnostic approach significantly reduces the attack success rate, establishing spectral analysis as a principled alternative to conventional defenses. Our results demonstrate that EigenShield consistently outperforms all existing defenses, including adversarial training, UNIGUARD, and CIDER.

new Generative Modeling of Individual Behavior at Scale

Authors: Nabil Omi, Lucas Caccia, Anurag Sarkar, Jordan T. Ash, Siddhartha Sen

Abstract: There has been a growing interest in using AI to model human behavior, particularly in domains where humans interact with this technology. While most existing work models human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent approaches to behavioral stylometry -- or the task of identifying a person from their actions alone -- have shown promise in domains like chess, but these approaches are either not scalable (e.g., fine-tune a separate model for each person) or not generative, in that they cannot generate actions. We address these limitations by framing behavioral stylometry as a multi-task learning problem -- where each task represents a distinct person -- and use parameter-efficient fine-tuning (PEFT) methods to learn an explicit style vector for each person. Style vectors are generative: they selectively activate shared "skill" parameters to generate actions in the style of each person. They also induce a latent space that we can interpret and manipulate algorithmically. In particular, we develop a general technique for style steering that allows us to steer a player's style vector towards a desired property. We apply our approach to two very different games, at unprecedented scales: chess (47,864 players) and Rocket League (2,000 players). We also show generality beyond gaming by applying our method to image generation, where we learn style vectors for 10,177 celebrities and use these vectors to steer their images.

new Digital implementations of deep feature extractors are intrinsically informative

Authors: Max Getter

Abstract: Rapid information (energy) propagation in deep feature extractors is crucial to balance computational complexity versus expressiveness as a representation of the input. We prove an upper bound for the speed of energy propagation in a unified framework that covers different neural network models, both over Euclidean and non-Euclidean domains. Additional structural information about the signal domain can be used to explicitly determine or improve the rate of decay. To illustrate this, we show global exponential energy decay for a range of 1) feature extractors with discrete-domain input signals, and 2) convolutional neural networks (CNNs) via scattering over locally compact abelian (LCA) groups.

new Understanding the Design Principles of Link Prediction in Directed Settings

Authors: Jun Zhai, Muberra Ozmen, Thomas Markovich

Abstract: Link prediction is a widely studied task in Graph Representation Learning (GRL) for modeling relational data. The early theories in GRL were based on the assumption of a symmetric adjacency matrix, reflecting an undirected setting. As a result, much of the following state-of-the-art research has continued to operate under this symmetry assumption, even though real-world data often involve crucial information conveyed through the direction of relationships. This oversight limits the ability of these models to fully capture the complexity of directed interactions. In this paper, we focus on the challenge of directed link prediction by evaluating key heuristics that have been successful in undirected settings. We propose simple but effective adaptations of these heuristics to the directed link prediction task and demonstrate that these modifications produce competitive performance compared to the leading Graph Neural Networks (GNNs) originally designed for undirected graphs. Through an extensive set of experiments, we derive insights that inform the development of a novel framework for directed link prediction, which not only surpasses baseline methods but also outperforms state-of-the-art GNNs on multiple benchmarks.

new Towards Physics-Guided Foundation Models

Authors: Majid Farhadloo, Arun Sharma, Mingzhou Yang, Bharat Jayaprakash, William Northrop, Shashi Shekhar

Abstract: Traditional foundation models are pre-trained on broad datasets to reduce the training resources (e.g., time, energy, labeled samples) needed for fine-tuning a wide range of downstream tasks. However, traditional foundation models struggle with out-of-distribution prediction and can produce outputs that are unrealistic and physically infeasible. We propose the notation of physics-guided foundation models (PGFM), that is, foundation models integrated with broad or general domain (e.g., scientific) physical knowledge applicable to a wide range of downstream tasks.

new Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition

Authors: Priya Kasimbeg, Frank Schneider, Runa Eschenhagen, Juhan Bae, Chandramouli Shama Sastry, Mark Saroufim, Boyuan Feng, Less Wright, Edward Z. Yang, Zachary Nado, Sourabh Medapati, Philipp Hennig, Michael Rabbat, George E. Dahl

Abstract: The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

new TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

Authors: Juntong Ni, Zewen Liu, Shiyu Wang, Ming Jin, Wei Jin

Abstract: Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

new Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability

Authors: Akshay G Rao, Chandrashekhar Lakshminarayanan, Arun Rajkumar

Abstract: Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively (using visualizations) and quantitatively contrast gating patterns between robust and standard models. We uncover insights into hyperplanes resembling principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes aligned farther from the data points. We use path activity analysis to show that PGD-AT models create diverse, non-overlapping active subnetworks across classes, preventing attack-induced gating overlaps. Our visualization ideas show the nature of representations learnt by PGD-AT and STD-TR models.

new GeoAggregator: An Efficient Transformer Model for Geo-Spatial Tabular Data

Authors: Rui Deng, Ziqi Li, Mingshu Wang

Abstract: Modeling geospatial tabular data with deep learning has become a promising alternative to traditional statistical and machine learning approaches. However, existing deep learning models often face challenges related to scalability and flexibility as datasets grow. To this end, this paper introduces GeoAggregator, an efficient and lightweight algorithm based on transformer architecture designed specifically for geospatial tabular data modeling. GeoAggregators explicitly account for spatial autocorrelation and spatial heterogeneity through Gaussian-biased local attention and global positional awareness. Additionally, we introduce a new attention mechanism that uses the Cartesian product to manage the size of the model while maintaining strong expressive power. We benchmark GeoAggregator against spatial statistical models, XGBoost, and several state-of-the-art geospatial deep learning methods using both synthetic and empirical geospatial datasets. The results demonstrate that GeoAggregators achieve the best or second-best performance compared to their competitors on nearly all datasets. GeoAggregator's efficiency is underscored by its reduced model size, making it both scalable and lightweight. Moreover, ablation experiments offer insights into the effectiveness of the Gaussian bias and Cartesian attention mechanism, providing recommendations for further optimizing the GeoAggregator's performance.

new Approximating Latent Manifolds in Neural Networks via Vanishing Ideals

Authors: Nico Pelleriti, Max Zimmer, Elias Wirth, Sebastian Pokutta

Abstract: Deep neural networks have reshaped modern machine learning by learning powerful latent representations that often align with the manifold hypothesis: high-dimensional data lie on lower-dimensional manifolds. In this paper, we establish a connection between manifold learning and computational algebra by demonstrating how vanishing ideals can characterize the latent manifolds of deep networks. To that end, we propose a new neural architecture that (i) truncates a pretrained network at an intermediate layer, (ii) approximates each class manifold via polynomial generators of the vanishing ideal, and (iii) transforms the resulting latent space into linearly separable features through a single polynomial layer. The resulting models have significantly fewer layers than their pretrained baselines, while maintaining comparable accuracy, achieving higher throughput, and utilizing fewer parameters. Furthermore, drawing on spectral complexity analysis, we derive sharper theoretical guarantees for generalization, showing that our approach can in principle offer tighter bounds than standard deep networks. Numerical experiments confirm the effectiveness and efficiency of the proposed approach.

new GiGL: Large-Scale Graph Neural Networks at Snapchat

Authors: Tong Zhao, Yozen Liu, Matthew Kolodner, Kyle Montemayor, Elham Ghazizadeh, Ankit Batra, Zihao Fan, Xiaobin Gao, Xuan Guo, Jiwen Ren, Serim Park, Peicheng Yu, Jun Yu, Shubham Vij, Neil Shah

Abstract: Recent advances in graph machine learning (ML) with the introduction of Graph Neural Networks (GNNs) have led to a widespread interest in applying these approaches to business applications at scale. GNNs enable differentiable end-to-end (E2E) learning of model parameters given graph structure which enables optimization towards popular node, edge (link) and graph-level tasks. While the research innovation in new GNN layers and training strategies has been rapid, industrial adoption and utility of GNNs has lagged considerably due to the unique scale challenges that large-scale graph ML problems create. In this work, we share our approach to training, inference, and utilization of GNNs at Snapchat. To this end, we present GiGL (Gigantic Graph Learning), an open-source library to enable large-scale distributed graph ML to the benefit of researchers, ML engineers, and practitioners. We use GiGL internally at Snapchat to manage the heavy lifting of GNN workflows, including graph data preprocessing from relational DBs, subgraph sampling, distributed training, inference, and orchestration. GiGL is designed to interface cleanly with open-source GNN modeling libraries prominent in academia like PyTorch Geometric (PyG), while handling scaling and productionization challenges that make it easier for internal practitioners to focus on modeling. GiGL is used in multiple production settings, and has powered over 35 launches across multiple business domains in the last 2 years in the contexts of friend recommendation, content recommendation and advertising. This work details high-level design and tools the library provides, scaling properties, case studies in diverse business settings with industry-scale graphs, and several key lessons learned in employing graph ML at scale on large social data. GiGL is open-sourced at https://github.com/snap-research/GiGL.

URLs: https://github.com/snap-research/GiGL.

new Visualizing Machine Learning Models for Enhanced Financial Decision-Making and Risk Management

Authors: Priyam Ganguly, Ramakrishna Garine, Isha Mukherjee

Abstract: This study emphasizes how crucial it is to visualize machine learning models, especially for the banking industry, in order to improve interpretability and support predictions in high stakes financial settings. Visual tools enable performance improvements and support the creation of innovative financial models by offering crucial insights into the algorithmic decision-making processes. Within a financial machine learning framework, the research uses visually guided experiments to make important concepts, such risk assessment and portfolio allocation, more understandable. The study also examines variations in trading tactics and how they relate to risk appetite, coming to the conclusion that the frequency of portfolio rebalancing is negatively correlated with risk tolerance. Finding these ideas is made possible in large part by visualization. The study concludes by presenting a novel method of locally stochastic asset weighing, where visualization facilitates data extraction and validation. This highlights the usefulness of these methods in furthering the field of financial machine learning research.

new More for Keys, Less for Values: Adaptive KV Cache Quantization

Authors: Mohsen Hariri, Lam Nguyen, Sixu Chen, Shaochen Zhong, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary

Abstract: This paper introduces an information-aware quantization framework that adaptively compresses the key-value (KV) cache in large language models (LLMs). Although prior work has underscored the distinct roles of key and value cache during inference, our systematic analysis -- examining singular value distributions, spectral norms, and Frobenius norms -- reveals, for the first time, that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices. Furthermore, our theoretical analysis shows that matrices with higher spectral norms amplify quantization errors more significantly. Motivated by these insights, we propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bit-width for keys and fewer for values since key matrices have higher norm values. With the same total KV bit budget, this approach effectively mitigates error propagation across transformer layers while achieving significant memory savings. Our extensive experiments on multiple LLMs (1B--70B) demonstrate that our mixed-precision quantization scheme maintains high model accuracy even under aggressive compression. For instance, using 4-bit for Key and 2-bit for Value achieves an accuracy of 75.2%, whereas reversing the assignment (2-bit for Key and 4-bit for Value) yields only 54.7% accuracy. The code is available at https://tinyurl.com/kv-adaquant

URLs: https://tinyurl.com/kv-adaquant

new UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning

Authors: Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal

Abstract: User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. We evaluate UPCORE across three standard unlearning methods consistently achieving a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. We find that UPCORE improves both standard metrics and AUC, benefitting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.

new Leveraging ChatGPT for Sponsored Ad Detection and Keyword Extraction in YouTube Videos

Authors: Brice Valentin Kok-Shun, Johnny Chan

Abstract: This work-in-progress paper presents a novel approach to detecting sponsored advertisement segments in YouTube videos and comparing the advertisement with the main content. Our methodology involves the collection of 421 auto-generated and manual transcripts which are then fed into a prompt-engineered GPT-4o for ad detection, a KeyBERT for keyword extraction, and another iteration of ChatGPT for category identification. The results revealed a significant prevalence of product-related ads across various educational topics, with ad categories refined using GPT-4o into succinct 9 content and 4 advertisement categories. This approach provides a scalable and efficient alternative to traditional ad detection methods while offering new insights into the types and relevance of ads embedded within educational content. This study highlights the potential of LLMs in transforming ad detection processes and improving our understanding of advertisement strategies in digital media.

new Assessing a Single Student's Concentration on Learning Platforms: A Machine Learning-Enhanced EEG-Based Framework

Authors: Zewen Zhuo, Mohamad Najafi, Hazem Zein, Amine Nait-Ali

Abstract: This study introduces a specialized pipeline designed to classify the concentration state of an individual student during online learning sessions by training a custom-tailored machine learning model. Detailed protocols for acquiring and preprocessing EEG data are outlined, along with the extraction of fifty statistical features from five EEG signal bands: alpha, beta, theta, delta, and gamma. Following feature extraction, a thorough feature selection process was conducted to optimize the data inputs for a personalized analysis. The study also explores the benefits of hyperparameter fine-tuning to enhance the classification accuracy of the student's concentration state. EEG signals were captured from the student using a Muse headband (Gen 2), equipped with five electrodes (TP9, AF7, AF8, TP10, and a reference electrode NZ), during engagement with educational content on computer-based e-learning platforms. Employing a random forest model customized to the student's data, we achieved remarkable classification performance, with test accuracies of 97.6% in the computer-based learning setting and 98% in the virtual reality setting. These results underscore the effectiveness of our approach in delivering personalized insights into student concentration during online educational activities.

new MONSTER: Monash Scalable Time Series Evaluation Repository

Authors: Angus Dempster, Navid Mohammadi Foumani, Chang Wei Tan, Lynn Miller, Amish Mishra, Mahsa Salehi, Charlotte Pelletier, Daniel F. Schmidt, Geoffrey I. Webb

Abstract: We introduce MONSTER-the MONash Scalable Time Series Evaluation Repository-a collection of large datasets for time series classification. The field of time series classification has benefitted from common benchmarks set by the UCR and UEA time series classification repositories. However, the datasets in these benchmarks are small, with median sizes of 217 and 255 examples, respectively. In consequence they favour a narrow subspace of models that are optimised to achieve low classification error on a wide variety of smaller datasets, that is, models that minimise variance, and give little weight to computational issues such as scalability. Our hope is to diversify the field by introducing benchmarks using larger datasets. We believe that there is enormous potential for new progress in the field by engaging with the theoretical and practical challenges of learning effectively from larger quantities of data.

new Data Complexity Measures for Quantum Circuits Architecture Recommendation

Authors: Fernando M de Paula Neto

Abstract: Quantum Parametric Circuits are constructed as an alternative to reduce the size of quantum circuits, meaning to decrease the number of quantum gates and, consequently, the depth of these circuits. However, determining the optimal circuit for a given problem remains an open question. Testing various combinations is challenging due to the infinite possibilities. In this work, a quantum circuit recommendation architecture for classification problems is proposed using database complexity measures. A quantum circuit is defined based on a circuit layer and the number of times this layer is iterated. Fourteen databases of varying dimensions and different numbers of classes were used to evaluate six quantum circuits, each with 1, 2, 3, 4, 8, and 16-layer repetitions. Using data complexity measures from the databases, it was possible to identify the optimal circuit capable of solving all problems with up to 100$\%$ accuracy. Furthermore, with a mean absolute error of 0.80 $\pm$ 2.17, one determined the appropriate number of layer repetitions, allowing for an error margin of up to three additional layers. Sixteen distinct machine learning models were employed for the selection of quantum circuits, alongside twelve classical regressor models to dynamically define the number of layers.

new Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Authors: Nuoya Xiong, Aarti Singh

Abstract: Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning model, particularly Language Model (LM) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices. %, e.g. compare two papers on their novelty, clarity, correctness, etc. Multi-Objective RLHF (MORLHF) aims to use per-objective preference feedback and achieve Pareto optimality among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.

new Optimizing Product Provenance Verification using Data Valuation Methods

Authors: Raquib Bin Yousuf, Hoang Anh Just, Shengzhe Xu, Brian Mayer, Victor Deklerck, Jakub Truszkowski, John C. Simeone, Jade Saunders, Chang-Tien Lu, Ruoxi Jia, Naren Ramakrishnan

Abstract: Determining and verifying product provenance remains a critical challenge in global supply chains, particularly as geopolitical conflicts and shifting borders create new incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested timber or stolen agricultural products. Stable Isotope Ratio Analysis (SIRA), combined with Gaussian process regression-based isoscapes, has emerged as a powerful tool for geographic origin verification. However, the effectiveness of these models is often constrained by data scarcity and suboptimal dataset selection. In this work, we introduce a novel data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in SIRA. By prioritizing high-informative samples, our approach improves model robustness and predictive accuracy across diverse datasets and geographies. We validate our methodology with extensive experiments, demonstrating its potential to significantly enhance provenance verification, mitigate fraudulent trade practices, and strengthen regulatory enforcement of global supply chains.

new Graph-Based Deep Learning on Stereo EEG for Predicting Seizure Freedom in Epilepsy Patients

Authors: Artur Agaronyan, Syeda Abeera Amir, Nunthasiri Wittayanakorn, John Schreiber, Marius G. Linguraru, William Gaillard, Chima Oluigbo, Syed Muhammad Anwar

Abstract: Predicting seizure freedom is essential for tailoring epilepsy treatment. But accurate prediction remains challenging with traditional methods, especially with diverse patient populations. This study developed a deep learning-based graph neural network (GNN) model to predict seizure freedom from stereo electroencephalography (sEEG) data in patients with refractory epilepsy. We utilized high-quality sEEG data from 15 pediatric patients to train a deep learning model that can accurately predict seizure freedom outcomes and advance understanding of brain connectivity at the seizure onset zone. Our model integrates local and global connectivity using graph convolutions with multi-scale attention mechanisms to capture connections between difficult-to-study regions such as the thalamus and motor regions. The model achieved an accuracy of 92.4% in binary class analysis, 86.6% in patient-wise analysis, and 81.4% in multi-class analysis. Node and edge-level feature analysis highlighted the anterior cingulate and frontal pole regions as key contributors to seizure freedom outcomes. The nodes identified by our model were also more likely to coincide with seizure onset zones. Our findings underscore the potential of new connectivity-based deep learning models such as GNNs for enhancing the prediction of seizure freedom, predicting seizure onset zones, connectivity analysis of the brain during seizure, as well as informing AI-assisted personalized epilepsy treatment planning.

new PairBench: A Systematic Framework for Selecting Reliable Judge VLMs

Authors: Aarash Feizi, Sai Rajeswar, Adriana Romero-Soriano, Reihaneh Rabbany, Spandana Gella, Valentina Zantedeschi, Jo\~ao Monteiro

Abstract: As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.

new The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning

Authors: Sheila Schoepp, Masoud Jafaripour, Yingyue Cao, Tianpei Yang, Fatemeh Abdollahi, Shadan Golestan, Zahin Sufiyan, Osmar R. Zaiane, Matthew E. Taylor

Abstract: Reinforcement learning (RL) has shown impressive results in sequential decision-making tasks. Meanwhile, Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged, exhibiting impressive capabilities in multimodal understanding and reasoning. These advances have led to a surge of research integrating LLMs and VLMs into RL. In this survey, we review representative works in which LLMs and VLMs are used to overcome key challenges in RL, such as lack of prior knowledge, long-horizon planning, and reward design. We present a taxonomy that categorizes these LLM/VLM-assisted RL approaches into three roles: agent, planner, and reward. We conclude by exploring open problems, including grounding, bias mitigation, improved representations, and action advice. By consolidating existing research and identifying future directions, this survey establishes a framework for integrating LLMs and VLMs into RL, advancing approaches that unify natural language and visual understanding with sequential decision-making.

new Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Authors: Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, Dianbo Liu

Abstract: Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.

new Multi-agent Multi-armed Bandits with Minimum Reward Guarantee Fairness

Authors: Piyushi Manupriya, Himanshu, SakethaNath Jagarlapudi, Ganesh Ghalme

Abstract: We investigate the problem of maximizing social welfare while ensuring fairness in a multi-agent multi-armed bandit (MA-MAB) setting. In this problem, a centralized decision-maker takes actions over time, generating random rewards for various agents. Our goal is to maximize the sum of expected cumulative rewards, a.k.a. social welfare, while ensuring that each agent receives an expected reward that is at least a constant fraction of the maximum possible expected reward. Our proposed algorithm, RewardFairUCB, leverages the Upper Confidence Bound (UCB) technique to achieve sublinear regret bounds for both fairness and social welfare. The fairness regret measures the positive difference between the minimum reward guarantee and the expected reward of a given policy, whereas the social welfare regret measures the difference between the social welfare of the optimal fair policy and that of the given policy. We show that RewardFairUCB algorithm achieves instance-independent social welfare regret guarantees of $\tilde{O}(T^{1/2})$ and a fairness regret upper bound of $\tilde{O}(T^{3/4})$. We also give the lower bound of $\Omega(\sqrt{T})$ for both social welfare and fairness regret. We evaluate RewardFairUCB's performance against various baseline and heuristic algorithms using simulated data and real world data, highlighting trade-offs between fairness and social welfare regrets.

new Real-Time Moving Flock Detection in Pedestrian Trajectories Using Sequential Deep Learning Models

Authors: Amartaivan Sanjjamts, Hiroshi Morita, Togootogtokh Enkhtogtokh

Abstract: Understanding collective pedestrian movement is crucial for applications in crowd management, autonomous navigation, and human-robot interaction. This paper investigates the use of sequential deep learning models, including Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers, for real-time flock detection in multi-pedestrian trajectories. Our proposed approach consists of a two-stage process: first, a pre-trained binary classification model is used for pairwise trajectory classification, and second, the learned representations are applied to identify multi-agent flocks dynamically. We validate our method using real-world group movement datasets, demonstrating its robustness across varying sequence lengths and diverse movement patterns. Experimental results indicate that our model consistently detects pedestrian flocks with high accuracy and stability, even in dynamic and noisy environments. Furthermore, we extend our approach to identify other forms of collective motion, such as convoys and swarms, paving the way for more comprehensive multi-agent behavior analysis.

new Towards a Reward-Free Reinforcement Learning Framework for Vehicle Control

Authors: Jielong Yang, Daoyuan Huang

Abstract: Reinforcement learning plays a crucial role in vehicle control by guiding agents to learn optimal control strategies through designing or learning appropriate reward signals. However, in vehicle control applications, rewards typically need to be manually designed while considering multiple implicit factors, which easily introduces human biases. Although imitation learning methods does not rely on explicit reward signals, they necessitate high-quality expert actions, which are often challenging to acquire. To address these issues, we propose a reward-free reinforcement learning framework (RFRLF). This framework directly learns the target states to optimize agent behavior through a target state prediction network (TSPN) and a reward-free state-guided policy network (RFSGPN), avoiding the dependence on manually designed reward signals. Specifically, the policy network is learned via minimizing the differences between the predicted state and the expert state. Experimental results demonstrate the effectiveness of the proposed RFRLF in controlling vehicle driving, showing its advantages in improving learning efficiency and adapting to reward-free environments.

new Hyperspherical Normalization for Scalable Deep Reinforcement Learning

Authors: Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo

Abstract: Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.

URLs: https://dojeon-ai.github.io/SimbaV2.

new Beyond Fixed Variables: Expanding-variate Time Series Forecasting via Flat Scheme and Spatio-temporal Focal Learning

Authors: Minbo Ma, Kai Tang, Huan Li, Fei Teng, Dalin Zhang, Tianrui Li

Abstract: Multivariate Time Series Forecasting (MTSF) has long been a key research focus. Traditionally, these studies assume a fixed number of variables, but in real-world applications, Cyber-Physical Systems often expand as new sensors are deployed, increasing variables in MTSF. In light of this, we introduce a novel task, Expanding-variate Time Series Forecasting (EVTSF). This task presents unique challenges, specifically (1) handling inconsistent data shapes caused by adding new variables, and (2) addressing imbalanced spatio-temporal learning, where expanding variables have limited observed data due to the necessity for timely operation. To address these challenges, we propose STEV, a flexible spatio-temporal forecasting framework. STEV includes a new Flat Scheme to tackle the inconsistent data shape issue, which extends the graph-based spatio-temporal modeling architecture into 1D space by flattening the 2D samples along the variable dimension, making the model variable-scale-agnostic while still preserving dynamic spatial correlations through a holistic graph. We introduce a novel Spatio-temporal Focal Learning strategy that incorporates a negative filter to resolve potential conflicts between contrastive learning and graph representation, and a focal contrastive loss as its core to guide the framework to focus on optimizing the expanding variables. We benchmark EVTSF performance using three real-world datasets and compare it against three potential solutions employing SOTA MTSF models tailored for EVSTF. Experimental results show that STEV significantly outperforms its competitors, particularly on expanding variables. Notably, STEV, with only 5% of observations from the expanding period, is on par with SOTA MTSF models trained with complete observations. Further exploration of various expanding strategies underscores the generalizability of STEV in real-world applications.

new SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention

Authors: Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan

Abstract: For the efficient inference of Large Language Models (LLMs), the effective compression of key-value (KV) cache is essential. Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache. Initially, K cache is transformed into latent channels using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors (x0.1 or even lower) that are much lower than those of per-channel key quantization in the original space. Our findings based on RULER and LongBench benchmarks demonstrate that SVDq can achieve an equivalent key cache precision as low as 1.25-bit. When combined with key sparsity, it can reach a key compression ratio of up to 410x for attention computation, all while maintaining comparable model performance. Notably, our method is nearly lossless for LongBench datasets. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for KV cache compression in LLMs.

new Tight Clusters Make Specialized Experts

Authors: Stefan K. Nielsen, Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen

Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

new Learning with Limited Shared Information in Multi-agent Multi-armed Bandit

Authors: Junning Shao, Siwei Wang, Zhixuan Fang

Abstract: Multi-agent multi-armed bandit (MAMAB) is a classic collaborative learning model and has gained much attention in recent years. However, existing studies do not consider the case where an agent may refuse to share all her information with others, e.g., when some of the data contains personal privacy. In this paper, we propose a novel limited shared information multi-agent multi-armed bandit (LSI-MAMAB) model in which each agent only shares the information that she is willing to share, and propose the Balanced-ETC algorithm to help multiple agents collaborate efficiently with limited shared information. Our analysis shows that Balanced-ETC is asymptotically optimal and its average regret (on each agent) approaches a constant when there are sufficient agents involved. Moreover, to encourage agents to participate in this collaborative learning, an incentive mechanism is proposed to make sure each agent can benefit from the collaboration system. Finally, we present experimental results to validate our theoretical results.

new Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

Authors: Lixing Lyu, Jiashuo Jiang, Wang Chi Cheung

Abstract: We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with $N$ state-action pairs and discounted factor $\gamma$. Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an $\epsilon$-optimal policy with a sample complexity bound better than $\tilde{O}((1-\gamma)^{-3} N\epsilon^{-2})$, which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than $\tilde{O}((1-\gamma)^{-4} N \epsilon^{-2})$, the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.

new Efficient and Provable Algorithms for Covariate Shift

Authors: Deeksha Adil, Jaros{\l}aw B{\l}asiok

Abstract: Covariate shift, a widely used assumption in tackling {\it distributional shift} (when training and test distributions differ), focuses on scenarios where the distribution of the labels conditioned on the feature vector is the same, but the distribution of features in the training and test data are different. Despite the significance and extensive work on covariate shift, theoretical guarantees for algorithms in this domain remain sparse. In this paper, we distill the essence of the covariate shift problem and focus on estimating the average $\mathbb{E}_{\tilde{\mathbf{x}}\sim p_{\mathrm{test}}}\mathbf{f}(\tilde{\mathbf{x}})$, of any unknown and bounded function $\mathbf{f}$, given labeled training samples $(\mathbf{x}_i, \mathbf{f}(\mathbf{x}_i))$, and unlabeled test samples $\tilde{\mathbf{x}}_i$; this is a core subroutine for several widely studied learning problems. We give several efficient algorithms, with provable sample complexity and computational guarantees. Moreover, we provide the first rigorous analysis of algorithms in this space when $\mathbf{f}$ is unrestricted, laying the groundwork for developing a solid theoretical foundation for covariate shift problems.

new Learning Chern Numbers of Topological Insulators with Gauge Equivariant Neural Networks

Authors: Longde Huang, Oleksandr Balabanov, Hampus Linander, Mats Granath, Daniel Persson, Jan E. Gerken

Abstract: Equivariant network architectures are a well-established tool for predicting invariant or equivariant quantities. However, almost all learning problems considered in this context feature a global symmetry, i.e. each point of the underlying space is transformed with the same group element, as opposed to a local ``gauge'' symmetry, where each point is transformed with a different group element, exponentially enlarging the size of the symmetry group. Gauge equivariant networks have so far mainly been applied to problems in quantum chromodynamics. Here, we introduce a novel application domain for gauge-equivariant networks in the theory of topological condensed matter physics. We use gauge equivariant networks to predict topological invariants (Chern numbers) of multiband topological insulators. The gauge symmetry of the network guarantees that the predicted quantity is a topological invariant. We introduce a novel gauge equivariant normalization layer to stabilize the training and prove a universal approximation theorem for our setup. We train on samples with trivial Chern number only but show that our models generalize to samples with non-trivial Chern number. We provide various ablations of our setup. Our code is available at https://github.com/sitronsea/GENet/tree/main.

URLs: https://github.com/sitronsea/GENet/tree/main.

new Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution

Authors: Carlos Eiras-Franco, Anna Hedstr\"om, Marina M. -C. H\"ohne

Abstract: Obtaining high-quality explanations of a model's output enables developers to identify and correct biases, align the system's behavior with human values, and ensure ethical compliance. Explainable Artificial Intelligence (XAI) practitioners rely on specific measures to gauge the quality of such explanations. These measures assess key attributes, such as how closely an explanation aligns with a model's decision process (faithfulness), how accurately it pinpoints the relevant input features (localization), and its consistency across different cases (robustness). Despite providing valuable information, these measures do not fully address a critical practitioner's concern: how does the quality of a given explanation compare to other potential explanations? Traditionally, the quality of an explanation has been assessed by comparing it to a randomly generated counterpart. This paper introduces an alternative: the Quality Gap Estimate (QGE). The QGE method offers a direct comparison to what can be viewed as the `inverse' explanation, one that conceptually represents the antithesis of the original explanation. Our extensive testing across multiple model architectures, datasets, and established quality metrics demonstrates that the QGE method is superior to the traditional approach. Furthermore, we show that QGE enhances the statistical reliability of these quality assessments. This advance represents a significant step toward a more insightful evaluation of explanations that enables a more effective inspection of a model's behavior.

new Single-pass Detection of Jailbreaking Input in Large Language Models

Authors: Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher

Abstract: Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

new Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Authors: Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma

Abstract: Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at https://github.com/CERT-Lab/fed-sb.

URLs: https://github.com/CERT-Lab/fed-sb.

new A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLMs

Authors: Yuan Sun

Abstract: MoE (Mixture-of-Expert) architectures appear frequently in large language models, and the number of experts can be over one hundred recently. However, the expert load imbalance problem always happens in MoE model pre-training, which will cause routing collapse or increased computational overhead. In order to balance loads on experts, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q that can help change the top-K order of s by solving a binary integer programming with very small time costs. In simulation experiments, we observe that BIP-Based Balancing make imbalance disappoint very fast, while the final sum of routine scores decreases very little. Our algorithm achieves nearly perfect trade-off between expert load balance and pre-training efficiency under the simulation view.

new R-LoRA: Random Initialization of Multi-Head LoRA for Multi-Task Learning

Authors: Jinda Liu, Yi Chang, Yuan Wu

Abstract: Fine-tuning large language models (LLMs) is prohibitively expensive in terms of computational and memory costs. Low-rank Adaptation (LoRA), as one of the most popular parameter-efficient fine-tuning (PEFT) methods, offers a cost-effective alternative by approximating the model changes $\Delta W \in \mathbb{R}^{m \times n}$ through the product of down-projection matrix $A \in \mathbb{R}^{m \times r}$ and head matrix $B \in \mathbb{R}^{r \times n}$, where $r \ll \min(m, n)$. In real-world scenarios, LLMs are fine-tuned on data from multiple domains to perform tasks across various fields, embodying multi-task learning (MTL). LoRA often underperforms in such complex scenarios. To enhance LoRA's capability in multi-task learning, we propose R-LoRA, which incorporates Multi-Head Randomization. Multi-Head Randomization diversifies the head matrices through Multi-Head Random Initialization and Multi-Head Dropout, enabling more efficient learning of task-specific features while maintaining shared knowledge representation. Extensive experiments demonstrate that R-LoRA is better at capturing task-specific knowledge, thereby improving performance in multi-task scenarios. The code is available at https://github.com/jinda-liu/R-LoRA.

URLs: https://github.com/jinda-liu/R-LoRA.

new Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

Authors: Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang, Jing Liu

Abstract: Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as data scarcity and data imbalance continue to hinder their development. To address this, we consider modeling complex systems through symbolic expressions that serve as semantic descriptors of time series. Building on this concept, we introduce a series-symbol (S2) dual-modulity data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic representations. Leveraging the S2 dataset, we develop SymTime, a pre-trained foundation model for TSA. SymTime demonstrates competitive performance across five major TSA tasks when fine-tuned with downstream task, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of dual-modality data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance.

new Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance

Authors: Yongli Yan, Linglong Dai

Abstract: Neural network-based decoding methods have shown promise in enhancing error correction performance, but traditional approaches struggle with the challenges posed by punctured codes. In particular, these methods fail to address the complexities of variable code rates and the need for protocol compatibility. This paper presents a unified Long Short-Term Memory (LSTM)-based decoding architecture specifically designed to overcome these challenges. The proposed method unifies punctured convolutional and Turbo codes. A puncture embedding mechanism integrates puncturing patterns directly into the network, enabling seamless adaptation to varying code rates, while balanced bit error rate training ensures robustness across different code lengths, rates, and channels, maintaining protocol flexibility. Extensive simulations in Additive White Gaussian Noise and Rayleigh fading channels demonstrate that the proposed approach outperforms conventional decoding techniques, providing significant improvements in decoding accuracy and robustness. These results underscore the potential of LSTM-based decoding as a promising solution for next-generation artificial intelligence powered communication systems.

new MoMa: A Modular Deep Learning Framework for Material Property Prediction

Authors: Botian Wang, Yawen Ouyang, Yaohui Li, Yiqun Wang, Haorui Cui, Jianbing Zhang, Xiaonan Wang, Wei-Ying Ma, Hao Zhou

Abstract: Deep learning methods for material property prediction have been widely explored to advance materials discovery. However, the prevailing pre-train then fine-tune paradigm often fails to address the inherent diversity and disparity of material tasks. To overcome these challenges, we introduce MoMa, a Modular framework for Materials that first trains specialized modules across a wide range of tasks and then adaptively composes synergistic modules tailored to each downstream scenario. Evaluation across 17 datasets demonstrates the superiority of MoMa, with a substantial 14% average improvement over the strongest baseline. Few-shot and continual learning experiments further highlight MoMa's potential for real-world applications. Pioneering a new paradigm of modular material learning, MoMa will be open-sourced to foster broader community collaboration.

new Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis

Authors: Alexandre Gemayel, Dimitrios Michael Manias, Abdallah Shami

Abstract: As smart cities begin to materialize, the role of Unmanned Aerial Vehicles (UAVs) and their reliability becomes increasingly important. One aspect of reliability relates to Condition Monitoring (CM), where Machine Learning (ML) models are leveraged to identify abnormal and adverse conditions. Given the resource-constrained nature of next-generation edge networks, the utilization of precious network resources must be minimized. This work explores the optimization of network resources for ML-based UAV CM frameworks. The developed framework uses experimental data and varies the feature extraction aggregation interval to optimize ML model selection. Additionally, by leveraging dimensionality reduction techniques, there is a 99.9% reduction in network resource consumption.

new Verification and Validation for Trustworthy Scientific Machine Learning

Authors: John D. Jakeman, Lorena A. Barba, Joaquim R. R. A. Martins, Thomas O'Leary-Roseberry

Abstract: Scientific machine learning (SciML) models are transforming many scientific disciplines. However, the development of good modeling practices to increase the trustworthiness of SciML has lagged behind its application, limiting its potential impact. The goal of this paper is to start a discussion on establishing consensus-based good practices for predictive SciML. We identify key challenges in applying existing computational science and engineering guidelines, such as verification and validation protocols, and provide recommendations to address these challenges. Our discussion focuses on predictive SciML, which uses machine learning models to learn, improve, and accelerate numerical simulations of physical systems. While centered on predictive applications, our 16 recommendations aim to help researchers conduc

new Activation Steering in Neural Theorem Provers

Authors: Shashank Kirtania

Abstract: Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean. However, current state of the art language models struggles to predict next step in proofs leading practitioners to use different sampling techniques to improve LLMs capabilities. We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics, affecting the overall selection process. To overcome this hurdle, we use activation steering to guide LLMs responses to improve the generations at the time of inference. Our results suggest that activation steering offers a promising lightweight alternative to specialized fine-tuning for enhancing theorem proving capabilities in LLMs, particularly valuable in resource-constrained environments.

new SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning

Authors: Xuyang Li, Romit Maulik

Abstract: Modern deep reinforcement learning (DRL) methods have made significant advances in handling continuous action spaces. However, real-world control systems--especially those requiring precise and reliable performance--often demand formal stability, and existing DRL approaches typically lack explicit mechanisms to ensure or analyze stability. To address this limitation, we propose SALSA-RL (Stability Analysis in the Latent Space of Actions), a novel RL framework that models control actions as dynamic, time-dependent variables evolving within a latent space. By employing a pre-trained encoder-decoder and a state-dependent linear system, our approach enables both stability analysis and interpretability. We demonstrated that SALSA-RL can be deployed in a non-invasive manner for assessing the local stability of actions from pretrained RL agents without compromising on performance across diverse benchmark environments. By enabling a more interpretable analysis of action generation, SALSA-RL provides a powerful tool for advancing the design, analysis, and theoretical understanding of RL systems.

new Solving Inverse Problems with Deep Linear Neural Networks: Global Convergence Guarantees for Gradient Descent with Weight Decay

Authors: Hannah Laus, Suzanna Parkinson, Vasileios Charisopoulos, Felix Krahmer, Rebecca Willett

Abstract: Machine learning methods are commonly used to solve inverse problems, wherein an unknown signal must be estimated from few measurements generated via a known acquisition procedure. In particular, neural networks perform well empirically but have limited theoretical guarantees. In this work, we study an underdetermined linear inverse problem that admits several possible solution mappings. A standard remedy (e.g., in compressed sensing) establishing uniqueness of the solution mapping is to assume knowledge of latent low-dimensional structure in the source signal. We ask the following question: do deep neural networks adapt to this low-dimensional structure when trained by gradient descent with weight decay regularization? We prove that mildly overparameterized deep linear networks trained in this manner converge to an approximate solution that accurately solves the inverse problem while implicitly encoding latent subspace structure. To our knowledge, this is the first result to rigorously show that deep linear networks trained with weight decay automatically adapt to latent subspace structure in the data under practical stepsize and weight initialization schemes. Our work highlights that regularization and overparameterization improve generalization, while overparameterization also accelerates convergence during training.

new Model Privacy: A Unified Framework to Understand Model Stealing Attacks and Defenses

Authors: Ganghua Wang, Yuhong Yang, Jie Ding

Abstract: The use of machine learning (ML) has become increasingly prevalent in various domains, highlighting the importance of understanding and ensuring its safety. One pressing concern is the vulnerability of ML applications to model stealing attacks. These attacks involve adversaries attempting to recover a learned model through limited query-response interactions, such as those found in cloud-based services or on-chip artificial intelligence interfaces. While existing literature proposes various attack and defense strategies, these often lack a theoretical foundation and standardized evaluation criteria. In response, this work presents a framework called ``Model Privacy'', providing a foundation for comprehensively analyzing model stealing attacks and defenses. We establish a rigorous formulation for the threat model and objectives, propose methods to quantify the goodness of attack and defense strategies, and analyze the fundamental tradeoffs between utility and privacy in ML models. Our developed theory offers valuable insights into enhancing the security of ML models, especially highlighting the importance of the attack-specific structure of perturbations for effective defenses. We demonstrate the application of model privacy from the defender's perspective through various learning scenarios. Extensive experiments corroborate the insights and the effectiveness of defense mechanisms developed under the proposed framework.

new A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany

Authors: Ina Dormuth, Sven Franke, Marlies Hafer, Tim Katzke, Alexander Marx, Emmanuel M\"uller, Daniel Neider, Markus Pauly, J\'er\^ome Rutinowski

Abstract: In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties' stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.

new Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Authors: Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano

Abstract: Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

new PDeepPP:A Deep learning framework with Pretrained Protein language for peptide classification

Authors: Jixiu Zhai, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Xueying Wang, Dan Huang

Abstract: Protein post-translational modifications (PTMs) and bioactive peptides (BPs) play critical roles in various biological processes and have significant therapeutic potential. However, identifying PTM sites and bioactive peptides through experimental methods is often labor-intensive, costly, and time-consuming. As a result, computational tools, particularly those based on deep learning, have become effective solutions for predicting PTM sites and peptide bioactivity. Despite progress in this field, existing methods still struggle with the complexity of protein sequences and the challenge of requiring high-quality predictions across diverse datasets. To address these issues, we propose a deep learning framework that integrates pretrained protein language models with a neural network combining transformer and CNN for peptide classification. By leveraging the ability of pretrained models to capture complex relationships within protein sequences, combined with the predictive power of parallel networks, our approach improves feature extraction while enhancing prediction accuracy. This framework was applied to multiple tasks involving PTM site and bioactive peptide prediction, utilizing large-scale datasets to enhance the model's robustness. In the comparison across 33 tasks, the model achieved state-of-the-art (SOTA) performance in 25 of them, surpassing existing methods and demonstrating its versatility across different datasets. Our results suggest that this approach provides a scalable and effective solution for large-scale peptide discovery and PTM analysis, paving the way for more efficient peptide classification and functional annotation.

new The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

Authors: Marthe Ballon, Andres Algaba, Vincent Ginis

Abstract: Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

new Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification

Authors: Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, Ievgen Redko

Abstract: In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.

new Training Neural ODEs Using Fully Discretized Simultaneous Optimization

Authors: Mariia Shapovalova, Calvin Tsay

Abstract: Neural Ordinary Differential Equations (Neural ODEs) represent continuous-time dynamics with neural networks, offering advancements for modeling and control tasks. However, training Neural ODEs requires solving differential equations at each epoch, leading to high computational costs. This work investigates simultaneous optimization methods as a faster training alternative. In particular, we employ a collocation-based, fully discretized formulation and use IPOPT--a solver for large-scale nonlinear optimization--to simultaneously optimize collocation coefficients and neural network parameters. Using the Van der Pol Oscillator as a case study, we demonstrate faster convergence compared to traditional training methods. Furthermore, we introduce a decomposition framework utilizing Alternating Direction Method of Multipliers (ADMM) to effectively coordinate sub-models among data batches. Our results show significant potential for (collocation-based) simultaneous Neural ODE training pipelines.

new AutoTandemML: Active Learning Enhanced Tandem Neural Networks for Inverse Design Problems

Authors: Luka Grbcic, Juliane M\"uller, Wibe Albert de Jong

Abstract: Inverse design in science and engineering involves determining optimal design parameters that achieve desired performance outcomes, a process often hindered by the complexity and high dimensionality of design spaces, leading to significant computational costs. To tackle this challenge, we propose a novel hybrid approach that combines active learning with Tandem Neural Networks to enhance the efficiency and effectiveness of solving inverse design problems. Active learning allows to selectively sample the most informative data points, reducing the required dataset size without compromising accuracy. We investigate this approach using three benchmark problems: airfoil inverse design, photonic surface inverse design, and scalar boundary condition reconstruction in diffusion partial differential equations. We demonstrate that integrating active learning with Tandem Neural Networks outperforms standard approaches across the benchmark suite, achieving better accuracy with fewer training samples.

new Predicting gene essentiality and drug response from perturbation screens in preclinical cancer models with LEAP: Layered Ensemble of Autoencoders and Predictors

Authors: Barbara Bodinier, Gaetan Dissez, Linus Bleistein, Antonin Dauvin

Abstract: Preclinical perturbation screens, where the effects of genetic, chemical, or environmental perturbations are systematically tested on disease models, hold significant promise for machine learning-enhanced drug discovery due to their scale and causal nature. Predictive models can infer perturbation responses for previously untested disease models based on molecular profiles. These in silico labels can expand databases and guide experimental prioritization. However, modelling perturbation-specific effects and generating robust prediction performances across diverse biological contexts remain elusive. We introduce LEAP (Layered Ensemble of Autoencoders and Predictors), a novel ensemble framework to improve robustness and generalization. LEAP leverages multiple DAMAE (Data Augmented Masked Autoencoder) representations and LASSO regressors. By combining diverse gene expression representation models learned from different random initializations, LEAP consistently outperforms state-of-the-art approaches in predicting gene essentiality or drug responses in unseen cell lines, tissues and disease models. Notably, our results show that ensembling representation models, rather than prediction models alone, yields superior predictive performance. Beyond its performance gains, LEAP is computationally efficient, requires minimal hyperparameter tuning and can therefore be readily incorporated into drug discovery pipelines to prioritize promising targets and support biomarker-driven stratification. The code and datasets used in this work are made publicly available.

new Logit Disagreement: OoD Detection with Bayesian Neural Networks

Authors: Kevin Raina

Abstract: Bayesian neural networks (BNNs), which estimate the full posterior distribution over model parameters, are well-known for their role in uncertainty quantification and its promising application in out-of-distribution detection (OoD). Amongst other uncertainty measures, BNNs provide a state-of-the art estimation of predictive entropy (total uncertainty) which can be decomposed as the sum of mutual information and expected entropy. In the context of OoD detection the estimation of predictive uncertainty in the form of the predictive entropy score confounds aleatoric and epistemic uncertainty, the latter being hypothesized to be high for OoD points. Despite these justifications, the mutual information score has been shown to perform worse than predictive entropy. Taking inspiration from Bayesian variational autoencoder (BVAE) literature, this work proposes to measure the disagreement between a corrected version of the pre-softmax quantities, otherwise known as logits, as an estimate of epistemic uncertainty for Bayesian NNs under mean field variational inference. The three proposed epistemic uncertainty scores demonstrate marked improvements over mutual information on a range of OoD experiments, with equal performance otherwise. Moreover, the epistemic uncertainty scores perform on par with the Bayesian benchmark predictive entropy on a range of MNIST and CIFAR10 experiments.

new Testing the limits of fine-tuning to improve reasoning in vision language models

Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz

Abstract: Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

new One-step Diffusion Models with $f$-Divergence Distribution Matching

Authors: Yilun Xu, Weili Nie, Arash Vahdat

Abstract: Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel $f$-divergence minimization framework, termed $f$-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the $f$-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative $f$-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, $f$-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill

URLs: https://research.nvidia.com/labs/genair/f-distill

cross Automating Customer Needs Analysis: A Comparative Study of Large Language Models in the Travel Industry

Authors: Simone Barandoni, Filippo Chiarello, Lorenzo Cascone, Emiliano Marrale, Salvatore Puccio

Abstract: In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) have emerged as powerful tools for many tasks, such as extracting valuable insights from vast amounts of textual data. In this study, we conduct a comparative analysis of LLMs for the extraction of travel customer needs from TripAdvisor posts. Leveraging a diverse range of models, including both open-source and proprietary ones such as GPT-4 and Gemini, we aim to elucidate their strengths and weaknesses in this specialized domain. Through an evaluation process involving metrics such as BERTScore, ROUGE, and BLEU, we assess the performance of each model in accurately identifying and summarizing customer needs. Our findings highlight the efficacy of opensource LLMs, particularly Mistral 7B, in achieving comparable performance to larger closed models while offering affordability and customization benefits. Additionally, we underscore the importance of considering factors such as model size, resource requirements, and performance metrics when selecting the most suitable LLM for customer needs analysis tasks. Overall, this study contributes valuable insights for businesses seeking to leverage advanced NLP techniques to enhance customer experience and drive operational efficiency in the travel industry.

cross Exploring Quantum Control Landscape and Solution Space Complexity through Dimensionality Reduction & Optimization Algorithms

Authors: Haftu W. Fentaw, Steve Campbell, Simon Caton

Abstract: Understanding the quantum control landscape (QCL) is important for designing effective quantum control strategies. In this study, we analyze the QCL for a single two-level quantum system (qubit) using various control strategies. We employ Principal Component Analysis (PCA), to visualize and analyze the QCL for higher dimensional control parameters. Our results indicate that dimensionality reduction techniques such as PCA, can play an important role in understanding the complex nature of quantum control in higher dimensions. Evaluations of traditional control techniques and machine learning algorithms reveal that Genetic Algorithms (GA) outperform Stochastic Gradient Descent (SGD), while Q-learning (QL) shows great promise compared to Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO). Additionally, our experiments highlight the importance of reward function design in DQN and PPO demonstrating that using immediate reward results in improved performance rather than delayed rewards for systems with short time steps. A study of solution space complexity was conducted by using Cluster Density Index (CDI) as a key metric for analyzing the density of optimal solutions in the landscape. The CDI reflects cluster quality and helps determine whether a given algorithm generates regions of high fidelity or not. Our results provide insights into effective quantum control strategies, emphasizing the significance of parameter selection and algorithm optimization.

cross Applications of Random Matrix Theory in Machine Learning and Brain Mapping

Authors: Katrina Lawrence

Abstract: Brain mapping analyzes the wavelengths of brain signals and outputs them in a map, which is then analyzed by a radiologist. Introducing Machine Learning (ML) into the brain mapping process reduces the variable of human error in reading such maps and increases efficiency. A key area of interest is determining the correlation between the functional areas of the brain on a voxel (3-dimensional pixel) wise basis. This leads to determining how a brain is functioning and can be used to detect diseases, disabilities, and sicknesses. As such, random noise presents a challenge in consistently determining the actual signals from the scan. This paper discusses how an algorithm created by Random Matrix Theory (RMT) can be used as a tool for ML, as it detects the correlation of the functional areas of the brain. Random matrices are simulated to represent the voxel signal intensity strength for each time interval where a stimulus is presented in an fMRI scan. Using the Marchenko-Pastur law for Wishart Matrices, a result of RMT, it was found that no matter what type of noise was added to the random matrices, the observed eigenvalue distribution of the Wishart Matrices would converge to the theoretical distribution. This means that RMT is robust and has a high test-re-test reliability. These results further indicate that a strong correlation exists between the eigenvalues, and hence the functional regions of the brain. Any eigenvalue that differs significantly from those predicted from RMT may indicate the discovery of a new discrete brain network.

cross SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image

Authors: Qian Jin, Yuqi Jiang, Xudong Lu, Yumeng Liu, Yining Chen, Dawei Gao, Qi Sun, Cheng Zhuo

Abstract: In the field of integrated circuit manufacturing, the detection and classification of nanoscale wafer defects are critical for subsequent root cause analysis and yield enhancement. The complex background patterns observed in scanning electron microscope (SEM) images and the diverse textures of the defects pose significant challenges. Traditional methods usually suffer from insufficient data, labels, and poor transferability. In this paper, we propose a novel few-shot learning approach, SEM-CLIP, for accurate defect classification and segmentation. SEM-CLIP customizes the Contrastive Language-Image Pretraining (CLIP) model to better focus on defect areas and minimize background distractions, thereby enhancing segmentation accuracy. We employ text prompts enriched with domain knowledge as prior information to assist in precise analysis. Additionally, our approach incorporates feature engineering with textual guidance to categorize defects more effectively. SEM-CLIP requires little annotated data, substantially reducing labor demands in the semiconductor industry. Extensive experimental validation demonstrates that our model achieves impressive classification and segmentation results under few-shot learning scenarios.

cross NOTA: Multimodal Music Notation Understanding for Visual Large Language Model

Authors: Mingni Tang, Jiajia Li, Lu Yang, Zhiqiang Zhang, Jinghao Tian, Zuchao Li, Lefei Zhang, Ping Wang

Abstract: Symbolic music is represented in two distinct forms: two-dimensional, visually intuitive score images, and one-dimensional, standardized text annotation sequences. While large language models have shown extraordinary potential in music, current research has primarily focused on unimodal symbol sequence text. Existing general-domain visual language models still lack the ability of music notation understanding. Recognizing this gap, we propose NOTA, the first large-scale comprehensive multimodal music notation dataset. It consists of 1,019,237 records, from 3 regions of the world, and contains 3 tasks. Based on the dataset, we trained NotaGPT, a music notation visual large language model. Specifically, we involve a pre-alignment training phase for cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation. Subsequent training phases focus on foundational music information extraction, followed by training on music notation analysis. Experimental results demonstrate that our NotaGPT-7B achieves significant improvement on music understanding, showcasing the effectiveness of NOTA and the training pipeline. Our datasets are open-sourced at https://huggingface.co/datasets/MYTH-Lab/NOTA-dataset.

URLs: https://huggingface.co/datasets/MYTH-Lab/NOTA-dataset.

cross FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

Authors: Jowaria Khan, Alexa Friedman, Sydney Evans, Runzi Wang, Kaley Beins, David Andrews, Elizabeth Bondi-Kelly

Abstract: Per and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results highlight our framework's potential for scalable PFAS monitoring.

cross Revisiting Financial Sentiment Analysis: A Language Model Approach

Authors: Hamid Moradi-Kamali, Mohammad-Hossein Rajabi-Ghozlou, Mahdi Ghazavi, Ali Soltani, Amirreza Sattarzadeh, Reza Entezari-Maleki

Abstract: Financial Sentiment Analysis (FSA) traditionally relies on human-annotated sentiment labels to infer investor sentiment and forecast market movements. However, inferring the potential market impact of words based on human-perceived intentions is inherently challenging. We hypothesize that the historical market reactions to words, offer a more reliable indicator of their potential impact on markets than subjective sentiment interpretations by human annotators. To test this hypothesis, a market-derived labeling approach is proposed to assign tweet labels based on ensuing short-term price trends, enabling the language model to capture the relationship between textual signals and market dynamics directly. A domain-specific language model was fine-tuned on these labels, achieving up to an 11% improvement in short-term trend prediction accuracy over traditional sentiment-based benchmarks. Moreover, by incorporating market and temporal context through prompt-tuning, the proposed context-aware language model demonstrated an accuracy of 89.6% on a curated dataset of 227 impactful Bitcoin-related news events with significant market impacts. Aggregating daily tweet predictions into trading signals, our method outperformed traditional fusion models (which combine sentiment-based and price-based predictions). It challenged the assumption that sentiment-based signals are inferior to price-based predictions in forecasting market movements. Backtesting these signals across three distinct market regimes yielded robust Sharpe ratios of up to 5.07 in trending markets and 3.73 in neutral markets. Our findings demonstrate that language models can serve as effective short-term market predictors. This paradigm shift underscores the untapped capabilities of language models in financial decision-making and opens new avenues for market prediction applications.

cross UPCMR: A Universal Prompt-guided Model for Random Sampling Cardiac MRI Reconstruction

Authors: Donghang Lyu, Chinmay Rao, Marius Staring, Matthias J. P. van Osch, Mariya Doneva, Hildo J. Lamb, Nicola Pezzotti

Abstract: Cardiac magnetic resonance imaging (CMR) is vital for diagnosing heart diseases, but long scan time remains a major drawback. To address this, accelerated imaging techniques have been introduced by undersampling k-space, which reduces the quality of the resulting images. Recent deep learning advancements aim to speed up scanning while preserving quality, but adapting to various sampling modes and undersampling factors remains challenging. Therefore, building a universal model is a promising direction. In this work, we introduce UPCMR, a universal unrolled model designed for CMR reconstruction. This model incorporates two kinds of learnable prompts, undersampling-specific prompt and spatial-specific prompt, and integrates them with a UNet structure in each block. Overall, by using the CMRxRecon2024 challenge dataset for training and validation, the UPCMR model highly enhances reconstructed image quality across all random sampling scenarios through an effective training strategy compared to some traditional methods, demonstrating strong adaptability potential for this task.

cross Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models

Authors: Jonathan Bourne

Abstract: Oscar Wilde said, "The difference between literature and journalism is that journalism is unreadable, and literature is not read." Unfortunately, The digitally archived journalism of Oscar Wilde's 19th century often has no or poor quality Optical Character Recognition (OCR), reducing the accessibility of these archives and making them unreadable both figuratively and literally. This paper helps address the issue by performing OCR on "The Nineteenth Century Serials Edition" (NCSE), an 84k-page collection of 19th-century English newspapers and periodicals, using Pixtral 12B, a pre-trained image-to-text language model. The OCR capability of Pixtral was compared to 4 other OCR approaches, achieving a median character error rate of 1%, 5x lower than the next best model. The resulting NCSE v2.0 dataset features improved article identification, high-quality OCR, and text classified into four types and seventeen topics. The dataset contains 1.4 million entries, and 321 million words. Example use cases demonstrate analysis of topic similarity, readability, and event tracking. NCSE v2.0 is freely available to encourage historical and sociological research. As a result, 21st-century readers can now share Oscar Wilde's disappointment with 19th-century journalistic standards, reading the unreadable from the comfort of their own computers.

cross Towards an automated workflow in materials science for combining multi-modal simulative and experimental information using data mining and large language models

Authors: Balduin Katzer, Steffen Klinder, Katrin Schulz

Abstract: To retrieve and compare scientific data of simulations and experiments in materials science, data needs to be easily accessible and machine readable to qualify and quantify various materials science phenomena. The recent progress in open science leverages the accessibility to data. However, a majority of information is encoded within scientific documents limiting the capability of finding suitable literature as well as material properties. This manuscript showcases an automated workflow, which unravels the encoded information from scientific literature to a machine readable data structure of texts, figures, tables, equations and meta-data, using natural language processing and language as well as vision transformer models to generate a machine-readable database. The machine-readable database can be enriched with local data, as e.g. unpublished or private material data, leading to knowledge synthesis. The study shows that such an automated workflow accelerates information retrieval, proximate context detection and material property extraction from multi-modal input data exemplarily shown for the research field of microstructural analyses of face-centered cubic single crystals. Ultimately, a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) enables a fast and efficient question answering chat bot.

cross Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

Authors: Bhavik Agarwal, Ishan Joshi, Viktoria Rojkova

Abstract: In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

cross KOALA: Knowledge Conflict Augmentations for Robustness in Vision Language Models

Authors: Peter Carragher, Nikitha Rao, Abhinand Jha, R Raghav, Kathleen M. Carley

Abstract: The robustness of large language models (LLMs) against knowledge conflicts in unimodal question answering systems has been well studied. However, the effect of conflicts in information sources on vision language models (VLMs) in multimodal settings has not yet been explored. In this work, we propose \segsub, a framework that applies targeted perturbations to image sources to study and improve the robustness of VLMs against three different types of knowledge conflicts, namely parametric, source, and counterfactual conflicts. Contrary to prior findings that showed that LLMs are sensitive to parametric conflicts arising from textual perturbations, we find VLMs are largely robust to image perturbation. On the other hand, VLMs perform poorly on counterfactual examples (<30% accuracy) and fail to reason over source conflicts (<1% accuracy). We also find a link between hallucinations and image context, with GPT-4o prone to hallucination when presented with highly contextualized counterfactual examples. While challenges persist with source conflicts, finetuning models significantly improves reasoning over counterfactual samples. Our findings highlight the need for VLM training methodologies that enhance their reasoning capabilities, particularly in addressing complex knowledge conflicts between multimodal sources.

cross Universal Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery

Authors: Yunze Jia, Yuehui Xian, Yangyang Xu, Pengfei Dang, Xiangdong Ding, Jun Sun, Yumei Zhou, Dezhen Xue

Abstract: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers, to capture latent knowledge and contextual relationships specific to alloys. These semantic embeddings serve as robust elemental descriptors, consistently outperforming traditional empirical descriptors with significant improvements across multiple downstream tasks. These include predicting mechanical and transformation properties, classifying phase structures, and optimizing materials properties via Bayesian optimization. Applications to titanium alloys, high-entropy alloys, and shape memory alloys demonstrate up to 23% gains in prediction accuracy. Our results show that ElementBERT surpasses general-purpose BERT variants by encoding specialized alloy knowledge. By bridging contextual insights from scientific literature with quantitative inference, our framework accelerates the discovery and optimization of advanced materials, with potential applications extending beyond alloys to other material classes.

cross What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

Authors: Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Boqiang Zhang, Nianzu Yang, Pandeng Li, Yun Zheng, Hongtao Xie

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have rendered traditional visual captioning benchmarks obsolete, as they primarily evaluate short descriptions with outdated metrics. While recent benchmarks address these limitations by decomposing captions into visual elements and adopting model-based evaluation, they remain incomplete-overlooking critical aspects, while providing vague, non-explanatory scores. To bridge this gap, we propose CV-CapBench, a Comprehensive Visual Caption Benchmark that systematically evaluates caption quality across 6 views and 13 dimensions. CV-CapBench introduces precision, recall, and hit rate metrics for each dimension, uniquely assessing both correctness and coverage. Experiments on leading MLLMs reveal significant capability gaps, particularly in dynamic and knowledge-intensive dimensions. These findings provide actionable insights for future research. The code and data will be released.

cross Optimizing Gene-Based Testing for Antibiotic Resistance Prediction

Authors: David Hagerman, Anna Johnning, Roman Naeem, Fredrik Kahl, Erik Kristiansson, Lennart Svensson

Abstract: Antibiotic Resistance (AR) is a critical global health challenge that necessitates the development of cost-effective, efficient, and accurate diagnostic tools. Given the genetic basis of AR, techniques such as Polymerase Chain Reaction (PCR) that target specific resistance genes offer a promising approach for predictive diagnostics using a limited set of key genes. This study introduces GenoARM, a novel framework that integrates reinforcement learning (RL) with transformer-based models to optimize the selection of PCR gene tests and improve AR predictions, leveraging observed metadata for improved accuracy. In our evaluation, we developed several high-performing baselines and compared them using publicly available datasets derived from real-world bacterial samples representing multiple clinically relevant pathogens. The results show that all evaluated methods achieve strong and reliable performance when metadata is not utilized. When metadata is introduced and the number of selected genes increases, GenoARM demonstrates superior performance due to its capacity to approximate rewards for unseen and sparse combinations. Overall, our framework represents a major advancement in optimizing diagnostic tools for AR in clinical settings.

cross The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

Authors: Matthieu Meeus, Lukas Wutschitz, Santiago Zanella-B\'eguelin, Shruti Tople, Reza Shokri

Abstract: How much information about training samples can be gleaned from synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we design membership inference attacks (MIAs) that target data used to fine-tune pre-trained LLMs that are then used to synthesize data, particularly when the adversary does not have access to the fine-tuned model but only to the synthetic data. We show that such data-based MIAs do significantly better than a random guess, meaning that synthetic data leaks information about the training data. Further, we find that canaries crafted to maximize vulnerability to model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model's output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their vulnerability. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.

cross Fast and Accurate Blind Flexible Docking

Authors: Zizhuo Zhang, Lijun Wu, Kaiyuan Gao, Jiangchao Yao, Tao Qin, Bo Han

Abstract: Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking-pocket identification, ligand conformation prediction, and protein flexibility modeling-into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 $\times$) compared to existing state-of-the-art methods. Our code is released at https://github.com/tmlr-group/FABFlex.

URLs: https://github.com/tmlr-group/FABFlex.

cross Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design

Authors: Masatoshi Uehara, Xingyu Su, Yulai Zhao, Xiner Li, Aviv Regev, Shuiwang Ji, Sergey Levine, Tommaso Biancalani

Abstract: To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for reward-guided generation have been recently proposed due to their significance, current approaches predominantly focus on single-shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference-time reward optimization with diffusion models inspired by evolutionary algorithms. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Besides, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and cell-type-specific regulatory DNA design. The code is available at \href{https://github.com/masa-ue/ProDifEvo-Refinement}{https://github.com/masa-ue/ProDifEvo-Refinement}.

URLs: https://github.com/masa-ue/ProDifEvo-Refinement, https://github.com/masa-ue/ProDifEvo-Refinement

cross KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Authors: Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Ahmed, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan

Abstract: With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.

cross Symmetric observations without symmetric causal explanations

Authors: Christian William, Patrick Remy, Jean-Daniel Bancal, Yu Cai, Nicolas Brunner, Alejandro Pozas-Kerstjens

Abstract: Inferring causal models from observed correlations is a challenging task, crucial to many areas of science. In order to alleviate the effort, it is important to know whether symmetries in the observations correspond to symmetries in the underlying realization. Via an explicit example, we answer this question in the negative. We use a tripartite probability distribution over binary events that is realized by using three (different) independent sources of classical randomness. We prove that even removing the condition that the sources distribute systems described by classical physics, the requirements that i) the sources distribute the same physical systems, ii) these physical systems respect relativistic causality, and iii) the correlations are the observed ones, are incompatible.

cross Few-shot Species Range Estimation

Authors: Christian Lange, Max Hamilton, Elijah Cole, Alexander Shepard, Samuel Heinrich, Angela Zhu, Subhransu Maji, Grant Van Horn, Oisin Mac Aodha

Abstract: Knowing where a particular species can or cannot be found on Earth is crucial for ecological research and conservation efforts. By mapping the spatial ranges of all species, we would obtain deeper insights into how global biodiversity is affected by climate change and habitat loss. However, accurate range estimates are only available for a relatively small proportion of all known species. For the majority of the remaining species, we often only have a small number of records denoting the spatial locations where they have previously been observed. We outline a new approach for few-shot species range estimation to address the challenge of accurately estimating the range of a species from limited data. During inference, our model takes a set of spatial locations as input, along with optional metadata such as text or an image, and outputs a species encoding that can be used to predict the range of a previously unseen species in feed-forward manner. We validate our method on two challenging benchmarks, where we obtain state-of-the-art range estimation performance, in a fraction of the compute time, compared to recent alternative approaches.

cross Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models

Authors: Mark Russinovich, Ahmed Salem

Abstract: Recent copyright agreements between AI companies and content creators have highlighted the need for precise control over language models' ability to reproduce copyrighted content. While existing approaches rely on either complete concept removal through unlearning or simple output filtering, we propose Obliviate, a novel post-training technique that selectively prevents verbatim reproduction of specific text while preserving semantic understanding. Obliviate operates by selecting tokens within memorized sequences and modifying the model's probability distribution to prevent exact reproduction while maintaining contextual understanding. We evaluate Obliviate on multiple large language models (LLaMA-3.1 8B, LLaMA-3.1-instruct 8B, Qwen-2.5-7B, and Yi-1.5 6B) across both synthetic memorization tasks and organic copyright content. Our results demonstrate that Obliviate achieves orders of magnitude reduction, e.g., 100x, in verbatim memorization while maintaining model performance within 1% of baseline on standard benchmarks (HellaSwag, MMLU, TruthfulQA, and Winogrande). This makes Obliviate particularly suitable for practical deployment scenarios where companies need to efficiently address copyright concerns in pretrained models without compromising their general capabilities.

cross Low degree conjecture implies sharp computational thresholds in stochastic block model

Authors: Jingqiu Ding, Yiding Hua, Lucas Slot, David Steurer

Abstract: We investigate implications of the (extended) low-degree conjecture (recently formalized in [MW23]) in the context of the symmetric stochastic block model. Assuming the conjecture holds, we establish that no polynomial-time algorithm can weakly recover community labels below the Kesten-Stigum (KS) threshold. In particular, we rule out polynomial-time estimators that, with constant probability, achieve correlation with the true communities that is significantly better than random. Whereas, above the KS threshold, polynomial-time algorithms are known to achieve constant correlation with the true communities with high probability[Mas14,AS15]. To our knowledge, we provide the first rigorous evidence for the sharp transition in recovery rate for polynomial-time algorithms at the KS threshold. Notably, under a stronger version of the low-degree conjecture, our lower bound remains valid even when the number of blocks diverges. Furthermore, our results provide evidence of a computational-to-statistical gap in learning the parameters of stochastic block models. In contrast to prior work, which either (i) rules out polynomial-time algorithms for hypothesis testing with 1-o(1) success probability [Hopkins18, BBK+21a] under the low-degree conjecture, or (ii) rules out low-degree polynomials for learning the edge connection probability matrix [LG23], our approach provides stronger lower bounds on the recovery and learning problem. Our proof combines low-degree lower bounds from [Hopkins18, BBK+21a] with graph splitting and cross-validation techniques. In order to rule out general recovery algorithms, we employ the correlation preserving projection method developed in [HS17].

cross An Interpretable Machine Learning Approach to Understanding the Relationships between Solar Flares and Source Active Regions

Authors: Huseyin Cavus, Jason T. L. Wang, Teja P. S. Singampalli, Gani Caglar Coban, Hongyang Zhang, Abd-ur Raheem, Haimin Wang

Abstract: Solar flares are defined as outbursts on the surface of the Sun. They occur when energy accumulated in magnetic fields enclosing solar active regions (ARs) is abruptly expelled. Solar flares and associated coronal mass ejections are sources of space weather that adversely impact devices at or near Earth, including the obstruction of high-frequency radio waves utilized for communication and the deterioration of power grid operations. Tracking and delivering early and precise predictions of solar flares is essential for readiness and catastrophe risk mitigation. This paper employs the random forest (RF) model to address the binary classification task, analyzing the links between solar flares and their originating ARs with observational data gathered from 2011 to 2021 by SolarMonitor.org and the XRT flare database. We seek to identify the physical features of a source AR that significantly influence its potential to trigger >=C-class flares. We found that the features of AR_Type_Today, Hale_Class_Yesterday are the most and the least prepotent features, respectively. NoS_Difference has a remarkable effect in decision-making in both global and local interpretations.

cross Modifying Final Splits of Classification Tree for Fine-tuning Subpopulation Target in Policy Making

Authors: Lei Bill Wang, Zhenbang Jiao, Fangyi Wang

Abstract: Policymakers often use Classification and Regression Trees (CART) to partition populations based on binary outcomes and target subpopulations whose probability of the binary event exceeds a threshold. However, classic CART and knowledge distillation method whose student model is a CART (referred to as KD-CART) do not minimize the misclassification risk associated with classifying the latent probabilities of these binary events. To reduce the misclassification risk, we propose two methods, Penalized Final Split (PFS) and Maximizing Distance Final Split (MDFS). PFS incorporates a tunable penalty into the standard CART splitting criterion function. MDFS maximizes a weighted sum of distances between node means and the threshold. It can point-identify the optimal split under the unique intersect latent probability assumption. In addition, we develop theoretical result for MDFS splitting rule estimation, which has zero asymptotic risk. Through extensive simulation studies, we demonstrate that these methods predominately outperform classic CART and KD-CART in terms of misclassification error. Furthermore, in our empirical evaluations, these methods provide deeper insights than the two baseline methods.

cross Forecasting Local Ionospheric Parameters Using Transformers

Authors: Daniel J. Alford-Lago, Christopher W. Curtis, Alexander T. Ihler, Katherine A. Zawdie, Douglas P. Drob

Abstract: We present a novel method for forecasting key ionospheric parameters using transformer-based neural networks. The model provides accurate forecasts and uncertainty quantification of the F2-layer peak plasma frequency (foF2), the F2-layer peak density height (hmF2), and total electron content (TEC) for a given geographic location. It supports a number of exogenous variables, including F10.7cm solar flux and disturbance storm time (Dst). We demonstrate how transformers can be trained in a data assimilation-like fashion that use these exogenous variables along with na\"ive predictions from climatology to generate 24-hour forecasts with non-parametric uncertainty bounds. We call this method the Local Ionospheric Forecast Transformer (LIFT). We demonstrate that the trained model can generalize to new geographic locations and time periods not seen during training, and we compare its performance to that of the International Reference Ionosphere (IRI).

cross LUME: LLM Unlearning with Multitask Evaluations

Authors: Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, Rahul Gupta

Abstract: Unlearning aims to remove copyrighted, sensitive, or private content from large language models (LLMs) without a full retraining. In this work, we develop a multi-task unlearning benchmark (LUME) which features three tasks: (1) unlearn synthetically generated creative short novels, (2) unlearn synthetic biographies with sensitive information, and (3) unlearn a collection of public biographies. We further release two fine-tuned LLMs of 1B and 7B parameter sizes as the target models. We conduct detailed evaluations of several recently proposed unlearning algorithms and present results on carefully crafted metrics to understand their behavior and limitations.

cross Social Genome: Grounded Social Reasoning Abilities of Multimodal Models

Authors: Leena Mathur, Marian Qian, Paul Pu Liang, Louis-Philippe Morency

Abstract: Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce Social Genome, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. Social Genome contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferences about these interactions. These traces contain 5,777 reasoning steps that reference evidence from visual cues, verbal cues, vocal cues, and external knowledge (contextual knowledge external to videos). Social Genome is also the first modeling challenge to study external knowledge in social reasoning. Social Genome computes metrics to holistically evaluate semantic and structural qualities of model-generated social reasoning traces. We demonstrate the utility of Social Genome through experiments with state-of-the-art models, identifying performance gaps and opportunities for future research to improve the grounded social reasoning abilities of multimodal models.

cross Variational phylogenetic inference with products over bipartitions

Authors: Evan Sidrow, Alexandre Bouchard-C\^ot\'e, Lloyd T. Elliott

Abstract: Bayesian phylogenetics requires accurate and efficient approximation of posterior distributions over trees. In this work, we develop a variational Bayesian approach for ultrametric phylogenetic trees. We present a novel variational family based on coalescent times of a single-linkage clustering and derive a closed-form density of the resulting distribution over trees. Unlike existing methods for ultrametric trees, our method performs inference over all of tree space, it does not require any Markov chain Monte Carlo subroutines, and our variational family is differentiable. Through experiments on benchmark genomic datasets and an application to SARS-CoV-2, we demonstrate that our method achieves competitive accuracy while requiring significantly fewer gradient evaluations than existing state-of-the-art techniques.

cross Do we really need the Rademacher complexities?

Authors: Daniel Bartl, Shahar Mendelson

Abstract: We study the fundamental problem of learning with respect to the squared loss in a convex class. The state-of-the-art sample complexity estimates in this setting rely on Rademacher complexities, which are generally difficult to control. We prove that, contrary to prevailing belief and under minimal assumptions, the sample complexity is not governed by the Rademacher complexities but rather by the behaviour of the limiting gaussian process. In particular, all such learning problems that have the same $L_2$-structure -- even those with heavy-tailed distributions -- share the same sample complexity. This constitutes the first universality result for general convex learning problems. The proof is based on a novel learning procedure, and its performance is studied by combining optimal mean estimation techniques for real-valued random variables with Talagrand's generic chaining method.

cross Curvature Corrected Nonnegative Manifold Data Factorization

Authors: Joyce Chew, Willem Diepeveen, Deanna Needell

Abstract: Data with underlying nonlinear structure are collected across numerous application domains, necessitating new data processing and analysis methods adapted to nonlinear domain structure. Riemannanian manifolds present a rich environment in which to develop such tools, as manifold-valued data arise in a variety of scientific settings, and Riemannian geometry provides a solid theoretical grounding for geometric data analysis. Low-rank approximations, such as nonnegative matrix factorization (NMF), are the foundation of many Euclidean data analysis methods, so adaptations of these factorizations for manifold-valued data are important building blocks for further development of manifold data analysis. In this work, we propose curvature corrected nonnegative manifold data factorization (CC-NMDF) as a geometry-aware method for extracting interpretable factors from manifold-valued data, analogous to nonnegative matrix factorization. We develop an efficient iterative algorithm for computing CC-NMDF and demonstrate our method on real-world diffusion tensor magnetic resonance imaging data.

cross Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling

Authors: Yufan Li, Pragya Sur

Abstract: We study the fundamental problem of calibrating a linear binary classifier of the form $\sigma(\hat{w}^\top x)$, where the feature vector $x$ is Gaussian, $\sigma$ is a link function, and $\hat{w}$ is an estimator of the true linear weight $w^\star$. By interpolating with a noninformative $\textit{chance classifier}$, we construct a well-calibrated predictor whose interpolation weight depends on the angle $\angle(\hat{w}, w_\star)$ between the estimator $\hat{w}$ and the true linear weight $w_\star$. We establish that this angular calibration approach is provably well-calibrated in a high-dimensional regime where the number of samples and features both diverge, at a comparable rate. The angle $\angle(\hat{w}, w_\star)$ can be consistently estimated. Furthermore, the resulting predictor is uniquely $\textit{Bregman-optimal}$, minimizing the Bregman divergence to the true label distribution within a suitable class of calibrated predictors. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions. Additionally, we identify conditions under which a classical Platt-scaling predictor converges to our Bregman-optimal calibrated solution. Thus, Platt-scaling also inherits these desirable properties provably in high dimensions.

cross CoT-ICL Lab: A Petri Dish for Studying Chain-of-Thought Learning from In-Context Demonstrations

Authors: Vignesh Kothapalli, Hamed Firooz, Maziar Sanjabi

Abstract: We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of-thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.

cross Tensor Product Neural Networks for Functional ANOVA Model

Authors: Seokhun Park, Insung Kong, Yongchan Choi, Chanmoo Park, Yongdai Kim

Abstract: Interpretability for machine learning models is becoming more and more important as machine learning models become more complex. The functional ANOVA model, which decomposes a high-dimensional function into a sum of lower dimensional functions so called components, is one of the most popular tools for interpretable AI, and recently, various neural network models have been developed for estimating each component in the functional ANOVA model. However, such neural networks are highly unstable when estimating components since the components themselves are not uniquely defined. That is, there are multiple functional ANOVA decompositions for a given function. In this paper, we propose a novel interpretable model which guarantees a unique functional ANOVA decomposition and thus is able to estimate each component stably. We call our proposed model ANOVA-NODE since it is a modification of Neural Oblivious Decision Ensembles (NODE) for the functional ANOVA model. Theoretically, we prove that ANOVA-NODE can approximate a smooth function well. Additionally, we experimentally show that ANOVA-NODE provides much more stable estimation of each component and thus much more stable interpretation when training data and initial values of the model parameters vary than existing neural network models do.

cross FormalSpecCpp: A Dataset of C++ Formal Specifications created using LLMs

Authors: Madhurima Chakraborty, Peter Pirkelbauer, Qing Yi

Abstract: FormalSpecCpp is a dataset designed to fill the gap in standardized benchmarks for verifying formal specifications in C++ programs. To the best of our knowledge, this is the first comprehensive collection of C++ programs with well-defined preconditions and postconditions. It provides a structured benchmark for evaluating specification inference tools and testing theaccuracy of generated specifications. Researchers and developers can use this dataset to benchmark specification inference tools,fine-tune Large Language Models (LLMs) for automated specification generation, and analyze the role of formal specifications in improving program verification and automated testing. By making this dataset publicly available, we aim to advance research in program verification, specification inference, and AI-assisted software development. The dataset and the code are available at https://github.com/MadhuNimmo/FormalSpecCpp.

URLs: https://github.com/MadhuNimmo/FormalSpecCpp.

cross Steganographic Embeddings as an Effective Data Augmentation

Authors: Nicholas DiSalvo

Abstract: Image Steganography is a cryptographic technique that embeds secret information into an image, ensuring the hidden data remains undetectable to the human eye while preserving the image's original visual integrity. Least Significant Bit (LSB) Steganography achieves this by replacing the k least significant bits of an image with the k most significant bits of a secret image, maintaining the appearance of the original image while simultaneously encoding the essential elements of the hidden data. In this work, we shift away from conventional applications of steganography in deep learning and explore its potential from a new angle. We present experimental results on CIFAR-10 showing that LSB Steganography, when used as a data augmentation strategy for downstream computer vision tasks such as image classification, can significantly improve the training efficiency of deep neural networks. It can also act as an implicit, uniformly discretized piecewise linear approximation of color augmentations such as (brightness, contrast, hue, and saturation), without introducing additional training overhead through a new joint image training regime that disregards the need for tuning sensitive augmentation hyperparameters.

cross Comparative Analysis of Black Hole Mass Estimation in Type-2 AGNs: Classical vs. Quantum Machine Learning and Deep Learning Approaches

Authors: Sathwik Narkedimilli, Venkata Sriram Amballa, N V Saran Kumar, R Arun Kumar, R Praneeth Reddy, Satvik Raghav, Manish M, Aswath Babu H

Abstract: In the case of Type-2 AGNs, estimating the mass of the black hole is challenging. Understanding how galaxies form and evolve requires considerable insight into the mass of black holes. This work compared different classical and quantum machine learning (QML) algorithms for black hole mass estimation, wherein the classical algorithms are Linear Regression, XGBoost Regression, Random Forest Regressor, Support Vector Regressor (SVR), Lasso Regression, Ridge Regression, Elastic Net Regression, Bayesian Regression, Decision Tree Regressor, Gradient Booster Regressor, Classical Neural Networks, Gated Recurrent Unit (GRU), LSTM, Deep Residual Networks (ResNets) and Transformer-Based Regression. On the other hand, quantum algorithms including Hybrid Quantum Neural Networks (QNN), Quantum Long Short-Term Memory (Q-LSTM), Sampler-QNN, Estimator-QNN, Variational Quantum Regressor (VQR), Quantum Linear Regression(Q-LR), QML with JAX optimization were also tested. The results revealed that classical algorithms gave better R^2, MAE, MSE, and RMSE results than the quantum models. Among the classical models, LSTM has the best result with an accuracy of 99.77%. Estimator-QNN has the highest accuracy for quantum algorithms with an MSE of 0.0124 and an accuracy of 99.75%. This study ascertains both the strengths and weaknesses of the classical and the quantum approaches. As far as our knowledge goes, this work could pave the way for the future application of quantum algorithms in astrophysical data analysis.

cross A Data-Driven Real-Time Optimal Power Flow Algorithm Using Local Feedback

Authors: Heng Liang, Yujin Huang, Changhong Zhao

Abstract: The increasing penetration of distributed energy resources (DERs) adds variability as well as fast control capabilities to power networks. Dispatching the DERs based on local information to provide real-time optimal network operation is the desideratum. In this paper, we propose a data-driven real-time algorithm that uses only the local measurements to solve time-varying AC optimal power flow (OPF). Specifically, we design a learnable function that takes the local feedback as input in the algorithm. The learnable function, under certain conditions, will result in a unique stationary point of the algorithm, which in turn transfers the OPF problems to be optimized over the parameters of the function. We then develop a stochastic primal-dual update to solve the variant of the OPF problems based on a deep neural network (DNN) parametrization of the learnable function, which is referred to as the training stage. We also design a gradient-free alternative to bypass the cumbersome gradient calculation of the nonlinear power flow model. The OPF solution-tracking error bound is established in the sense of universal approximation of DNN. Numerical results on the IEEE 37-bus test feeder show that the proposed method can track the time-varying OPF solutions with higher accuracy and faster computation compared to benchmark methods.

cross Utilizing Sequential Information of General Lab-test Results and Diagnoses History for Differential Diagnosis of Dementia

Authors: Yizong Xing, Dhita Putri Pratama, Yuke Wang, Yufan Zhang, Brian E. Chapman

Abstract: Early diagnosis of Alzheimer's Disease (AD) faces multiple data-related challenges, including high variability in patient data, limited access to specialized diagnostic tests, and overreliance on single-type indicators. These challenges are exacerbated by the progressive nature of AD, where subtle pathophysiological changes often precede clinical symptoms by decades. To address these limitations, this study proposes a novel approach that takes advantage of routinely collected general laboratory test histories for the early detection and differential diagnosis of AD. By modeling lab test sequences as "sentences", we apply word embedding techniques to capture latent relationships between tests and employ deep time series models, including long-short-term memory (LSTM) and Transformer networks, to model temporal patterns in patient records. Experimental results demonstrate that our approach improves diagnostic accuracy and enables scalable and costeffective AD screening in diverse clinical settings.

cross Lightweight yet Efficient: An External Attentive Graph Convolutional Network with Positional Prompts for Sequential Recommendation

Authors: Jinyu Zhang, Chao Li, Zhongying Zhao

Abstract: Graph-based Sequential Recommender systems (GSRs) have gained significant research attention due to their ability to simultaneously handle user-item interactions and sequential relationships between items. Current GSRs often utilize composite or in-depth structures for graph encoding (e.g., the Graph Transformer). Nevertheless, they have high computational complexity, hindering the deployment on resource-constrained edge devices. Moreover, the relative position encoding in Graph Transformer has difficulty in considering the complicated positional dependencies within sequence. To this end, we propose an External Attentive Graph convolutional network with Positional prompts for Sequential recommendation, namely EA-GPS. Specifically, we first introduce an external attentive graph convolutional network that linearly measures the global associations among nodes via two external memory units. Then, we present a positional prompt-based decoder that explicitly treats the absolute item positions as external prompts. By introducing length-adaptive sequential masking and a soft attention network, such a decoder facilitates the model to capture the long-term positional dependencies and contextual relationships within sequences. Extensive experimental results on five real-world datasets demonstrate that the proposed EA-GPS outperforms the state-of-the-art methods. Remarkably, it achieves the superior performance while maintaining a smaller parameter size and lower training overhead. The implementation of this work is publicly available at https://github.com/ZZY-GraphMiningLab/EA-GPS.

URLs: https://github.com/ZZY-GraphMiningLab/EA-GPS.

cross Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

Authors: Pedram Zaree, Md Abdullah Al Mamun, Quazi Mishkatul Alam, Yue Dong, Ihsen Alouani, Nael Abu-Ghazaleh

Abstract: Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

cross Drug-Target Interaction/Affinity Prediction: Deep Learning Models and Advances Review

Authors: Ali Vefghi, Zahed Rahmati, Mohammad Akbari

Abstract: Drug discovery remains a slow and expensive process that involves many steps, from detecting the target structure to obtaining approval from the Food and Drug Administration (FDA), and is often riddled with safety concerns. Accurate prediction of how drugs interact with their targets and the development of new drugs by using better methods and technologies have immense potential to speed up this process, ultimately leading to faster delivery of life-saving medications. Traditional methods used for drug-target interaction prediction show limitations, particularly in capturing complex relationships between drugs and their targets. As an outcome, deep learning models have been presented to overcome the challenges of interaction prediction through their precise and efficient end results. By outlining promising research avenues and models, each with a different solution but similar to the problem, this paper aims to give researchers a better idea of methods for even more accurate and efficient prediction of drug-target interaction, ultimately accelerating the development of more effective drugs. A total of 180 prediction methods for drug-target interactions were analyzed throughout the period spanning 2016 to 2025 using different frameworks based on machine learning, mainly deep learning and graph neural networks. Additionally, this paper discusses the novelty, architecture, and input representation of these models.

cross AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms

Authors: Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen

Abstract: Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.

URLs: https://github.com/microsoft/AttentionEngine.

cross Fr\'echet Cumulative Covariance Net for Deep Nonlinear Sufficient Dimension Reduction with Random Objects

Authors: Hang Yuan, Christina Dan Wang, Zhou Yu

Abstract: Nonlinear sufficient dimension reduction\citep{libing_generalSDR}, which constructs nonlinear low-dimensional representations to summarize essential features of high-dimensional data, is an important branch of representation learning. However, most existing methods are not applicable when the response variables are complex non-Euclidean random objects, which are frequently encountered in many recent statistical applications. In this paper, we introduce a new statistical dependence measure termed Fr\'echet Cumulative Covariance (FCCov) and develop a novel nonlinear SDR framework based on FCCov. Our approach is not only applicable to complex non-Euclidean data, but also exhibits robustness against outliers. We further incorporate Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) to estimate nonlinear sufficient directions in the sample level. Theoretically, we prove that our method with squared Frobenius norm regularization achieves unbiasedness at the $\sigma$-field level. Furthermore, we establish non-asymptotic convergence rates for our estimators based on FNNs and ResNet-type CNNs, which match the minimax rate of nonparametric regression up to logarithmic factors. Intensive simulation studies verify the performance of our methods in both Euclidean and non-Euclidean settings. We apply our method to facial expression recognition datasets and the results underscore more realistic and broader applicability of our proposal.

cross TAG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning

Authors: Giuseppe Paolo, Abdelhakim Benechehab, Hamza Cherkaoui, Albert Thomas, Bal\'azs K\'egl

Abstract: Hierarchical organization is fundamental to biological systems and human societies, yet artificial intelligence systems often rely on monolithic architectures that limit adaptability and scalability. Current hierarchical reinforcement learning (HRL) approaches typically restrict hierarchies to two levels or require centralized training, which limits their practical applicability. We introduce TAME Agent Framework (TAG), a framework for constructing fully decentralized hierarchical multi-agent systems.TAG enables hierarchies of arbitrary depth through a novel LevelEnv concept, which abstracts each hierarchy level as the environment for the agents above it. This approach standardizes information flow between levels while preserving loose coupling, allowing for seamless integration of diverse agent types. We demonstrate the effectiveness of TAG by implementing hierarchical architectures that combine different RL agents across multiple levels, achieving improved performance over classical multi-agent RL baselines on standard benchmarks. Our results show that decentralized hierarchical organization enhances both learning speed and final performance, positioning TAG as a promising direction for scalable multi-agent systems.

cross Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

Authors: Giulio Zizzo, Giandomenico Cornacchia, Kieran Fraser, Muhammad Zaid Hameed, Ambrish Rawat, Beat Buesser, Mark Purcell, Pin-Yu Chen, Prasanna Sattigeri, Kush Varshney

Abstract: As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to. Additionally, we show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance compared to many state-of-the-art defences. Code is available at https://github.com/IBM/Adversarial-Prompt-Evaluation.

URLs: https://github.com/IBM/Adversarial-Prompt-Evaluation.

cross Dimension-free bounds in high-dimensional linear regression via error-in-operator approach

Authors: Fedor Noskov, Nikita Puchkin, Vladimir Spokoiny

Abstract: We consider a problem of high-dimensional linear regression with random design. We suggest a novel approach referred to as error-in-operator which does not estimate the design covariance $\Sigma$ directly but incorporates it into empirical risk minimization. We provide an expansion of the excess prediction risk and derive non-asymptotic dimension-free bounds on the leading term and the remainder. This helps us to show that auxiliary variables do not increase the effective dimension of the problem, provided that parameters of the procedure are tuned properly. We also discuss computational aspects of our method and illustrate its performance with numerical experiments.

cross PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

Authors: Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan G\'omez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu

Abstract: Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8$\times$ and 11.1$\times$ speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.

cross Sheaf theory: from deep geometry to deep learning

Authors: Anton Ayzenberg, Thomas Gebhart, German Magai, Grigory Solomadin

Abstract: This paper provides an overview of the applications of sheaf theory in deep learning, data science, and computer science in general. The primary text of this work serves as a friendly introduction to applied and computational sheaf theory accessible to those with modest mathematical familiarity. We describe intuitions and motivations underlying sheaf theory shared by both theoretical researchers and practitioners, bridging classical mathematical theory and its more recent implementations within signal processing and deep learning. We observe that most notions commonly considered specific to cellular sheaves translate to sheaves on arbitrary posets, providing an interesting avenue for further generalization of these methods in applications, and we present a new algorithm to compute sheaf cohomology on arbitrary finite posets in response. By integrating classical theory with recent applications, this work reveals certain blind spots in current machine learning practices. We conclude with a list of problems related to sheaf-theoretic applications that we find mathematically insightful and practically instructive to solve. To ensure the exposition of sheaf theory is self-contained, a rigorous mathematical introduction is provided in appendices which moves from an introduction of diagrams and sheaves to the definition of derived functors, higher order cohomology, sheaf Laplacians, sheaf diffusion, and interconnections of these subjects therein.

cross Generalization Guarantees for Representation Learning via Data-Dependent Gaussian Mixture Priors

Authors: Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski

Abstract: We establish in-expectation and tail bounds on the generalization error of representation learning type algorithms. The bounds are in terms of the relative entropy between the distribution of the representations extracted from the training and "test'' datasets and a data-dependent symmetric prior, i.e., the Minimum Description Length (MDL) of the latent variables for the training and test datasets. Our bounds are shown to reflect the "structure" and "simplicity'' of the encoder and significantly improve upon the few existing ones for the studied model. We then use our in-expectation bound to devise a suitable data-dependent regularizer; and we investigate thoroughly the important question of the selection of the prior. We propose a systematic approach to simultaneously learning a data-dependent Gaussian mixture prior and using it as a regularizer. Interestingly, we show that a weighted attention mechanism emerges naturally in this procedure. Our experiments show that our approach outperforms the now popular Variational Information Bottleneck (VIB) method as well as the recent Category-Dependent VIB (CDVIB).

cross Estimating Vehicle Speed on Roadways Using RNNs and Transformers: A Video-based Approach

Authors: Sai Krishna Reddy Mareddy, Dhanush Upplapati, Dhanush Kumar Antharam

Abstract: This project explores the application of advanced machine learning models, specifically Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Transformers, to the task of vehicle speed estimation using video data. Traditional methods of speed estimation, such as radar and manual systems, are often constrained by high costs, limited coverage, and potential disruptions. In contrast, leveraging existing surveillance infrastructure and cutting-edge neural network architectures presents a non-intrusive, scalable solution. Our approach utilizes LSTM and GRU to effectively manage long-term dependencies within the temporal sequence of video frames, while Transformers are employed to harness their self-attention mechanisms, enabling the processing of entire sequences in parallel and focusing on the most informative segments of the data. This study demonstrates that both LSTM and GRU outperform basic Recurrent Neural Networks (RNNs) due to their advanced gating mechanisms. Furthermore, increasing the sequence length of input data consistently improves model accuracy, highlighting the importance of contextual information in dynamic environments. Transformers, in particular, show exceptional adaptability and robustness across varied sequence lengths and complexities, making them highly suitable for real-time applications in diverse traffic conditions. The findings suggest that integrating these sophisticated neural network models can significantly enhance the accuracy and reliability of automated speed detection systems, thus promising to revolutionize traffic management and road safety.

cross A Defensive Framework Against Adversarial Attacks on Machine Learning-Based Network Intrusion Detection Systems

Authors: Benyamin Tafreshian, Shengzhi Zhang

Abstract: As cyberattacks become increasingly sophisticated, advanced Network Intrusion Detection Systems (NIDS) are critical for modern network security. Traditional signature-based NIDS are inadequate against zero-day and evolving attacks. In response, machine learning (ML)-based NIDS have emerged as promising solutions; however, they are vulnerable to adversarial evasion attacks that subtly manipulate network traffic to bypass detection. To address this vulnerability, we propose a novel defensive framework that enhances the robustness of ML-based NIDS by simultaneously integrating adversarial training, dataset balancing techniques, advanced feature engineering, ensemble learning, and extensive model fine-tuning. We validate our framework using the NSL-KDD and UNSW-NB15 datasets. Experimental results show, on average, a 35% increase in detection accuracy and a 12.5% reduction in false positives compared to baseline models, particularly under adversarial conditions. The proposed defense against adversarial attacks significantly advances the practical deployment of robust ML-based NIDS in real-world networks.

cross Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation

Authors: Tim R\"adsch, Leon Mayer, Simon Pavicic, A. Emre Kavur, Marcel Knopp, Bar{\i}\c{s} \"Ozt\"urk, Klaus Maier-Hein, Paul F. Jaeger, Fabian Isensee, Annika Reinke, Lena Maier-Hein

Abstract: Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains and tasks, thereby supporting the need for tailored VLM benchmarks. Adoption of our methodology will pave the way for the resource-efficient domain-specific selection of models and guide future research efforts toward addressing core open questions.

cross Feature maps for the Laplacian kernel and its generalizations

Authors: Sudhendu Ahir, Parthe Pandit

Abstract: Recent applications of kernel methods in machine learning have seen a renewed interest in the Laplacian kernel, due to its stability to the bandwidth hyperparameter in comparison to the Gaussian kernel, as well as its expressivity being equivalent to that of the neural tangent kernel of deep fully connected networks. However, unlike the Gaussian kernel, the Laplacian kernel is not separable. This poses challenges for techniques to approximate it, especially via the random Fourier features (RFF) methodology and its variants. In this work, we provide random features for the Laplacian kernel and its two generalizations: Mat\'{e}rn kernel and the Exponential power kernel. We provide efficiently implementable schemes to sample weight matrices so that random features approximate these kernels. These weight matrices have a weakly coupled heavy-tailed randomness. Via numerical experiments on real datasets we demonstrate the efficacy of these random feature maps.

cross Context-Aware Doubly-Robust Semi-Supervised Learning

Authors: Clement Ruah, Houssem Sifaou, Osvaldo Simeone, Bashir Al-Hashimi

Abstract: The widespread adoption of artificial intelligence (AI) in next-generation communication systems is challenged by the heterogeneity of traffic and network conditions, which call for the use of highly contextual, site-specific, data. A promising solution is to rely not only on real-world data, but also on synthetic pseudo-data generated by a network digital twin (NDT). However, the effectiveness of this approach hinges on the accuracy of the NDT, which can vary widely across different contexts. To address this problem, this paper introduces context-aware doubly-robust (CDR) learning, a novel semi-supervised scheme that adapts its reliance on the pseudo-data to the different levels of fidelity of the NDT across contexts. CDR is evaluated on the task of downlink beamforming, showing superior performance compared to previous state-of-the-art semi-supervised approaches.

cross LightThinker: Thinking Step-by-Step Compression

Authors: Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang

Abstract: Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code will be released at https://github.com/zjunlp/LightThinker.

URLs: https://github.com/zjunlp/LightThinker.

cross KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

Authors: Yoonjin Chung, Pilsun Eu, Junwon Lee, Keunwoo Choi, Juhan Nam, Ben Sangbae Chon

Abstract: Although being widely adopted for evaluating generated audio signals, the Fr\'echet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

cross Do Multilingual LLMs Think In English?

Authors: Lisa Schut, Yarin Gal, Sebastian Farquhar

Abstract: Large language models (LLMs) have multilingual capabilities and can solve tasks across various languages. However, we show that current LLMs make key decisions in a representation space closest to English, regardless of their input and output languages. Exploring the internal representations with a logit lens for sentences in French, German, Dutch, and Mandarin, we show that the LLM first emits representations close to English for semantically-loaded words before translating them into the target language. We further show that activation steering in these LLMs is more effective when the steering vectors are computed in English rather than in the language of the inputs and outputs. This suggests that multilingual LLMs perform key reasoning steps in a representation that is heavily shaped by English in a way that is not transparent to system users.

cross On the Robustness of Transformers against Context Hijacking for Linear Classification

Authors: Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, Difan Zou

Abstract: Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.

cross Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing

Authors: Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar

Abstract: We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.

URLs: https://github.com/Qi-Le1/Probe_Pruning.

cross Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

Authors: John Burden, Marko Te\v{s}i\'c, Lorenzo Pacchiardi, Jos\'e Hern\'andez-Orallo

Abstract: Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other's contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.

cross Sparks of cognitive flexibility: self-guided context inference for flexible stimulus-response mapping by attentional routing

Authors: Rowan Sommers, Sushrut Thorat, Daniel Anthes, Tim C. Kietzmann

Abstract: Flexible cognition demands discovering hidden rules to quickly adapt stimulus-response mappings. Standard neural networks struggle in tasks requiring rapid, context-driven remapping. Recently, Hummos (2023) introduced a fast-and-slow learning algorithm to mitigate this shortfall, but its scalability to complex, image-computable tasks was unclear. Here, we propose the Wisconsin Neural Network (WiNN), which expands on fast-and-slow learning for real-world tasks demanding flexible rule-based behavior. WiNN employs a pretrained convolutional neural network for vision, coupled with an adjustable "context state" that guides attention to relevant features. If WiNN produces an incorrect response, it first iteratively updates its context state to refocus attention on task-relevant cues, then performs minimal parameter updates to attention and readout layers. This strategy preserves generalizable representations in the sensory network, reducing catastrophic forgetting. We evaluate WiNN on an image-based extension of the Wisconsin Card Sorting Task, revealing several markers of cognitive flexibility: (i) WiNN autonomously infers underlying rules, (ii) requires fewer examples to do so than control models reliant on large-scale parameter updates, (iii) can perform context-based rule inference solely via context-state adjustments-further enhanced by slow updates of attention and readout parameters, and (iv) generalizes to unseen compositional rules through context-state inference alone. By blending fast context inference with targeted attentional guidance, WiNN achieves "sparks" of flexibility. This approach offers a path toward context-sensitive models that retain knowledge while rapidly adapting to complex, rule-based tasks.

cross Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models

Authors: Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina

Abstract: Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions -- a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM's activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.

cross Machine-generated text detection prevents language model collapse

Authors: George Drayson, Vasileios Lampos

Abstract: As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since web data is the primary resource for LLM pretraining, future models will be trained on an unknown portion of synthetic data. This will lead to model collapse, a degenerative process which causes models to reinforce their own errors and experience a drop in model performance. In this study, we investigate the impact of decoding strategy on model collapse, where we analyse the characteristics of the generated data during recursive training, its similarity to human references and the resulting model performance. Using the decoding strategies that lead to the most significant model degradation, we tackle the question: how to avoid model collapse when the origin (human or synthetic) of the training data is unknown. We design a novel methodology based on resampling the data distribution using importance weights from our machine-generated text detector. Our method is validated on two LLM variants (GPT-2 and SmolLM2) on the open-ended text generation task, demonstrating that we can successfully prevent model collapse and when there is enough human-authored data in the training dataset, our method improves model performance.

cross Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

Authors: Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S\"oren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King

Abstract: The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

cross Automating Curriculum Learning for Reinforcement Learning using a Skill-Based Bayesian Network

Authors: Vincent Hsiao, Mark Roberts, Laura M. Hiatt, George Konidaris, Dana Nau

Abstract: A major challenge for reinforcement learning is automatically generating curricula to reduce training time or improve performance in some target task. We introduce SEBNs (Skill-Environment Bayesian Networks) which model a probabilistic relationship between a set of skills, a set of goals that relate to the reward structure, and a set of environment features to predict policy performance on (possibly unseen) tasks. We develop an algorithm that uses the inferred estimates of agent success from SEBN to weigh the possible next tasks by expected improvement. We evaluate the benefit of the resulting curriculum on three environments: a discrete gridworld, continuous control, and simulated robotics. The results show that curricula constructed using SEBN frequently outperform other baselines.

cross Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing

Authors: Shoumik Saha, Soheil Feizi

Abstract: The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains $11.7K$ samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.

cross FLEKE: Federated Locate-then-Edit Knowledge Editing

Authors: Zongkai Zhao, Guozeng Xu, Xiuhua Li, Kaiwen Wei, Jiang Zhong

Abstract: Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns. To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse. In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations. Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FLEKE task with our FedEdit framework. Our code is available at https://github.com/zongkaiz/FLEKE.

URLs: https://github.com/zongkaiz/FLEKE.

replace GraphFM: Graph Factorization Machines for Feature Interaction Modeling

Authors: Shu Wu, Zekun Li, Yunyue Su, Zeyu Cui, Xiaoyu Zhang, Liang Wang

Abstract: Factorization machine (FM) is a prevalent approach to modeling pairwise (second-order) feature interactions when dealing with high-dimensional sparse data. However, on the one hand, FM fails to capture higher-order feature interactions suffering from combinatorial expansion. On the other hand, taking into account interactions between every pair of features may introduce noise and degrade prediction accuracy. To solve the problems, we propose a novel approach, Graph Factorization Machine (GraphFM), by naturally representing features in the graph structure. In particular, we design a mechanism to select the beneficial feature interactions and formulate them as edges between features. Then the proposed model, which integrates the interaction function of FM into the feature aggregation strategy of Graph Neural Network (GNN), can model arbitrary-order feature interactions on the graph-structured features by stacking layers. Experimental results on several real-world datasets have demonstrated the rationality and effectiveness of our proposed approach. The code and data are available at https://github.com/CRIPAC-DIG/GraphCTR}{https://github.com/CRIPAC-DIG/GraphCTR

URLs: https://github.com/CRIPAC-DIG/GraphCTR, https://github.com/CRIPAC-DIG/GraphCTR

replace Graph Neural Diffusion Networks for Semi-supervised Learning

Authors: Wei Ye, Zexi Huang, Yunqi Hong, Ambuj Singh

Abstract: Graph Convolutional Networks (GCN) is a pioneering model for graph-based semi-supervised learning. However, GCN does not perform well on sparsely-labeled graphs. Its two-layer version cannot effectively propagate the label information to the whole graph structure (i.e., the under-smoothing problem) while its deep version over-smoothens and is hard to train (i.e., the over-smoothing problem). To solve these two issues, we propose a new graph neural network called GND-Nets (for Graph Neural Diffusion Networks) that exploits the local and global neighborhood information of a vertex in a single layer. Exploiting the shallow network mitigates the over-smoothing problem while exploiting the local and global neighborhood information mitigates the under-smoothing problem. The utilization of the local and global neighborhood information of a vertex is achieved by a new graph diffusion method called neural diffusions, which integrate neural networks into the conventional linear and nonlinear graph diffusions. The adoption of neural networks makes neural diffusions adaptable to different datasets. Extensive experiments on various sparsely-labeled graphs verify the effectiveness and efficiency of GND-Nets compared to state-of-the-art approaches.

replace IGN : Implicit Generative Networks

Authors: Haozheng Luo, Tianyi Wu, Colin Feiyu Han, Zhijun Yan

Abstract: In this work, we build recent advances in distributional reinforcement learning to give a state-of-art distributional variant of the model based on the IQN. We achieve this by using the GAN model's generator and discriminator function with the quantile regression to approximate the full quantile value for the state-action return distribution. We demonstrate improved performance on our baseline dataset - 57 Atari 2600 games in the ALE. Also, we use our algorithm to show the state-of-art training performance of risk-sensitive policies in Atari games with the policy optimization and evaluation.

replace Graph Perceiver IO: A General Architecture for Graph Structured Data

Authors: Seyun Bae, Hoyoon Byun, Changdae Oh, Yoon-Sik Cho, Kyungwoo Song

Abstract: Multimodal machine learning has been widely studied for the development of general intelligence. Recently, the Perceiver and Perceiver IO, show competitive results for diverse dataset domains and tasks. However, recent works, Perceiver and Perceiver IO, have focused on heterogeneous modalities, including image, text, and there are few research works for graph structured datasets. A graph has an adjacency matrix different from other datasets such as text and image, and it is not trivial to handle the topological information. In this study, we provide a Graph Perceiver IO (GPIO), the Perceiver IO for the graph structured dataset. We keep the main structure of the GPIO as the Perceiver IO because the Perceiver IO already handles the diverse dataset well, except for the graph structured dataset. The GPIO is a general method that handles diverse datasets, such as graph-structured data, text, and images, by leveraging positional encoding and output query smoothing. Compared to graph neural networks (GNNs), GPIO requires lower complexity and can efficiently incorporate global and local information, which is also empirically validated through experiments. Furthermore, we propose GPIO+ for the multimodal few-shot classification that incorporates both images and graphs simultaneously. GPIO achieves higher benchmark accuracy than GNNs across multiple tasks, including graph classification, node classification, and multimodal text classification, while also attaining superior AP and AUC in link prediction. Additionally, GPIO+ outperforms GNNs in multimodal few-shot classification. Our GPIO(+) can serve as a general architecture for handling various modalities and tasks.

replace Towards Comparable Active Learning

Authors: Thorben Werner, Johannes Burchert, Lars Schmidt-Thieme

Abstract: Active Learning has received significant attention in the field of machine learning for its potential in selecting the most informative samples for labeling, thereby reducing data annotation costs. However, we show that the reported lifts in recent literature generalize poorly to other domains leading to an inconclusive landscape in Active Learning research. Furthermore, we highlight overlooked problems for reproducing AL experiments that can lead to unfair comparisons and increased variance in the results. This paper addresses these issues by providing an Active Learning framework for a fair comparison of algorithms across different tasks and domains, as well as a fast and performant oracle algorithm for evaluation. To the best of our knowledge, we propose the first AL benchmark that tests algorithms in 3 major domains: Tabular, Image, and Text. We report empirical results for 6 widely used algorithms on 7 real-world and 2 synthetic datasets and aggregate them into a domain-specific ranking of AL algorithms.

replace DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis

Authors: YongKyung Oh, Dong-Young Lim, Sungil Kim

Abstract: Real-world time series analysis faces significant challenges when dealing with irregular and incomplete data. While Neural Differential Equation (NDE) based methods have shown promise, they struggle with limited expressiveness, scalability issues, and stability concerns. Conversely, Neural Flows offer stability but falter with irregular data. We introduce 'DualDynamics', a novel framework that synergistically combines NDE-based method and Neural Flow-based method. This approach enhances expressive power while balancing computational demands, addressing critical limitations of existing techniques. We demonstrate DualDynamics' effectiveness across diverse tasks: classification of robustness to dataset shift, irregularly-sampled series analysis, interpolation of missing data, and forecasting with partial observations. Our results show consistent outperformance over state-of-the-art methods, indicating DualDynamics' potential to advance irregular time series analysis significantly.

replace Three Mechanisms of Feature Learning in a Linear Network

Authors: Yizhou Xu, Liu Ziyin

Abstract: Understanding the dynamics of neural networks in different width regimes is crucial for improving their training and performance. We present an exact solution for the learning dynamics of a one-hidden-layer linear network, with one-dimensional data, across any finite width, uniquely exhibiting both kernel and feature learning phases. This study marks a technical advancement by enabling the analysis of the training trajectory from any initialization and a detailed phase diagram under varying common hyperparameters such as width, layer-wise learning rates, and scales of output and initialization. We identify three novel prototype mechanisms specific to the feature learning regime: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling, which contrast starkly with the dynamics observed in the kernel regime. Our theoretical findings are substantiated with empirical evidence showing that these mechanisms also manifest in deep nonlinear networks handling real-world tasks, enhancing our understanding of neural network training dynamics and guiding the design of more effective learning strategies.

replace Privacy-Enhanced Training-as-a-Service for On-Device Intelligence: Concept, Architectural Scheme, and Open Problems

Authors: Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Bo Gao, Tianliu He, Wen Wang

Abstract: On-device intelligence (ODI) enables artificial intelligence (AI) applications to run on end devices, providing real-time and customized AI inference without relying on remote servers. However, training models for on-device deployment face significant challenges due to the decentralized and privacy-sensitive nature of users' data, along with end-side constraints related to network connectivity, computation efficiency, etc. Existing training paradigms, such as cloud-based training, federated learning, and transfer learning, fail to sufficiently address these practical constraints that are prevalent for devices. To overcome these challenges, we propose Privacy-Enhanced Training-as-a-Service (PTaaS), a novel service computing paradigm that provides privacy-friendly, customized AI model training for end devices. PTaaS outsources the core training process to remote and powerful cloud or edge servers, efficiently developing customized on-device models based on uploaded anonymous queries, enhancing data privacy while reducing the computation load on individual devices. We explore the definition, goals, and design principles of PTaaS, alongside emerging technologies that support the PTaaS paradigm. An architectural scheme for PTaaS is also presented, followed by a series of open problems that set the stage for future research directions in the field of PTaaS.

replace Large Language Models for Mobility Analysis in Transportation Systems: A Survey on Forecasting Tasks

Authors: Zijian Zhang, Yujie Sun, Zepu Wang, Yuqi Nie, Xiaobo Ma, Ruolin Li, Peng Sun, Xuegang Ban

Abstract: Mobility analysis is a crucial element in the research area of transportation systems. Forecasting traffic information offers a viable solution to address the conflict between increasing transportation demands and the limitations of transportation infrastructure. Predicting human travel is significant in aiding various transportation and urban management tasks, such as taxi dispatch and urban planning. Machine learning and deep learning methods are favored for their flexibility and accuracy. Nowadays, with the advent of large language models (LLMs), many researchers have combined these models with previous techniques or applied LLMs to directly predict future traffic information and human travel behaviors. However, there is a lack of comprehensive studies on how LLMs can contribute to this field. This survey explores existing approaches using LLMs for time series forecasting problems for mobility in transportation systems. We provide a literature review concerning the forecasting applications within transportation systems, elucidating how researchers utilize LLMs, showcasing recent state-of-the-art advancements, and identifying the challenges that must be overcome to fully leverage LLMs in this domain.

replace Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

Authors: Raeid Saqur, Anastasis Kratsios, Florian Krach, Yannick Limmer, Jacob-Junqi Tian, John Willes, Blanka Horvath, Frank Rudzicz

Abstract: We propose MoE-F - a formalized mechanism for combining $N$ pre-trained Large Language Models (LLMs) for online time-series prediction by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert's running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, our approach employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs N parallel filters corresponding to each of the $N$ individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the N filter outputs are optimally aggregated to maximize their robust predictive power, and this update is computed efficiently via a closed-form expression, generating our ensemble predictor. Our contributions are: **(I)** the MoE-F plug-and-play filtering harness algorithm, **(II)** theoretical optimality guarantees of the proposed filtering-based gating algorithm (via optimality guarantees for its parallel Bayesian filtering and its robust aggregation steps), and **(III)** empirical evaluation and ablative results using state-of-the-art foundational and MoE LLMs on a real-world __Financial Market Movement__ task where MoE-F attains a remarkable 17\% absolute and 48.5\% relative F1 measure improvement over the next best performing individual LLM expert predicting short-horizon market movement based on streaming news. Further, we provide empirical evidence of substantial performance gains in applying MoE-F over specialized models in the long-horizon time-series forecasting domain.

replace Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks

Authors: Tiago Novello, Diana Aldana, Andre Araujo, Luiz Velho

Abstract: Sinusoidal neural networks have been shown effective as implicit neural representations (INRs) of low-dimensional signals, due to their smoothness and high representation capacity. However, initializing and training them remain empirical tasks which lack on deeper understanding to guide the learning process. To fill this gap, our work introduces a theoretical framework that explains the capacity property of sinusoidal networks and offers robust control mechanisms for initialization and training. Our analysis is based on a novel amplitude-phase expansion of the sinusoidal multilayer perceptron, showing how its layer compositions produce a large number of new frequencies expressed as integer combinations of the input frequencies. This relationship can be directly used to initialize the input neurons, as a form of spectral sampling, and to bound the network's spectrum while training. Our method, referred to as TUNER (TUNing sinusoidal nEtwoRks), greatly improves the stability and convergence of sinusoidal INR training, leading to detailed reconstructions, while preventing overfitting.

replace Selective Prompt Anchoring for Code Generation

Authors: Yuan Tian, Tianyi Zhang

Abstract: Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we propose Selective Prompt Anchoring (SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available at https://github.com/magic-YuanTian/Selective-Prompt-Anchoring.

URLs: https://github.com/magic-YuanTian/Selective-Prompt-Anchoring.

replace Interactive incremental learning of generalizable skills with local trajectory modulation

Authors: Markus Knauer, Alin Albu-Sch\"affer, Freek Stulp, Jo\~ao Silv\'erio

Abstract: The problem of generalization in learning from demonstration (LfD) has received considerable attention over the years, particularly within the context of movement primitives, where a number of approaches have emerged. Recently, two important approaches have gained recognition. While one leverages via-points to adapt skills locally by modulating demonstrated trajectories, another relies on so-called task-parameterized models that encode movements with respect to different coordinate systems, using a product of probabilities for generalization. While the former are well-suited to precise, local modulations, the latter aim at generalizing over large regions of the workspace and often involve multiple objects. Addressing the quality of generalization by leveraging both approaches simultaneously has received little attention. In this work, we propose an interactive imitation learning framework that simultaneously leverages local and global modulations of trajectory distributions. Building on the kernelized movement primitives (KMP) framework, we introduce novel mechanisms for skill modulation from direct human corrective feedback. Our approach particularly exploits the concept of via-points to incrementally and interactively 1) improve the model accuracy locally, 2) add new objects to the task during execution and 3) extend the skill into regions where demonstrations were not provided. We evaluate our method on a bearing ring-loading task using a torque-controlled, 7-DoF, DLR SARA robot.

replace Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

Authors: Matteo Risso, Alessio Burrello, Daniele Jahier Pagliari

Abstract: The demand for executing Deep Neural Networks (DNNs) with low latency and minimal power consumption at the edge has led to the development of advanced heterogeneous Systems-on-Chips (SoCs) that incorporate multiple specialized computing units (CUs), such as accelerators. Offloading DNN computations to a specific CU from the available set often exposes accuracy vs efficiency trade-offs, due to differences in their supported operations (e.g., standard vs. depthwise convolution) or data representations (e.g., more/less aggressively quantized). A challenging yet unresolved issue is how to map a DNN onto these multi-CU systems to maximally exploit the parallelization possibilities while taking accuracy into account. To address this problem, we present ODiMO, a hardware-aware tool that efficiently explores fine-grain mapping of DNNs among various on-chip CUs, during the training phase. ODiMO strategically splits individual layers of the neural network and executes them in parallel on the multiple available CUs, aiming to balance the total inference energy consumption or latency with the resulting accuracy, impacted by the unique features of the different hardware units. We test our approach on CIFAR-10, CIFAR-100, and ImageNet, targeting two open-source heterogeneous SoCs, i.e., DIANA and Darkside. We obtain a rich collection of Pareto-optimal networks in the accuracy vs. energy or latency space. We show that ODiMO reduces the latency of a DNN executed on the Darkside SoC by up to 8x at iso-accuracy, compared to manual heuristic mappings. When targeting energy, on the same SoC, ODiMO produced up to 50.8x more efficient mappings, with minimal accuracy drop (< 0.3%).

replace Deep Parallel Spectral Neural Operators for Solving Partial Differential Equations with Enhanced Low-Frequency Learning Capability

Authors: Qinglong Ma, Peizhi Zhao, Sen Wang, Tao Song

Abstract: Designing universal artificial intelligence (AI) solver for partial differential equations (PDEs) is an open-ended problem and a significant challenge in science and engineering. Currently, data-driven solvers have achieved great success, such as neural operators. However, the ability of various neural operator solvers to learn low-frequency information still needs improvement. In this study, we propose a Deep Parallel Spectral Neural Operator (DPNO) to enhance the ability to learn low-frequency information. Our method enhances the neural operator's ability to learn low-frequency information through parallel modules. In addition, due to the presence of truncation coefficients, some high-frequency information is lost during the nonlinear learning process. We smooth this information through convolutional mappings, thereby reducing high-frequency errors. We selected several challenging partial differential equation datasets for experimentation, and DPNO performed exceptionally well. As a neural operator, DPNO also possesses the capability of resolution invariance.

replace NEAT: Nonlinear Parameter-efficient Adaptation of Pre-trained Models

Authors: Yibo Zhong, Haoxiang Jiang, Lincan Li, Ryumei Nakada, Tianci Liu, Linjun Zhang, Huaxiu Yao, Haoyu Wang

Abstract: Fine-tuning pre-trained models often yields state-of-the-art performance but is computationally expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this by freezing pre-trained weights and introducing low-rank matrices. However, because LoRA relies on low-rank decomposition, it struggles to capture complex nonlinear dynamics and optimal optimization trajectories, resulting in a performance gap relative to full fine-tuning and inefficient parameter utilization. To overcome these issues, we propose NEAT, a nonlinear PEFT approach that employs a lightweight neural network to learn a nonlinear transformation of the pre-trained weights, thereby better approximating cumulative weight updates. Our theoretical analysis shows that NEAT achieves greater efficiency than LoRA while maintaining equivalent expressivity. Extensive experiments on four benchmarks and over twenty datasets demonstrate that NEAT significantly outperforms state-of-the-art baselines in both vision and text tasks.

replace Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes

Authors: Erica Zhang, Fangzhao Zhang, Mert Pilanci

Abstract: Active learning methods aim to improve sample complexity in machine learning. In this work, we investigate an active learning scheme via a novel gradient-free cutting-plane training method for ReLU networks of arbitrary depth and develop a convergence theory. We demonstrate, for the first time, that cutting-plane algorithms, traditionally used in linear models, can be extended to deep neural networks despite their nonconvexity and nonlinear decision boundaries. Moreover, this training method induces the first deep active learning scheme known to achieve convergence guarantees, revealing a geometric contraction rate of the feasible set. We exemplify the effectiveness of our proposed active learning method against popular deep active learning baselines via both synthetic data experiments and sentimental classification task on real datasets.

replace Instance-Level Difficulty: A Missing Perspective in Machine Unlearning

Authors: Hammad Rizwan, Mahtab Sarvmaili, Hassan Sajjad, Ga Wu

Abstract: Current research on deep machine unlearning primarily focuses on improving or evaluating the overall effectiveness of unlearning methods while overlooking the varying difficulty of unlearning individual training samples. As a result, the broader feasibility of machine unlearning remains under-explored. This paper studies the cruxes that make machine unlearning difficult through a thorough instance-level unlearning performance analysis over various unlearning algorithms and datasets. In particular, we summarize four factors that make unlearning a data point difficult, and we empirically show that these factors are independent of a specific unlearning algorithm but only relevant to the target model and its training data. Given these findings, we argue that machine unlearning research should pay attention to the instance-level difficulty of unlearning.

replace Compute Or Load KV Cache? Why Not Both?

Authors: Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao

Abstract: Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6x reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.

replace TimeDART: A Diffusion Autoregressive Transformer for Self-Supervised Time Series Representation

Authors: Daoyu Wang, Mingyue Cheng, Zhiding Liu, Qi Liu, Enhong Chen

Abstract: Self-supervised learning has garnered increasing attention in time series analysis for benefiting various downstream tasks and reducing reliance on labeled data. Despite its effectiveness, existing methods often struggle to comprehensively capture both long-term dynamic evolution and subtle local patterns in a unified manner. In this work, we propose TimeDART, a novel self-supervised time series pre-training framework that unifies two powerful generative paradigms to learn more transferable representations. Specifically, we first employ a causal Transformer encoder, accompanied by a patch-based embedding strategy, to model the evolving trends from left to right. Building on this global modeling, we further introduce a denoising diffusion process to capture fine-grained local patterns through forward diffusion and reverse denoising. Finally, we optimize the model in an autoregressive manner. As a result, TimeDART effectively accounts for both global and local sequence features in a coherent way. We conduct extensive experiments on public datasets for time series forecasting and classification. The experimental results demonstrate that TimeDART consistently outperforms previous compared methods, validating the effectiveness of our approach. Our code is available at https://github.com/Melmaphother/TimeDART.

URLs: https://github.com/Melmaphother/TimeDART.

replace Optimizing Estimators of Squared Calibration Errors in Classification

Authors: Sebastian G. Gruber, Francis Bach

Abstract: In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings. Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios. Although various calibration (error) estimators exist in the current literature, there is a lack of guidance on selecting the appropriate estimator and tuning its hyperparameters. By leveraging the bilinear structure of squared calibration errors, we reformulate calibration estimation as a regression problem with independent and identically distributed (i.i.d.) input pairs. This reformulation allows us to quantify the performance of different estimators even for the most challenging calibration criterion, known as canonical calibration. Our approach advocates for a training-validation-testing pipeline when estimating a calibration error on an evaluation dataset. We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on standard image classification tasks.

replace Towards Foundation Models for Mixed Integer Linear Programming

Authors: Sirui Li, Janardhan Kulkarni, Ishai Menache, Cathy Wu, Beibin Li

Abstract: Mixed Integer Linear Programming (MILP) is essential for modeling complex decision-making problems but faces challenges in computational tractability and requires expert formulation. Current deep learning approaches for MILP focus on specific problem classes and do not generalize to unseen classes. To address this shortcoming, we take a foundation model training approach, where we train a single deep learning model on a diverse set of MILP problems to generalize across problem classes. As existing datasets for MILP lack diversity and volume, we introduce MILP-Evolve, a novel LLM-based evolutionary framework that is capable of generating a large set of diverse MILP classes with an unlimited amount of instances. We study our methodology on three key learning tasks that capture diverse aspects of MILP: (1) integrality gap prediction, (2) learning to branch, and (3) a new task of aligning MILP instances with natural language descriptions. Our empirical results show that models trained on the data generated by MILP-Evolve achieve significant improvements on unseen problems, including MIPLIB benchmarks. Our work highlights the potential of moving towards a foundation model approach for MILP that can generalize to a broad range of MILP applications. Our code and data are publicly available at https://github.com/microsoft/OptiGuide.

URLs: https://github.com/microsoft/OptiGuide.

replace KLay: Accelerating Arithmetic Circuits for Neurosymbolic AI

Authors: Jaron Maene, Vincent Derkinderen, Pedro Zuidberg Dos Martires

Abstract: A popular approach to neurosymbolic AI involves mapping logic formulas to arithmetic circuits (computation graphs consisting of sums and products) and passing the outputs of a neural network through these circuits. This approach enforces symbolic constraints onto a neural network in a principled and end-to-end differentiable way. Unfortunately, arithmetic circuits are challenging to run on modern AI accelerators as they exhibit a high degree of irregular sparsity. To address this limitation, we introduce knowledge layers (KLay), a new data structure to represent arithmetic circuits that can be efficiently parallelized on GPUs. Moreover, we contribute two algorithms used in the translation of traditional circuit representations to KLay and a further algorithm that exploits parallelization opportunities during circuit evaluations. We empirically show that KLay achieves speedups of multiple orders of magnitude over the state of the art, thereby paving the way towards scaling neurosymbolic AI to larger real-world applications.

replace Generalization Bounds via Meta-Learned Model Representations: PAC-Bayes and Sample Compression Hypernetworks

Authors: Benjamin Leblanc, Mathieu Bazinet, Nathaniel D'Amours, Alexandre Drouin, Pascal Germain

Abstract: PAC-Bayesian and Sample Compress learning frameworks are instrumental for deriving tight (non-vacuous) generalization bounds for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.

replace DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents

Authors: Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao

Abstract: On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.

replace SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance

Authors: Liyu Zhang, Haochi Wu, Xu Wan, Quan Kong, Ruilong Deng, Mingyang Sun

Abstract: Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline \Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.

replace Ask, and it shall be given: On the Turing completeness of prompting

Authors: Ruizhong Qiu, Zhe Xu, Wenxuan Bao, Hanghang Tong

Abstract: Since the success of GPT, large language models (LLMs) have been revolutionizing machine learning and have initiated the so-called LLM prompting paradigm. In the era of LLMs, people train a single general-purpose LLM and provide the LLM with different prompts to perform different tasks. However, such empirical success largely lacks theoretical understanding. Here, we present the first theoretical study on the LLM prompting paradigm to the best of our knowledge. In this work, we show that prompting is in fact Turing-complete: there exists a finite-size Transformer such that for any computable function, there exists a corresponding prompt following which the Transformer computes the function. Furthermore, we show that even though we use only a single finite-size Transformer, it can still achieve nearly the same complexity bounds as that of the class of all unbounded-size Transformers. Overall, our result reveals that prompting can enable a single finite-size Transformer to be efficiently universal, which establishes a theoretical underpinning for prompt engineering in practice.

replace Are Neuromorphic Architectures Inherently Privacy-preserving? An Exploratory Study

Authors: Ayana Moshruba, Ihsen Alouani, Maryam Parsa

Abstract: While machine learning (ML) models are becoming mainstream, especially in sensitive application areas, the risk of data leakage has become a growing concern. Attacks like membership inference (MIA) have shown that trained models can reveal sensitive data, jeopardizing confidentiality. While traditional Artificial Neural Networks (ANNs) dominate ML applications, neuromorphic architectures, specifically Spiking Neural Networks (SNNs), are emerging as promising alternatives due to their low power consumption and event-driven processing, akin to biological neurons. Privacy in ANNs is well-studied; however, little work has explored the privacy-preserving properties of SNNs. This paper examines whether SNNs inherently offer better privacy. Using MIAs, we assess the privacy resilience of SNNs versus ANNs across diverse datasets. We analyze the impact of learning algorithms (surrogate gradient and evolutionary), frameworks (snnTorch, TENNLab, LAVA), and parameters on SNN privacy. Our findings show that SNNs consistently outperform ANNs in privacy preservation, with evolutionary algorithms offering additional resilience. For instance, on CIFAR-10, SNNs achieve an AUC of 0.59, significantly lower than ANNs' 0.82, and on CIFAR-100, SNNs maintain an AUC of 0.58 compared to ANNs' 0.88. Additionally, we explore the privacy-utility trade-off with Differentially Private Stochastic Gradient Descent (DPSGD), finding that SNNs sustain less accuracy loss than ANNs under similar privacy constraints.

replace Provable Partially Observable Reinforcement Learning with Privileged Information

Authors: Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang

Abstract: Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain \emph{privileged information}, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of \emph{expert distillation} (also known as \emph{teacher-student} learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the \emph{deterministic filter condition}, under which expert distillation achieves sample and computational complexities that are \emph{both} polynomial. Furthermore, we investigate another useful empirical paradigm of \emph{asymmetric actor-critic}, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve \emph{filter stability} under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring \emph{centralized-training-with-decentralized-execution}, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

replace SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment

Authors: Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

Abstract: Direct Preference Optimization (DPO) has proven effective in aligning large language models with human preferences but is often constrained to pairwise comparisons -- overlooking additional positive and negative responses that are commonly available in real-world settings. We propose Simultaneous Weighted Preference Optimization (SWEPO), which incorporates multiple responses per query and prioritizes those that deviate most from the average reward. This deviation-based weighting focuses training on the most informative outliers, akin to a built-in curriculum. Theoretically, we prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of $\mathcal{O}(\tfrac{1}{\sqrt{k}})$. Empirically, SWEPO outperforms state-of-the-art baselines on the Ultra-Feedback dataset and demonstrates substantial improvements over DPO and InfoNCA, yielding boosts of up to $\sim 4$% on length-controlled win-rate on AlpacaEval.

replace Infinite-dimensional next-generation reservoir computing

Authors: Lyudmila Grigoryeva, Hannah Lim Jing Ting, Juan-Pablo Ortega

Abstract: Next-generation reservoir computing (NG-RC) has attracted much attention due to its excellent performance in spatio-temporal forecasting of complex systems and its ease of implementation. This paper shows that NG-RC can be encoded as a kernel ridge regression that makes training efficient and feasible even when the space of chosen polynomial features is very large. Additionally, an extension to an infinite number of covariates is possible, which makes the methodology agnostic with respect to the lags into the past that are considered as explanatory factors, as well as with respect to the number of polynomial covariates, an important hyperparameter in traditional NG-RC. We show that this approach has solid theoretical backing and good behavior based on kernel universality properties previously established in the literature. Various numerical illustrations show that these generalizations of NG-RC outperform the traditional approach in several forecasting applications.

replace SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Authors: Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

Abstract: Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.

replace Virtual Nodes Can Help: Tackling Distribution Shifts in Federated Graph Learning

Authors: Xingbo Fu, Zihan Chen, Yinhan He, Song Wang, Binchi Zhang, Chen Chen, Jundong Li

Abstract: Federated Graph Learning (FGL) enables multiple clients to jointly train powerful graph learning models, e.g., Graph Neural Networks (GNNs), without sharing their local graph data for graph-related downstream tasks, such as graph property prediction. In the real world, however, the graph data can suffer from significant distribution shifts across clients as the clients may collect their graph data for different purposes. In particular, graph properties are usually associated with invariant label-relevant substructures (i.e., subgraphs) across clients, while label-irrelevant substructures can appear in a client-specific manner. The issue of distribution shifts of graph data hinders the efficiency of GNN training and leads to serious performance degradation in FGL. To tackle the aforementioned issue, we propose a novel FGL framework entitled FedVN that eliminates distribution shifts through client-specific graph augmentation strategies with multiple learnable Virtual Nodes (VNs). Specifically, FedVN lets the clients jointly learn a set of shared VNs while training a global GNN model. To eliminate distribution shifts, each client trains a personalized edge generator that determines how the VNs connect local graphs in a client-specific manner. Furthermore, we provide theoretical analyses indicating that FedVN can eliminate distribution shifts of graph data across clients. Comprehensive experiments on four datasets under five settings demonstrate the superiority of our proposed FedVN over nine baselines.

replace A View of the Certainty-Equivalence Method for PAC RL as an Application of the Trajectory Tree Method

Authors: Shivaram Kalyanakrishnan, Sheel Shah, Santhosh Kumar Guguloth

Abstract: Reinforcement learning (RL) enables an agent interacting with an unknown MDP $M$ to optimise its behaviour by observing transitions sampled from $M$. A natural entity that emerges in the agent's reasoning is $\widehat{M}$, the maximum likelihood estimate of $M$ based on the observed transitions. The well-known \textit{certainty-equivalence} method (CEM) dictates that the agent update its behaviour to $\widehat{\pi}$, which is an optimal policy for $\widehat{M}$. Not only is CEM intuitive, it has been shown to enjoy minimax-optimal sample complexity in some regions of the parameter space for PAC RL with a generative model~\citep{Agarwal2020GenModel}. A seemingly unrelated algorithm is the ``trajectory tree method'' (TTM)~\citep{Kearns+MN:1999}, originally developed for efficient decision-time planning in large POMDPs. This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM. The qualitative benefits of this view are (1) new and simple proofs of sample complexity upper bounds for CEM, in fact under a (2) weaker assumption on the rewards than is prevalent in the current literature. Our analysis applies to both non-stationary and stationary MDPs. Quantitatively, we obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs, in the regime that the ``mistake probability'' $\delta$ is small. Additionally, we show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs in the small-$\delta$ regime.

replace SpinSVAR: Estimating Structural Vector Autoregression Assuming Sparse Input

Authors: Panagiotis Misiakos, Markus P\"uschel

Abstract: We introduce SpinSVAR, a novel method for estimating a structural vector autoregression (SVAR) from time-series data under sparse input assumption. Unlike prior approaches using Gaussian noise, we model the input as independent Laplacian variables, enforcing sparsity and yielding a maximum likelihood estimator (MLE) based on least absolute error regression. We provide theoretical consistency guarantees for the MLE under mild assumptions. SpinSVAR is efficient: it can leverage GPU acceleration to scale to thousands of nodes. On synthetic data with Laplacian or Bernoulli-uniform inputs, SpinSVAR outperforms state-of-the-art methods in accuracy and runtime. When applied to S&P 500 data, it clusters stocks by sectors and identifies significant structural shocks linked to major price movements, demonstrating the viability of our sparse input assumption.

replace Humanity's Last Exam

Authors: Long Phan (Michael Pokorny), Alice Gatti (Michael Pokorny), Ziwen Han (Michael Pokorny), Nathaniel Li (Michael Pokorny), Josephina Hu (Michael Pokorny), Hugh Zhang (Michael Pokorny), Chen Bo Calvin Zhang (Michael Pokorny), Mohamed Shaaban (Michael Pokorny), John Ling (Michael Pokorny), Sean Shi (Michael Pokorny), Michael Choi (Michael Pokorny), Anish Agrawal (Michael Pokorny), Arnav Chopra (Michael Pokorny), Adam Khoja (Michael Pokorny), Ryan Kim (Michael Pokorny), Richard Ren (Michael Pokorny), Jason Hausenloy (Michael Pokorny), Oliver Zhang (Michael Pokorny), Mantas Mazeika (Michael Pokorny), Tung Nguyen (Michael Pokorny), Daron Anderson (Michael Pokorny), Imad Ali Shah (Michael Pokorny), Mikhail Doroshenko (Michael Pokorny), Alun Cennyth Stokes (Michael Pokorny), Mobeen Mahmood (Michael Pokorny), Jaeho Lee (Michael Pokorny), Oleksandr Pokutnyi (Michael Pokorny), Oleg Iskra (Michael Pokorny), Jessica P. Wang (Michael Pokorny), Robert Gerbicz (Michael Pokorny), John-Clark Levin (Michael Pokorny), Serguei Popov (Michael Pokorny), Fiona Feng (Michael Pokorny), Steven Y. Feng (Michael Pokorny), Haoran Zhao (Michael Pokorny), Michael Yu (Michael Pokorny), Varun Gangal (Michael Pokorny), Chelsea Zou (Michael Pokorny), Zihan Wang (Michael Pokorny), Mstyslav Kazakov (Michael Pokorny), Geoff Galgon (Michael Pokorny), Johannes Schmitt (Michael Pokorny), Alvaro Sanchez (Michael Pokorny), Yongki Lee (Michael Pokorny), Will Yeadon (Michael Pokorny), Scott Sauers (Michael Pokorny), Marc Roth (Michael Pokorny), Chidozie Agu (Michael Pokorny), S{\o}ren Riis (Michael Pokorny), Fabian Giska (Michael Pokorny), Saiteja Utpala (Michael Pokorny), Antrell Cheatom (Michael Pokorny), Zachary Giboney (Michael Pokorny), Gashaw M. Goshu (Michael Pokorny), Sarah-Jane Crowson (Michael Pokorny), Mohinder Maheshbhai Naiya (Michael Pokorny), Noah Burns (Michael Pokorny), Lennart Finke (Michael Pokorny), Zerui Cheng (Michael Pokorny), Hyunwoo Park (Michael Pokorny), Francesco Fournier-Facio (Michael Pokorny), Jennifer Zampese (Michael Pokorny), John B. Wydallis (Michael Pokorny), Ryan G. Hoerr (Michael Pokorny), Mark Nandor (Michael Pokorny), Tim Gehrunger (Michael Pokorny), Jiaqi Cai (Michael Pokorny), Ben McCarty (Michael Pokorny), Jungbae Nam (Michael Pokorny), Edwin Taylor (Michael Pokorny), Jun Jin (Michael Pokorny), Gautier Abou Loume (Michael Pokorny), Hangrui Cao (Michael Pokorny), Alexis C Garretson (Michael Pokorny), Damien Sileo (Michael Pokorny), Qiuyu Ren (Michael Pokorny), Doru Cojoc (Michael Pokorny), Pavel Arkhipov (Michael Pokorny), Usman Qazi (Michael Pokorny), Aras Bacho (Michael Pokorny), Lianghui Li (Michael Pokorny), Sumeet Motwani (Michael Pokorny), Christian Schroeder de Witt (Michael Pokorny), Alexei Kopylov (Michael Pokorny), Johannes Veith (Michael Pokorny), Eric Singer (Michael Pokorny), Paolo Rissone (Michael Pokorny), Jaehyeok Jin (Michael Pokorny), Jack Wei Lun Shi (Michael Pokorny), Chris G. Willcocks (Michael Pokorny), Ameya Prabhu (Michael Pokorny), Longke Tang (Michael Pokorny), Kevin Zhou (Michael Pokorny), Emily de Oliveira Santos (Michael Pokorny), Andrey Pupasov Maksimov (Michael Pokorny), Edward Vendrow (Michael Pokorny), Kengo Zenitani (Michael Pokorny), Joshua Robinson (Michael Pokorny), Aleksandar Mikov (Michael Pokorny), Julien Guillod (Michael Pokorny), Yuqi Li (Michael Pokorny), Ben Pageler (Michael Pokorny), Joshua Vendrow (Michael Pokorny), Vladyslav Kuchkin (Michael Pokorny), Pierre Marion (Michael Pokorny), Denis Efremov (Michael Pokorny), Jayson Lynch (Michael Pokorny), Kaiqu Liang (Michael Pokorny), Andrew Gritsevskiy (Michael Pokorny), Dakotah Martinez (Michael Pokorny), Nick Crispino (Michael Pokorny), Dimitri Zvonkine (Michael Pokorny), Natanael Wildner Fraga (Michael Pokorny), Saeed Soori (Michael Pokorny), Ori Press (Michael Pokorny), Henry Tang (Michael Pokorny), Julian Salazar (Michael Pokorny), Sean R. Green (Michael Pokorny), Lina Br\"ussel (Michael Pokorny), Moon Twayana (Michael Pokorny), Aymeric Dieuleveut (Michael Pokorny), T. Ryan Rogers (Michael Pokorny), Wenjin Zhang (Michael Pokorny), Ross Finocchio (Michael Pokorny), Bikun Li (Michael Pokorny), Jinzhou Yang (Michael Pokorny), Arun Rao (Michael Pokorny), Gabriel Loiseau (Michael Pokorny), Mikhail Kalinin (Michael Pokorny), Marco Lukas (Michael Pokorny), Ciprian Manolescu (Michael Pokorny), Nate Stambaugh (Michael Pokorny), Subrata Mishra (Michael Pokorny), Ariel Ghislain Kemogne Kamdoum (Michael Pokorny), Tad Hogg (Michael Pokorny), Alvin Jin (Michael Pokorny), Carlo Bosio (Michael Pokorny), Gongbo Sun (Michael Pokorny), Brian P Coppola (Michael Pokorny), Haline Heidinger (Michael Pokorny), Rafael Sayous (Michael Pokorny), Stefan Ivanov (Michael Pokorny), Joseph M Cavanagh (Michael Pokorny), Jiawei Shen (Michael Pokorny), Joseph Marvin Imperial (Michael Pokorny), Philippe Schwaller (Michael Pokorny), Shaipranesh Senthilkuma (Michael Pokorny), Andres M Bran (Michael Pokorny), Andres Algaba (Michael Pokorny), Brecht Verbeken (Michael Pokorny), Kelsey Van den Houte (Michael Pokorny), Lynn Van Der Sypt (Michael Pokorny), David Noever (Michael Pokorny), Lisa Schut (Michael Pokorny), Ilia Sucholutsky (Michael Pokorny), Evgenii Zheltonozhskii (Michael Pokorny), Qiaochu Yuan (Michael Pokorny), Derek Lim (Michael Pokorny), Richard Stanley (Michael Pokorny), Shankar Sivarajan (Michael Pokorny), Tong Yang (Michael Pokorny), John Maar (Michael Pokorny), Julian Wykowski (Michael Pokorny), Mart\'i Oller (Michael Pokorny), Jennifer Sandlin (Michael Pokorny), Anmol Sahu (Michael Pokorny), Cesare Giulio Ardito (Michael Pokorny), Yuzheng Hu (Michael Pokorny), Felipe Meneguitti Dias (Michael Pokorny), Tobias Kreiman (Michael Pokorny), Kaivalya Rawal (Michael Pokorny), Tobias Garcia Vilchis (Michael Pokorny), Yuexuan Zu (Michael Pokorny), Martin Lackner (Michael Pokorny), James Koppel (Michael Pokorny), Jeremy Nguyen (Michael Pokorny), Daniil S. Antonenko (Michael Pokorny), Steffi Chern (Michael Pokorny), Bingchen Zhao (Michael Pokorny), Pierrot Arsene (Michael Pokorny), Sergey Ivanov (Michael Pokorny), Rafa{\l} Po\'swiata (Michael Pokorny), Chenguang Wang (Michael Pokorny), Daofeng Li (Michael Pokorny), Donato Crisostomi (Michael Pokorny), Ali Dehghan (Michael Pokorny), Andrea Achilleos (Michael Pokorny), John Arnold Ambay (Michael Pokorny), Benjamin Myklebust (Michael Pokorny), Archan Sen (Michael Pokorny), David Perrella (Michael Pokorny), Nurdin Kaparov (Michael Pokorny), Mark H Inlow (Michael Pokorny), Allen Zang (Michael Pokorny), Kalyan Ramakrishnan (Michael Pokorny), Daniil Orel (Michael Pokorny), Vladislav Poritski (Michael Pokorny), Shalev Ben-David (Michael Pokorny), Zachary Berger (Michael Pokorny), Parker Whitfill (Michael Pokorny), Michael Foster (Michael Pokorny), Daniel Munro (Michael Pokorny), Linh Ho (Michael Pokorny), Dan Bar Hava (Michael Pokorny), Aleksey Kuchkin (Michael Pokorny), Robert Lauff (Michael Pokorny), David Holmes (Michael Pokorny), Frank Sommerhage (Michael Pokorny), Anji Zhang (Michael Pokorny), Richard Moat (Michael Pokorny), Keith Schneider (Michael Pokorny), Daniel Pyda (Michael Pokorny), Zakayo Kazibwe (Michael Pokorny), Mukhwinder Singh (Michael Pokorny), Don Clarke (Michael Pokorny), Dae Hyun Kim (Michael Pokorny), Sara Fish (Michael Pokorny), Veit Elser (Michael Pokorny), Victor Efren Guadarrama Vilchis (Michael Pokorny), Immo Klose (Michael Pokorny), Christoph Demian (Michael Pokorny), Ujjwala Anantheswaran (Michael Pokorny), Adam Zweiger (Michael Pokorny), Guglielmo Albani (Michael Pokorny), Jeffery Li (Michael Pokorny), Nicolas Daans (Michael Pokorny), Maksim Radionov (Michael Pokorny), V\'aclav Rozho\v{n} (Michael Pokorny), Vincent Ginis (Michael Pokorny), Ziqiao Ma (Michael Pokorny), Christian Stump (Michael Pokorny), Jacob Platnick (Michael Pokorny), Volodymyr Nevirkovets (Michael Pokorny), Luke Basler (Michael Pokorny), Marco Piccardo (Michael Pokorny), Niv Cohen (Michael Pokorny), Virendra Singh (Michael Pokorny), Josef Tkadlec (Michael Pokorny), Paul Rosu (Michael Pokorny), Alan Goldfarb (Michael Pokorny), Piotr Padlewski (Michael Pokorny), Stanislaw Barzowski (Michael Pokorny), Kyle Montgomery (Michael Pokorny), Aline Menezes (Michael Pokorny), Arkil Patel (Michael Pokorny), Zixuan Wang (Michael Pokorny), Jamie Tucker-Foltz (Michael Pokorny), Jack Stade (Michael Pokorny), Declan Grabb (Michael Pokorny), Tom Goertzen (Michael Pokorny), Fereshteh Kazemi (Michael Pokorny), Jeremiah Milbauer (Michael Pokorny), Abhishek Shukla (Michael Pokorny), Hossam Elgnainy (Michael Pokorny), Yan Carlos Leyva Labrador (Michael Pokorny), Hao He (Michael Pokorny), Ling Zhang (Michael Pokorny), Alan Givr\'e (Michael Pokorny), Hew Wolff (Michael Pokorny), G\"ozdenur Demir (Michael Pokorny), Muhammad Fayez Aziz (Michael Pokorny), Younesse Kaddar (Michael Pokorny), Ivar \"Angquist (Michael Pokorny), Yanxu Chen (Michael Pokorny), Elliott Thornley (Michael Pokorny), Robin Zhang (Michael Pokorny), Jiayi Pan (Michael Pokorny), Antonio Terpin (Michael Pokorny), Niklas Muennighoff (Michael Pokorny), Hailey Schoelkopf (Michael Pokorny), Eric Zheng (Michael Pokorny), Avishy Carmi (Michael Pokorny), Jainam Shah (Michael Pokorny), Ethan D. L. Brown (Michael Pokorny), Kelin Zhu (Michael Pokorny), Max Bartolo (Michael Pokorny), Richard Wheeler (Michael Pokorny), Andrew Ho (Michael Pokorny), Shaul Barkan (Michael Pokorny), Jiaqi Wang (Michael Pokorny), Martin Stehberger (Michael Pokorny), Egor Kretov (Michael Pokorny), Peter Bradshaw (Michael Pokorny), JP Heimonen (Michael Pokorny), Kaustubh Sridhar (Michael Pokorny), Zaki Hossain (Michael Pokorny), Ido Akov (Michael Pokorny), Yury Makarychev (Michael Pokorny), Joanna Tam (Michael Pokorny), Hieu Hoang (Michael Pokorny), David M. Cunningham (Michael Pokorny), Vladimir Goryachev (Michael Pokorny), Demosthenes Patramanis (Michael Pokorny), Michael Krause (Michael Pokorny), Andrew Redenti (Michael Pokorny), David Aldous (Michael Pokorny), Jesyin Lai (Michael Pokorny), Shannon Coleman (Michael Pokorny), Jiangnan Xu (Michael Pokorny), Sangwon Lee (Michael Pokorny), Ilias Magoulas (Michael Pokorny), Sandy Zhao (Michael Pokorny), Ning Tang (Michael Pokorny), Michael K. Cohen (Michael Pokorny), Micah Carroll (Michael Pokorny), Orr Paradise (Michael Pokorny), Jan Hendrik Kirchner (Michael Pokorny), Stefan Steinerberger (Michael Pokorny), Maksym Ovchynnikov (Michael Pokorny), Jason O. Matos (Michael Pokorny), Adithya Shenoy (Michael Pokorny), Michael Wang (Michael Pokorny), Yuzhou Nie (Michael Pokorny), Paolo Giordano (Michael Pokorny), Philipp Petersen (Michael Pokorny), Anna Sztyber-Betley (Michael Pokorny), Paolo Faraboschi (Michael Pokorny), Robin Riblet (Michael Pokorny), Jonathan Crozier (Michael Pokorny), Shiv Halasyamani (Michael Pokorny), Antonella Pinto (Michael Pokorny), Shreyas Verma (Michael Pokorny), Prashant Joshi (Michael Pokorny), Eli Meril (Michael Pokorny), Zheng-Xin Yong (Michael Pokorny), Allison Tee (Michael Pokorny), J\'er\'emy Andr\'eoletti (Michael Pokorny), Orion Weller (Michael Pokorny), Raghav Singhal (Michael Pokorny), Gang Zhang (Michael Pokorny), Alexander Ivanov (Michael Pokorny), Seri Khoury (Michael Pokorny), Nils Gustafsson (Michael Pokorny), Hamid Mostaghimi (Michael Pokorny), Kunvar Thaman (Michael Pokorny), Qijia Chen (Michael Pokorny), Tran Quoc Kh\'anh (Michael Pokorny), Jacob Loader (Michael Pokorny), Stefano Cavalleri (Michael Pokorny), Hannah Szlyk (Michael Pokorny), Zachary Brown (Michael Pokorny), Himanshu Narayan (Michael Pokorny), Jonathan Roberts (Michael Pokorny), William Alley (Michael Pokorny), Kunyang Sun (Michael Pokorny), Ryan Stendall (Michael Pokorny), Max Lamparth (Michael Pokorny), Anka Reuel (Michael Pokorny), Ting Wang (Michael Pokorny), Hanmeng Xu (Michael Pokorny), Pablo Hern\'andez-C\'amara (Michael Pokorny), Freddie Martin (Michael Pokorny), Thomas Preu (Michael Pokorny), Tomek Korbak (Michael Pokorny), Marcus Abramovitch (Michael Pokorny), Dominic Williamson (Michael Pokorny), Ida Bosio (Michael Pokorny), Ziye Chen (Michael Pokorny), Bir\'o B\'alint (Michael Pokorny), Eve J. Y. Lo (Michael Pokorny), Maria In\^es S. Nunes (Michael Pokorny), Yibo Jiang (Michael Pokorny), M Saiful Bari (Michael Pokorny), Peyman Kassani (Michael Pokorny), Zihao Wang (Michael Pokorny), Behzad Ansarinejad (Michael Pokorny), Yewen Sun (Michael Pokorny), Stephane Durand (Michael Pokorny), Guillaume Douville (Michael Pokorny), Daniel Tordera (Michael Pokorny), George Balabanian (Michael Pokorny), Earth Anderson (Michael Pokorny), Lynna Kvistad (Michael Pokorny), Alejandro Jos\'e Moyano (Michael Pokorny), Hsiaoyun Milliron (Michael Pokorny), Ahmad Sakor (Michael Pokorny), Murat Eron (Michael Pokorny), Isaac C. McAlister (Michael Pokorny), Andrew Favre D. O. (Michael Pokorny), Shailesh Shah (Michael Pokorny), Xiaoxiang Zhou (Michael Pokorny), Firuz Kamalov (Michael Pokorny), Ronald Clark (Michael Pokorny), Sherwin Abdoli (Michael Pokorny), Tim Santens (Michael Pokorny), Harrison K Wang (Michael Pokorny), Evan Chen (Michael Pokorny), Alessandro Tomasiello (Michael Pokorny), G. Bruno De Luca (Michael Pokorny), Shi-Zhuo Looi (Michael Pokorny), Vinh-Kha Le (Michael Pokorny), Noam Kolt (Michael Pokorny), Niels M\"undler (Michael Pokorny), Avi Semler (Michael Pokorny), Emma Rodman (Michael Pokorny), Jacob Drori (Michael Pokorny), Carl J Fossum (Michael Pokorny), Luk Gloor (Michael Pokorny), Milind Jagota (Michael Pokorny), Ronak Pradeep (Michael Pokorny), Honglu Fan (Michael Pokorny), Tej Shah (Michael Pokorny), Jonathan Eicher (Michael Pokorny), Michael Chen (Michael Pokorny), Kushal Thaman (Michael Pokorny), William Merrill (Michael Pokorny), Moritz Firsching (Michael Pokorny), Carter Harris (Michael Pokorny), Stefan Ciob\^ac\u{a} (Michael Pokorny), Jason Gross (Michael Pokorny), Rohan Pandey (Michael Pokorny), Ilya Gusev (Michael Pokorny), Adam Jones (Michael Pokorny), Shashank Agnihotri (Michael Pokorny), Pavel Zhelnov (Michael Pokorny), Siranut Usawasutsakorn (Michael Pokorny), Mohammadreza Mofayezi (Michael Pokorny), Alexander Piperski (Michael Pokorny), Marc Carauleanu (Michael Pokorny), David K. Zhang (Michael Pokorny), Kostiantyn Dobarskyi (Michael Pokorny), Dylan Ler (Michael Pokorny), Roman Leventov (Michael Pokorny), Ignat Soroko (Michael Pokorny), Thorben Jansen (Michael Pokorny), Scott Creighton (Michael Pokorny), Pascal Lauer (Michael Pokorny), Joshua Duersch (Michael Pokorny), Vage Taamazyan (Michael Pokorny), Dario Bezzi (Michael Pokorny), Wiktor Morak (Michael Pokorny), Wenjie Ma (Michael Pokorny), William Held (Michael Pokorny), Tran {\DJ}uc Huy (Michael Pokorny), Ruicheng Xian (Michael Pokorny), Armel Randy Zebaze (Michael Pokorny), Mohanad Mohamed (Michael Pokorny), Julian Noah Leser (Michael Pokorny), Michelle X Yuan (Michael Pokorny), Laila Yacar (Michael Pokorny), Johannes Lengler (Michael Pokorny), Katarzyna Olszewska (Michael Pokorny), Hossein Shahrtash (Michael Pokorny), Edson Oliveira (Michael Pokorny), Joseph W. Jackson (Michael Pokorny), Daniel Espinosa Gonzalez (Michael Pokorny), Andy Zou (Michael Pokorny), Muthu Chidambaram (Michael Pokorny), Timothy Manik (Michael Pokorny), Hector Haffenden (Michael Pokorny), Dashiell Stander (Michael Pokorny), Ali Dasouqi (Michael Pokorny), Alexander Shen (Michael Pokorny), Emilien Duc (Michael Pokorny), Bita Golshani (Michael Pokorny), David Stap (Michael Pokorny), Mikalai Uzhou (Michael Pokorny), Alina Borisovna Zhidkovskaya (Michael Pokorny), Lukas Lewark (Michael Pokorny), Miguel Orbegozo Rodriguez (Michael Pokorny), M\'aty\'as Vincze (Michael Pokorny), Dustin Wehr (Michael Pokorny), Colin Tang (Michael Pokorny), Shaun Phillips (Michael Pokorny), Fortuna Samuele (Michael Pokorny), Jiang Muzhen (Michael Pokorny), Fredrik Ekstr\"om (Michael Pokorny), Angela Hammon (Michael Pokorny), Oam Patel (Michael Pokorny), Faraz Farhidi (Michael Pokorny), George Medley (Michael Pokorny), Forough Mohammadzadeh (Michael Pokorny), Madellene Pe\~naflor (Michael Pokorny), Haile Kassahun (Michael Pokorny), Alena Friedrich (Michael Pokorny), Claire Sparrow (Michael Pokorny), Rayner Hernandez Perez (Michael Pokorny), Taom Sakal (Michael Pokorny), Omkar Dhamane (Michael Pokorny), Ali Khajegili Mirabadi (Michael Pokorny), Eric Hallman (Michael Pokorny), Kenchi Okutsu (Michael Pokorny), Mike Battaglia (Michael Pokorny), Mohammad Maghsoudimehrabani (Michael Pokorny), Alon Amit (Michael Pokorny), Dave Hulbert (Michael Pokorny), Roberto Pereira (Michael Pokorny), Simon Weber (Michael Pokorny), Handoko (Michael Pokorny), Anton Peristyy (Michael Pokorny), Stephen Malina (Michael Pokorny), Samuel Albanie (Michael Pokorny), Will Cai (Michael Pokorny), Mustafa Mehkary (Michael Pokorny), Rami Aly (Michael Pokorny), Frank Reidegeld (Michael Pokorny), Anna-Katharina Dick (Michael Pokorny), Cary Friday (Michael Pokorny), Jasdeep Sidhu (Michael Pokorny), Hassan Shapourian (Michael Pokorny), Wanyoung Kim (Michael Pokorny), Mariana Costa (Michael Pokorny), Hubeyb Gurdogan (Michael Pokorny), Brian Weber (Michael Pokorny), Harsh Kumar (Michael Pokorny), Tong Jiang (Michael Pokorny), Arunim Agarwal (Michael Pokorny), Chiara Ceconello (Michael Pokorny), Warren S. Vaz (Michael Pokorny), Chao Zhuang (Michael Pokorny), Haon Park (Michael Pokorny), Andrew R. Tawfeek (Michael Pokorny), Daattavya Aggarwal (Michael Pokorny), Michael Kirchhof (Michael Pokorny), Linjie Dai (Michael Pokorny), Evan Kim (Michael Pokorny), Johan Ferret (Michael Pokorny), Yuzhou Wang (Michael Pokorny), Minghao Yan (Michael Pokorny), Krzysztof Burdzy (Michael Pokorny), Lixin Zhang (Michael Pokorny), Antonio Franca (Michael Pokorny), Diana T. Pham (Michael Pokorny), Kang Yong Loh (Michael Pokorny), Joshua Robinson (Michael Pokorny), Abram Jackson (Michael Pokorny), Shreen Gul (Michael Pokorny), Gunjan Chhablani (Michael Pokorny), Zhehang Du (Michael Pokorny), Adrian Cosma (Michael Pokorny), Jesus Colino (Michael Pokorny), Colin White (Michael Pokorny), Jacob Votava (Michael Pokorny), Vladimir Vinnikov (Michael Pokorny), Ethan Delaney (Michael Pokorny), Petr Spelda (Michael Pokorny), Vit Stritecky (Michael Pokorny), Syed M. Shahid (Michael Pokorny), Jean-Christophe Mourrat (Michael Pokorny), Lavr Vetoshkin (Michael Pokorny), Koen Sponselee (Michael Pokorny), Renas Bacho (Michael Pokorny), Florencia de la Rosa (Michael Pokorny), Xiuyu Li (Michael Pokorny), Guillaume Malod (Michael Pokorny), Leon Lang (Michael Pokorny), Julien Laurendeau (Michael Pokorny), Dmitry Kazakov (Michael Pokorny), Fatimah Adesanya (Michael Pokorny), Julien Portier (Michael Pokorny), Lawrence Hollom (Michael Pokorny), Victor Souza (Michael Pokorny), Yuchen Anna Zhou (Michael Pokorny), Julien Degorre (Michael Pokorny), Yi\u{g}it Yal{\i}n (Michael Pokorny), Gbenga Daniel Obikoya (Michael Pokorny), Luca Arnaboldi (Michael Pokorny), Rai (Michael Pokorny), Filippo Bigi (Quinn), M. C. Bosc\'a (Quinn), Oleg Shumar (Quinn), Kaniuar Bacho (Quinn), Pierre Clavier (Quinn), Gabriel Recchia (Quinn), Mara Popescu (Quinn), Nikita Shulga (Quinn), Ngefor Mildred Tanwie (Quinn), Denis Peskoff (Quinn), Thomas C. H. Lux (Quinn), Ben Rank (Quinn), Colin Ni (Quinn), Matthew Brooks (Quinn), Alesia Yakimchyk (Quinn), Huanxu (Quinn), Liu (Tony), Olle H\"aggstr\"om (Tony), Emil Verkama (Tony), Hans Gundlach (Tony), Leonor Brito-Santana (Tony), Brian Amaro (Tony), Vivek Vajipey (Tony), Rynaa Grover (Tony), Yiyang Fan (Tony), Gabriel Poesia Reis e Silva (Tony), Linwei Xin (Tony), Yosi Kratish (Tony), Jakub {\L}ucki (Tony), Wen-Ding Li (Tony), Sivakanth Gopi (Tony), Andrea Caciolai (Tony), Justin Xu (Tony), Kevin Joseph Scaria (Tony), Freddie Vargus (Tony), Farzad Habibi (Tony), Long (Tony), Lian, Emanuele Rodol\`a, Jules Robins, Vincent Cheng, Tony Fruhauff, Brad Raynor, Hao Qi, Xi Jiang, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Alon Ragoler, Justin Tan, Blake Sims, Rebeka Plecnik, Aaron Kirtland, Omer Faruk Bodur, D. P. Shinde, Zahra Adoul, Mohamed Zekry, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Sarah Hoback, Rodrigo De Oliveira Pena, Glen Sherman, Elizabeth Kelley, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, Sandra Mendoza, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Ashley Cartwright, Daphiny Pottmaier, Omid Taheri, David Outevsky, Stanley Stepanic, Samuel Perry, Luke Askew, Ra\'ul Adri\'an Huerta Rodr\'iguez, Ali M. R. Minissi, Sam Ali, Ricardo Lorena, Krishnamurthy Iyer, Arshad Anil Fasiludeen, Sk Md Salauddin, Murat Islam, Juan Gonzalez, Josh Ducey, Maja Somrak, Vasilios Mavroudis, Eric Vergo, Juehang Qin, Benj\'amin Borb\'as, Eric Chu, Jack Lindsey, Anil Radhakrishnan, Antoine Jallon, I. M. J. McInnis, Pawan Kumar, Laxman Prasad Goswami, Daniel Bugas, Nasser Heydari, Ferenc Jeanplong, Archimedes Apronti, Abdallah Galal, Ng Ze-An, Ankit Singh, Joan of Arc Xavier, Kanu Priya Agarwal, Mohammed Berkani, Benedito Alves de Oliveira Junior, Dmitry Malishev, Nicolas Remy, Taylor D. Hartman, Tim Tarver, Stephen Mensah, Javier Gimenez, Roselynn Grace Montecillo, Russell Campbell, Asankhaya Sharma, Khalida Meer, Xavier Alapont, Deepakkumar Patil, Rajat Maheshwari, Abdelkader Dendane, Priti Shukla, Sergei Bogdanov, S\"oren M\"oller, Muhammad Rehan Siddiqi, Prajvi Saxena, Himanshu Gupta, Innocent Enyekwe, Ragavendran P V, Zienab EL-Wasif, Aleksandr Maksapetyan, Vivien Rossbach, Chris Harjadi, Mohsen Bahaloohoreh, Song Bian, John Lai, Justine Leon Uro, Greg Bateman, Mohamed Sayed, Ahmed Menshawy, Darling Duclosel, Yashaswini Jain, Ashley Aaron, Murat Tiryakioglu, Sheeshram Siddh, Keith Krenek, Alex Hoover, Joseph McGowan, Tejal Patwardhan, Summer Yue, Alexandr Wang, Dan Hendrycks

Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,700 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

URLs: https://lastexam.ai.

replace Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning

Authors: Minh Le, Anh Nguyen, Huy Nguyen, Chau Nguyen, Nhat Ho

Abstract: Visual Prompt Tuning (VPT) has recently emerged as a powerful method for adapting pre-trained vision models to downstream tasks. By introducing learnable prompt tokens as task-specific instructions, VPT effectively guides pre-trained transformer models with minimal overhead. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on recent insights into the connection between mixture of experts and prompt-based approaches, we identify a key limitation in VPT: the restricted functional expressiveness in prompt formulation. To address this limitation, we propose Visual Adaptive Prompt Tuning (VAPT), a new generation of prompts that redefines prompts as adaptive functions of the input. Our theoretical analysis shows that this simple yet intuitive approach achieves optimal sample efficiency. Empirical results on VTAB-1K and FGVC further demonstrate VAPT's effectiveness, with performance gains of 7.34% and 1.04% over fully fine-tuning baselines, respectively. Notably, VAPT also surpasses VPT by a substantial margin while using fewer parameters. These results highlight both the effectiveness and efficiency of our method and pave the way for future research to explore the potential of adaptive prompts. Our code is publicly available at https://github.com/Minhchuyentoancbn/VAPT

URLs: https://github.com/Minhchuyentoancbn/VAPT

replace Linear $Q$-Learning Does Not Diverge in $L^2$: Convergence Rates to a Bounded Set

Authors: Xinyu Liu, Zixuan Xie, Shangtong Zhang

Abstract: $Q$-learning is one of the most fundamental reinforcement learning algorithms. It is widely believed that $Q$-learning with linear function approximation (i.e., linear $Q$-learning) suffers from possible divergence until the recent work Meyn (2024) which establishes the ultimate almost sure boundedness of the iterates of linear $Q$-learning. Building on this success, this paper further establishes the first $L^2$ convergence rate of linear $Q$-learning iterates (to a bounded set). Similar to Meyn (2024), we do not make any modification to the original linear $Q$-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an $\epsilon$-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the $L^2$ convergence rate of tabular $Q$-learning with an $\epsilon$-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.

replace Stein Discrepancy for Unsupervised Domain Adaptation

Authors: Anneke von Seeger, Dongmian Zou, Gilad Lerman

Abstract: Unsupervised domain adaptation (UDA) leverages information from a labeled source dataset to improve accuracy on a related but unlabeled target dataset. A common approach to UDA is aligning representations from the source and target domains by minimizing the distance between their data distributions. Previous methods have employed distances such as Wasserstein distance and maximum mean discrepancy. However, these approaches are less effective when the target data is significantly scarcer than the source data. Stein discrepancy is an asymmetric distance between distributions that relies on one distribution only through its score function. In this paper, we propose a novel UDA method that uses Stein discrepancy to measure the distance between source and target domains. We develop a learning framework using both non-kernelized and kernelized Stein discrepancy. Theoretically, we derive an upper bound for the generalization error. Numerical experiments show that our method outperforms existing methods using other domain discrepancy measures when only small amounts of target data are available.

replace Harnessing omnipresent oscillator networks as computational resource

Authors: Thomas Geert de Jong, Hirofumi Notsu, Kohei Nakajima

Abstract: Nature is pervaded with oscillatory behavior. In networks of coupled oscillators patterns can arise when the system synchronizes to an external input. Hence, these networks provide processing and memory of input. We present a universal framework for harnessing oscillator networks as computational resource. This reservoir computing framework is introduced by the ubiquitous model for phase-locking, the Kuramoto model. We force the Kuramoto model by a nonlinear target-system, then after substituting the target-system with a trained feedback-loop it emulates the target-system. Our results are two-fold. Firstly, the trained network inherits performance properties of the Kuramoto model, where all-to-all coupling is performed in linear time with respect to the number of nodes and parameters for synchronization are abundant. Secondly, the learning capabilities of the oscillator network can be explained using Kuramoto model's order parameter. This work provides the foundation for utilizing nature's oscillator networks as a new class of information processing systems.

replace DarwinLM: Evolutionary Structured Pruning of Large Language Models

Authors: Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh

Abstract: Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for \emph{non-uniform} model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for \emph{training-aware} structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring $5\times$ less training data during post-compression training. Code is at: https://github.com/IST-DASLab/DarwinLM

URLs: https://github.com/IST-DASLab/DarwinLM

replace LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization

Authors: Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, Robert Tibshirani

Abstract: We introduce LLM-Lasso, a novel framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. Unlike traditional methods that rely solely on numerical data, LLM-Lasso incorporates domain-specific knowledge extracted from natural language, enhanced through a retrieval-augmented generation (RAG) pipeline, to seamlessly integrate data-driven modeling with contextual insights. Specifically, the LLM generates penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model, while less relevant features are assigned higher penalties, reducing their influence. Importantly, LLM-Lasso has an internal validation step that determines how much to trust the contextual knowledge in our prediction pipeline. Hence it addresses key challenges in robustness, making it suitable for mitigating potential inaccuracies or hallucinations from the LLM. In various biomedical case studies, LLM-Lasso outperforms standard Lasso and existing feature selection baselines, all while ensuring the LLM operates without prior access to the datasets. To our knowledge, this is the first approach to effectively integrate conventional feature selection techniques directly with LLM-based domain-specific reasoning.

replace Generalization of the Gibbs algorithm with high probability at low temperatures

Authors: Andreas Maurer

Abstract: The paper gives a bound on the generalization error of the Gibbs algorithm, which recovers known data-independent bounds for the high temperature range and extends to the low-temperature range, where generalization depends critically on the data-dependent loss-landscape. It is shown, that with high probability the generalization error of a single hypothesis drawn from the Gibbs posterior decreases with the total prior volume of all hypotheses with similar or smaller empirical error. This gives theoretical support to the belief in the benefit of flat minima. The zero temperature limit is discussed and the bound is extended to a class of similar stochastic algorithms.

replace Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives

Authors: Leo Schwinn, Yan Scholten, Tom Wollschl\"ager, Sophie Xhonneux, Stephen Casper, Stephan G\"unnemann, Gauthier Gidel

Abstract: Misaligned research objectives have considerably hindered progress in adversarial robustness research over the past decade. For instance, an extensive focus on optimizing target metrics, while neglecting rigorous standardized evaluation, has led researchers to pursue ad-hoc heuristic defenses that were seemingly effective. Yet, most of these were exposed as flawed by subsequent evaluations, ultimately contributing little measurable progress to the field. In this position paper, we illustrate that current research on the robustness of large language models (LLMs) risks repeating past patterns with potentially worsened real-world implications. To address this, we argue that realigned objectives are necessary for meaningful progress in adversarial alignment. To this end, we build on established cybersecurity taxonomy to formally define differences between past and emerging threat models that apply to LLMs. Using this framework, we illustrate that progress requires disentangling adversarial alignment into addressable sub-problems and returning to core academic principles, such as measureability, reproducibility, and comparability. Although the field presents significant challenges, the fresh start on adversarial robustness offers the unique opportunity to build on past experience while avoiding previous mistakes.

replace LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Authors: Florian Sestak, Artur Toshev, Andreas F\"urst, G\"unter Klambauer, Andreas Mayr, Johannes Brandstetter

Abstract: Generative models are spearheading recent progress in deep learning, showing strong promise for trajectory sampling in dynamical systems as well. However, while latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), combines the advantages of graph neural networks, i.e., the traceability of entities across time-steps, with the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder are frozen to enable generative modeling in the latent space. The core idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for retrieval of entity properties, e.g., entity coordinates, from latent system representations and thus enables traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .

URLs: https://github.com/ml-jku/LaM-SLidE

replace Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

Authors: Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang

Abstract: Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning. The code will be released at https://github.com/Sunmmyy/OTPR.git.

URLs: https://github.com/Sunmmyy/OTPR.git.

replace Is Q-learning an Ill-posed Problem?

Authors: Philipp Wissmann, Daniel Hein, Steffen Udluft, Thomas Runkler

Abstract: This paper investigates the instability of Q-learning in continuous environments, a challenge frequently encountered by practitioners. Traditionally, this instability is attributed to bootstrapping and regression model errors. Using a representative reinforcement learning benchmark, we systematically examine the effects of bootstrapping and model inaccuracies by incrementally eliminating these potential error sources. Our findings reveal that even in relatively simple benchmarks, the fundamental task of Q-learning - iteratively learning a Q-function from policy-specific target values - can be inherently ill-posed and prone to failure. These insights cast doubt on the reliability of Q-learning as a universal solution for reinforcement learning problems.

replace Temporal Misalignment in ANN-SNN Conversion and Its Mitigation via Probabilistic Spiking Neurons

Authors: Velibor Bojkovi\'c, Xiaofeng Wu, Bin Gu

Abstract: Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to Artificial Neural Networks (ANNs) by mimicking biological neural principles, establishing them as a promising approach to mitigate the increasing energy demands of large-scale neural models. However, fully harnessing the capabilities of SNNs remains challenging due to their discrete signal processing and temporal dynamics. ANN-SNN conversion has emerged as a practical approach, enabling SNNs to achieve competitive performance on complex machine learning tasks. In this work, we identify a phenomenon in the ANN-SNN conversion framework, termed temporal misalignment, in which random spike rearrangement across SNN layers leads to performance improvements. Based on this observation, we introduce biologically plausible two-phase probabilistic (TPP) spiking neurons, further enhancing the conversion process. We demonstrate the advantages of our proposed method both theoretically and empirically through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet across a variety of architectures, achieving state-of-the-art results.

replace Not All Data are Good Labels: On the Self-supervised Labeling for Time Series Forecasting

Authors: Yuxuan Yang, Dalin Zhang, Yuxuan Liang, Hua Lu, Gang Chen, Huan Li

Abstract: Time Series Forecasting (TSF) is a crucial task in various domains, yet existing TSF models rely heavily on high-quality data and insufficiently exploit all available data. This paper explores a novel self-supervised approach to re-label time series datasets by inherently constructing candidate datasets. During the optimization of a simple reconstruction network, intermediates are used as pseudo labels in a self-supervised paradigm, improving generalization for any predictor. We introduce the Self-Correction with Adaptive Mask (SCAM), which discards overfitted components and selectively replaces them with pseudo labels generated from reconstructions. Additionally, we incorporate Spectral Norm Regularization (SNR) to further suppress overfitting from a loss landscape perspective. Our experiments on eleven real-world datasets demonstrate that SCAM consistently improves the performance of various backbone models. This work offers a new perspective on constructing datasets and enhancing the generalization of TSF models through self-supervised learning.

replace An Adversarial Analysis of Thompson Sampling for Full-information Online Learning: from Finite to Infinite Action Spaces

Authors: Alexander Terenin, Jeffrey Negrea

Abstract: We develop an analysis of Thompson sampling for online learning under full feedback - also known as prediction with expert advice - where the learner's prior is defined over the space of an adversary's future actions, rather than the space of experts. We show regret decomposes into regret the learner expected a priori, plus a prior-robustness-type term we call excess regret. In the classical finite-expert setting, this recovers optimal rates. As an initial step towards practical online learning in settings with a potentially-uncountably-infinite number of experts, we show that Thompson sampling with a certain Gaussian process prior widely-used in the Bayesian optimization literature has a $\mathcal{O}(\beta\sqrt{T\log(1+\lambda)})$ rate against a $\beta$-bounded $\lambda$-Lipschitz adversary.

replace-cross Concentration Inequalities for Statistical Inference

Authors: Huiming Zhang, Song Xi Chen

Abstract: This paper gives a review of concentration inequalities which are widely employed in non-asymptotical analyses of mathematical statistics in a wide range of settings, from distribution-free to distribution-dependent, from sub-Gaussian to sub-exponential, sub-Gamma, and sub-Weibull random variables, and from the mean to the maximum concentration. This review provides results in these settings with some fresh new results. Given the increasing popularity of high-dimensional data and inference, results in the context of high-dimensional linear and Poisson regressions are also provided. We aim to illustrate the concentration inequalities with known constants and to improve existing bounds with sharper constants.

replace-cross A graph neural network-based model with Out-of-Distribution Robustness for enhancing Antiretroviral Therapy Outcome Prediction for HIV-1

Authors: Giulia Di Teodoro, Federico Siciliano, Valerio Guarrasi, Anne-Mieke Vandamme, Valeria Ghisetti, Anders S\"onnerborg, Maurizio Zazzi, Fabrizio Silvestri, Laura Palagi

Abstract: Predicting the outcome of antiretroviral therapies (ART) for HIV-1 is a pressing clinical challenge, especially when the ART includes drugs with limited effectiveness data. This scarcity of data can arise either due to the introduction of a new drug to the market or due to limited use in clinical settings, resulting in clinical dataset with highly unbalanced therapy representation. To tackle this issue, we introduce a novel joint fusion model, which combines features from a Fully Connected (FC) Neural Network and a Graph Neural Network (GNN) in a multi-modality fashion. Our model uses both tabular data about genetic sequences and a knowledge base derived from Stanford drug-resistance mutation tables, which serve as benchmark references for deducing in-vivo treatment efficacy based on the viral genetic sequence. By leveraging this knowledge base structured as a graph, the GNN component enables our model to adapt to imbalanced data distributions and account for Out-of-Distribution (OoD) drugs. We evaluated these models' robustness against OoD drugs in the test set. Our comprehensive analysis demonstrates that the proposed model consistently outperforms the FC model. These results underscore the advantage of integrating Stanford scores in the model, thereby enhancing its generalizability and robustness, but also extending its utility in contributing in more informed clinical decisions with limited data availability. The source code is available at https://github.com/federicosiciliano/graph-ood-hiv

URLs: https://github.com/federicosiciliano/graph-ood-hiv

replace-cross PiCO: Peer Review in LLMs based on the Consistency Optimization

Authors: Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yong-Hong Tian, Yibing Song, Li Yuan

Abstract: Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these models, we assign each LLM a learnable capability parameter to adjust the final ranking. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. The key assumption behind is that high-level LLM can evaluate others' answers more accurately than low-level ones, while higher-level LLM can also achieve higher response scores. Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings. We perform experiments on multiple datasets with these metrics, validating the effectiveness of the proposed approach.

replace-cross Scaling Laws for Downstream Task Performance in Machine Translation

Authors: Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo

Abstract: Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.

replace-cross MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

Authors: Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter

Abstract: We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. MIM-Refiner is motivated by the insight that strong representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to different intermediate layers. In each head, a modified nearest neighbor objective constructs semantic clusters that capture semantic information which improves performance on downstream tasks, including off-the-shelf and fine-tuning settings. The refinement process is short and simple - yet highly effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. MIM-Refiner efficiently combines the advantages of MIM and ID objectives and compares favorably against previous state-of-the-art SSL models on a variety of benchmarks such as low-shot classification, long-tailed classification, clustering and semantic segmentation.

replace-cross Aligning the Objective of LLM-based Program Repair

Authors: Junjielong Xu, Ying Fu, Shin Hwei Tan, Pinjia He

Abstract: Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs can locate and repair bugs in certain functions using the related artifacts (e.g., test cases), existing methods still depend on statement-level fault localization methods to provide a list of buggy hunks for repair. This restriction hinders LLMs from exploring potential patches beyond the given locations. In this paper, we investigate a new approach to adapt LLMs to program repair. Our core insight is that LLM's APR capability can be greatly improved by simply aligning the output to their training objective and allowing them to refine the whole program without first identifying faulty statements. Based on this insight, we designed D4C, a straightforward prompting framework for APR. D4C can repair 180 bugs correctly in Defects4J, with each patch being sampled only 10 times. This surpasses the SOTA APR methods with perfect fault localization by 10% and reduces the patch sampling number by 90%. Our findings reveal that (1) objective alignment is crucial for fully exploiting LLM's pre-trained capability, and (2) replacing the traditional localize-buggy-hunks-then-repair workflow with direct debugging is more effective for LLM-based APR methods. Thus, we believe this paper introduces a new mindset for harnessing LLMs in APR.

replace-cross Online Planning of Power Flows for Power Systems Against Bushfires Using Spatial Context

Authors: Jianyu Xu, Qiuzhuang Sun, Yang Yang, Huadong Mo, Daoyi Dong

Abstract: The 2019-20 Australia bushfire incurred numerous economic losses and significantly affected the operations of power systems. A power station or transmission line can be significantly affected due to bushfires, leading to an increase in operational costs. We study a fundamental but challenging problem of planning the optimal power flow (OPF) for power systems subject to bushfires. Considering the stochastic nature of bushfire spread, we develop a model to capture such dynamics based on Moore's neighborhood model. Under a periodic inspection scheme that reveals the in-situ bushfire status, we propose an online optimization modeling framework that sequentially plans the power flows in the electricity network. Our framework assumes that the spread of bushfires is non-stationary over time, and the spread and containment probabilities are unknown. To meet these challenges, we develop a contextual online learning algorithm that treats the in-situ geographical information of the bushfire as a 'spatial context'. The online learning algorithm learns the unknown probabilities sequentially based on the observed data and then makes the OPF decision accordingly. The sequential OPF decisions aim to minimize the regret function, which is defined as the cumulative loss against the clairvoyant strategy that knows the true model parameters. We provide a theoretical guarantee of our algorithm by deriving a bound on the regret function, which outperforms the regret bound achieved by other benchmark algorithms. Our model assumptions are verified by the real bushfire data from NSW, Australia, and we apply our model to two power systems to illustrate its applicability.

replace-cross CoverLib: Classifiers-equipped Experience Library by Iterative Problem Distribution Coverage Maximization for Domain-tuned Motion Planning

Authors: Hirokazu Ishida, Naoki Hiraoka, Kei Okada, Masayuki Inaba

Abstract: Library-based methods are known to be very effective for fast motion planning by adapting an experience retrieved from a precomputed library. This article presents CoverLib, a principled approach for constructing and utilizing such a library. CoverLib iteratively adds an experience-classifier-pair to the library, where each classifier corresponds to an adaptable region of the experience within the problem space. This iterative process is an active procedure, as it selects the next experience based on its ability to effectively cover the uncovered region. During the query phase, these classifiers are utilized to select an experience that is expected to be adaptable for a given problem. Experimental results demonstrate that CoverLib effectively mitigates the trade-off between plannability and speed observed in global (e.g. sampling-based) and local (e.g. optimization-based) methods. As a result, it achieves both fast planning and high success rates over the problem domain. Moreover, due to its adaptation-algorithm-agnostic nature, CoverLib seamlessly integrates with various adaptation methods, including nonlinear programming-based and sampling-based algorithms.

replace-cross Feature Aggregation with Latent Generative Replay for Federated Continual Learning of Socially Appropriate Robot Behaviours

Authors: Nikhil Churamani, Saksham Checker, Fethiye Irmak Dogan, Hao-Tien Lewis Chiang, Hatice Gunes

Abstract: It is critical for robots to explore Federated Learning (FL) settings where several robots, deployed in parallel, can learn independently while also sharing their learning with each other. This collaborative learning in real-world environments requires social robots to adapt dynamically to changing and unpredictable situations and varying task settings. Our work contributes to addressing these challenges by exploring a simulated living room environment where robots need to learn the social appropriateness of their actions. First, we propose Federated Root (FedRoot) averaging, a novel weight aggregation strategy which disentangles feature learning across clients from individual task-based learning. Second, to adapt to challenging environments, we extend FedRoot to Federated Latent Generative Replay (FedLGR), a novel Federated Continual Learning (FCL) strategy that uses FedRoot-based weight aggregation and embeds each client with a generator model for pseudo-rehearsal of learnt feature embeddings to mitigate forgetting in a resource-efficient manner. Our results show that FedRoot-based methods offer competitive performance while also resulting in a sizeable reduction in resource consumption (up to 86% for CPU usage and up to 72% for GPU usage). Additionally, our results demonstrate that FedRoot-based FCL methods outperform other methods while also offering an efficient solution (up to 84% CPU and 92% GPU usage reduction), with FedLGR providing the best results across evaluations.

replace-cross Is Free Self-Alignment Possible?

Authors: Dyah Adila, Changho Shin, Yijing Zhang, Frederic Sala

Abstract: Aligning pretrained language models (LMs) often requires large-scale preference data and substantial computational resources. These costs become even more prohibitive for multi-objective or pluralistic alignment. Is this truly necessary? Can we perform efficient alignment using only internal model capabilities, and without additional training? To answer this question, we propose AlignEZ, a novel approach that leverages (1) self-generated preference data and (2) representation editing to achieve cost-effective, efficient alignment. By operating directly on learned representations, AlignEZ independently targets different behavioral aspects without the overhead of traditional alignment methods. Our experiments reveal that this cost-efficient procedure improves performance across diverse tasks: up to 19.9% on general alignment and 1.9% on challenging mathematical reasoning tasks, even when starting from a strong base model. AlignEZ can also align models to multiple objectives simultaneously, granting fine-grained control over multiple preference axes. Finally, we show that AlignEZ can accelerate more expensive alignment procedures--such as DPO--even under limited availability of ground-truth preference data.

replace-cross Vision-LSTM: xLSTM as Generic Vision Backbone

Authors: Benedikt Alkin, Maximilian Beck, Korbinian P\"oppel, Sepp Hochreiter, Johannes Brandstetter

Abstract: Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

replace-cross Aligned at the Start: Conceptual Groupings in LLM Embeddings

Authors: Mehrdad Khatir, Sanchit Kabara, Chandan K. Reddy

Abstract: This paper shifts focus to the often-overlooked input embeddings - the initial representations fed into transformer blocks. Using fuzzy graph, k-nearest neighbor (k-NN), and community detection, we analyze embeddings from diverse LLMs, finding significant categorical community structure aligned with predefined concepts and categories aligned with humans. We observe these groupings exhibit within-cluster organization (such as hierarchies, topological ordering, etc.), hypothesizing a fundamental structure that precedes contextual processing. To further investigate the conceptual nature of these groupings, we explore cross-model alignments across different LLM categories within their input embeddings, observing a medium to high degree of alignment. Furthermore, provide evidence that manipulating these groupings can play a functional role in mitigating ethnicity bias in LLM tasks.

replace-cross Dataless Quadratic Neural Networks for the Maximum Independent Set Problem

Authors: Ismail Alkhouri, Cedric Le Denmat, Yingjie Li, Cunxi Yu, Jia Liu, Rongrong Wang, Alvaro Velasquez

Abstract: Combinatorial Optimization (CO) addresses many important problems, including the challenging Maximum Independent Set (MIS) problem. Alongside exact and heuristic solvers, differentiable approaches have emerged, often using continuous relaxations of ReLU-based or quadratic objectives. Noting that an MIS in a graph is a Maximum Clique (MC) in its complement, we propose a new quadratic formulation for MIS by incorporating an MC term, improving convergence and exploration. We show that every maximal independent set corresponds to a local minimizer, derive conditions for the MIS size, and characterize stationary points. To solve our non-convex objective, we propose solving parallel multiple initializations using momentum-based gradient descent, complemented by an efficient MIS checking criterion derived from our theory. Therefore, we dub our method as parallelized Clique-Informed Quadratic Optimization for MIS (pCQO-MIS). Our experimental results demonstrate the effectiveness of the proposed method compared to exact, heuristic, sampling, and data-centric approaches. Notably, our method avoids the out-of-distribution tuning and reliance on (un)labeled data required by data-centric methods, while achieving superior MIS sizes and competitive runtime relative to their inference time. Additionally, a key advantage of pCQO-MIS is that, unlike exact and heuristic solvers, the runtime scales only with the number of nodes in the graph, not the number of edges.

replace-cross Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Authors: Julien Hauret, Malo Olivier, Thomas Joubaud, Christophe Langrenne, Sarah Poir\'ee, V\'eronique Zimpfer, \'Eric Bavu

Abstract: Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors : two in-ear microphones, two bone conduction vibration pickups and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 hours of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by an high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.

replace-cross BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models

Authors: Yupeng Chang, Yi Chang, Yuan Wu

Abstract: Large language models (LLMs) have demonstrated remarkable proficiency across various natural language processing (NLP) tasks. However, adapting LLMs to downstream applications requires computationally intensive and memory-demanding fine-tuning procedures. To alleviate these burdens, parameter-efficient fine-tuning (PEFT) techniques have emerged as a promising approach to tailor LLMs with minimal computational overhead. While PEFT methods offer substantial advantages, they do not fully address the pervasive issue of bias propagation from pre-training data. This work introduces Bias-Alleviating Low-Rank Adaptation (BA-LoRA), a novel PEFT method designed to counteract bias inheritance. BA-LoRA incorporates three distinct regularization terms: (1) a consistency regularizer, (2) a diversity regularizer, and (3) a singular value decomposition regularizer. These regularizers aim to enhance the models' consistency, diversity, and generalization capabilities during fine-tuning. We conduct extensive experiments on natural language understanding (NLU) and natural language generation (NLG) tasks using prominent LLMs such as LLaMA, Mistral, and Gemma. The results demonstrate that BA-LoRA outperforms LoRA and its state-of-the-art variants. Moreover, our method effectively mitigates the adverse effects of pre-training bias, leading to more reliable and robust model outputs. The code is available at https://github.com/cyp-jlu-ai/BA-LoRA.

URLs: https://github.com/cyp-jlu-ai/BA-LoRA.

replace-cross Fast John Ellipsoid Computation with Differential Privacy Optimization

Authors: Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Junwei Yu

Abstract: Determining the John ellipsoid - the largest volume ellipsoid contained within a convex polytope - is a fundamental problem with applications in machine learning, optimization, and data analytics. Recent work has developed fast algorithms for approximating the John ellipsoid using sketching and leverage score sampling techniques. However, these algorithms do not provide privacy guarantees for sensitive input data. In this paper, we present the first differentially private algorithm for fast John ellipsoid computation. Our method integrates noise perturbation with sketching and leverages score sampling to achieve both efficiency and privacy. We prove that (1) our algorithm provides $(\epsilon,\delta)$-differential privacy and the privacy guarantee holds for neighboring datasets that are $\epsilon_0$-close, allowing flexibility in the privacy definition; (2) our algorithm still converges to a $(1+\xi)$-approximation of the optimal John ellipsoid in $\Theta(\xi^{-2}(\log(n/\delta_0) + (L\epsilon_0)^{-2}))$ iterations where $n$ is the number of data point, $L$ is the Lipschitz constant, $\delta_0$ is the failure probability, and $\epsilon_0$ is the closeness of neighboring input datasets. Our theoretical analysis demonstrates the algorithm's convergence and privacy properties, providing a robust approach for balancing utility and privacy in John ellipsoid computation. This is the first differentially private algorithm for fast John ellipsoid computation, opening avenues for future research in privacy-preserving optimization techniques.

replace-cross Integrating the Expected Future in Load Forecasts with Contextually Enhanced Transformer Models

Authors: Raffael Theiler, Leandro Von Krannichfeldt, Giovanni Sansavini, Michael F. Howland, Olga Fink

Abstract: Accurate and reliable energy forecasting is essential for power grid operators who strive to minimize extreme forecasting errors that pose significant operational challenges and incur high intra-day trading costs. Incorporating planning information -- such as anticipated user behavior, scheduled events or timetables -- provides substantial contextual information to enhance forecast accuracy and reduce the occurrence of large forecasting errors. Existing approaches, however, lack the flexibility to effectively integrate both dynamic, forward-looking contextual inputs and historical data. In this work, we conceptualize forecasting as a combined forecasting-regression task, formulated as a sequence-to-sequence prediction problem, and introduce contextually-enhanced transformer models designed to leverage all contextual information effectively. We demonstrate the effectiveness of our approach through a primary case study on nationwide railway energy consumption forecasting, where integrating contextual information into transformer models, particularly timetable data, resulted in a significant average mean absolute error reduction of 26.6%. An auxiliary case study on building energy forecasting, leveraging planned office occupancy data, further illustrates the generalizability of our method, showing an average reduction of 56.3% in mean absolute error. Compared to other state-of-the-art methods, our approach consistently outperforms existing models, underscoring the value of context-aware deep learning techniques in energy forecasting applications.

replace-cross Fully automatic extraction of morphological traits from the Web: utopia or reality?

Authors: Diego Marcos, Robert van de Vlasakker, Ioannis N. Athanasiadis, Pierre Bonnet, Herv\'e Goeau, Alexis Joly, W. Daniel Kissling, C\'esar Leblanc, Andr\'e S. J. van Proosdij, Konstantinos P. Panousis

Abstract: Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the traits of interest.

replace-cross Packet Inspection Transformer: A Self-Supervised Journey to Unseen Malware Detection with Few Samples

Authors: Kyle Stein, Arash Mahyari, Guillermo Francia III, Eman El-Sheikh

Abstract: As networks continue to expand and become more interconnected, the need for novel malware detection methods becomes more pronounced. Traditional security measures are increasingly inadequate against the sophistication of modern cyber attacks. Deep Packet Inspection (DPI) has been pivotal in enhancing network security, offering an in-depth analysis of network traffic that surpasses conventional monitoring techniques. DPI not only examines the metadata of network packets, but also dives into the actual content being carried within the packet payloads, providing a comprehensive view of the data flowing through networks. While the integration of advanced deep learning techniques with DPI has introduced modern methodologies into malware detection and network traffic classification, state-of-the-art supervised learning approaches are limited by their reliance on large amounts of annotated data and their inability to generalize to novel, unseen malware threats. To address these limitations, this paper leverages the recent advancements in self-supervised learning (SSL) and few-shot learning (FSL). Our proposed self-supervised approach trains a transformer via SSL to learn the embedding of packet content, including payload, from vast amounts of unlabeled data by masking portions of packets, leading to a learned representation that generalizes to various downstream tasks. Once the representation is extracted from the packets, they are used to train a malware detection algorithm. The representation obtained from the transformer is then used to adapt the malware detector to novel types of attacks using few-shot learning approaches. Our experimental results demonstrate that our method achieves classification accuracies of up to 94.76% on the UNSW-NB15 dataset and 83.25% on the CIC-IoT23 dataset.

replace-cross Post-edits Are Preferences Too

Authors: Nathaniel Berger, Miriam Exel, Matthias Huck, Stefan Riezler

Abstract: Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks.

replace-cross Encryption-Friendly LLM Architecture

Authors: Donghwan Rho, Taeseong Kim, Minje Park, Jung Woo Kim, Hyunsik Chae, Ernest K. Ryu, Jung Hee Cheon

Abstract: Large language models (LLMs) offer personalized responses based on user interactions, but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a cryptographic protocol supporting arithmetic computations in encrypted states and provides a potential solution for privacy-preserving machine learning (PPML). However, the computational intensity of transformers poses challenges for applying HE to LLMs. In this work, we propose a modified HE-friendly transformer architecture with an emphasis on inference following personalized (private) fine-tuning. Utilizing LoRA fine-tuning and Gaussian kernels, we achieve significant computational speedups -- 6.94x for fine-tuning and 2.3x for inference -- while maintaining performance comparable to plaintext models. Our findings provide a viable proof of concept for offering privacy-preserving LLM services in areas where data protection is crucial. Our code is available on GitHub.

replace-cross Identification and estimation for matrix time series CP-factor models

Authors: Jinyuan Chang, Yue Du, Guanglin Huang, Qiwei Yao

Abstract: We propose a new method for identifying and estimating the CP-factor models for matrix time series. Unlike the generalized eigenanalysis-based method of Chang et al.(2023) for which the convergence rates may suffer from small eigengaps as the asymptotic theory is based on some matrix perturbation analysis, the proposed new method enjoys faster convergence rates which are free from any eigengaps. It achieves this by turning the problem into a joint diagonalization of several matrices whose elements are determined by a basis of a linear system, and by choosing the basis carefully to avoid near co-linearity (see Proposition 5 and Section 4.3 below). Furthermore, unlike Chang et al.(2023) which requires the two factor loading matrices to be full-ranked, the new method can handle rank-deficient factor loading matrices. Illustration with both simulated and real matrix time series data shows the advantages of the proposed new method.

replace-cross Feedback Schr\"odinger Bridge Matching

Authors: Panagiotis Theodoropoulos, Nikolaos Komianos, Vincent Pacelli, Guan-Horng Liu, Evangelos A. Theodorou

Abstract: Recent advancements in diffusion bridges for distribution transport problems have heavily relied on matching frameworks, yet existing methods often face a trade-off between scalability and access to optimal pairings during training. Fully unsupervised methods make minimal assumptions but incur high computational costs, limiting their practicality. On the other hand, imposing full supervision of the matching process with optimal pairings improves scalability, however, it can be infeasible in many applications. To strike a balance between scalability and minimal supervision, we introduce Feedback Schr\"odinger Bridge Matching (FSBM), a novel semi-supervised matching framework that incorporates a small portion (less than 8% of the entire dataset) of pre-aligned pairs as state feedback to guide the transport map of non coupled samples, thereby significantly improving efficiency. This is achieved by formulating a static Entropic Optimal Transport (EOT) problem with an additional term capturing the semi-supervised guidance. The generalized EOT objective is then recast into a dynamic formulation to leverage the scalability of matching frameworks. Extensive experiments demonstrate that FSBM accelerates training and enhances generalization by leveraging coupled pairs guidance, opening new avenues for training matching frameworks with partially aligned datasets.

replace-cross Interpretable Multimodal Machine Learning Analysis of X-ray Absorption Near-Edge Spectra and Pair Distribution Functions

Authors: Tanaporn Na Narong, Zoe N. Zachko, Steven B. Torrisi, Simon J. L. Billinge

Abstract: We used interpretable machine learning to combine information from multiple heterogeneous spectra: X-ray absorption near-edge spectra (XANES) and atomic pair distribution functions (PDFs) to extract local structural and chemical environments of transition metal cations in oxides. Random forest models were trained on simulated XANES, PDF, and both combined to extract oxidation state, coordination number, and mean nearest-neighbor bond length. XANES-only models generally outperformed PDF-only models, even for structural tasks, although using the metal's differential PDFs (dPDFs) instead of total PDFs narrowed this gap. When combined with PDFs, information from XANES often dominates the prediction. Our results demonstrate that XANES contain rich structural information and highlight the utility of species-specificity. This interpretable, multimodal approach is quick to implement with suitable databases and offers valuable insights into the relative strengths of different modalities, guiding researchers in experiment design and identifying when combining complementary techniques adds meaningful information to a scientific investigation.

replace-cross Refined Risk Bounds for Unbounded Losses via Transductive Priors

Authors: Jian Qian, Alexander Rakhlin, Nikita Zhivotovskiy

Abstract: We revisit the sequential variants of linear regression with the squared loss, classification problems with hinge loss, and logistic regression, all characterized by unbounded losses in the setup where no assumptions are made on the magnitude of design vectors and the norm of the optimal vector of parameters. The key distinction from existing results lies in our assumption that the set of design vectors is known in advance (though their order is not), a setup sometimes referred to as transductive online learning. While this assumption seems similar to fixed design regression or denoising, we demonstrate that the sequential nature of our algorithms allows us to convert our bounds into statistical ones with random design without making any additional assumptions about the distribution of the design vectors--an impossibility for standard denoising results. Our key tools are based on the exponential weights algorithm with carefully chosen transductive (design-dependent) priors, which exploit the full horizon of the design vectors. Our classification regret bounds have a feature that is only attributed to bounded losses in the literature: they depend solely on the dimension of the parameter space and on the number of rounds, independent of the design vectors or the norm of the optimal solution. For linear regression with squared loss, we further extend our analysis to the sparse case, providing sparsity regret bounds that additionally depend on the magnitude of the response variables. We argue that these improved bounds are specific to the transductive setting and unattainable in the worst-case sequential setup. Our algorithms, in several cases, have polynomial time approximations and reduce to sampling with respect to log-concave measures instead of aggregating over hard-to-construct $\varepsilon$-covers of classes.

replace-cross Securing Healthcare with Deep Learning: A CNN-Based Model for medical IoT Threat Detection

Authors: Alireza Mohamadi, Hosna Ghahramani, Seyyed Amir Asghari, Mehdi Aminian

Abstract: The increasing integration of the Internet of Medical Things (IoMT) into healthcare systems has significantly enhanced patient care but has also introduced critical cybersecurity challenges. This paper presents a novel approach based on Convolutional Neural Networks (CNNs) for detecting cyberattacks within IoMT environments. Unlike previous studies that predominantly utilized traditional machine learning (ML) models or simpler Deep Neural Networks (DNNs), the proposed model leverages the capabilities of CNNs to effectively analyze the temporal characteristics of network traffic data. Trained and evaluated on the CICIoMT2024 dataset, which comprises 18 distinct types of cyberattacks across a range of IoMT devices, the proposed CNN model demonstrates superior performance compared to previous state-of-the-art methods, achieving a perfect accuracy of 99% in binary, categorical, and multiclass classification tasks. This performance surpasses that of conventional ML models such as Logistic Regression, AdaBoost, DNNs, and Random Forests. These findings highlight the potential of CNNs to substantially improve IoMT cybersecurity, thereby ensuring the protection and integrity of connected healthcare systems.

replace-cross Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling

Authors: Yiwen Ding, Zhiheng Xi, Wei He, Zhuoyuan Li, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract: Self-improvement methods enable large language models (LLMs) to generate solutions themselves and iteratively train on filtered, high-quality rationales. This process proves effective and reduces the reliance on human supervision in LLMs' reasoning, but the performance soon plateaus. We delve into the process and find that models tend to over-sample on easy queries and under-sample on queries they have yet to master. As iterations proceed, this imbalance in sampling is exacerbated, leading to a long-tail distribution where solutions to difficult queries almost diminish. This phenomenon limits the performance gain of self-improving models. A straightforward solution is brute-force sampling to balance the distribution, which significantly raises computational costs. In this paper, we introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data. It leverages Socratic-style guidance signals to help LLM reasoning with complex queries, reducing the exploration effort and minimizing computational overhead. Experiments on four models across diverse mathematical tasks show that GSI strikes a balance between performance and efficiency, while also being effective on held-out tasks.

replace-cross Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

Authors: Ke Liang Xiao, Noah Marshall, Atish Agarwala, Elliot Paquette

Abstract: In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

replace-cross R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge

Authors: Aladin Djuhera, Vlad C. Andrei, Mohsen Pourghasemian, Haris Gacanin, Holger Boche, Walid Saad

Abstract: Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently. However, training MTLLMs is complex and exhaustive, particularly when tasks are subject to change. Recently, the concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM. In this paper, the problem of enabling edge users to collaboratively craft such MTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks. To this end, first the influence of adversarial noise to multi-task model fusion is investigated and a relationship between the so-called weight disentanglement error and the mean squared error (MSE) is derived. Using hypothesis testing, it is directly shown that the MSE increases interference between task vectors, thereby rendering model fusion ineffective. Then, a novel resilient MTLLM fusion (R-MTLLMF) is proposed, which leverages insights about the LLM architecture and fine-tuning process to safeguard task vector aggregation under adversarial noise by realigning the MTLLM. The proposed R-MTLLMF is then compared for both worst-case and ideal transmission scenarios to study the impact of the wireless channel. Extensive model fusion experiments with vision LLMs demonstrate R-MTLLMF's effectiveness, achieving close-to-baseline performance across eight different tasks in ideal noise scenarios and significantly outperforming unprotected model fusion in worst-case scenarios. The results further advocate for additional physical layer protection for a holistic approach to resilience, from both a wireless and LLM perspective.

replace-cross From Priest to Doctor: Domain Adaptation for Low-Resource Neural Machine Translation

Authors: Ali Marashian, Enora Rice, Luke Gessler, Alexis Palmer, Katharina von der Wense

Abstract: Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.

replace-cross Automatic Self-supervised Learning for Social Recommendations

Authors: Xin He, Wenqi Fan, Mingchen Sun, Ying Wang, Xin Wang

Abstract: In recent years, researchers have attempted to exploit social relations to improve the performance in recommendation systems. Generally, most existing social recommendation methods heavily depends on substantial domain knowledge and expertise in primary recommendation tasks for designing useful auxiliary tasks. Meanwhile, Self-Supervised Learning (SSL) recently has received considerable attention in the field of recommendation, since it can provide self-supervision signals in assisting the improvement of target recommendation systems by constructing self-supervised auxiliary tasks from raw data without human-annotated labels. Despite the great success, these SSL-based social recommendations are insufficient to adaptively balance various self-supervised auxiliary tasks, since assigning equal weights on various auxiliary tasks can result in sub-optimal recommendation performance, where different self-supervised auxiliary tasks may contribute differently to improving the primary social recommendation across different datasets. To address this issue, in this work, we propose Adaptive Self-supervised Learning for Social Recommendations (AdasRec) by taking advantage of various self-supervised auxiliary tasks. More specifically, an adaptive weighting mechanism is proposed to learn adaptive weights for various self-supervised auxiliary tasks, so as to balance the contribution of such self-supervised auxiliary tasks for enhancing representation learning in social recommendations. The adaptive weighting mechanism is used to assign different weights on auxiliary tasks to achieve an overall weighting of the entire auxiliary tasks and ultimately assist the primary recommendation task, achieved by a meta learning optimization problem with an adaptive weighting network. Comprehensive experiments on various real-world datasets are constructed to verify the effectiveness of our proposed method.

replace-cross ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think

Authors: Tao Feng, Wei Li, Didi Zhu, Hangjie Yuan, Wendi Zheng, Dan Zhang, Jie Tang

Abstract: Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Like, SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. In practice, permission to access gradient information is not always granted (the gradient ban), such as black-box APIs, hardware limitations, and non-differentiable systems. To bridge this gap, we introduce the first benchmark ZeroFlow to evaluate gradient-free optimization algorithms for overcoming forgetting. This benchmark examines a suite of forward pass methods across multiple methods, forgetting scenarios, and datasets. We find that forward passes alone are enough to overcome forgetting. Our findings reveal new optimization principles that highlight the potential of forward-pass in mitigating forgetting, managing task conflicts, and reducing memory demands, alongside novel enhancements that further mitigate forgetting with just one forward pass. This work provides essential insights and tools for advancing forward pass methods to overcome forgetting.

replace-cross Tensorization of neural networks for improved privacy and interpretability

Authors: Jos\'e Ram\'on Pareja Monturiol, Alejandro Pozas-Kerstjens, David P\'erez-Garc\'ia

Abstract: We present a tensorization algorithm for constructing tensor train representations of functions, drawing on sketching and cross interpolation ideas. The method only requires black-box access to the target function and a small set of sample points defining the domain of interest. Thus, it is particularly well-suited for machine learning models, where the domain of interest is naturally defined by the training dataset. We show that this approach can be used to enhance the privacy and interpretability of neural network models. Specifically, we apply our decomposition to (i) obfuscate neural networks whose parameters encode patterns tied to the training data distribution, and (ii) estimate topological phases of matter that are easily accessible from the tensor train representation. Additionally, we show that this tensorization can serve as an efficient initialization method for optimizing tensor trains in general settings, and that, for model compression, our algorithm achieves a superior trade-off between memory and time complexity compared to conventional tensorization methods of neural networks.

replace-cross Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home

Authors: Viktor Moskvoretskii, Maria Lysyuk, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, Alexander Panchenko

Abstract: Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs' intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.

replace-cross Refined climatologies of future precipitation over High Mountain Asia using probabilistic ensemble learning

Authors: Kenza Tazi, Sun Woo P. Kim, Marc Girona-Mata, Richard E. Turner

Abstract: High Mountain Asia holds the largest concentration of frozen water outside the polar regions, serving as a crucial water source for more than 1.9 billion people. In the face of climate change, precipitation represents the largest source of uncertainty for hydrological modelling in this area. Future precipitation predictions remain challenging due to complex orography, lack of in situ hydrological observations, and limitations in climate model resolution and parametrisation for this region. To address the uncertainty posed by these challenges, climate models are often aggregated into multi-model ensembles. While multi-model ensembles are known to improve the predictive accuracy and analysis of future climate projections, consensus regarding how models are aggregated is lacking. In this study, we propose a probabilistic machine learning framework to combine 13 regional climate models from the Coordinated Regional Downscaling Experiment (CORDEX) over High Mountain Asia. Our approach accounts for seasonal and spatial biases within the models, enabling the prediction of more faithful precipitation distributions. The framework is validated against gridded historical precipitation data and is used to generate projections for the near future (2036$\unicode{x2013}$2065) and far future (2066$\unicode{x2013}$2095) under RCP4.5 and RCP8.5 scenarios.

replace-cross Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Authors: Jia Li, Wenjie Zhao, Ziru Huang, Yunhui Guo, Yapeng Tian

Abstract: Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that state-of-theart AVS methods consistently fail under negative audio conditions, demonstrating the prevalence of visual bias. In contrast, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving highquality segmentation performance.

replace-cross The Observational Partial Order of Causal Structures with Latent Variables

Authors: Marina Maciel Ansanelli, Elie Wolfe, Robert W. Spekkens

Abstract: For two causal structures with the same set of visible variables, one is said to observationally dominate the other if the set of distributions over the visible variables realizable by the first contains the set of distributions over the visible variables realizable by the second. Knowing such dominance relations is useful for adjudicating between these structures given observational data. We here consider the problem of determining the partial order of equivalence classes of causal structures with latent variables relative to observational dominance. We provide a complete characterization of the dominance order in the case of three visible variables, and a partial characterization in the case of four visible variables. Our techniques also help to identify which observational equivalence classes have a set of realizable distributions that is characterized by nontrivial inequality constraints, analogous to Bell inequalities and instrumental inequalities. We find evidence that as one increases the number of visible variables, the equivalence classes satisfying nontrivial inequality constraints become ubiquitous. (Because such classes are the ones for which there can be a difference in the distributions that are quantumly and classically realizable, this implies that the potential for quantum-classical gaps is also ubiquitous.) Furthermore, we find evidence that constraint-based causal discovery algorithms that rely solely on conditional independence constraints have a significantly weaker distinguishing power among observational equivalence classes than algorithms that go beyond these (i.e., algorithms that also leverage nested Markov constraints and inequality constraints).

replace-cross MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning

Authors: Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen

Abstract: Inductive Reasoning (IR), the ability to summarize rules from examples and apply on new ones, has long been viewed as a primal ability for general intelligence and widely studied by cognitive science and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually $<$10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations are mostly focused on classification (a very limited aspect of IR), and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark that asks LLM to induce output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots and the effect of Chain-of-Thought (CoT), and acquired insightful findings.

replace-cross OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling

Authors: Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, Zaiwen Wen

Abstract: Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. Our dataset is publicly available at https://github.com/AuroraLHL/OptMATH.

URLs: https://github.com/AuroraLHL/OptMATH.

replace-cross Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

Authors: Kaikai Zhao, Zhaoxiang Liu, Xuejiao Lei, Ning Wang, Jiaojiao Zhao, Zipeng Wang, Zhenhong Long, Peijun Yang, Minjie Hua, Chaoyang Ma, Wen Liu, Kai Wang, Shiguo Lian

Abstract: DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we evaluate the DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, and DeepSeek-R1-Distill-Llama series on the improved version A-Eval (A-Eval-2.0), an application-driven benchmark. By comparing original instruction-tuned models with their distilled counterparts, we analyze how reasoning enhancements impact performance across diverse practical tasks. Our results show that reasoning-enhanced models, while generally powerful, do not universally outperform across all tasks, with performance gains varying significantly across tasks and models. To further assist users in model selection, we quantify the capability boundary of DeepSeek models through performance tier classifications and intuitive line charts. Specific examples provide actionable insights to help users select and deploy the most cost-effective DeepSeek models, ensuring optimal performance and resource efficiency in real-world applications. It should be noted that, despite our efforts to establish a comprehensive, objective, and authoritative evaluation benchmark, the selection of test samples, characteristics of data distribution, and the setting of evaluation criteria may inevitably introduce certain biases into the evaluation results. We will continuously optimize the evaluation benchmarks and periodically update this paper to provide more comprehensive and accurate evaluation results. Please refer to the latest version of the paper for the most recent results and conclusions.

replace-cross Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Authors: Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

Abstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.

replace-cross RLTHF: Targeted Human Feedback for LLM Alignment

Authors: Yifei Xu, Tusher Chakraborty, Emre K{\i}c{\i}man, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra

Abstract: Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF's strategic data curation.

replace-cross ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

Authors: Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Ran Cheng, Yaxin Peng, Chaomin Shen, Feifei Feng

Abstract: Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can't large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks. Notably, it achieves a six times higher performance on MMMU and scores 47.2% on MMStar with a more parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates superior performance on 25 real-world robot manipulation tasks compared to existing VLA methods like OpenVLA. Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.