new On the Interpolation Effect of Score Smoothing

Authors: Zhengdao Chen

Abstract: Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we examine the hypothesis that their generalization ability arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional linear subspace, we study the interplay between score smoothing and the denoising dynamics with mathematically solvable models. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate among the training data within their subspace while avoiding full memorization. We also present evidence that learning score functions with regularized neural networks can have a similar effect on the denoising dynamics as score smoothing.

new Models That Are Interpretable But Not Transparent

Authors: Chudi Zhong, Panyu Chen, Cynthia Rudin

Abstract: Faithful explanations are essential for machine learning models in high-stakes applications. Inherently interpretable models are well-suited for these applications because they naturally provide faithful explanations by revealing their decision logic. However, model designers often need to keep these models proprietary to maintain their value. This creates a tension: we need models that are interpretable--allowing human decision-makers to understand and justify predictions, but not transparent, so that the model's decision boundary is not easily replicated by attackers. Shielding the model's decision boundary is particularly challenging alongside the requirement of completely faithful explanations, since such explanations reveal the true logic of the model for an entire subspace around each query point. This work provides an approach, FaithfulDefense, that creates model explanations for logical models that are completely faithful, yet reveal as little as possible about the decision boundary. FaithfulDefense is based on a maximum set cover formulation, and we provide multiple formulations for it, taking advantage of submodularity.

new TRIX: A More Expressive Model for Zero-shot Domain Transfer in Knowledge Graphs

Authors: Yucheng Zhang, Beatrice Bevilacqua, Mikhail Galkin, Bruno Ribeiro

Abstract: Fully inductive knowledge graph models can be trained on multiple domains and subsequently perform zero-shot knowledge graph completion (KGC) in new unseen domains. This is an important capability towards the goal of having foundation models for knowledge graphs. In this work, we introduce a more expressive and capable fully inductive model, dubbed TRIX, which not only yields strictly more expressive triplet embeddings (head entity, relation, tail entity) compared to state-of-the-art methods, but also introduces a new capability: directly handling both entity and relation prediction tasks in inductive settings. Empirically, we show that TRIX outperforms the state-of-the-art fully inductive models in zero-shot entity and relation predictions in new domains, and outperforms large-context LLMs in out-of-domain predictions. The source code is available at https://github.com/yuchengz99/TRIX.

URLs: https://github.com/yuchengz99/TRIX.

new Mixtraining: A Better Trade-Off Between Compute and Performance

Authors: Zexin Li, Jiancheng Zhang, Yinglun Zhu, Cong Liu

Abstract: Incorporating self-supervised learning (SSL) before standard supervised learning (SL) has become a widely used strategy to enhance model performance, particularly in data-limited scenarios. However, this approach introduces a trade-off between computation and performance: while SSL helps with representation learning, it requires a separate, often time-consuming training phase, increasing computational overhead and limiting efficiency in resource-constrained settings. To address these challenges, we propose MixTraining, a novel framework that interleaves several SSL and SL epochs within a unified mixtraining training phase, featuring a smooth transition between two learning objectives. MixTraining enhances synergy between SSL and SL for improved accuracy and consolidates shared computation steps to reduce computation overhead. MixTraining is versatile and applicable to both single-task and multi-task learning scenarios. Extensive experiments demonstrate that MixTraining offers a superior compute-performance trade-off compared to conventional pipelines, achieving an 8.81% absolute accuracy gain (18.89% relative accuracy gain) on the TinyImageNet dataset while accelerating training by up to 1.29x with the ViT-Tiny model.

new Analyzing Cost-Sensitive Surrogate Losses via $\mathcal{H}$-calibration

Authors: Sanket Shah, Milind Tambe, Jessie Finocchiaro

Abstract: This paper aims to understand whether machine learning models should be trained using cost-sensitive surrogates or cost-agnostic ones (e.g., cross-entropy). Analyzing this question through the lens of $\mathcal{H}$-calibration, we find that cost-sensitive surrogates can strictly outperform their cost-agnostic counterparts when learning small models under common distributional assumptions. Since these distributional assumptions are hard to verify in practice, we also show that cost-sensitive surrogates consistently outperform cost-agnostic surrogates on classification datasets from the UCI repository. Together, these make a strong case for using cost-sensitive surrogates in practice.

new Retrieval Augmented Anomaly Detection (RAAD): Nimble Model Adjustment Without Retraining

Authors: Sam Pastoriza, Iman Yousfi, Christopher Redino, Marc Vucovich, Abdul Rahman, Sal Aguinaga, Dhruv Nandakumar

Abstract: We propose a novel mechanism for real-time (human-in-the-loop) feedback focused on false positive reduction to enhance anomaly detection models. It was designed for the lightweight deployment of a behavioral network anomaly detection model. This methodology is easily integrable to similar domains that require a premium on throughput while maintaining high precision. In this paper, we introduce Retrieval Augmented Anomaly Detection, a novel method taking inspiration from Retrieval Augmented Generation. Human annotated examples are sent to a vector store, which can modify model outputs on the very next processed batch for model inference. To demonstrate the generalization of this technique, we benchmarked several different model architectures and multiple data modalities, including images, text, and graph-based data.

new High-fidelity Multiphysics Modelling for Rapid Predictions Using Physics-informed Parallel Neural Operator

Authors: Biao Yuan, He Wang, Yanjie Song, Ana Heitor, Xiaohui Chen

Abstract: Modelling complex multiphysics systems governed by nonlinear and strongly coupled partial differential equations (PDEs) is a cornerstone in computational science and engineering. However, it remains a formidable challenge for traditional numerical solvers due to high computational cost, making them impractical for large-scale applications. Neural operators' reliance on data-driven training limits their applicability in real-world scenarios, as data is often scarce or expensive to obtain. Here, we propose a novel paradigm, physics-informed parallel neural operator (PIPNO), a scalable and unsupervised learning framework that enables data-free PDE modelling by leveraging only governing physical laws. The parallel kernel integration design, incorporating ensemble learning, significantly enhances both compatibility and computational efficiency, enabling scalable operator learning for nonlinear and strongly coupled PDEs. PIPNO efficiently captures nonlinear operator mappings across diverse physics, including geotechnical engineering, material science, electromagnetism, quantum mechanics, and fluid dynamics. The proposed method achieves high-fidelity and rapid predictions, outperforming existing operator learning approaches in modelling nonlinear and strongly coupled multiphysics systems. Therefore, PIPNO offers a powerful alternative to conventional solvers, broadening the applicability of neural operators for multiphysics modelling while ensuring efficiency, robustness, and scalability.

new Generalist World Model Pre-Training for Efficient Reinforcement Learning

Authors: Yi Zhao, Aidan Scannell, Yuxin Hou, Tianyu Cui, Le Chen, Dieter B\"uchler, Arno Solin, Juho Kannala, Joni Pajarinen

Abstract: Sample-efficient robot learning is a longstanding goal in robotics. Inspired by the success of scaling in vision and language, the robotics community is now investigating large-scale offline datasets for robot learning. However, existing methods often require expert and/or reward-labeled task-specific data, which can be costly and limit their application in practice. In this paper, we consider a more realistic setting where the offline data consists of reward-free and non-expert multi-embodiment offline data. We show that generalist world model pre-training (WPT), together with retrieval-based experience rehearsal and execution guidance, enables efficient reinforcement learning (RL) and fast task adaptation with such non-curated data. In experiments over 72 visuomotor tasks, spanning 6 different embodiments, covering hard exploration, complex dynamics, and various visual properties, WPT achieves 35.65% and 35% higher aggregated score compared to widely used learning-from-scratch baselines, respectively.

new Extremely Greedy Equivalence Search

Authors: Achille Nazaret, David Blei

Abstract: The goal of causal discovery is to learn a directed acyclic graph from data. One of the most well-known methods for this problem is Greedy Equivalence Search (GES). GES searches for the graph by incrementally and greedily adding or removing edges to maximize a model selection criterion. It has strong theoretical guarantees on infinite data but can fail in practice on finite data. In this paper, we first identify some of the causes of GES's failure, finding that it can get blocked in local optima, especially in denser graphs. We then propose eXtremely Greedy Equivalent Search (XGES), which involves a new heuristic to improve the search strategy of GES while retaining its theoretical guarantees. In particular, XGES favors deleting edges early in the search over inserting edges, which reduces the possibility of the search ending in local optima. A further contribution of this work is an efficient algorithmic formulation of XGES (and GES). We benchmark XGES on simulated datasets with known ground truth. We find that XGES consistently outperforms GES in recovering the correct graphs, and it is 10 times faster. XGES implementations in Python and C++ are available at https://github.com/ANazaret/XGES.

URLs: https://github.com/ANazaret/XGES.

new PhenoProfiler: Advancing Phenotypic Learning for Image-based Drug Discovery

Authors: Bo Li, Bob Zhang, Chengyang Zhang, Minghao Zhou, Weiliang Huang, Shihang Wang, Qing Wang, Mengran Li, Yong Zhang, Qianqian Song

Abstract: In the field of image-based drug discovery, capturing the phenotypic response of cells to various drug treatments and perturbations is a crucial step. However, existing methods require computationally extensive and complex multi-step procedures, which can introduce inefficiencies, limit generalizability, and increase potential errors. To address these challenges, we present PhenoProfiler, an innovative model designed to efficiently and effectively extract morphological representations, enabling the elucidation of phenotypic changes induced by treatments. PhenoProfiler is designed as an end-to-end tool that processes whole-slide multi-channel images directly into low-dimensional quantitative representations, eliminating the extensive computational steps required by existing methods. It also includes a multi-objective learning module to enhance robustness, accuracy, and generalization in morphological representation learning. PhenoProfiler is rigorously evaluated on large-scale publicly available datasets, including over 230,000 whole-slide multi-channel images in end-to-end scenarios and more than 8.42 million single-cell images in non-end-to-end settings. Across these benchmarks, PhenoProfiler consistently outperforms state-of-the-art methods by up to 20%, demonstrating substantial improvements in both accuracy and robustness. Furthermore, PhenoProfiler uses a tailored phenotype correction strategy to emphasize relative phenotypic changes under treatments, facilitating the detection of biologically meaningful signals. UMAP visualizations of treatment profiles demonstrate PhenoProfiler ability to effectively cluster treatments with similar biological annotations, thereby enhancing interpretability. These findings establish PhenoProfiler as a scalable, generalizable, and robust tool for phenotypic learning.

new LORENZA: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

Authors: Yehonathan Refael, Iftach Arbel, Ofir Lindenbaum, Tom Tirer

Abstract: We study robust parameter-efficient fine-tuning (PEFT) techniques designed to improve accuracy and generalization while operating within strict computational and memory hardware constraints, specifically focusing on large-language models (LLMs). Existing PEFT methods often lack robustness and fail to generalize effectively across diverse tasks, leading to suboptimal performance in real-world scenarios. To address this, we present a new highly computationally efficient framework called AdaZo-SAM, combining Adam and Sharpness-Aware Minimization (SAM) while requiring only a single-gradient computation in every iteration. This is achieved using a stochastic zeroth-order estimation to find SAM's ascent perturbation. We provide a convergence guarantee for AdaZo-SAM and show that it improves the generalization ability of state-of-the-art PEFT methods. Additionally, we design a low-rank gradient optimization method named LORENZA, which is a memory-efficient version of AdaZo-SAM. LORENZA utilizes a randomized SVD scheme to efficiently compute the subspace projection matrix and apply optimization steps onto the selected subspace. This technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, achieving the same reduced memory consumption as gradient-low-rank-projection methods. We provide a convergence analysis of LORENZA and demonstrate its merits for pre-training and fine-tuning LLMs.

new Improving Representation Learning of Complex Critical Care Data with ICU-BERT

Authors: Ricardo Santos, Andr\'e V. Carreiro, Xi Peng, Hugo Gamboa, Holger Fr\"ohlich

Abstract: The multivariate, asynchronous nature of real-world clinical data, such as that generated in Intensive Care Units (ICUs), challenges traditional AI-based decision-support systems. These often assume data regularity and feature independence and frequently rely on limited data scopes and manual feature engineering. The potential of generative AI technologies has not yet been fully exploited to analyze clinical data. We introduce ICU-BERT, a transformer-based model pre-trained on the MIMIC-IV database using a multi-task scheme to learn robust representations of complex ICU data with minimal preprocessing. ICU-BERT employs a multi-token input strategy, incorporating dense embeddings from a biomedical Large Language Model to learn a generalizable representation of complex and multivariate ICU data. With an initial evaluation of five tasks and four additional ICU datasets, ICU-BERT results indicate that ICU-BERT either compares to or surpasses current performance benchmarks by leveraging fine-tuning. By integrating structured and unstructured data, ICU-BERT advances the use of foundational models in medical informatics, offering an adaptable solution for clinical decision support across diverse applications.

new Introduction to Sequence Modeling with Transformers

Authors: Joni-Kristian K\"am\"ar\"ainen

Abstract: Understanding the transformer architecture and its workings is essential for machine learning (ML) engineers. However, truly understanding the transformer architecture can be demanding, even if you have a solid background in machine learning or deep learning. The main working horse is attention, which yields to the transformer encoder-decoder structure. However, putting attention aside leaves several programming components that are easy to implement but whose role for the whole is unclear. These components are 'tokenization', 'embedding' ('un-embedding'), 'masking', 'positional encoding', and 'padding'. The focus of this work is on understanding them. To keep things simple, the understanding is built incrementally by adding components one by one, and after each step investigating what is doable and what is undoable with the current model. Simple sequences of zeros (0) and ones (1) are used to study the workings of each step.

new Learning Ensembles of Interpretable Simple Structure

Authors: Gaurav Arwade, Sigurdur Olafsson

Abstract: Decision-making in complex systems often relies on machine learning models, yet highly accurate models such as XGBoost and neural networks can obscure the reasoning behind their predictions. In operations research applications, understanding how a decision is made is often as crucial as the decision itself. Traditional interpretable models, such as decision trees and logistic regression, provide transparency but may struggle with datasets containing intricate feature interactions. However, complexity in decision-making stem from interactions that are only relevant within certain subsets of data. Within these subsets, feature interactions may be simplified, forming simple structures where simple interpretable models can perform effectively. We propose a bottom-up simple structure-identifying algorithm that partitions data into interpretable subgroups known as simple structure, where feature interactions are minimized, allowing simple models to be trained within each subgroup. We demonstrate the robustness of the algorithm on synthetic data and show that the decision boundaries derived from simple structures are more interpretable and aligned with the intuition of the domain than those learned from a global model. By improving both explainability and predictive accuracy, our approach provides a principled framework for decision support in applications where model transparency is essential.

new PRDP: Progressively Refined Differentiable Physics

Authors: Kanishk Bhatia, Felix Koehler, Nils Thuerey

Abstract: The physics solvers employed for neural network training are primarily iterative, and hence, differentiating through them introduces a severe computational burden as iterations grow large. Inspired by works in bilevel optimization, we show that full accuracy of the network is achievable through physics significantly coarser than fully converged solvers. We propose Progressively Refined Differentiable Physics (PRDP), an approach that identifies the level of physics refinement sufficient for full training accuracy. By beginning with coarse physics, adaptively refining it during training, and stopping refinement at the level adequate for training, it enables significant compute savings without sacrificing network accuracy. Our focus is on differentiating iterative linear solvers for sparsely discretized differential operators, which are fundamental to scientific computing. PRDP is applicable to both unrolled and implicit differentiation. We validate its performance on a variety of learning scenarios involving differentiable physics solvers such as inverse problems, autoregressive neural emulators, and correction-based neural-hybrid solvers. In the challenging example of emulating the Navier-Stokes equations, we reduce training time by 62%.

new Treatment Non-Adherence Bias in Clinical Machine Learning: A Real-World Study on Hypertension Medication

Authors: Zhongyuan Liang, Arvind Suresh, Irene Y. Chen

Abstract: Machine learning systems trained on electronic health records (EHRs) increasingly guide treatment decisions, but their reliability depends on the critical assumption that patients follow the prescribed treatments recorded in EHRs. Using EHR data from 3,623 hypertension patients, we investigate how treatment non-adherence introduces implicit bias that can fundamentally distort both causal inference and predictive modeling. By extracting patient adherence information from clinical notes using a large language model, we identify 786 patients (21.7%) with medication non-adherence. We further uncover key demographic and clinical factors associated with non-adherence, as well as patient-reported reasons including side effects and difficulties obtaining refills. Our findings demonstrate that this implicit bias can not only reverse estimated treatment effects, but also degrade model performance by up to 5% while disproportionately affecting vulnerable populations by exacerbating disparities in decision outcomes and model error rates. This highlights the importance of accounting for treatment non-adherence in developing responsible and equitable clinical machine learning systems.

new Developing robust methods to handle missing data in real-world applications effectively

Authors: Youran Zhou, Mohamed Reda Bouadjenek, Sunil Arya

Abstract: Missing data is a pervasive challenge spanning diverse data types, including tabular, sensor data, time-series, images and so on. Its origins are multifaceted, resulting in various missing mechanisms. Prior research in this field has predominantly revolved around the assumption of the Missing Completely At Random (MCAR) mechanism. However, Missing At Random (MAR) and Missing Not At Random (MNAR) mechanisms, though equally prevalent, have often remained underexplored despite their significant influence. This PhD project presents a comprehensive research agenda designed to investigate the implications of diverse missing data mechanisms. The principal aim is to devise robust methodologies capable of effectively handling missing data while accommodating the unique characteristics of MCAR, MAR, and MNAR mechanisms. By addressing these gaps, this research contributes to an enriched understanding of the challenges posed by missing data across various industries and data modalities. It seeks to provide practical solutions that enable the effective management of missing data, empowering researchers and practitioners to leverage incomplete datasets confidently.

new cMIM: A Contrastive Mutual Information Framework for Unified Generative and Discriminative Representation Learning

Authors: Micha Livne

Abstract: Learning representations that are useful for unknown downstream tasks is a fundamental challenge in representation learning. Prominent approaches in this domain include contrastive learning, self-supervised masking, and denoising auto-encoders. In this paper, we introduce a novel method, termed contrastive Mutual Information Machine (cMIM), which aims to enhance the utility of learned representations for downstream tasks. cMIM integrates a new contrastive learning loss with the Mutual Information Machine (MIM) learning framework, a probabilistic auto-encoder that maximizes the mutual information between inputs and latent representations while clustering the latent codes. Despite MIM's potential, initial experiments indicated that the representations learned by MIM were less effective for discriminative downstream tasks compared to state-of-the-art (SOTA) models. The proposed cMIM method directly addresses this limitation. The main contributions of this work are twofold: (1) We propose a novel contrastive extension to MIM for learning discriminative representations which eliminates the need for data augmentation and is robust to variations in the number of negative examples (i.e., batch size). (2) We introduce a generic method for extracting informative embeddings from encoder-decoder models, which significantly improves performance in discriminative downstream tasks without requiring additional training. This method is applicable to any pre-trained encoder-decoder model. By presenting cMIM, we aim to offer a unified generative model that is effective for both generative and discriminative tasks. Our results demonstrate that the learned representations are valuable for downstream tasks while maintaining the generative capabilities of MIM.

new Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Authors: Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz

Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

new Unlocking Multi-Modal Potentials for Dynamic Text-Attributed Graph Representation

Authors: Yuanyuan Xu, Wenjie Zhang, Ying Zhang, Xuemin Lin, Xiwei Xu

Abstract: Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal edges alongside rich textual attributes. A prior approach to representing DyTAGs leverages pre-trained language models to encode text attributes and subsequently integrates them into dynamic graph models. However, it follows edge-centric modeling, as in dynamic graph learning, which is limited in local structures and fails to exploit the unique characteristics of DyTAGs, leading to suboptimal performance. We observe that DyTAGs inherently comprise three distinct modalities-temporal, textual, and structural-often exhibiting dispersed or even orthogonal distributions, with the first two largely overlooked in existing research. Building on this insight, we propose MoMent, a model-agnostic multi-modal framework that can seamlessly integrate with dynamic graph models for structural modality learning. The core idea is to shift from edge-centric to node-centric modeling, fully leveraging three modalities for node representation. Specifically, MoMent presents non-shared node-centric encoders based on the attention mechanism to capture global temporal and semantic contexts from temporal and textual modalities, together with local structure learning, thus generating modality-specific tokens. To prevent disjoint latent space, we propose a symmetric alignment loss, an auxiliary objective that aligns temporal and textual tokens, ensuring global temporal-semantic consistency with a theoretical guarantee. Last, we design a lightweight adaptor to fuse these tokens, generating comprehensive and cohesive node representations. We theoretically demonstrate that MoMent enhances discriminative power over exclusive edge-centric modeling. Extensive experiments across seven datasets and two downstream tasks show that MoMent achieves up to 33.62% improvement against the baseline using four dynamic graph models.

new Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning

Authors: Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos

Abstract: Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchmarks for robust RL. Current robust RL policies often focus on a specific type of uncertainty and are evaluated in distinct, one-off environments. In this work, we introduce Robust-Gymnasium, a unified modular benchmark designed for robust RL that supports a wide variety of disruptions across all key RL components-agents' observed state and reward, agents' actions, and the environment. Offering over sixty diverse task environments spanning control and robotics, safe RL, and multi-agent RL, it provides an open-source and user-friendly tool for the community to assess current methods and foster the development of robust RL algorithms. In addition, we benchmark existing standard and robust RL algorithms within this framework, uncovering significant deficiencies in each and offering new insights.

new Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Authors: Pavel Rumiantsev, Mark Coates

Abstract: Neural Architecture Search (NAS) is a powerful automatic alternative to manual design of a neural network. In the zero-shot version, a fast ranking function is used to compare architectures without training them. The outputs of the ranking functions often vary significantly due to different sources of randomness, including the evaluated architecture's weights' initialization or the batch of data used for calculations. A common approach to addressing the variation is to average a ranking function output over several evaluations. We propose taking into account the variation in a different manner, by viewing the ranking function output as a random variable representing a proxy performance metric. During the search process, we strive to construct a stochastic ordering of the performance metrics to determine the best architecture. Our experiments show that the proposed stochastic ordering can effectively boost performance of a search on standard benchmark search spaces.

new Out-of-distribution Generalization for Total Variation based Invariant Risk Minimization

Authors: Yuanchao Wang, Zhao-Rong Lai, Tianqi Zhong

Abstract: Invariant risk minimization is an important general machine learning framework that has recently been interpreted as a total variation model (IRM-TV). However, how to improve out-of-distribution (OOD) generalization in the IRM-TV setting remains unsolved. In this paper, we extend IRM-TV to a Lagrangian multiplier model named OOD-TV-IRM. We find that the autonomous TV penalty hyperparameter is exactly the Lagrangian multiplier. Thus OOD-TV-IRM is essentially a primal-dual optimization model, where the primal optimization minimizes the entire invariant risk and the dual optimization strengthens the TV penalty. The objective is to reach a semi-Nash equilibrium where the balance between the training loss and OOD generalization is maintained. We also develop a convergent primal-dual algorithm that facilitates an adversarial learning scheme. Experimental results show that OOD-TV-IRM outperforms IRM-TV in most situations.

new Training Robust Graph Neural Networks by Modeling Noise Dependencies

Authors: Yeonjun In, Kanghoon Yoon, Sukwon Yun, Kibum Kim, Sungchul Kim, Chanyoung Park

Abstract: In real-world applications, node features in graphs often contain noise from various sources, leading to significant performance degradation in GNNs. Although several methods have been developed to enhance robustness, they rely on the unrealistic assumption that noise in node features is independent of the graph structure and node labels, thereby limiting their applicability. To this end, we introduce a more realistic noise scenario, dependency-aware noise on graphs (DANG), where noise in node features create a chain of noise dependencies that propagates to the graph structure and node labels. We propose a novel robust GNN, DA-GNN, which captures the causal relationships among variables in the data generating process (DGP) of DANG using variational inference. In addition, we present new benchmark datasets that simulate DANG in real-world applications, enabling more practical research on robust GNNs. Extensive experiments demonstrate that DA-GNN consistently outperforms existing baselines across various noise scenarios, including both DANG and conventional noise models commonly considered in this field.

new The Future Outcome Reasoning and Confidence Assessment Benchmark

Authors: Zhangdie Yuan, Zifeng Ding, Andreas Vlachos

Abstract: Forecasting is an important task in many domains, such as technology and economics. However existing forecasting benchmarks largely lack comprehensive confidence assessment, focus on limited question types, and often consist of artificial questions that do not align with real-world human forecasting needs. To address these gaps, we introduce FOReCAst (Future Outcome Reasoning and Confidence Assessment), a benchmark that evaluates models' ability to make predictions and their confidence in them. FOReCAst spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation, enabling a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.

new Accurate and Scalable Graph Neural Networks via Message Invariance

Authors: Zhihao Shi, Jie Wang, Zhiwei Zhuang, Xize Liang, Bin Li, Feng Wu

Abstract: Message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. For a sampled mini-batch of target nodes, the message passing process is divided into two parts: message passing between nodes within the batch (MP-IB) and message passing from nodes outside the batch to those within it (MP-OB). However, MP-OB recursively relies on higher-order out-of-batch neighbors, leading to an exponentially growing computational cost with respect to the number of layers. Due to the neighbor explosion, the whole message passing stores most nodes and edges on the GPU such that many GNNs are infeasible to large-scale graphs. To address this challenge, we propose an accurate and fast mini-batch approach for large graph transductive learning, namely topological compensation (TOP), which obtains the outputs of the whole message passing solely through MP-IB, without the costly MP-OB. The major pillar of TOP is a novel concept of message invariance, which defines message-invariant transformations to convert costly MP-OB into fast MP-IB. This ensures that the modified MP-IB has the same output as the whole message passing. Experiments demonstrate that TOP is significantly faster than existing mini-batch methods by order of magnitude on vast graphs (millions of nodes and billions of edges) with limited accuracy degradation.

new Tokens for Learning, Tokens for Unlearning: Mitigating Membership Inference Attacks in Large Language Models via Dual-Purpose Training

Authors: Toan Tran, Ruixuan Liu, Li Xiong

Abstract: Large language models (LLMs) have become the backbone of modern natural language processing but pose privacy concerns about leaking sensitive training data. Membership inference attacks (MIAs), which aim to infer whether a sample is included in a model's training dataset, can serve as a foundation for broader privacy threats. Existing defenses designed for traditional classification models do not account for the sequential nature of text data. As a result, they either require significant computational resources or fail to effectively mitigate privacy risks in LLMs. In this work, we propose a lightweight yet effective empirical privacy defense for protecting training data of language modeling by leveraging the token-specific characteristics. By analyzing token dynamics during training, we propose a token selection strategy that categorizes tokens into hard tokens for learning and memorized tokens for unlearning. Subsequently, our training-phase defense optimizes a novel dual-purpose token-level loss to achieve a Pareto-optimal balance between utility and privacy. Extensive experiments demonstrate that our approach not only provides strong protection against MIAs but also improves language modeling performance by around 10\% across various LLM architectures and datasets compared to the baselines.

new Causal Effect Estimation under Networked Interference without Networked Unconfoundedness Assumption

Authors: Weilin Chen, Ruichu Cai, Jie Qiao, Yuguang Yan, Jos\'e Miguel Hern\'andez-Lobato

Abstract: Estimating causal effects under networked interference is a crucial yet challenging problem. Existing methods based on observational data mainly rely on the networked unconfoundedness assumption, which guarantees the identification of networked effects. However, the networked unconfoundedness assumption is usually violated due to the latent confounders in observational data, hindering the identification of networked effects. Interestingly, in such networked settings, interactions between units provide valuable information for recovering latent confounders. In this paper, we identify three types of latent confounders in networked inference that hinder identification: those affecting only the individual, those affecting only neighbors, and those influencing both. Specifically, we devise a networked effect estimator based on identifiable representation learning techniques. Theoretically, we establish the identifiability of all latent confounders, and leveraging the identified latent confounders, we provide the networked effect identification result. Extensive experiments validate our theoretical results and demonstrate the effectiveness of the proposed method.

new BiRating -- Iterative averaging on a bipartite graph of Beat Saber scores, player skills, and map difficulties

Authors: Juan Casanova

Abstract: Difficulty estimation of Beat Saber maps is an interesting data analysis problem and valuable to the Beat Saber competitive scene. We present a simple algorithm that iteratively averages player skill and map difficulty estimations in a bipartite graph of players and maps, connected by scores, using scores only as input. This approach simultaneously estimates player skills and map difficulties, exploiting each of them to improve the estimation of the other, exploitng the relation of multiple scores by different players on the same map, or on different maps by the same player. While we have been unable to prove or characterize theoretical convergence, the implementation exhibits convergent behaviour to low estimation error in all instances, producing accurate results. An informal qualitative evaluation involving experienced Beat Saber community members was carried out, comparing the difficulty estimations output by our algorithm with their personal perspectives on the difficulties of different maps. There was a significant alignment with player perceived perceptions of difficulty and with other existing methods for estimating difficulty. Our approach showed significant improvement over existing methods in certain known problematic maps that are not typically accurately estimated, but also produces problematic estimations for certain families of maps where the assumptions on the meaning of scores were inadequate (e.g. not enough scores, or scores over optimized by players). The algorithm has important limitations, related to data quality and meaningfulness, assumptions on the domain problem, and theoretical convergence of the algorithm. Future work would significantly benefit from a better understanding of adequate ways to quantify map difficulty in Beat Saber, including multidimensionality of skill and difficulty, and the systematic biases present in score data.

new CirT: Global Subseasonal-to-Seasonal Forecasting with Geometry-inspired Transformer

Authors: Yang Liu, Zinan Zheng, Jiashun Cheng, Fugee Tsung, Deli Zhao, Yu Rong, Jia Li

Abstract: Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather data as planar images, resulting in an inaccurate representation of locations and spatial relations. In this work, we propose the geometric-inspired Circular Transformer (CirT) to model the cyclic characteristic of the graticule, consisting of two key designs: (1) Decomposing the weather data by latitude into circular patches that serve as input tokens to the Transformer; (2) Leveraging Fourier transform in self-attention to capture the global information and model the spatial periodicity. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate our model yields a significant improvement over the advanced data-driven models, including PanguWeather and GraphCast, as well as skillful ECMWF systems. Additionally, we empirically show the effectiveness of our model designs and high-quality prediction over spatial and temporal dimensions.

new Probabilistic Federated Prompt-Tuning with Non-IID and Imbalanced Data

Authors: Pei-Yau Weng, Minh Hoang, Lam M. Nguyen, My T. Thai, Tsui-Wei Weng, Trong Nghia Hoang

Abstract: Fine-tuning pre-trained models is a popular approach in machine learning for solving complex tasks with moderate data. However, fine-tuning the entire pre-trained model is ineffective in federated data scenarios where local data distributions are diversely skewed. To address this, we explore integrating federated learning with a more effective prompt-tuning method, optimizing for a small set of input prefixes to reprogram the pre-trained model's behavior. Our approach transforms federated learning into a distributed set modeling task, aggregating diverse sets of prompts to globally fine-tune the pre-trained model. We benchmark various baselines based on direct adaptations of existing federated model aggregation techniques and introduce a new probabilistic prompt aggregation method that substantially outperforms these baselines. Our reported results on a variety of computer vision datasets confirm that the proposed method is most effective to combat extreme data heterogeneity in federated learning.

new HALO: Robust Out-of-Distribution Detection via Joint Optimisation

Authors: Hugo Lyons Keenan, Sarah Erfani, Christopher Leckie

Abstract: Effective out-of-distribution (OOD) detection is crucial for the safe deployment of machine learning models in real-world scenarios. However, recent work has shown that OOD detection methods are vulnerable to adversarial attacks, potentially leading to critical failures in high-stakes applications. This discovery has motivated work on robust OOD detection methods that are capable of maintaining performance under various attack settings. Prior approaches have made progress on this problem but face a number of limitations: often only exhibiting robustness to attacks on OOD data or failing to maintain strong clean performance. In this work, we adapt an existing robust classification framework, TRADES, extending it to the problem of robust OOD detection and discovering a novel objective function. Recognising the critical importance of a strong clean/robust trade-off for OOD detection, we introduce an additional loss term which boosts classification and detection performance. Our approach, called HALO (Helper-based AdversariaL OOD detection), surpasses existing methods and achieves state-of-the-art performance across a number of datasets and attack settings. Extensive experiments demonstrate an average AUROC improvement of 3.15 in clean settings and 7.07 under adversarial attacks when compared to the next best method. Furthermore, HALO exhibits resistance to transferred attacks, offers tuneable performance through hyperparameter selection, and is compatible with existing OOD detection frameworks out-of-the-box, leaving open the possibility of future performance gains. Code is available at: https://github.com/hugo0076/HALO

URLs: https://github.com/hugo0076/HALO

new Learning with Exact Invariances in Polynomial Time

Authors: Ashkan Soleymani, Behrooz Tahmasebi, Stefanie Jegelka, Patrick Jaillet

Abstract: We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (not approximate) invariances in this context. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.

new Obtaining Example-Based Explanations from Deep Neural Networks

Authors: Genghua Dong, Henrik Bostr\"om, Michalis Vazirgiannis, Roman Bresson

Abstract: Most techniques for explainable machine learning focus on feature attribution, i.e., values are assigned to the features such that their sum equals the prediction. Example attribution is another form of explanation that assigns weights to the training examples, such that their scalar product with the labels equals the prediction. The latter may provide valuable complementary information to feature attribution, in particular in cases where the features are not easily interpretable. Current example-based explanation techniques have targeted a few model types only, such as k-nearest neighbors and random forests. In this work, a technique for obtaining example-based explanations from deep neural networks (EBE-DNN) is proposed. The basic idea is to use the deep neural network to obtain an embedding, which is employed by a k-nearest neighbor classifier to form a prediction; the example attribution can hence straightforwardly be derived from the latter. Results from an empirical investigation show that EBE-DNN can provide highly concentrated example attributions, i.e., the predictions can be explained with few training examples, without reducing accuracy compared to the original deep neural network. Another important finding from the empirical investigation is that the choice of layer to use for the embeddings may have a large impact on the resulting accuracy.

new In-Context Learning with Hypothesis-Class Guidance

Authors: Ziqian Lin, Shubham Kumar Bharti, Kangwook Lee

Abstract: Recent research has investigated the underlying mechanisms of in-context learning (ICL) both theoretically and empirically, often using data generated from simple function classes. However, the existing work often focuses on the sequence consisting solely of labeled examples, while in practice, labeled examples are typically accompanied by an instruction, providing some side information about the task. In this work, we propose ICL with hypothesis-class guidance (ICL-HCG), a novel synthetic data model for ICL where the input context consists of the literal description of a (finite) hypothesis class $\mathcal{H}$ and $(x,y)$ pairs from a hypothesis chosen from $\mathcal{H}$. Under our framework ICL-HCG, we conduct extensive experiments to explore: (i) a variety of generalization abilities to new hypothesis classes; (ii) different model architectures; (iii) sample complexity; (iv) in-context data imbalance; (v) the role of instruction; and (vi) the effect of pretraining hypothesis diversity. As a result, we show that (a) Transformers can successfully learn ICL-HCG and generalize to unseen hypotheses and unseen hypothesis classes, and (b) compared with ICL without instruction, ICL-HCG achieves significantly higher accuracy, demonstrating the role of instructions.

new Mixtera: A Data Plane for Foundation Model Training

Authors: Maximilian B\"other, Xiaozhe Yao, Tolga Kerimoglu, Ana Klimovic

Abstract: State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model accuracy. We build and present Mixtera, a data plane for foundation model training that enables users to declaratively express which data samples should be used in which proportion and in which order during training. Mixtera is a centralized, read-only layer that is deployed on top of existing training data collections and can be declaratively queried. It operates independently of the filesystem structure and supports mixtures across arbitrary properties (e.g., language, source dataset) as well as dynamic adjustment of the mixture based on model feedback. We experimentally evaluate Mixtera and show that our implementation does not bottleneck training and scales to 256 GH200 superchips. We demonstrate how Mixtera supports recent advancements in mixing strategies by implementing the proposed Adaptive Data Optimization (ADO) algorithm in the system and evaluating its performance impact. We also explore the role of mixtures for vision-language models.

new ServoLNN: Lagrangian Neural Networks Driven by Servomechanisms

Authors: Brandon Johns, Zhuomin Zhou, Elahe Abdi

Abstract: Combining deep learning with classical physics facilitates the efficient creation of accurate dynamical models. In a recent class of neural network, Lagrangian mechanics is hard-coded into the architecture, and training the network learns the given system. However, the current architectures do not facilitate the modelling of dynamical systems that are driven by servomechanisms (e.g. servomotors, stepper motors, current sources, volumetric pumps). This article presents ServoLNN, a new architecture to model dynamical systems that are driven by servomechanisms. ServoLNN is compatible for use in real-time applications, where the driving motion is known only just-in-time. A PyTorch implementation of ServoLNN is provided. The derivations and results reveal the occurrence of a possible family of solutions that the training may converge on. The effect of the family of solutions on the predicted physical quantities is explored, as is the resolution to reduce the family of solutions to a single solution. Resultantly, the architecture can simultaneously accurately find the energies, power, rate of work, mass matrix, generalised accelerations, generalised forces, and the generalised forces that drive the servomechanisms.

new Implicit Search via Discrete Diffusion: A Study on Chess

Authors: Jiacheng Ye, Zhenyu Wu, Jiahui Gao, Zhiyong Wu, Xin Jiang, Zhenguo Li, Lingpeng Kong

Abstract: In the post-AlphaGo era, there has been a renewed interest in search techniques such as Monte Carlo Tree Search (MCTS), particularly in their application to Large Language Models (LLMs). This renewed attention is driven by the recognition that current next-token prediction models often lack the ability for long-term planning. Is it possible to instill search-like abilities within the models to enhance their planning abilities without relying on explicit search? We propose DiffuSearch , a model that does \textit{implicit search} by looking into the future world via discrete diffusion modeling. We instantiate DiffuSearch on a classical board game, Chess, where explicit search is known to be essential. Through extensive controlled experiments, we show DiffuSearch outperforms both the searchless and explicit search-enhanced policies. Specifically, DiffuSearch outperforms the one-step policy by 19.2% and the MCTS-enhanced policy by 14% on action accuracy. Furthermore, DiffuSearch demonstrates a notable 30% enhancement in puzzle-solving abilities compared to explicit search-based policies, along with a significant 540 Elo increase in game-playing strength assessment. These results indicate that implicit search via discrete diffusion is a viable alternative to explicit search over a one-step policy. All codes are publicly available at \href{https://github.com/HKUNLP/DiffuSearch}{https://github.com/HKUNLP/DiffuSearch}.

URLs: https://github.com/HKUNLP/DiffuSearch, https://github.com/HKUNLP/DiffuSearch

new Advancing GDP Forecasting: The Potential of Machine Learning Techniques in Economic Predictions

Authors: Bogdan Oancea

Abstract: The quest for accurate economic forecasting has traditionally been dominated by econometric models, which most of the times rely on the assumptions of linear relationships and stationarity in of the data. However, the complex and often nonlinear nature of global economies necessitates the exploration of alternative approaches. Machine learning methods offer promising advantages over traditional econometric techniques for Gross Domestic Product forecasting, given their ability to model complex, nonlinear interactions and patterns without the need for explicit specification of the underlying relationships. This paper investigates the efficacy of Recurrent Neural Networks, in forecasting GDP, specifically LSTM networks. These models are compared against a traditional econometric method, SARIMA. We employ the quarterly Romanian GDP dataset from 1995 to 2023 and build a LSTM network to forecast to next 4 values in the series. Our findings suggest that machine learning models, consistently outperform traditional econometric models in terms of predictive accuracy and flexibility

new GraphSparseNet: a Novel Method for Large Scale Trafffic Flow Prediction

Authors: Weiyang Kong, Kaiqi Wu, Sen Zhang, Yubao Liu

Abstract: Traffic flow forecasting is a critical spatio-temporal data mining task with wide-ranging applications in intelligent route planning and dynamic traffic management. Recent advancements in deep learning, particularly through Graph Neural Networks (GNNs), have significantly enhanced the accuracy of these forecasts by capturing complex spatio-temporal dynamics. However, the scalability of GNNs remains a challenge due to their exponential growth in model complexity with increasing nodes in the graph. Existing methods to address this issue, including sparsification, decomposition, and kernel-based approaches, either do not fully resolve the complexity issue or risk compromising predictive accuracy. This paper introduces GraphSparseNet (GSNet), a novel framework designed to improve both the scalability and accuracy of GNN-based traffic forecasting models. GraphSparseNet is comprised of two core modules: the Feature Extractor and the Relational Compressor. These modules operate with linear time and space complexity, thereby reducing the overall computational complexity of the model to a linear scale. Our extensive experiments on multiple real-world datasets demonstrate that GraphSparseNet not only significantly reduces training time by 3.51x compared to state-of-the-art linear models but also maintains high predictive performance.

new Knowledge Bridger: Towards Training-free Missing Multi-modality Completion

Authors: Guanzhou Ke, Shengfeng He, Xiao Li Wang, Bo Wang, Guoqing Chao, Yuanyang Zhang, Yi Xie, HeXing Su

Abstract: Previous successful approaches to missing modality completion rely on carefully designed fusion techniques and extensive pre-training on complete data, which can limit their generalizability in out-of-domain (OOD) scenarios. In this study, we pose a new challenge: can we develop a missing modality completion model that is both resource-efficient and robust to OOD generalization? To address this, we present a training-free framework for missing modality completion that leverages large multimodal models (LMMs). Our approach, termed the "Knowledge Bridger", is modality-agnostic and integrates generation and ranking of missing modalities. By defining domain-specific priors, our method automatically extracts structured information from available modalities to construct knowledge graphs. These extracted graphs connect the missing modality generation and ranking modules through the LMM, resulting in high-quality imputations of missing modalities. Experimental results across both general and medical domains show that our approach consistently outperforms competing methods, including in OOD generalization. Additionally, our knowledge-driven generation and ranking techniques demonstrate superiority over variants that directly employ LMMs for generation and ranking, offering insights that may be valuable for applications in other domains.

new Revisit the Stability of Vanilla Federated Learning Under Diverse Conditions

Authors: Youngjoon Lee, Jinu Gong, Sun Choi, Joonhyuk Kang

Abstract: Federated Learning (FL) is a distributed machine learning paradigm enabling collaborative model training across decentralized clients while preserving data privacy. In this paper, we revisit the stability of the vanilla FedAvg algorithm under diverse conditions. Despite its conceptual simplicity, FedAvg exhibits remarkably stable performance compared to more advanced FL techniques. Our experiments assess the performance of various FL methods on blood cell and skin lesion classification tasks using Vision Transformer (ViT). Additionally, we evaluate the impact of different representative classification models and analyze sensitivity to hyperparameter variations. The results consistently demonstrate that, regardless of dataset, classification model employed, or hyperparameter settings, FedAvg maintains robust performance. Given its stability, robust performance without the need for extensive hyperparameter tuning, FedAvg is a safe and efficient choice for FL deployments in resource-constrained hospitals handling medical data. These findings underscore the enduring value of the vanilla FedAvg approach as a trusted baseline for clinical practice.

new IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic

Authors: Stefano Viel, Luca Viano, Volkan Cevher

Abstract: This paper introduces the SOAR framework for imitation learning. SOAR is an algorithmic template that learns a policy from expert demonstrations with a primal dual style algorithm that alternates cost and policy updates. Within the policy updates, the SOAR framework uses an actor critic method with multiple critics to estimate the critic uncertainty and build an optimistic critic fundamental to drive exploration. When instantiated in the tabular setting, we get a provable algorithm with guarantees that matches the best known results in $\epsilon$. Practically, the SOAR template is shown to boost consistently the performance of imitation learning algorithms based on Soft Actor Critic such as f-IRL, ML-IRL and CSIL in several MuJoCo environments. Overall, thanks to SOAR, the required number of episodes to achieve the same performance is reduced by half.

new Graph Probability Aggregation Clustering

Authors: Yuxuan Yan, Na Lu, Difei Mei, Ruofan Yan, Youtian Du

Abstract: Traditional clustering methods typically focus on either cluster-wise global clustering or point-wise local clustering to reveal the intrinsic structures in unlabeled data. Global clustering optimizes an objective function to explore the relationships between clusters, but this approach may inevitably lead to coarse partition. In contrast, local clustering heuristically groups data based on detailed point relationships, but it tends to be less coherence and efficient. To bridge the gap between these two concepts and utilize the strengths of both, we propose Graph Probability Aggregation Clustering (GPAC), a graph-based fuzzy clustering algorithm. GPAC unifies the global clustering objective function with a local clustering constraint. The entire GPAC framework is formulated as a multi-constrained optimization problem, which can be solved using the Lagrangian method. Through the optimization process, the probability of a sample belonging to a specific cluster is iteratively calculated by aggregating information from neighboring samples within the graph. We incorporate a hard assignment variable into the objective function to further improve the convergence and stability of optimization. Furthermore, to efficiently handle large-scale datasets, we introduce an acceleration program that reduces the computational complexity from quadratic to linear, ensuring scalability. Extensive experiments conducted on synthetic, real-world, and deep learning datasets demonstrate that GPAC not only exceeds existing state-of-the-art methods in clustering performance but also excels in computational efficiency, making it a powerful tool for complex clustering challenges.

new SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks

Authors: Nikolay Blagoev, Lydia Yiyu Chen, O\u{g}uzhan Ersoy

Abstract: Data and pipeline parallelism are ubiquitous for training of Large Language Models (LLM) on distributed nodes. Driven by the need for cost-effective training, recent work explores efficient communication arrangement for end to end training. Motivated by LLM's resistance to layer skipping and layer reordering, in this paper, we explore stage (several consecutive layers) skipping in pipeline training, and challenge the conventional practice of sequential pipeline execution. We derive convergence and throughput constraints (guidelines) for pipelining with skipping and swapping pipeline stages. Based on these constraints, we propose SkipPipe, the first partial pipeline framework to reduce the end-to-end training time for LLMs while preserving the convergence. The core of SkipPipe is a path scheduling algorithm that optimizes the paths for individual microbatches and reduces idle time (due to microbatch collisions) on the distributed nodes, complying with the given stage skipping ratio. We extensively evaluate SkipPipe on LLaMa models from 500M to 8B parameters on up to 20 nodes. Our results show that SkipPipe reduces training iteration time by up to $55\%$ compared to full pipeline. Our partial pipeline training also improves resistance to layer omission during inference, experiencing a drop in perplexity of only $7\%$ when running only half the model. Our code is available at https://github.com/gensyn-ai/skippipe.

URLs: https://github.com/gensyn-ai/skippipe.

new Playing Pok\'emon Red via Deep Reinforcement Learning

Authors: Marco Pleines, Daniel Addis, David Rubinstein, Frank Zimmer, Mike Preuss, Peter Whidden

Abstract: Pok\'emon Red, a classic Game Boy JRPG, presents significant challenges as a testbed for agents, including multi-tasking, long horizons of tens of thousands of steps, hard exploration, and a vast array of potential policies. We introduce a simplistic environment and a Deep Reinforcement Learning (DRL) training methodology, demonstrating a baseline agent that completes an initial segment of the game up to completing Cerulean City. Our experiments include various ablations that reveal vulnerabilities in reward shaping, where agents exploit specific reward signals. We also discuss limitations and argue that games like Pok\'emon hold strong potential for future research on Large Language Model agents, hierarchical training algorithms, and advanced exploration methods. Source Code: https://github.com/MarcoMeter/neroRL/tree/poke_red

URLs: https://github.com/MarcoMeter/neroRL/tree/poke_red

new Shifting the Paradigm: A Diffeomorphism Between Time Series Data Manifolds for Achieving Shift-Invariancy in Deep Learning

Authors: Berken Utku Demirel, Christian Holz

Abstract: Deep learning models lack shift invariance, making them sensitive to input shifts that cause changes in output. While recent techniques seek to address this for images, our findings show that these approaches fail to provide shift-invariance in time series, where the data generation mechanism is more challenging due to the interaction of low and high frequencies. Worse, they also decrease performance across several tasks. In this paper, we propose a novel differentiable bijective function that maps samples from their high-dimensional data manifold to another manifold of the same dimension, without any dimensional reduction. Our approach guarantees that samples -- when subjected to random shifts -- are mapped to a unique point in the manifold while preserving all task-relevant information without loss. We theoretically and empirically demonstrate that the proposed transformation guarantees shift-invariance in deep learning models without imposing any limits to the shift. Our experiments on six time series tasks with state-of-the-art methods show that our approach consistently improves the performance while enabling models to achieve complete shift-invariance without modifying or imposing restrictions on the model's topology. The source code is available on \href{https://github.com/eth-siplab/Shifting-the-Paradigm}{GitHub}.

URLs: https://github.com/eth-siplab/Shifting-the-Paradigm

new Incremental Learning with Repetition via Pseudo-Feature Projection

Authors: Benedikt Tscheschner, Eduardo Veas, Marc Masana

Abstract: Incremental Learning scenarios do not always represent real-world inference use-cases, which tend to have less strict task boundaries, and exhibit repetition of common classes and concepts in their continual data stream. To better represent these use-cases, new scenarios with partial repetition and mixing of tasks are proposed, where the repetition patterns are innate to the scenario and unknown to the strategy. We investigate how exemplar-free incremental learning strategies are affected by data repetition, and we adapt a series of state-of-the-art approaches to analyse and fairly compare them under both settings. Further, we also propose a novel method (Horde), able to dynamically adjust an ensemble of self-reliant feature extractors, and align them by exploiting class repetition. Our proposed exemplar-free method achieves competitive results in the classic scenario without repetition, and state-of-the-art performance in the one with repetition.

new Lotus at SemEval-2025 Task 11: RoBERTa with Llama-3 Generated Explanations for Multi-Label Emotion Classification

Authors: Niloofar Ranjbar, Hamed Baghbani

Abstract: This paper presents a novel approach for multi-label emotion detection, where Llama-3 is used to generate explanatory content that clarifies ambiguous emotional expressions, thereby enhancing RoBERTa's emotion classification performance. By incorporating explanatory context, our method improves F1-scores, particularly for emotions like fear, joy, and sadness, and outperforms text-only models. The addition of explanatory content helps resolve ambiguity, addresses challenges like overlapping emotional cues, and enhances multi-label classification, marking a significant advancement in emotion detection tasks.

new Flexible Bivariate Beta Mixture Model: A Probabilistic Approach for Clustering Complex Data Structures

Authors: Yung-Peng Hsu, Hung-Hsuan Chen

Abstract: Clustering is essential in data analysis and machine learning, but traditional algorithms like $k$-means and Gaussian Mixture Models (GMM) often fail with nonconvex clusters. To address the challenge, we introduce the Flexible Bivariate Beta Mixture Model (FBBMM), which utilizes the flexibility of the bivariate beta distribution to handle diverse and irregular cluster shapes. Using the Expectation Maximization (EM) algorithm and Sequential Least Squares Programming (SLSQP) optimizer for parameter estimation, we validate FBBMM on synthetic and real-world datasets, demonstrating its superior performance in clustering complex data structures, offering a robust solution for big data analytics across various domains. We release the experimental code at https://github.com/yung-peng/MBMM-and-FBBMM.

URLs: https://github.com/yung-peng/MBMM-and-FBBMM.

new Algebraic Machine Learning: Learning as computing an algebraic decomposition of a task

Authors: Fernando Martin-Maroto, Nabil Abderrahaman, David Mendez, Gonzalo G. de Polavieja

Abstract: Statistics and Optimization are foundational to modern Machine Learning. Here, we propose an alternative foundation based on Abstract Algebra, with mathematics that facilitates the analysis of learning. In this approach, the goal of the task and the data are encoded as axioms of an algebra, and a model is obtained where only these axioms and their logical consequences hold. Although this is not a generalizing model, we show that selecting specific subsets of its breakdown into algebraic atoms obtained via subdirect decomposition gives a model that generalizes. We validate this new learning principle on standard datasets such as MNIST, FashionMNIST, CIFAR-10, and medical images, achieving performance comparable to optimized multilayer perceptrons. Beyond data-driven tasks, the new learning principle extends to formal problems, such as finding Hamiltonian cycles from their specifications and without relying on search. This algebraic foundation offers a fresh perspective on machine intelligence, featuring direct learning from training data without the need for validation dataset, scaling through model additivity, and asymptotic convergence to the underlying rule in the data.

new Dynamic DropConnect: Enhancing Neural Network Robustness through Adaptive Edge Dropping Strategies

Authors: Yuan-Chih Yang, Hung-Hsuan Chen

Abstract: Dropout and DropConnect are well-known techniques that apply a consistent drop rate to randomly deactivate neurons or edges in a neural network layer during training. This paper introduces a novel methodology that assigns dynamic drop rates to each edge within a layer, uniquely tailoring the dropping process without incorporating additional learning parameters. We perform experiments on synthetic and openly available datasets to validate the effectiveness of our approach. The results demonstrate that our method outperforms Dropout, DropConnect, and Standout, a classic mechanism known for its adaptive dropout capabilities. Furthermore, our approach improves the robustness and generalization of neural network training without increasing computational complexity. The complete implementation of our methodology is publicly accessible for research and replication purposes at https://github.com/ericabd888/Adjusting-the-drop-probability-in-DropConnect-based-on-the-magnitude-of-the-gradient/.

URLs: https://github.com/ericabd888/Adjusting-the-drop-probability-in-DropConnect-based-on-the-magnitude-of-the-gradient/.

new Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches

Authors: Mohammad Moulaeifard, Loic Coquelin, Mantas Rinkevi\v{c}ius, Andrius Solo\v{s}enko, Oskar Pfeffer, Ciaran Bench, Nando Hegemann, Sara Vardanega, Manasi Nandi, Jordi Alastruey, Christian Heiss, Vaidotas Marozas, Andrew Thompson, Philip J. Aston, Peter H. Charlton, Nils Strodthoff

Abstract: Photoplethysmography (PPG) is a widely used non-invasive physiological sensing technique, suitable for various clinical applications. Such clinical applications are increasingly supported by machine learning methods, raising the question of the most appropriate input representation and model choice. Comprehensive comparisons, in particular across different input representations, are scarce. We address this gap in the research landscape by a comprehensive benchmarking study covering three kinds of input representations, interpretable features, image representations and raw waveforms, across prototypical regression and classification use cases: blood pressure and atrial fibrillation prediction. In both cases, the best results are achieved by deep neural networks operating on raw time series as input representations. Within this model class, best results are achieved by modern convolutional neural networks (CNNs). but depending on the task setup, shallow CNNs are often also very competitive. We envision that these results will be insightful for researchers to guide their choice on machine learning tasks for PPG data, even beyond the use cases presented in this work.

new SeisMoLLM: Advancing Seismic Monitoring via Cross-modal Transfer with Pre-trained Large Language Model

Authors: Xinghao Wang, Feng Liu, Rui Su, Zhihui Wang, Lei Bai, Wanli Ouyang

Abstract: Recent advances in deep learning have revolutionized seismic monitoring, yet developing a foundation model that performs well across multiple complex tasks remains challenging, particularly when dealing with degraded signals or data scarcity. This work presents SeisMoLLM, the first foundation model that utilizes cross-modal transfer for seismic monitoring, to unleash the power of large-scale pre-training from a large language model without requiring direct pre-training on seismic datasets. Through elaborate waveform tokenization and fine-tuning of pre-trained GPT-2 model, SeisMoLLM achieves state-of-the-art performance on the DiTing and STEAD datasets across five critical tasks: back-azimuth estimation, epicentral distance estimation, magnitude estimation, phase picking, and first-motion polarity classification. It attains 36 best results out of 43 task metrics and 12 top scores out of 16 few-shot generalization metrics, with many relative improvements ranging from 10% to 50%. In addition to its superior performance, SeisMoLLM maintains efficiency comparable to or even better than lightweight models in both training and inference. These findings establish SeisMoLLM as a promising foundation model for practical seismic monitoring and highlight cross-modal transfer as an exciting new direction for earthquake studies, showcasing the potential of advanced deep learning techniques to propel seismology research forward.

new Do Sparse Autoencoders Generalize? A Case Study of Answerability

Authors: Lovis Heindrich, Philip Torr, Fazl Barez, Veronika Thost

Abstract: Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through "answerability"-a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features demonstrate inconsistent transfer ability, and residual stream probes similarly show high variance out of distribution. Overall, this demonstrates the need for quantitative methods to predict feature generalization in SAE-based interpretability.

new Can Textual Gradient Work in Federated Learning?

Authors: Minghui Chen, Ruinan Jin, Wenlong Deng, Yuanyuan Chen, Zhi Huang, Han Yu, Xiaoxiao Li

Abstract: Recent studies highlight the promise of LLM-based prompt optimization, especially with TextGrad, which automates differentiation'' via texts and backpropagates textual feedback. This approach facilitates training in various real-world applications that do not support numerical gradient propagation or loss calculation. In this paper, we systematically explore the potential and challenges of incorporating textual gradient into Federated Learning (FL). Our contributions are fourfold. Firstly, we introduce a novel FL paradigm, Federated Textual Gradient (FedTextGrad), that allows clients to upload locally optimized prompts derived from textual gradients, while the server aggregates the received prompts. Unlike traditional FL frameworks, which are designed for numerical aggregation, FedTextGrad is specifically tailored for handling textual data, expanding the applicability of FL to a broader range of problems that lack well-defined numerical loss functions. Secondly, building on this design, we conduct extensive experiments to explore the feasibility of FedTextGrad. Our findings highlight the importance of properly tuning key factors (e.g., local steps) in FL training. Thirdly, we highlight a major challenge in FedTextGrad aggregation: retaining essential information from distributed prompt updates. Last but not least, in response to this issue, we improve the vanilla variant of FedTextGrad by providing actionable guidance to the LLM when summarizing client prompts by leveraging the Uniform Information Density principle. Through this principled study, we enable the adoption of textual gradients in FL for optimizing LLMs, identify important issues, and pinpoint future directions, thereby opening up a new research area that warrants further investigation.

new Efficient Time Series Forecasting via Hyper-Complex Models and Frequency Aggregation

Authors: Eyal Yakir, Dor Tsur, Haim Permuter

Abstract: Time series forecasting is a long-standing problem in statistics and machine learning. One of the key challenges is processing sequences with long-range dependencies. To that end, a recent line of work applied the short-time Fourier transform (STFT), which partitions the sequence into multiple subsequences and applies a Fourier transform to each separately. We propose the Frequency Information Aggregation (FIA)-Net, which is based on a novel complex-valued MLP architecture that aggregates adjacent window information in the frequency domain. To further increase the receptive field of the FIA-Net, we treat the set of windows as hyper-complex (HC) valued vectors and employ HC algebra to efficiently combine information from all STFT windows altogether. Using the HC-MLP backbone allows for improved handling of sequences with long-term dependence. Furthermore, due to the nature of HC operations, the HC-MLP uses up to three times fewer parameters than the equivalent standard window aggregation method. We evaluate the FIA-Net on various time-series benchmarks and show that the proposed methodologies outperform existing state of the art methods in terms of both accuracy and efficiency. Our code is publicly available on https://anonymous.4open.science/r/research-1803/.

URLs: https://anonymous.4open.science/r/research-1803/.

new WaveGAS: Waveform Relaxation for Scaling Graph Neural Networks

Authors: Jana Vatter, Mykhaylo Zayats, Marcos Mart\'inez Galindo, Vanessa L\'opez, Ruben Mayer, Hans-Arno Jacobsen, Hoang Thanh Lam

Abstract: With the ever-growing size of real-world graphs, numerous techniques to overcome resource limitations when training Graph Neural Networks (GNNs) have been developed. One such approach, GNNAutoScale (GAS), uses graph partitioning to enable training under constrained GPU memory. GAS also stores historical embedding vectors, which are retrieved from one-hop neighbors in other partitions, ensuring critical information is captured across partition boundaries. The historical embeddings which come from the previous training iteration are stale compared to the GAS estimated embeddings, resulting in approximation errors of the training algorithm. Furthermore, these errors accumulate over multiple layers, leading to suboptimal node embeddings. To address this shortcoming, we propose two enhancements: first, WaveGAS, inspired by waveform relaxation, performs multiple forward passes within GAS before the backward pass, refining the approximation of historical embeddings and gradients to improve accuracy; second, a gradient-tracking method that stores and utilizes more accurate historical gradients during training. Empirical results show that WaveGAS enhances GAS and achieves better accuracy, even outperforming methods that train on full graphs, thanks to its robust estimation of node embeddings.

new Dam Volume Prediction Model Development Using ML Algorithms

Authors: Hugo Retief, Mariangel Garcia Andarcia, Chris Dickens, Surajit Ghosh

Abstract: Reliable reservoir volume estimates are crucial for water resource management, especially in arid and semi-arid regions. The present study investigates applying three machine learning regression techniques - Gradient Boosting, Random Forest, and ElasticNet to predict key dam performance characteristics of the Loskop Dam in South Africa. The models were trained and validated on a dataset comprising geospatial elevation measurements paired with corresponding reservoir supply capacity values. The best-performing approach was a threshold-based blended model that combined random forest for higher volumes with Ridge regression for lower volumes. This model achieved an RMSE of 4.88 MCM and an R2 of 0.99. These findings highlight the ability of ensemble learning techniques to capture complex relationships in dam datasets and underscore their practical utility for reliable dam performance modelling in real-world water resource management scenarios.

new Learning Hamiltonian Density Using DeepONet

Authors: Baige Xu, Yusuke Tanaka, Takashi Matsubara, Takaharu Yaguchi

Abstract: In recent years, deep learning for modeling physical phenomena which can be described by partial differential equations (PDEs) have received significant attention. For example, for learning Hamiltonian mechanics, methods based on deep neural networks such as Hamiltonian Neural Networks (HNNs) and their variants have achieved progress. However, existing methods typically depend on the discretization of data, and the determination of required differential operators is often necessary. Instead, in this work, we propose an operator learning approach for modeling wave equations. In particular, we present a method to compute the variational derivatives that are needed to formulate the equations using the automatic differentiation algorithm. The experiments demonstrated that the proposed method is able to learn the operator that defines the Hamiltonian density of waves from data with unspecific discretization without determination of the differential operators.

new Learning Classifiers That Induce Markets

Authors: Yonatan Sommer, Ivri Hikri, Lotan Amit, Nir Rosenfeld

Abstract: When learning is used to inform decisions about humans, such as for loans, hiring, or admissions, this can incentivize users to strategically modify their features to obtain positive predictions. A key assumption is that modifications are costly, and are governed by a cost function that is exogenous and predetermined. We challenge this assumption, and assert that the deployment of a classifier is what creates costs. Our idea is simple: when users seek positive predictions, this creates demand for important features; and if features are available for purchase, then a market will form, and competition will give rise to prices. We extend the strategic classification framework to support this notion, and study learning in a setting where a classifier can induce a market for features. We present an analysis of the learning task, devise an algorithm for computing market prices, propose a differentiable learning framework, and conduct experiments to explore our novel setting and approach.

new Climate And Resource Awareness is Imperative to Achieving Sustainable AI (and Preventing a Global AI Arms Race)

Authors: Pedram Bakhtiarifard, P{\i}nar T\"oz\"un, Christian Igel, Raghavendra Selvan

Abstract: Sustainability encompasses three key facets: economic, environmental, and social. However, the nascent discourse that is emerging on sustainable artificial intelligence (AI) has predominantly focused on the environmental sustainability of AI, often neglecting the economic and social aspects. Achieving truly sustainable AI necessitates addressing the tension between its climate awareness and its social sustainability, which hinges on equitable access to AI development resources. The concept of resource awareness advocates for broader access to the infrastructure required to develop AI, fostering equity in AI innovation. Yet, this push for improving accessibility often overlooks the environmental costs of expanding such resource usage. In this position paper, we argue that reconciling climate and resource awareness is essential to realizing the full potential of sustainable AI. We use the framework of base-superstructure to analyze how the material conditions are influencing the current AI discourse. We also introduce the Climate and Resource Aware Machine Learning (CARAML) framework to address this conflict and propose actionable recommendations spanning individual, community, industry, government, and global levels to achieve sustainable AI.

new Offline Reinforcement Learning via Inverse Optimization

Authors: Ioannis Dimanidis, Tolga Ok, Peyman Mohajerin Esfahani

Abstract: Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.

new Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping

Authors: Guannan Lai, Yujie Li, Xiangkun Wang, Junbo Zhang, Tianrui Li, Xin Yang

Abstract: Class Incremental Learning (CIL) requires a model to continuously learn new classes without forgetting previously learned ones. While recent studies have significantly alleviated the problem of catastrophic forgetting (CF), more and more research reveals that the order in which classes appear have significant influences on CIL models. Specifically, prioritizing the learning of classes with lower similarity will enhance the model's generalization performance and its ability to mitigate forgetting. Hence, it is imperative to develop an order-robust class incremental learning model that maintains stable performance even when faced with varying levels of class similarity in different orders. In response, we first provide additional theoretical analysis, which reveals that when the similarity among a group of classes is lower, the model demonstrates increased robustness to the class order. Then, we introduce a novel \textbf{G}raph-\textbf{D}riven \textbf{D}ynamic \textbf{S}imilarity \textbf{G}rouping (\textbf{GDDSG}) method, which leverages a graph coloring algorithm for class-based similarity grouping. The proposed approach trains independent CIL models for each group of classes, ultimately combining these models to facilitate joint prediction. Experimental results demonstrate that our method effectively addresses the issue of class order sensitivity while achieving optimal performance in both model accuracy and anti-forgetting capability. Our code is available at https://github.com/AIGNLAI/GDDSG.

URLs: https://github.com/AIGNLAI/GDDSG.

new Recommendations from Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization

Authors: Suryanarayana Sankagiri, Jalal Etesami, Matthias Grossglauser

Abstract: This paper provides a theoretical analysis of a new learning problem for recommender systems where users provide feedback by comparing pairs of items instead of rating them individually. We assume that comparisons stem from latent user and item features, which reduces the task of predicting preferences to learning these features from comparison data. Similar to the classical matrix factorization problem, the main challenge in this learning task is that the resulting loss function is nonconvex. Our analysis shows that the loss function exhibits (restricted) strong convexity near the true solution, which ensures gradient-based methods converge exponentially, given an appropriate warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend certain concentration inequalities commonly used in matrix completion to our model. Our work demonstrates that learning personalized recommendations from comparison data is computationally and statistically efficient.

new A Generative Model Enhanced Multi-Agent Reinforcement Learning Method for Electric Vehicle Charging Navigation

Authors: Tianyang Qi, Shibo Chen, Jun Zhang

Abstract: With the widespread adoption of electric vehicles (EVs), navigating for EV drivers to select a cost-effective charging station has become an important yet challenging issue due to dynamic traffic conditions, fluctuating electricity prices, and potential competition from other EVs. The state-of-the-art deep reinforcement learning (DRL) algorithms for solving this task still require global information about all EVs at the execution stage, which not only increases communication costs but also raises privacy issues among EV drivers. To overcome these drawbacks, we introduce a novel generative model-enhanced multi-agent DRL algorithm that utilizes only the EV's local information while achieving performance comparable to these state-of-the-art algorithms. Specifically, the policy network is implemented on the EV side, and a Conditional Variational Autoencoder-Long Short Term Memory (CVAE-LSTM)-based recommendation model is developed to provide recommendation information. Furthermore, a novel future charging competition encoder is designed to effectively compress global information, enhancing training performance. The multi-gradient descent algorithm (MGDA) is also utilized to adaptively balance the weight between the two parts of the training objective, resulting in a more stable training process. Simulations are conducted based on a practical area in Xi\'an, China. Experimental results show that our proposed algorithm, which relies on local information, outperforms existing local information-based methods and achieves less than 8\% performance loss compared to global information-based methods.

new RIZE: Regularized Imitation Learning via Distributional Reinforcement Learning

Authors: Adib Karimi, Mohammad Mehdi Ebadzadeh

Abstract: We introduce a novel Inverse Reinforcement Learning (IRL) approach that overcomes limitations of fixed reward assignments and constrained flexibility in implicit reward regularization. By extending the Maximum Entropy IRL framework with a squared temporal-difference (TD) regularizer and adaptive targets, dynamically adjusted during training, our method indirectly optimizes a reward function while incorporating reinforcement learning principles. Furthermore, we integrate distributional RL to capture richer return information. Our approach achieves state-of-the-art performance on challenging MuJoCo tasks, demonstrating expert-level results on the Humanoid task with only 3 demonstrations. Extensive experiments and ablation studies validate the effectiveness of our method, providing insights into adaptive targets and reward dynamics in imitation learning.

new Sanity Checking Causal Representation Learning on a Simple Real-World System

Authors: Juan L. Gamella, Simon Bing, Jakob Runge

Abstract: We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors (the inputs to the experiment) are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at github.com/simonbing/CRLSanityCheck

new Forward-Cooperation-Backward (FCB) learning in a Multi-Encoding Uni-Decoding neural network architecture

Authors: Prasun Dutta, Koustab Ghosh, Rajat K. De

Abstract: The most popular technique to train a neural network is backpropagation. Recently, the Forward-Forward technique has also been introduced for certain learning tasks. However, in real life, human learning does not follow any of these techniques exclusively. The way a human learns is basically a combination of forward learning, backward propagation and cooperation. Humans start learning a new concept by themselves and try to refine their understanding hierarchically during which they might come across several doubts. The most common approach to doubt solving is a discussion with peers, which can be called cooperation. Cooperation/discussion/knowledge sharing among peers is one of the most important steps of learning that humans follow. However, there might still be a few doubts even after the discussion. Then the difference between the understanding of the concept and the original literature is identified and minimized over several revisions. Inspired by this, the paper introduces Forward-Cooperation-Backward (FCB) learning in a deep neural network framework mimicking the human nature of learning a new concept. A novel deep neural network architecture, called Multi Encoding Uni Decoding neural network model, has been designed which learns using the notion of FCB. A special lateral synaptic connection has also been introduced to realize cooperation. The models have been justified in terms of their performance in dimension reduction on four popular datasets. The ability to preserve the granular properties of data in low-rank embedding has been tested to justify the quality of dimension reduction. For downstream analyses, classification has also been performed. An experimental study on convergence analysis has been performed to establish the efficacy of the FCB learning strategy.

new Identifiable Multi-View Causal Discovery Without Non-Gaussianity

Authors: Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyv\"arinen

Abstract: We propose a novel approach to linear causal discovery in the framework of multi-view Structural Equation Models (SEM). Our proposed model relaxes the well-known assumption of non-Gaussian disturbances by alternatively assuming diversity of variances over views, making it more broadly applicable. We prove the identifiability of all the parameters of the model without any further assumptions on the structure of the SEM other than it being acyclic. We further propose an estimation algorithm based on recent advances in multi-view Independent Component Analysis (ICA). The proposed methodology is validated through simulations and application on real neuroimaging data, where it enables the estimation of causal graphs between brain regions.

new Exploring Open-world Continual Learning with Knowns-Unknowns Knowledge Transfer

Authors: Yujie Li, Guannan Lai, Xin Yang, Yonghao Li, Marcello Bonsangue, Tianrui Li

Abstract: Open-World Continual Learning (OWCL) is a challenging paradigm where models must incrementally learn new knowledge without forgetting while operating under an open-world assumption. This requires handling incomplete training data and recognizing unknown samples during inference. However, existing OWCL methods often treat open detection and continual learning as separate tasks, limiting their ability to integrate open-set detection and incremental classification in OWCL. Moreover, current approaches primarily focus on transferring knowledge from known samples, neglecting the insights derived from unknown/open samples. To address these limitations, we formalize four distinct OWCL scenarios and conduct comprehensive empirical experiments to explore potential challenges in OWCL. Our findings reveal a significant interplay between the open detection of unknowns and incremental classification of knowns, challenging a widely held assumption that unknown detection and known classification are orthogonal processes. Building on our insights, we propose \textbf{HoliTrans} (Holistic Knowns-Unknowns Knowledge Transfer), a novel OWCL framework that integrates nonlinear random projection (NRP) to create a more linearly separable embedding space and distribution-aware prototypes (DAPs) to construct an adaptive knowledge space. Particularly, our HoliTrans effectively supports knowledge transfer for both known and unknown samples while dynamically updating representations of open samples during OWCL. Extensive experiments across various OWCL scenarios demonstrate that HoliTrans outperforms 22 competitive baselines, bridging the gap between OWCL theory and practice and providing a robust, scalable framework for advancing open-world learning paradigms.

new FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Authors: Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali Thabet, Edgar Sch\"onfeld

Abstract: Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$\% less compute without compromising performance.

new Regional climate projections using a deep learning--based model-ranking and downscaling framework: Application to European climate zones

Authors: Parthiban Loganathan, Elias Zea, Ricardo Vinuesa, Evelyn Otero

Abstract: Accurate regional climate forecast calls for high-resolution downscaling of Global Climate Models (GCMs). This work presents a deep-learning-based multi-model evaluation and downscaling framework ranking 32 Coupled Model Intercomparison Project Phase 6 (CMIP6) models using a Deep Learning-TOPSIS (DL-TOPSIS) mechanism and so refines outputs using advanced deep-learning models. Using nine performance criteria, five K\"oppen-Geiger climate zones -- Tropical, Arid, Temperate, Continental, and Polar -- are investigated over four seasons. While TaiESM1 and CMCC-CM2-SR5 show notable biases, ranking results show that NorESM2-LM, GISS-E2-1-G, and HadGEM3-GC31-LL outperform other models. Four models contribute to downscaling the top-ranked GCMs to 0.1$^{\circ}$ resolution: Vision Transformer (ViT), Geospatial Spatiotemporal Transformer with Attention and Imbalance-Aware Network (GeoSTANet), CNN-LSTM, and CNN-Long Short-Term Memory (ConvLSTM). Effectively capturing temperature extremes (TXx, TNn), GeoSTANet achieves the highest accuracy (Root Mean Square Error (RMSE) = 1.57$^{\circ}$C, Kling-Gupta Efficiency (KGE) = 0.89, Nash-Sutcliffe Efficiency (NSE) = 0.85, Correlation ($r$) = 0.92), so reducing RMSE by 20\% over ConvLSTM. CNN-LSTM and ConvLSTM do well in Continental and Temperate zones; ViT finds fine-scale temperature fluctuations difficult. These results confirm that multi-criteria ranking improves GCM selection for regional climate studies and transformer-based downscaling exceeds conventional deep-learning methods. This framework offers a scalable method to enhance high-resolution climate projections, benefiting impact assessments and adaptation plans.

new LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping

Authors: J. Schmidinger, S. Vogel, V. Barkov, A. -D. Pham, R. Gebbers, H. Tavakoli, J. Correa, T. R. Tavares, P. Filippi, E. J. Jones, V. Lukas, E. Boenecke, J. Ruehlmann, I. Schroeter, E. Kramer, S. Paetzold, M. Kodaira, A. M. J. -C. Wadoux, L. Bragazza, K. Metzger, J. Huang, D. S. M. Valente, J. L. Safanelli, E. L. Bottega, R. S. D. Dalmolin, C. Farkas, A. Steiger, T. Z. Horst, L. Ramirez-Lopez, T. Scholten, F. Stumpf, P. Rosso, M. M. Costa, R. S. Zandonadi, J. Wetterlind, M. Atzmueller

Abstract: Digital soil mapping (DSM) relies on a broad pool of statistical methods, yet determining the optimal method for a given context remains challenging and contentious. Benchmarking studies on multiple datasets are needed to reveal strengths and limitations of commonly used methods. Existing DSM studies usually rely on a single dataset with restricted access, leading to incomplete and potentially misleading conclusions. To address these issues, we introduce an open-access dataset collection called Precision Liming Soil Datasets (LimeSoDa). LimeSoDa consists of 31 field- and farm-scale datasets from various countries. Each dataset has three target soil properties: (1) soil organic matter or soil organic carbon, (2) clay content and (3) pH, alongside a set of features. Features are dataset-specific and were obtained by optical spectroscopy, proximal- and remote soil sensing. All datasets were aligned to a tabular format and are ready-to-use for modeling. We demonstrated the use of LimeSoDa for benchmarking by comparing the predictive performance of four learning algorithms across all datasets. This comparison included multiple linear regression (MLR), support vector regression (SVR), categorical boosting (CatBoost) and random forest (RF). The results showed that although no single algorithm was universally superior, certain algorithms performed better in specific contexts. MLR and SVR performed better on high-dimensional spectral datasets, likely due to better compatibility with principal components. In contrast, CatBoost and RF exhibited considerably better performances when applied to datasets with a moderate number (< 20) of features. These benchmarking results illustrate that the performance of a method is highly context-dependent. LimeSoDa therefore provides an important resource for improving the development and evaluation of statistical methods in DSM.

new Your contrastive learning problem is secretly a distribution alignment problem

Authors: Zihao Chen, Chi-Heng Lin, Ran Liu, Jingyun Xiao, Eva L Dyer

Abstract: Despite the success of contrastive learning (CL) in vision and language, its theoretical foundations and mechanisms for building representations remain poorly understood. In this work, we build connections between noise contrastive estimation losses widely used in CL and distribution alignment with entropic optimal transport (OT). This connection allows us to develop a family of different losses and multistep iterative variants for existing CL methods. Intuitively, by using more information from the distribution of latents, our approach allows a more distribution-aware manipulation of the relationships within augmented sample sets. We provide theoretical insights and experimental evidence demonstrating the benefits of our approach for {\em generalized contrastive alignment}. Through this framework, it is possible to leverage tools in OT to build unbalanced losses to handle noisy views and customize the representation space by changing the constraints on alignment. By reframing contrastive learning as an alignment problem and leveraging existing optimization tools for OT, our work provides new insights and connections between different self-supervised learning models in addition to new tools that can be more easily adapted to incorporate domain knowledge into learning.

new Transfer Learning in Latent Contextual Bandits with Covariate Shift Through Causal Transportability

Authors: Mingwei Deng, Ville Kyrki, Dominik Baumann

Abstract: Transferring knowledge from one environment to another is an essential ability of intelligent systems. Nevertheless, when two environments are different, naively transferring all knowledge may deteriorate the performance, a phenomenon known as negative transfer. In this paper, we address this issue within the framework of multi-armed bandits from the perspective of causal inference. Specifically, we consider transfer learning in latent contextual bandits, where the actual context is hidden, but a potentially high-dimensional proxy is observable. We further consider a covariate shift in the context across environments. We show that naively transferring all knowledge for classical bandit algorithms in this setting led to negative transfer. We then leverage transportability theory from causal inference to develop algorithms that explicitly transfer effective knowledge for estimating the causal effects of interest in the target environment. Besides, we utilize variational autoencoders to approximate causal effects under the presence of a high-dimensional proxy. We test our algorithms on synthetic and semi-synthetic datasets, empirically demonstrating consistently improved learning efficiency across different proxies compared to baseline algorithms, showing the effectiveness of our causal framework in transferring knowledge.

new Gradient-Guided Annealing for Domain Generalization

Authors: Aristotelis Ballas, Christos Diou

Abstract: Domain Generalization (DG) research has gained considerable traction as of late, since the ability to generalize to unseen data distributions is a requirement that eludes even state-of-the-art training algorithms. In this paper we observe that the initial iterations of model training play a key role in domain generalization effectiveness, since the loss landscape may be significantly different across the training and test distributions, contrary to the case of i.i.d. data. Conflicts between gradients of the loss components of each domain lead the optimization procedure to undesirable local minima that do not capture the domain-invariant features of the target classes. We propose alleviating domain conflicts in model optimization, by iteratively annealing the parameters of a model in the early stages of training and searching for points where gradients align between domains. By discovering a set of parameter values where gradients are updated towards the same direction for each data distribution present in the training set, the proposed Gradient-Guided Annealing (GGA) algorithm encourages models to seek out minima that exhibit improved robustness against domain shifts. The efficacy of GGA is evaluated on five widely accepted and challenging image classification domain generalization benchmarks, where its use alone is able to establish highly competitive or even state-of-the-art performance. Moreover, when combined with previously proposed domain-generalization algorithms it is able to consistently improve their effectiveness by significant margins.

new Similarity-Distance-Magnitude Universal Verification

Authors: Allen Schmaltz

Abstract: We solve the neural network robustness problem by adding Similarity (i.e., correctly predicted depth-matches into training)-awareness and Distance-to-training-distribution-awareness to the existing output Magnitude (i.e., decision-boundary)-awareness of the softmax function. The resulting sdm activation function provides strong signals of the relative epistemic (reducible) predictive uncertainty. We use this novel behavior to further address the complementary HCI problem of mapping the output to human-interpretable summary statistics over relevant partitions of a held-out calibration set. Estimates of prediction-conditional uncertainty are obtained via a parsimonious learned transform over the class-conditional empirical CDFs of the output of a final-layer sdm activation function. For decision-making and as an intrinsic model check, estimates of class-conditional accuracy are obtained by further partitioning the high-probability regions of this calibrated output into class-conditional, region-specific CDFs. The uncertainty estimates from sdm calibration are remarkably robust to test-time distribution shifts and out-of-distribution inputs; incorporate awareness of the effective sample size; provide estimates of uncertainty from the learning and data splitting processes; and are well-suited for selective classification and conditional branching for additional test-time compute based on the predictive uncertainty, as for selective LLM generation, routing, and composition over multiple models and retrieval. Finally, we construct sdm networks, LLMs with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties. We provide open-source software implementing these results.

new Mixture of Experts-augmented Deep Unfolding for Activity Detection in IRS-aided Systems

Authors: Zeyi Ren, Qingfeng Lin, Jingreng Lei, Yang Li, Yik-Chung Wu

Abstract: In the realm of activity detection for massive machine-type communications, intelligent reflecting surfaces (IRS) have shown significant potential in enhancing coverage for devices lacking direct connections to the base station (BS). However, traditional activity detection methods are typically designed for a single type of channel model, which does not reflect the complexities of real-world scenarios, particularly in systems incorporating IRS. To address this challenge, this paper introduces a novel approach that combines model-driven deep unfolding with a mixture of experts (MoE) framework. By automatically selecting one of three expert designs and applying it to the unfolded projected gradient method, our approach eliminates the need for prior knowledge of channel types between devices and the BS. Simulation results demonstrate that the proposed MoE-augmented deep unfolding method surpasses the traditional covariance-based method and black-box neural network design, delivering superior detection performance under mixed channel fading conditions.

new Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies

Authors: Zhouyu He, Peng Qiao, Rongchun Li, Yong Dou, Yusong Tan

Abstract: As the demands for superior agents grow, the training complexity of Deep Reinforcement Learning (DRL) becomes higher. Thus, accelerating training of DRL has become a major research focus. Dividing the DRL training process into subtasks and using parallel computation can effectively reduce training costs. However, current DRL training systems lack sufficient parallelization due to data assignment between subtask components. This assignment issue has been ignored, but addressing it can further boost training efficiency. Therefore, we propose a high-throughput distributed RL training system called TianJi. It relaxes assignment dependencies between subtask components and enables event-driven asynchronous communication. Meanwhile, TianJi maintains clear boundaries between subtask components. To address convergence uncertainty from relaxed assignment dependencies, TianJi proposes a distributed strategy based on the balance of sample production and consumption. The strategy controls the staleness of samples to correct their quality, ensuring convergence. We conducted extensive experiments. TianJi achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison systems. When scaled to eight computational nodes, TianJi shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to XingTian, demonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, TianJi significantly outperforms other systems, approaching hardware limits. TianJi also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and XingTian. TianJi is accessible at https://github.com/HiPRL/TianJi.git.

URLs: https://github.com/HiPRL/TianJi.git.

new Mixture of Experts for Recognizing Depression from Interview and Reading Tasks

Authors: Loukas Ilias, Dimitris Askounis

Abstract: Depression is a mental disorder and can cause a variety of symptoms, including psychological, physical, and social. Speech has been proved an objective marker for the early recognition of depression. For this reason, many studies have been developed aiming to recognize depression through speech. However, existing methods rely on the usage of only the spontaneous speech neglecting information obtained via read speech, use transcripts which are often difficult to obtain (manual) or come with high word-error rates (automatic), and do not focus on input-conditional computation methods. To resolve these limitations, this is the first study in depression recognition task obtaining representations of both spontaneous and read speech, utilizing multimodal fusion methods, and employing Mixture of Experts (MoE) models in a single deep neural network. Specifically, we use audio files corresponding to both interview and reading tasks and convert each audio file into log-Mel spectrogram, delta, and delta-delta. Next, the image representations of the two tasks pass through shared AlexNet models. The outputs of the AlexNet models are given as input to a multimodal fusion method. The resulting vector is passed through a MoE module. In this study, we employ three variants of MoE, namely sparsely-gated MoE and multilinear MoE based on factorization. Findings suggest that our proposed approach yields an Accuracy and F1-score of 87.00% and 86.66% respectively on the Androids corpus.

new Teasing Apart Architecture and Initial Weights as Sources of Inductive Bias in Neural Networks

Authors: Gianluca Bencomo, Max Gupta, Ioana Marinescu, R. Thomas McCoy, Thomas L. Griffiths

Abstract: Artificial neural networks can acquire many aspects of human knowledge from data, making them promising as models of human learning. But what those networks can learn depends upon their inductive biases -- the factors other than the data that influence the solutions they discover -- and the inductive biases of neural networks remain poorly understood, limiting our ability to draw conclusions about human learning from the performance of these systems. Cognitive scientists and machine learning researchers often focus on the architecture of a neural network as a source of inductive bias. In this paper we explore the impact of another source of inductive bias -- the initial weights of the network -- using meta-learning as a tool for finding initial weights that are adapted for specific problems. We evaluate four widely-used architectures -- MLPs, CNNs, LSTMs, and Transformers -- by meta-training 430 different models across three tasks requiring different biases and forms of generalization. We find that meta-learning can substantially reduce or entirely eliminate performance differences across architectures and data representations, suggesting that these factors may be less important as sources of inductive bias than is typically assumed. When differences are present, architectures and data representations that perform well without meta-learning tend to meta-train more effectively. Moreover, all architectures generalize poorly on problems that are far from their meta-training experience, underscoring the need for stronger inductive biases for robust generalization.

new The Impact of Transparency in AI Systems on Users' Data-Sharing Intentions: A Scenario-Based Experiment

Authors: Julian Rosenberger, Sophie Kuhlemann, Verena Tiefenbeck, Mathias Kraus, Patrick Zschech

Abstract: Artificial Intelligence (AI) systems are frequently employed in online services to provide personalized experiences to users based on large collections of data. However, AI systems can be designed in different ways, with black-box AI systems appearing as complex data-processing engines and white-box AI systems appearing as fully transparent data-processors. As such, it is reasonable to assume that these different design choices also affect user perception and thus their willingness to share data. To this end, we conducted a pre-registered, scenario-based online experiment with 240 participants and investigated how transparent and non-transparent data-processing entities influenced data-sharing intentions. Surprisingly, our results revealed no significant difference in willingness to share data across entities, challenging the notion that transparency increases data-sharing willingness. Furthermore, we found that a general attitude of trust towards AI has a significant positive influence, especially in the transparent AI condition, whereas privacy concerns did not significantly affect data-sharing decisions.

new Understanding the Limits of Deep Tabular Methods with Temporal Shift

Authors: Hao-Run Cai, Han-Jia Ye

Abstract: Deep tabular models have demonstrated remarkable success on i.i.d. data, excelling in a variety of structured data tasks. However, their performance often deteriorates under temporal distribution shifts, where trends and periodic patterns are present in the evolving data distribution over time. In this paper, we explore the underlying reasons for this failure in capturing temporal dependencies. We begin by investigating the training protocol, revealing a key issue in how model selection perform. While existing approaches use temporal ordering for splitting validation set, we show that even a random split can significantly improve model performance. By minimizing the time lag between training data and test time, while reducing the bias in validation, our proposed training protocol significantly improves generalization across various methods. Furthermore, we analyze how temporal data affects deep tabular representations, uncovering that these models often fail to capture crucial periodic and trend information. To address this gap, we introduce a plug-and-play temporal embedding method based on Fourier series expansion to learn and incorporate temporal patterns, offering an adaptive approach to handle temporal shifts. Our experiments demonstrate that this temporal embedding, combined with the improved training protocol, provides a more effective and robust framework for learning from temporal tabular data.

new On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration: A Case Study on OneMax with (1+($\lambda$,$\lambda$))-GA

Authors: Tai Nguyen, Phong Le, Andr\'e Biendenkapp, Carola Doerr, Nguyen Dang

Abstract: Dynamic Algorithm Configuration (DAC) has garnered significant attention in recent years, particularly in the prevalence of machine learning and deep learning algorithms. Numerous studies have leveraged the robustness of decision-making in Reinforcement Learning (RL) to address the optimization challenges associated with algorithm configuration. However, making an RL agent work properly is a non-trivial task, especially in reward design, which necessitates a substantial amount of handcrafted knowledge based on domain expertise. In this work, we study the importance of reward design in the context of DAC via a case study on controlling the population size of the $(1+(\lambda,\lambda))$-GA optimizing OneMax. We observed that a poorly designed reward can hinder the RL agent's ability to learn an optimal policy because of a lack of exploration, leading to both scalability and learning divergence issues. To address those challenges, we propose the application of a reward shaping mechanism to facilitate enhanced exploration of the environment by the RL agent. Our work not only demonstrates the ability of RL in dynamically configuring the $(1+(\lambda,\lambda))$-GA, but also confirms the advantages of reward shaping in the scalability of RL agents across various sizes of OneMax problems.

new Large Language Models as Attribution Regularizers for Efficient Model Training

Authors: Davor Vukadin, Marin \v{S}ili\'c, Goran Dela\v{c}

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains. However, effectively leveraging their vast knowledge for training smaller downstream models remains an open challenge, especially in domains like tabular data learning, where simpler models are often preferred due to interpretability and efficiency. In this paper, we introduce a novel yet straightforward method for incorporating LLM-generated global task feature attributions into the training process of smaller networks. Specifically, we propose an attribution-matching regularization term that aligns the training dynamics of the smaller model with the insights provided by the LLM. By doing so, our approach yields superior performance in few-shot learning scenarios. Notably, our method requires only black-box API access to the LLM, making it easy to integrate into existing training pipelines with minimal computational overhead. Furthermore, we demonstrate how this method can be used to address common issues in real-world datasets, such as skewness and bias. By integrating high-level knowledge from LLMs, our approach improves generalization, even when training data is limited or imbalanced. We validate its effectiveness through extensive experiments across multiple tasks, demonstrating improved learning efficiency and model robustness.

new Online Meta-learning for AutoML in Real-time (OnMAR)

Authors: Mia Gerber, Anna Sergeevna Bosman, Johan Pieter de Villiers

Abstract: Automated machine learning (AutoML) is a research area focusing on using optimisation techniques to design machine learning (ML) algorithms, alleviating the need for a human to perform manual algorithm design. Real-time AutoML enables the design process to happen while the ML algorithm is being applied to a task. Real-time AutoML is an emerging research area, as such existing real-time AutoML techniques need improvement with respect to the quality of designs and time taken to create designs. To address these issues, this study proposes an Online Meta-learning for AutoML in Real-time (OnMAR) approach. Meta-learning gathers information about the optimisation process undertaken by the ML algorithm in the form of meta-features. Meta-features are used in conjunction with a meta-learner to optimise the optimisation process. The OnMAR approach uses a meta-learner to predict the accuracy of an ML design. If the accuracy predicted by the meta-learner is sufficient, the design is used, and if the predicted accuracy is low, an optimisation technique creates a new design. A genetic algorithm (GA) is the optimisation technique used as part of the OnMAR approach. Different meta-learners (k-nearest neighbours, random forest and XGBoost) are tested. The OnMAR approach is model-agnostic (i.e. not specific to a single real-time AutoML application) and therefore evaluated on three different real-time AutoML applications, namely: composing an image clustering algorithm, configuring the hyper-parameters of a convolutional neural network, and configuring a video classification pipeline. The OnMAR approach is effective, matching or outperforming existing real-time AutoML approaches, with the added benefit of a faster runtime.

new Conformal Tail Risk Control for Large Language Model Alignment

Authors: Catherine Yu-Chi Chen, Jingyan Shen, Zhun Deng, Lihua Lei

Abstract: Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method relies on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment.

new Scalable Graph Attention-based Instance Selection via Mini-Batch Sampling and Hierarchical Hashing

Authors: Zahiriddin Rustamov, Ayham Zaitouny, Nazar Zaki

Abstract: Instance selection (IS) is important in machine learning for reducing dataset size while keeping key characteristics. Current IS methods often struggle with capturing complex relationships in high-dimensional spaces and scale with large datasets. This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances through their structural relationships in graph representations. We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that reduces computation through strategic batch processing, and a hierarchical hashing approach that allows for efficient similarity computation through random projections. The mini-batch approach keeps class distributions through stratified sampling, while the hierarchical hashing method captures relationships at multiple granularities through single-level, multi-level, and multi-view variants. Experiments across 39 datasets show that GAIS achieves reduction rates above 96\% while maintaining or improving model performance relative to state-of-the-art IS methods. The findings shows that the distance-based mini-batch approach offers an optimal balance of efficiency and effectiveness for large-scale datasets, while multi-view variants provide superior performance for complex, high-dimensional data, demonstrating that attention-based importance scoring can effectively identify instances crucial for maintaining decision boundaries without requiring exhaustive pairwise comparisons.

new Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

Authors: Benjamin Gutteridge, Matthew Thomas Jackson, Toni Kukurin, Xiaowen Dong

Abstract: Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.

new An exploration of features to improve the generalisability of fake news detection models

Authors: Nathaniel Hoy, Theodora Koulouri

Abstract: Fake news poses global risks by influencing elections and spreading misinformation, making detection critical. Existing NLP and supervised Machine Learning methods perform well under cross-validation but struggle to generalise across datasets, even within the same domain. This issue stems from coarsely labelled training data, where articles are labelled based on their publisher, introducing biases that token-based models like TF-IDF and BERT are sensitive to. While Large Language Models (LLMs) offer promise, their application in fake news detection remains limited. This study demonstrates that meaningful features can still be extracted from coarsely labelled data to improve real-world robustness. Stylistic features-lexical, syntactic, and semantic-are explored due to their reduced sensitivity to dataset biases. Additionally, novel social-monetisation features are introduced, capturing economic incentives behind fake news, such as advertisements, external links, and social media elements. The study trains on the coarsely labelled NELA 2020-21 dataset and evaluates using the manually labelled Facebook URLs dataset, a gold standard for generalisability. Results highlight the limitations of token-based models trained on biased data and contribute to the scarce evidence on LLMs like LLaMa in this field. Findings indicate that stylistic and social-monetisation features offer more generalisable predictions than token-based methods and LLMs. Statistical and permutation feature importance analyses further reveal their potential to enhance performance and mitigate dataset biases, providing a path forward for improving fake news detection.

new Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications

Authors: Marcus Yu Zhe Wee, Justin Juin Hng Wong, Lynus Lim, Joe Yu Wei Tan, Prannaya Gupta, Dillion Lim, En Hao Tew, Aloysius Keng Siew Han, Yong Zhi Lim

Abstract: Effective communication in Air Traffic Control (ATC) is critical to maintaining aviation safety, yet the challenges posed by accented English remain largely unaddressed in Automatic Speech Recognition (ASR) systems. Existing models struggle with transcription accuracy for Southeast Asian-accented (SEA-accented) speech, particularly in noisy ATC environments. This study presents the development of ASR models fine-tuned specifically for Southeast Asian accents using a newly created dataset. Our research achieves significant improvements, achieving a Word Error Rate (WER) of 0.0982 or 9.82% on SEA-accented ATC speech. Additionally, the paper highlights the importance of region-specific datasets and accent-focused training, offering a pathway for deploying ASR systems in resource-constrained military operations. The findings emphasize the need for noise-robust training techniques and region-specific datasets to improve transcription accuracy for non-Western accents in ATC communications.

new Adversarial Robustness in Parameter-Space Classifiers

Authors: Tamir Shor, Ethan Fetaya, Chaim Baskin, Alex Bronstein

Abstract: Implicit Neural Representations (INRs) have been recently garnering increasing interest in various research fields, mainly due to their ability to represent large, complex data in a compact and continuous manner. Past work further showed that numerous popular downstream tasks can be performed directly in the INR parameter-space. Doing so can substantially reduce the computational resources required to process the represented data in their native domain. A major difficulty in using modern machine-learning approaches, is their high susceptibility to adversarial attacks, which have been shown to greatly limit the reliability and applicability of such methods in a wide range of settings. In this work, we show that parameter-space models trained for classification are inherently robust to adversarial attacks -- without the need of any robust training. To support our claims, we develop a novel suite of adversarial attacks targeting parameter-space classifiers, and furthermore analyze practical considerations of attacking parameter-space classifiers. Code for reproducing all experiments and implementation of all proposed methods will be released upon publication.

new Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases

Authors: Yongjia Lei, Haoyu Han, Ryan A. Rossi, Franck Dernoncourt, Nedim Lipka, Mahantesh M Halappanavar, Jiliang Tang, Yu Wang

Abstract: Text-rich Graph Knowledge Bases (TG-KBs) have become increasingly crucial for answering queries by providing textual and structural knowledge. However, current retrieval methods often retrieve these two types of knowledge in isolation without considering their mutual reinforcement and some hybrid methods even bypass structural retrieval entirely after neighboring aggregation. To fill in this gap, we propose a Mixture of Structural-and-Textual Retrieval (MoR) to retrieve these two types of knowledge via a Planning-Reasoning-Organizing framework. In the Planning stage, MoR generates textual planning graphs delineating the logic for answering queries. Following planning graphs, in the Reasoning stage, MoR interweaves structural traversal and textual matching to obtain candidates from TG-KBs. In the Organizing stage, MoR further reranks fetched candidates based on their structural trajectory. Extensive experiments demonstrate the superiority of MoR in harmonizing structural and textual retrieval with insights, including uneven retrieving performance across different query logics and the benefits of integrating structural trajectories for candidate reranking. Our code is available at https://github.com/Yoega/MoR.

URLs: https://github.com/Yoega/MoR.

new Safety Representations for Safer Policy Learning

Authors: Kaustubh Mani, Vincent Mai, Charlie Gauthier, Annie Chen, Samer Nashed, Liam Paull

Abstract: Reinforcement learning algorithms typically necessitate extensive exploration of the state space to find optimal policies. However, in safety-critical applications, the risks associated with such exploration can lead to catastrophic consequences. Existing safe exploration methods attempt to mitigate this by imposing constraints, which often result in overly conservative behaviours and inefficient learning. Heavy penalties for early constraint violations can trap agents in local optima, deterring exploration of risky yet high-reward regions of the state space. To address this, we introduce a method that explicitly learns state-conditioned safety representations. By augmenting the state features with these safety representations, our approach naturally encourages safer exploration without being excessively cautious, resulting in more efficient and safer policy learning in safety-critical scenarios. Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety.

new Constrained Generative Modeling with Manually Bridged Diffusion Models

Authors: Saeid Naderiparizi, Xiaoxuan Liang, Berend Zwartsenberg, Frank Wood

Abstract: In this paper we describe a novel framework for diffusion-based generative modeling on constrained spaces. In particular, we introduce manual bridges, a framework that expands the kinds of constraints that can be practically used to form so-called diffusion bridges. We develop a mechanism for combining multiple such constraints so that the resulting multiply-constrained model remains a manual bridge that respects all constraints. We also develop a mechanism for training a diffusion model that respects such multiple constraints while also adapting it to match a data distribution. We develop and extend theory demonstrating the mathematical validity of our mechanisms. Additionally, we demonstrate our mechanism in constrained generative modeling tasks, highlighting a particular high-value application in modeling trajectory initializations for path planning and control in autonomous vehicles.

new When does a predictor know its own loss?

Authors: Aravind Gollakota, Parikshit Gopalan, Aayush Karan, Charlotte Peale, Udi Wieder

Abstract: Given a predictor and a loss function, how well can we predict the loss that the predictor will incur on an input? This is the problem of loss prediction, a key computational task associated with uncertainty estimation for a predictor. In a classification setting, a predictor will typically predict a distribution over labels and hence have its own estimate of the loss that it will incur, given by the entropy of the predicted distribution. Should we trust this estimate? In other words, when does the predictor know what it knows and what it does not know? In this work we study the theoretical foundations of loss prediction. Our main contribution is to establish tight connections between nontrivial loss prediction and certain forms of multicalibration, a multigroup fairness notion that asks for calibrated predictions across computationally identifiable subgroups. Formally, we show that a loss predictor that is able to improve on the self-estimate of a predictor yields a witness to a failure of multicalibration, and vice versa. This has the implication that nontrivial loss prediction is in effect no easier or harder than auditing for multicalibration. We support our theoretical results with experiments that show a robust positive correlation between the multicalibration error of a predictor and the efficacy of training a loss predictor.

new PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

Authors: Albert Gong, Kamil\.e Stankevi\v{c}i\=ut\.e, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P. Gomes, Kilian Q. Weinberger

Abstract: High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.

URLs: https://github.com/kilian-group/phantom-wiki.

new Multi-Turn Code Generation Through Single-Step Rewards

Authors: Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury

Abstract: We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

URLs: https://github.com/portal-cornell/muCode.

new Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis

Authors: Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen

Abstract: Recent advancements in Web AI agents have demonstrated remarkable capabilities in addressing complex web navigation tasks. However, emerging research shows that these agents exhibit greater vulnerability compared to standalone Large Language Models (LLMs), despite both being built upon the same safety-aligned models. This discrepancy is particularly concerning given the greater flexibility of Web AI Agent compared to standalone LLMs, which may expose them to a wider range of adversarial user inputs. To build a scaffold that addresses these concerns, this study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents. Notably, this disparity stems from the multifaceted differences between Web AI agents and standalone LLMs, as well as the complex signals - nuances that simple evaluation metrics, such as success rate, often fail to capture. To tackle these challenges, we propose a component-level analysis and a more granular, systematic evaluation framework. Through this fine-grained investigation, we identify three critical factors that amplify the vulnerability of Web AI agents; (1) embedding user goals into the system prompt, (2) multi-step action generation, and (3) observational capabilities. Our findings highlights the pressing need to enhance security and robustness in AI agent design and provide actionable insights for targeted defense strategies.

new Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models

Authors: Susmit Agrawal, Deepika Vemuri, Sri Siddarth Chakaravarthy P, Vineeth N. Balasubramanian

Abstract: Concept-based methods have emerged as a promising direction to develop interpretable neural networks in standard supervised settings. However, most works that study them in incremental settings assume either a static concept set across all experiences or assume that each experience relies on a distinct set of concepts. In this work, we study concept-based models in a more realistic, dynamic setting where new classes may rely on older concepts in addition to introducing new concepts themselves. We show that concepts and classes form a complex web of relationships, which is susceptible to degradation and needs to be preserved and augmented across experiences. We introduce new metrics to show that existing concept-based models cannot preserve these relationships even when trained using methods to prevent catastrophic forgetting, since they cannot handle forgetting at concept, class, and concept-class relationship levels simultaneously. To address these issues, we propose a novel method - MuCIL - that uses multimodal concepts to perform classification without increasing the number of trainable parameters across experiences. The multimodal concepts are aligned to concepts provided in natural language, making them interpretable by design. Through extensive experimentation, we show that our approach obtains state-of-the-art classification performance compared to other concept-based models, achieving over 2$\times$ the classification performance in some cases. We also study the ability of our model to perform interventions on concepts, and show that it can localize visual concepts in input images, providing post-hoc interpretations.

new R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

Abstract: In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

cross scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders

Authors: Gyutaek Oh, Baekgyu Choi, Seyoung Jin, Inkyung Jung, Jong Chul Ye

Abstract: Single-nucleus RNA sequencing (snRNA-seq) has significantly advanced our understanding of the disease etiology of neurodegenerative disorders. However, the low quality of specimens derived from postmortem brain tissues, combined with the high variability caused by disease heterogeneity, makes it challenging to integrate snRNA-seq data from multiple sources for precise analyses. To address these challenges, we present scMamba, a pre-trained model designed to improve the quality and utility of snRNA-seq analysis, with a particular focus on neurodegenerative diseases. Inspired by the recent Mamba model, scMamba introduces a novel architecture that incorporates a linear adapter layer, gene embeddings, and bidirectional Mamba blocks, enabling efficient processing of snRNA-seq data while preserving information from the raw input. Notably, scMamba learns generalizable features of cells and genes through pre-training on snRNA-seq data, without relying on dimension reduction or selection of highly variable genes. We demonstrate that scMamba outperforms benchmark methods in various downstream tasks, including cell type annotation, doublet detection, imputation, and the identification of differentially expressed genes.

cross Evolutionary Algorithms Approach For Search Based On Semantic Document Similarity

Authors: Chandrashekar Muniyappa, Eujin Kim

Abstract: Advancements in cloud computing and distributed computing have fostered research activities in Computer science. As a result, researchers have made significant progress in Neural Networks, Evolutionary Computing Algorithms like Genetic, and Differential evolution algorithms. These algorithms are used to develop clustering, recommendation, and question-and-answering systems using various text representation and similarity measurement techniques. In this research paper, Universal Sentence Encoder (USE) is used to capture the semantic similarity of text; And the transfer learning technique is used to apply Genetic Algorithm (GA) and Differential Evolution (DE) algorithms to search and retrieve relevant top N documents based on user query. The proposed approach is applied to the Stanford Question and Answer (SQuAD) Dataset to identify a user query. Finally, through experiments, we prove that text documents can be efficiently represented as sentence embedding vectors using USE to capture the semantic similarity, and by comparing the results of the Manhattan Distance, GA, and DE algorithms we prove that the evolutionary algorithms are good at finding the top N results than the traditional ranking approach.

cross Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting

Authors: Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, Siyuan Huang

Abstract: Building articulated objects is a key challenge in computer vision. Existing methods often fail to effectively integrate information across different object states, limiting the accuracy of part-mesh reconstruction and part dynamics modeling, particularly for complex multi-part articulated objects. We introduce ArtGS, a novel approach that leverages 3D Gaussians as a flexible and efficient representation to address these issues. Our method incorporates canonical Gaussians with coarse-to-fine initialization and updates for aligning articulated part information across different object states, and employs a skinning-inspired part dynamics modeling module to improve both part-mesh reconstruction and articulation learning. Extensive experiments on both synthetic and real-world datasets, including a new benchmark for complex multi-part objects, demonstrate that ArtGS achieves state-of-the-art performance in joint parameter estimation and part mesh reconstruction. Our approach significantly improves reconstruction quality and efficiency, especially for multi-part articulated objects. Additionally, we provide comprehensive analyses of our design choices, validating the effectiveness of each component to highlight potential areas for future improvement.

cross Practical Evaluation of Copula-based Survival Metrics: Beyond the Independent Censoring Assumption

Authors: Christian Marius Lillelund, Shi-ang Qi, Russell Greiner

Abstract: Conventional survival metrics, such as Harrell's concordance index and the Brier Score, rely on the independent censoring assumption for valid inference in the presence of right-censored data. However, when instances are censored for reasons related to the event of interest, this assumption no longer holds, as this kind of dependent censoring biases the marginal survival estimates of popular nonparametric estimators. In this paper, we propose three copula-based metrics to evaluate survival models in the presence of dependent censoring, and design a framework to create realistic, semi-synthetic datasets with dependent censoring to facilitate the evaluation of the metrics. Our empirical analyses in synthetic and semi-synthetic datasets show that our metrics can give error estimates that are closer to the true error, mainly in terms of predictive accuracy.

cross Conversational Planning for Personal Plans

Authors: Konstantina Christakopoulou, Iris Qu, John Canny, Andrew Goodridge, Cj Adams, Minmin Chen, Maja Matari\'c

Abstract: The language generation and reasoning capabilities of large language models (LLMs) have enabled conversational systems with impressive performance in a variety of tasks, from code generation, to composing essays, to passing STEM and legal exams, to a new paradigm for knowledge search. Besides those short-term use applications, LLMs are increasingly used to help with real-life goals or tasks that take a long time to complete, involving multiple sessions across days, weeks, months, or even years. Thus to enable conversational systems for long term interactions and tasks, we need language-based agents that can plan for long horizons. Traditionally, such capabilities were addressed by reinforcement learning agents with hierarchical planning capabilities. In this work, we explore a novel architecture where the LLM acts as the meta-controller deciding the agent's next macro-action, and tool use augmented LLM-based option policies execute the selected macro-action. We instantiate this framework for a specific set of macro-actions enabling adaptive planning for users' personal plans through conversation and follow-up questions collecting user feedback. We show how this paradigm can be applicable in scenarios ranging from tutoring for academic and non-academic tasks to conversational coaching for personal health plans.

cross GONet: A Generalizable Deep Learning Model for Glaucoma Detection

Authors: Or Abramovich, Hadas Pizem, Jonathan Fhima, Eran Berkowitz, Ben Gofrit, Meishar Meisel, Meital Baskin, Jan Van Eijgen, Ingeborg Stalmans, Eytan Z. Blumenthal, Joachim A. Behar

Abstract: Glaucomatous optic neuropathy (GON) is a prevalent ocular disease that can lead to irreversible vision loss if not detected early and treated. The traditional diagnostic approach for GON involves a set of ophthalmic examinations, which are time-consuming and require a visit to an ophthalmologist. Recent deep learning models for automating GON detection from digital fundus images (DFI) have shown promise but often suffer from limited generalizability across different ethnicities, disease groups and examination settings. To address these limitations, we introduce GONet, a robust deep learning model developed using seven independent datasets, including over 119,000 DFIs with gold-standard annotations and from patients of diverse geographic backgrounds. GONet consists of a DINOv2 pre-trained self-supervised vision transformers fine-tuned using a multisource domain strategy. GONet demonstrated high out-of-distribution generalizability, with an AUC of 0.85-0.99 in target domains. GONet performance was similar or superior to state-of-the-art works and was significantly superior to the cup-to-disc ratio, by up to 21.6%. GONet is available at [URL provided on publication]. We also contribute a new dataset consisting of 768 DFI with GON labels as open access.

cross No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data

Authors: Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Dvijotham Krishnamurthy

Abstract: Leading language model (LM) providers like OpenAI and Google offer fine-tuning APIs that allow customers to adapt LMs for specific use cases. To prevent misuse, these LM providers implement filtering mechanisms to block harmful fine-tuning data. Consequently, adversaries seeking to produce unsafe LMs via these APIs must craft adversarial training data that are not identifiably harmful. We make three contributions in this context: 1. We show that many existing attacks that use harmless data to create unsafe LMs rely on eliminating model refusals in the first few tokens of their responses. 2. We show that such prior attacks can be blocked by a simple defense that pre-fills the first few tokens from an aligned model before letting the fine-tuned model fill in the rest. 3. We describe a new data-poisoning attack, ``No, Of course I Can Execute'' (NOICE), which exploits an LM's formulaic refusal mechanism to elicit harmful responses. By training an LM to refuse benign requests on the basis of safety before fulfilling those requests regardless, we are able to jailbreak several open-source models and a closed-source model (GPT-4o). We show an attack success rate (ASR) of 57% against GPT-4o; our attack earned a Bug Bounty from OpenAI. Against open-source models protected by simple defenses, we improve ASRs by an average of 3.25 times compared to the best performing previous attacks that use only harmless data. NOICE demonstrates the exploitability of repetitive refusal mechanisms and broadens understanding of the threats closed-source models face from harmless data.

cross Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

Authors: Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang

Abstract: The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination-generating false information-and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models' outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized "I don't know" responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.

cross Advancing calibration for stochastic agent-based models in epidemiology with Stein variational inference and Gaussian process surrogates

Authors: Connor Robertson, Cosmin Safta, Nicholson Collier, Jonathan Ozik, Jaideep Ray

Abstract: Accurate calibration of stochastic agent-based models (ABMs) in epidemiology is crucial to make them useful in public health policy decisions and interventions. Traditional calibration methods, e.g., Markov Chain Monte Carlo (MCMC), that yield a probability density function for the parameters being calibrated, are often computationally expensive. When applied to ABMs which are highly parametrized, the calibration process becomes computationally infeasible. This paper investigates the utility of Stein Variational Inference (SVI) as an alternative calibration technique for stochastic epidemiological ABMs approximated by Gaussian process (GP) surrogates. SVI leverages gradient information to iteratively update a set of particles in the space of parameters being calibrated, offering potential advantages in scalability and efficiency for high-dimensional ABMs. The ensemble of particles yields a joint probability density function for the parameters and serves as the calibration. We compare the performance of SVI and MCMC in calibrating CityCOVID, a stochastic epidemiological ABM, focusing on predictive accuracy and calibration effectiveness. Our results demonstrate that SVI maintains predictive accuracy and calibration effectiveness comparable to MCMC, making it a viable alternative for complex epidemiological models. We also present the practical challenges of using a gradient-based calibration such as SVI which include careful tuning of hyperparameters and monitoring of the particle dynamics.

cross Diffusion-based Planning with Learned Viability Filters

Authors: Nicholas Ioannidis, Daniele Reda, Setareh Cohan, Michiel van de Panne

Abstract: Diffusion models can be used as a motion planner by sampling from a distribution of possible futures. However, the samples may not satisfy hard constraints that exist only implicitly in the training data, e.g., avoiding falls or not colliding with a wall. We propose learned viability filters that efficiently predict the future success of any given plan, i.e., diffusion sample, and thereby enforce an implicit future-success constraint. Multiple viability filters can also be composed together. We demonstrate the approach on detailed footstep planning for challenging 3D human locomotion tasks, showing the effectiveness of viability filters in performing online planning and control for box-climbing, step-over walls, and obstacle avoidance. We further show that using viability filters is significantly faster than guidance-based diffusion prediction.

cross Do Large Language Models Know How Much They Know?

Authors: Gabriele Prato, Jerry Huang, Prasannna Parthasarathi, Shagun Sodhani, Sarath Chandar

Abstract: Large Language Models (LLMs) have emerged as highly capable systems and are increasingly being integrated into various uses. However, the rapid pace of their deployment has outpaced a comprehensive understanding of their internal mechanisms and a delineation of their capabilities and limitations. A desired attribute of an intelligent system is its ability to recognize the scope of its own knowledge. To investigate whether LLMs embody this characteristic, we develop a benchmark designed to challenge these models to enumerate all information they possess on specific topics. This benchmark evaluates whether the models recall excessive, insufficient, or the precise amount of information, thereby indicating their awareness of their own knowledge. Our findings reveal that all tested LLMs, given sufficient scale, demonstrate an understanding of how much they know about specific topics. While different architectures exhibit varying rates of this capability's emergence, the results suggest that awareness of knowledge may be a generalizable attribute of LLMs. Further research is needed to confirm this potential and fully elucidate the underlying mechanisms.

cross Tell me why: Visual foundation models as self-explainable classifiers

Authors: Hugues Turb\'e, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis

Abstract: Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.

URLs: https://github.com/hturbe/proto-fm.

cross A City of Millions: Mapping Literary Social Networks At Scale

Authors: Sil Hamilton, Rebecca M. M. Hicke, David Mimno, Matthew Wilkens

Abstract: We release 70,509 high-quality social networks extracted from multilingual fiction and nonfiction narratives. We additionally provide metadata for ~30,000 of these texts (73% nonfiction and 27% fiction) written between 1800 and 1999 in 58 languages. This dataset provides information on historical social worlds at an unprecedented scale, including data for 1,192,855 individuals in 2,805,482 pair-wise relationships annotated for affinity and relationship type. We achieve this scale by automating previously manual methods of extracting social networks; specifically, we adapt an existing annotation task as a language model prompt, ensuring consistency at scale with the use of structured output. This dataset provides an unprecedented resource for the humanities and social sciences by providing data on cognitive models of social realities.

cross Evaluation of Hate Speech Detection Using Large Language Models and Geographical Contextualization

Authors: Anwar Hossain Zahid, Monoshi Kumar Roy, Swarna Das

Abstract: The proliferation of hate speech on social media is one of the serious issues that is bringing huge impacts to society: an escalation of violence, discrimination, and social fragmentation. The problem of detecting hate speech is intrinsically multifaceted due to cultural, linguistic, and contextual complexities and adversarial manipulations. In this study, we systematically investigate the performance of LLMs on detecting hate speech across multilingual datasets and diverse geographic contexts. Our work presents a new evaluation framework in three dimensions: binary classification of hate speech, geography-aware contextual detection, and robustness to adversarially generated text. Using a dataset of 1,000 comments from five diverse regions, we evaluate three state-of-the-art LLMs: Llama2 (13b), Codellama (7b), and DeepSeekCoder (6.7b). Codellama had the best binary classification recall with 70.6% and an F1-score of 52.18%, whereas DeepSeekCoder had the best performance in geographic sensitivity, correctly detecting 63 out of 265 locations. The tests for adversarial robustness also showed significant weaknesses; Llama2 misclassified 62.5% of manipulated samples. These results bring to light the trade-offs between accuracy, contextual understanding, and robustness in the current versions of LLMs. This work has thus set the stage for developing contextually aware, multilingual hate speech detection systems by underlining key strengths and limitations, therefore offering actionable insights for future research and real-world applications.

cross Self-rewarding correction for mathematical reasoning

Authors: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang

Abstract: We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

cross A Method for Evaluating the Interpretability of Machine Learning Models in Predicting Bond Default Risk Based on LIME and SHAP

Authors: Yan Zhang, Lin Chen, Yixiang Tian

Abstract: Interpretability analysis methods for artificial intelligence models, such as LIME and SHAP, are widely used, though they primarily serve as post-model for analyzing model outputs. While it is commonly believed that the transparency and interpretability of AI models diminish as their complexity increases, currently there is no standardized method for assessing the inherent interpretability of the models themselves. This paper uses bond market default prediction as a case study, applying commonly used machine learning algorithms within AI models. First, the classification performance of these algorithms in default prediction is evaluated. Then, leveraging LIME and SHAP to assess the contribution of sample features to prediction outcomes, the paper proposes a novel method for evaluating the interpretability of the models themselves. The results of this analysis are consistent with the intuitive understanding and logical expectations regarding the interpretability of these models.

cross Sensor-Invariant Tactile Representation

Authors: Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan

Abstract: High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This limitation hinders the ability to transfer models or knowledge learned from one sensor to another. To address this, we introduce a novel method for extracting Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer across optical tactile sensors. Our approach utilizes a transformer-based architecture trained on a diverse dataset of simulated sensor designs, allowing it to generalize to new sensors in the real world with minimal calibration. Experimental results demonstrate the method's effectiveness across various tactile sensing applications, facilitating data and model transferability for future advancements in the field.

cross Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Authors: Moo Jin Kim, Chelsea Finn, Percy Liang

Abstract: Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($\pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

URLs: https://openvla-oft.github.io/.

cross AutoBS: Autonomous Base Station Deployment Framework with Reinforcement Learning and Digital Twin Network

Authors: Ju-Hyung Lee, Andreas F. Molisch

Abstract: This paper introduces AutoBS, a reinforcement learning (RL)-based framework for optimal base station (BS) deployment in 6G networks. AutoBS leverages the Proximal Policy Optimization (PPO) algorithm and fast, site-specific pathloss predictions from PMNet to efficiently learn deployment strategies that balance coverage and capacity. Numerical results demonstrate that AutoBS achieves 95% for a single BS, and 90% for multiple BSs, of the capacity provided by exhaustive search methods while reducing inference time from hours to milliseconds, making it highly suitable for real-time applications. AutoBS offers a scalable and automated solution for large-scale 6G networks, addressing the challenges of dynamic environments with minimal computational overhead.

cross Spectral Analysis of Representational Similarity with Limited Neurons

Authors: Hyunmo Kang, Abdulkadir Canatar, SueYeon Chung

Abstract: Measuring representational similarity between neural recordings and computational models is challenging due to constraints on the number of neurons that can be recorded simultaneously. In this work, we investigate how such limitations affect similarity measures, focusing on Canonical Correlation Analysis (CCA) and Centered Kernel Alignment (CKA). Leveraging tools from Random Matrix Theory, we develop a predictive spectral framework for these measures and demonstrate that finite neuron sampling systematically underestimates similarity due to eigenvector delocalization. To overcome this, we introduce a denoising method to infer population-level similarity, enabling accurate analysis even with small neuron samples. Our theory is validated on synthetic and real datasets, offering practical strategies for interpreting neural data under finite sampling constraints.

cross SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning

Authors: Mingsheng Cai, Jiuming Jiang, Wenhao Huang, Che Liu, Rossella Arcucci

Abstract: Cardiovascular diseases are a leading cause of death and disability worldwide. Electrocardiogram (ECG) recordings are critical for diagnosing and monitoring cardiac health, but obtaining large-scale annotated ECG datasets is labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL) methods mitigate this by learning features without extensive labels but fail to capture fine-grained clinical semantics and require extensive task-specific fine-tuning. To address these challenges, we propose $\textbf{SuPreME}$, a $\textbf{Su}$pervised $\textbf{Pre}$-training framework for $\textbf{M}$ultimodal $\textbf{E}$CG representation learning. SuPreME applies Large Language Models (LLMs) to extract structured clinical entities from free-text ECG reports, filter out noise and irrelevant content, enhance clinical representation learning, and build a high-quality, fine-grained labeled dataset. By using text-based cardiac queries instead of traditional categorical labels, SuPreME enables zero-shot classification of unseen diseases without additional fine-tuning. We evaluate SuPreME on six downstream datasets covering 127 cardiac conditions, achieving superior zero-shot AUC performance over state-of-the-art eSSL and multimodal methods by over 1.96\%. Results demonstrate the effectiveness of SuPreME in leveraging structured, clinically relevant knowledge for high-quality ECG representations. All code and data will be released upon acceptance.

cross Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack

Authors: Chenhe Gu, Jindong Gu, Andong Hua, Yao Qin

Abstract: Multimodal Large Language Models (MLLMs), built upon LLMs, have recently gained attention for their capabilities in image recognition and understanding. However, while MLLMs are vulnerable to adversarial attacks, the transferability of these attacks across different models remains limited, especially under targeted attack setting. Existing methods primarily focus on vision-specific perturbations but struggle with the complex nature of vision-language modality alignment. In this work, we introduce the Dynamic Vision-Language Alignment (DynVLA) Attack, a novel approach that injects dynamic perturbations into the vision-language connector to enhance generalization across diverse vision-language alignment of different models. Our experimental results show that DynVLA significantly improves the transferability of adversarial examples across various MLLMs, including BLIP2, InstructBLIP, MiniGPT4, LLaVA, and closed-source models such as Gemini.

cross BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

Authors: Xin Ye, Burhaneddin Yaman, Sheng Cheng, Feng Tao, Abhirup Mallik, Liu Ren

Abstract: Bird's-eye-view (BEV) representations play a crucial role in autonomous driving tasks. Despite recent advancements in BEV generation, inherent noise, stemming from sensor limitations and the learning process, remains largely unaddressed, resulting in suboptimal BEV representations that adversely impact the performance of downstream tasks. To address this, we propose BEVDiffuser, a novel diffusion model that effectively denoises BEV feature maps using the ground-truth object layout as guidance. BEVDiffuser can be operated in a plug-and-play manner during training time to enhance existing BEV models without requiring any architectural modifications. Extensive experiments on the challenging nuScenes dataset demonstrate BEVDiffuser's exceptional denoising and generation capabilities, which enable significant enhancement to existing BEV models, as evidenced by notable improvements of 12.3\% in mAP and 10.1\% in NDS achieved for 3D object detection without introducing additional computational complexity. Moreover, substantial improvements in long-tail object detection and under challenging weather and lighting conditions further validate BEVDiffuser's effectiveness in denoising and enhancing BEV representations.

cross Recent Advances on Generalizable Diffusion-generated Image Detection

Authors: Qijie Xu, Defang Chen, Jiawei Chen, Siwei Lyu, Can Wang

Abstract: The rise of diffusion models has significantly improved the fidelity and diversity of generated images. With numerous benefits, these advancements also introduce new risks. Diffusion models can be exploited to create high-quality Deepfake images, which poses challenges for image authenticity verification. In recent years, research on generalizable diffusion-generated image detection has grown rapidly. However, a comprehensive review of this topic is still lacking. To bridge this gap, we present a systematic survey of recent advances and classify them into two main categories: (1) data-driven detection and (2) feature-driven detection. Existing detection methods are further classified into six fine-grained categories based on their underlying principles. Finally, we identify several open challenges and envision some future directions, with the hope of inspiring more research work on this important topic. Reviewed works in this survey can be found at https://github.com/zju-pi/Awesome-Diffusion-generated-Image-Detection.

URLs: https://github.com/zju-pi/Awesome-Diffusion-generated-Image-Detection.

cross Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning

Authors: Xinran Li, Xiaolu Wang, Chenjia Bai, Jun Zhang

Abstract: In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links-a task that becomes increasingly complex as the number of agents grows-we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at https://github.com/LXXXXR/ExpoComm.

URLs: https://github.com/LXXXXR/ExpoComm.

cross Bridging the PLC Binary Analysis Gap: A Cross-Compiler Dataset and Neural Framework for Industrial Control Systems

Authors: Yonatan Gizachew Achamyeleh, Shih-Yuan Yu, Gustavo Quir\'os Araya, Mohammad Abdullah Al Faruque

Abstract: Industrial Control Systems (ICS) rely heavily on Programmable Logic Controllers (PLCs) to manage critical infrastructure, yet analyzing PLC executables remains challenging due to diverse proprietary compilers and limited access to source code. To bridge this gap, we introduce PLC-BEAD, a comprehensive dataset containing 2431 compiled binaries from 700+ PLC programs across four major industrial compilers (CoDeSys, GEB, OpenPLC-V2, OpenPLC-V3). This novel dataset uniquely pairs each binary with its original Structured Text source code and standardized functionality labels, enabling both binary-level and source-level analysis. We demonstrate the dataset's utility through PLCEmbed, a transformer-based framework for binary code analysis that achieves 93\% accuracy in compiler provenance identification and 42\% accuracy in fine-grained functionality classification across 22 industrial control categories. Through comprehensive ablation studies, we analyze how compiler optimization levels, code patterns, and class distributions influence model performance. We provide detailed documentation of the dataset creation process, labeling taxonomy, and benchmark protocols to ensure reproducibility. Both PLC-BEAD and PLCEmbed are released as open-source resources to foster research in PLC security, reverse engineering, and ICS forensics, establishing new baselines for data-driven approaches to industrial cybersecurity.

cross FPGA-Accelerated SpeckleNN with SNL for Real-time X-ray Single-Particle Imaging

Authors: Abhilasha Dave, Cong Wang, James Russell, Ryan Herbst, Jana Thayer

Abstract: We implement a specialized version of our SpeckleNN model for real-time speckle pattern classification in X-ray Single-Particle Imaging (SPI) using the SLAC Neural Network Library (SNL) on an FPGA. This hardware is optimized for inference near detectors in high-throughput X-ray free-electron laser (XFEL) facilities like the Linac Coherent Light Source (LCLS). To fit FPGA constraints, we optimized SpeckleNN, reducing parameters from 5.6M to 64.6K (98.8% reduction) with 90% accuracy. We also compressed the latent space from 128 to 50 dimensions. Deployed on a KCU1500 FPGA, the model used 71% of DSPs, 75% of LUTs, and 48% of FFs, with an average power consumption of 9.4W. The FPGA achieved 45.015us inference latency at 200 MHz. On an NVIDIA A100 GPU, the same inference consumed ~73W and had a 400us latency. Our FPGA version achieved an 8.9x speedup and 7.8x power reduction over the GPU. Key advancements include model specialization and dynamic weight loading through SNL, eliminating time-consuming FPGA re-synthesis for fast, continuous deployment of (re)trained models. These innovations enable real-time adaptive classification and efficient speckle pattern vetoing, making SpeckleNN ideal for XFEL facilities. This implementation accelerates SPI experiments and enhances adaptability to evolving conditions.

cross PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation

Authors: Nathan Roll

Abstract: Large language models (LLMs) showcase increasingly impressive English benchmark scores, however their performance profiles remain inconsistent across multilingual settings. To address this gap, we introduce PolyPrompt, a novel, parameter-efficient framework for enhancing the multilingual capabilities of LLMs. Our method learns a set of trigger tokens for each language through a gradient-based search, identifying the input query's language and selecting the corresponding trigger tokens which are prepended to the prompt during inference. We perform experiments on two ~1 billion parameter models, with evaluations on the global MMLU benchmark across fifteen typologically and resource diverse languages, demonstrating accuracy gains of 3.7%-19.9% compared to naive and translation-pipeline baselines.

cross Inexact Moreau Envelope Lagrangian Method for Non-Convex Constrained Optimization under Local Error Bound Conditions on Constraint Functions

Authors: Yankun Huang, Qihang Lin, Yangyang Xu

Abstract: In this paper, we study the inexact Moreau envelope Lagrangian (iMELa) method for solving smooth non-convex optimization problems over a simple polytope with additional convex inequality constraints. By incorporating a proximal term into the traditional Lagrangian function, the iMELa method approximately solves a convex optimization subproblem over the polyhedral set at each main iteration. Under the assumption of a local error bound condition for subsets of the feasible set defined by subsets of the constraints, we establish that the iMELa method can find an $\epsilon$-Karush-Kuhn-Tucker point with $\tilde O(\epsilon^{-2})$ gradient oracle complexity.

cross EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models

Authors: Che Hyun Lee, Heeseung Kim, Jiheum Yeom, Sungroh Yoon

Abstract: We propose EdiText, a controllable text editing method that modify the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of editing on its own, this fine-grained method, integrated with the SDEdit approach, enables EdiText to make precise adjustments within the desired range. EdiText demonstrates its controllability to robustly adjust reference text at broad range of levels across various tasks, including toxicity control and sentiment control.

cross TAPE: Tailored Posterior Difference for Auditing of Machine Unlearning

Authors: Weiqi Wang, Zhiyi Tian, An Liu, Shui Yu

Abstract: With the increasing prevalence of Web-based platforms handling vast amounts of user data, machine unlearning has emerged as a crucial mechanism to uphold users' right to be forgotten, enabling individuals to request the removal of their specified data from trained models. However, the auditing of machine unlearning processes remains significantly underexplored. Although some existing methods offer unlearning auditing by leveraging backdoors, these backdoor-based approaches are inefficient and impractical, as they necessitate involvement in the initial model training process to embed the backdoors. In this paper, we propose a TAilored Posterior diffErence (TAPE) method to provide unlearning auditing independently of original model training. We observe that the process of machine unlearning inherently introduces changes in the model, which contains information related to the erased data. TAPE leverages unlearning model differences to assess how much information has been removed through the unlearning operation. Firstly, TAPE mimics the unlearned posterior differences by quickly building unlearned shadow models based on first-order influence estimation. Secondly, we train a Reconstructor model to extract and evaluate the private information of the unlearned posterior differences to audit unlearning. Existing privacy reconstructing methods based on posterior differences are only feasible for model updates of a single sample. To enable the reconstruction effective for multi-sample unlearning requests, we propose two strategies, unlearned data perturbation and unlearned influence-based division, to augment the posterior difference. Extensive experimental results indicate the significant superiority of TAPE over the state-of-the-art unlearning verification methods, at least 4.5$\times$ efficiency speedup and supporting the auditing for broader unlearning scenarios.

cross Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

Authors: Peilin Wu, Xinlu Zhang, Wenhao Yu, Xingyu Liu, Xinya Du, Zhiyu Zoey Chen

Abstract: Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases-Context-Exclusive, Context-First, and Memory-First-across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts. We will release our code and URAQ dataset upon acceptance of the paper.

cross SCU: An Efficient Machine Unlearning Scheme for Deep Learning Enabled Semantic Communications

Authors: Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu

Abstract: Deep learning (DL) enabled semantic communications leverage DL to train encoders and decoders (codecs) to extract and recover semantic information. However, most semantic training datasets contain personal private information. Such concerns call for enormous requirements for specified data erasure from semantic codecs when previous users hope to move their data from the semantic system. {Existing machine unlearning solutions remove data contribution from trained models, yet usually in supervised sole model scenarios. These methods are infeasible in semantic communications that often need to jointly train unsupervised encoders and decoders.} In this paper, we investigate the unlearning problem in DL-enabled semantic communications and propose a semantic communication unlearning (SCU) scheme to tackle the problem. {SCU includes two key components. Firstly,} we customize the joint unlearning method for semantic codecs, including the encoder and decoder, by minimizing mutual information between the learned semantic representation and the erased samples. {Secondly,} to compensate for semantic model utility degradation caused by unlearning, we propose a contrastive compensation method, which considers the erased data as the negative samples and the remaining data as the positive samples to retrain the unlearned semantic models contrastively. Theoretical analysis and extensive experimental results on three representative datasets demonstrate the effectiveness and efficiency of our proposed methods.

cross Text classification using machine learning methods

Authors: Bogdan Oancea

Abstract: In this paper we present the results of an experiment aimed to use machine learning methods to obtain models that can be used for the automatic classification of products. In order to apply automatic classification methods, we transformed the product names from a text representation to numeric vectors, a process called word embedding. We used several embedding methods: Count Vectorization, TF-IDF, Word2Vec, FASTTEXT, and GloVe. Having the product names in a form of numeric vectors, we proceeded with a set of machine learning methods for automatic classification: Logistic Regression, Multinomial Naive Bayes, kNN, Artificial Neural Networks, Support Vector Machines, and Decision trees with several variants. The results show an impressive accuracy of the classification process for Support Vector Machines, Logistic Regression, and Random Forests. Regarding the word embedding methods, the best results were obtained with the FASTTEXT technique.

cross Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Authors: Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

Abstract: Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

cross Shared Stochastic Gaussian Process Latent Variable Models: A Multi-modal Generative Model for Quasar Spectra

Authors: Vidhi Lalchand, Anna-Christina Eilers

Abstract: This work proposes a scalable probabilistic latent variable model based on Gaussian processes (Lawrence, 2004) in the context of multiple observation spaces. We focus on an application in astrophysics where data sets typically contain both observed spectral features and scientific properties of astrophysical objects such as galaxies or exoplanets. In our application, we study the spectra of very luminous galaxies known as quasars, along with their properties, such as the mass of their central supermassive black hole, accretion rate, and luminosity-resulting in multiple observation spaces. A single data point is then characterized by different classes of observations, each with different likelihoods. Our proposed model extends the baseline stochastic variational Gaussian process latent variable model (GPLVM) introduced by Lalchand et al. (2022) to this setting, proposing a seamless generative model where the quasar spectra and scientific labels can be generated simultaneously using a shared latent space as input to different sets of Gaussian process decoders, one for each observation space. Additionally, this framework enables training in a missing data setting where a large number of dimensions per data point may be unknown or unobserved. We demonstrate high-fidelity reconstructions of the spectra and scientific labels during test-time inference and briefly discuss the scientific interpretations of the results, along with the significance of such a generative model.

cross Fast Debiasing of the LASSO Estimator

Authors: Shuvayan Banerjee, James Saunderson, Radhendushka Srivastava, Ajit Rajwade

Abstract: In high-dimensional sparse regression, the \textsc{Lasso} estimator offers excellent theoretical guarantees but is well-known to produce biased estimates. To address this, \cite{Javanmard2014} introduced a method to ``debias" the \textsc{Lasso} estimates for a random sub-Gaussian sensing matrix $\boldsymbol{A}$. Their approach relies on computing an ``approximate inverse" $\boldsymbol{M}$ of the matrix $\boldsymbol{A}^\top \boldsymbol{A}/n$ by solving a convex optimization problem. This matrix $\boldsymbol{M}$ plays a critical role in mitigating bias and allowing for construction of confidence intervals using the debiased \textsc{Lasso} estimates. However the computation of $\boldsymbol{M}$ is expensive in practice as it requires iterative optimization. In the presented work, we re-parameterize the optimization problem to compute a ``debiasing matrix" $\boldsymbol{W} := \boldsymbol{AM}^{\top}$ directly, rather than the approximate inverse $\boldsymbol{M}$. This reformulation retains the theoretical guarantees of the debiased \textsc{Lasso} estimates, as they depend on the \emph{product} $\boldsymbol{AM}^{\top}$ rather than on $\boldsymbol{M}$ alone. Notably, we provide a simple, computationally efficient, closed-form solution for $\boldsymbol{W}$ under similar conditions for the sensing matrix $\boldsymbol{A}$ used in the original debiasing formulation, with an additional condition that the elements of every row of $\boldsymbol{A}$ have uncorrelated entries. Also, the optimization problem based on $\boldsymbol{W}$ guarantees a unique optimal solution, unlike the original formulation based on $\boldsymbol{M}$. We verify our main result with numerical simulations.

cross Beyond Worst-Case Dimensionality Reduction for Sparse Vectors

Authors: Sandeep Silwal, David P. Woodruff, Qiuyi Zhang

Abstract: We study beyond worst-case dimensionality reduction for $s$-sparse vectors. Our work is divided into two parts, each focusing on a different facet of beyond worst-case analysis: We first consider average-case guarantees. A folklore upper bound based on the birthday-paradox states: For any collection $X$ of $s$-sparse vectors in $\mathbb{R}^d$, there exists a linear map to $\mathbb{R}^{O(s^2)}$ which \emph{exactly} preserves the norm of $99\%$ of the vectors in $X$ in any $\ell_p$ norm (as opposed to the usual setting where guarantees hold for all vectors). We give lower bounds showing that this is indeed optimal in many settings: any oblivious linear map satisfying similar average-case guarantees must map to $\Omega(s^2)$ dimensions. The same lower bound also holds for a wide class of smooth maps, including `encoder-decoder schemes', where we compare the norm of the original vector to that of a smooth function of the embedding. These lower bounds reveal a separation result, as an upper bound of $O(s \log(d))$ is possible if we instead use arbitrary (possibly non-smooth) functions, e.g., via compressed sensing algorithms. Given these lower bounds, we specialize to sparse \emph{non-negative} vectors. For a dataset $X$ of non-negative $s$-sparse vectors and any $p \ge 1$, we can non-linearly embed $X$ to $O(s\log(|X|s)/\epsilon^2)$ dimensions while preserving all pairwise distances in $\ell_p$ norm up to $1\pm \epsilon$, with no dependence on $p$. Surprisingly, the non-negativity assumption enables much smaller embeddings than arbitrary sparse vectors, where the best known bounds suffer exponential dependence. Our map also guarantees \emph{exact} dimensionality reduction for $\ell_{\infty}$ by embedding into $O(s\log |X|)$ dimensions, which is tight. We show that both the non-linearity of $f$ and the non-negativity of $X$ are necessary, and provide downstream algorithmic improvements.

cross NeRFCom: Feature Transform Coding Meets Neural Radiance Field for Free-View 3D Scene Semantic Transmission

Authors: Weijie Yue, Zhongwei Si, Bolin Wu, Sixian Wang, Xiaoqi Qin, Kai Niu, Jincheng Dai, Ping Zhang

Abstract: We introduce NeRFCom, a novel communication system designed for end-to-end 3D scene transmission. Compared to traditional systems relying on handcrafted NeRF semantic feature decomposition for compression and well-adaptive channel coding for transmission error correction, our NeRFCom employs a nonlinear transform and learned probabilistic models, enabling flexible variable-rate joint source-channel coding and efficient bandwidth allocation aligned with the NeRF semantic feature's different contribution to the 3D scene synthesis fidelity. Experimental results demonstrate that NeRFCom achieves free-view 3D scene efficient transmission while maintaining robustness under adverse channel conditions.

cross Physics-Informed Neural Networks for Optimal Vaccination Plan in SIR Epidemic Models

Authors: Minseok Kim, Yeongjong Kim, Yeoneung Kim

Abstract: This work focuses on understanding the minimum eradication time for the controlled Susceptible-Infectious-Recovered (SIR) model in the time-homogeneous setting, where the infection and recovery rates are constant. The eradication time is defined as the earliest time the infectious population drops below a given threshold and remains below it. For time-homogeneous models, the eradication time is well-defined due to the predictable dynamics of the infectious population, and optimal control strategies can be systematically studied. We utilize Physics-Informed Neural Networks (PINNs) to solve the partial differential equation (PDE) governing the eradication time and derive the corresponding optimal vaccination control. The PINN framework enables a mesh-free solution to the PDE by embedding the dynamics directly into the loss function of a deep neural network. We use a variable scaling method to ensure stable training of PINN and mathematically analyze that this method is effective in our setting. This approach provides an efficient computational alternative to traditional numerical methods, allowing for an approximation of the eradication time and the optimal control strategy. Through numerical experiments, we validate the effectiveness of the proposed method in computing the minimum eradication time and achieving optimal control. This work offers a novel application of PINNs to epidemic modeling, bridging mathematical theory and computational practice for time-homogeneous SIR models.

cross A Multiple Transferable Neural Network Method with Domain Decomposition for Elliptic Interface Problems

Authors: Tianzheng Lu, Lili Ju, Liyong Zhu

Abstract: The transferable neural network (TransNet) is a two-layer shallow neural network with pre-determined and uniformly distributed neurons in the hidden layer, and the least-squares solvers can be particularly used to compute the parameters of its output layer when applied to the solution of partial differential equations. In this paper, we integrate the TransNet technique with the nonoverlapping domain decomposition and the interface conditions to develop a novel multiple transferable neural network (Multi-TransNet) method for solving elliptic interface problems, which typically contain discontinuities in both solutions and their derivatives across interfaces. We first propose an empirical formula for the TransNet to characterize the relationship between the radius of the domain-covering ball, the number of hidden-layer neurons, and the optimal neuron shape. In the Multi-TransNet method, we assign each subdomain one distinct TransNet with an adaptively determined number of hidden-layer neurons to maintain the globally uniform neuron distribution across the entire computational domain, and then unite all the subdomain TransNets together by incorporating the interface condition terms into the loss function. The empirical formula is also extended to the Multi-TransNet and further employed to estimate appropriate neuron shapes for the subdomain TransNets, greatly reducing the parameter tuning cost. Additionally, we propose a normalization approach to adaptively select the weighting parameters for the terms in the loss function. Ablation studies and extensive experiments with comparison tests on different types of elliptic interface problems with low to high contrast diffusion coefficients in two and three dimensions are carried out to numerically demonstrate the superior accuracy, efficiency, and robustness of the proposed Multi-TransNet method.

cross Community Detection by ELPMeans: An Unsupervised Approach That Uses Laplacian Centrality and Clustering

Authors: Shahin Momenzadeh, Rojiar Pir Mohammadiani

Abstract: Community detection in network analysis has become more intricate due to the recent hike in social networks (Cai et al., 2024). This paper suggests a new approach named ELPMeans that strives to address this challenge. For community detection in the whole network, ELPMeans combines Laplacian, Hierarchical Clustering as well as K-means algorithms. Our technique employs Laplacian centrality and minimum distance metrics for central node identification while k-means learning is used for efficient convergence to final community structure. Remarkably, ELPMeans is an unsupervised method which is not only simple to implement but also effectively tackles common problems such as random initialization of central nodes, or finding of number of communities (K). Experimental results show that our algorithm improves accuracy and reduces time complexity considerably outperforming recent approaches on real world networks. Moreover, our approach has a wide applicability range in various community detection tasks even with nonconvex shapes and no prior knowledge about the number of communities present.

cross Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Authors: Yuan Sui, Yufei He, Tri Cao, Simeng Han, Bryan Hooi

Abstract: Large Language Models (LLMs) increasingly rely on prolonged reasoning chains to solve complex tasks. However, this trial-and-error approach often leads to high computational overhead and error propagation, where early mistakes can derail subsequent steps. To address these issues, we introduce Meta-Reasoner, a framework that dynamically optimizes inference-time reasoning by enabling LLMs to "think about how to think." Drawing inspiration from human meta-cognition and dual-process theory, Meta-Reasoner operates as a strategic advisor, decoupling high-level guidance from step-by-step generation. It employs "contextual multi-armed bandits" to iteratively evaluate reasoning progress, and select optimal strategies (e.g., backtrack, clarify ambiguity, restart from scratch, or propose alternative approaches), and reallocates computational resources toward the most promising paths. Our evaluations on mathematical reasoning and puzzles highlight the potential of dynamic reasoning chains to overcome inherent challenges in the LLM reasoning process and also show promise in broader applications, offering a scalable and adaptable solution for reasoning-intensive tasks.

cross Towards Collaborative Anti-Money Laundering Among Financial Institutions

Authors: Zhihua Tian, Yuan Ding, Xiang Yu, Enchao Gong, Jian Liu, Kui Ren

Abstract: Money laundering is the process that intends to legalize the income derived from illicit activities, thus facilitating their entry into the monetary flow of the economy without jeopardizing their source. It is crucial to identify such activities accurately and reliably in order to enforce anti-money laundering (AML). Despite considerable efforts to AML, a large number of such activities still go undetected. Rule-based methods were first introduced and are still widely used in current detection systems. With the rise of machine learning, graph-based learning methods have gained prominence in detecting illicit accounts through the analysis of money transfer graphs. Nevertheless, these methods generally assume that the transaction graph is centralized, whereas in practice, money laundering activities usually span multiple financial institutions. Due to regulatory, legal, commercial, and customer privacy concerns, institutions tend not to share data, restricting their utility in practical usage. In this paper, we propose the first algorithm that supports performing AML over multiple institutions while protecting the security and privacy of local data. To evaluate, we construct Alipay-ECB, a real-world dataset comprising digital transactions from Alipay, the world's largest mobile payment platform, alongside transactions from E-Commerce Bank (ECB). The dataset includes over 200 million accounts and 300 million transactions, covering both intra-institution transactions and those between Alipay and ECB. This makes it the largest real-world transaction graph available for analysis. The experimental results demonstrate that our methods can effectively identify cross-institution money laundering subgroups. Additionally, experiments on synthetic datasets also demonstrate that our method is efficient, requiring only a few minutes on datasets with millions of transactions.

cross Erasing Without Remembering: Safeguarding Knowledge Forgetting in Large Language Models

Authors: Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao

Abstract: In this paper, we explore machine unlearning from a novel dimension, by studying how to safeguard model unlearning in large language models (LLMs). Our goal is to prevent unlearned models from recalling any related memory of the targeted knowledge.We begin by uncovering a surprisingly simple yet overlooked fact: existing methods typically erase only the exact expressions of the targeted knowledge, leaving paraphrased or related information intact. To rigorously measure such oversights, we introduce UGBench, the first benchmark tailored for evaluating the generalisation performance across 13 state-of-the-art methods.UGBench reveals that unlearned models can still recall paraphrased answers and retain target facts in intermediate layers. To address this, we propose PERMU, a perturbation-based method that significantly enhances the generalisation capabilities for safeguarding LLM unlearning.Experiments demonstrate that PERMU delivers up to a 50.13% improvement in unlearning while maintaining a 43.53% boost in robust generalisation. Our code can be found in https://github.com/MaybeLizzy/UGBench.

URLs: https://github.com/MaybeLizzy/UGBench.

cross Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula

Authors: Matteo Vilucchio, Yatin Dandi, Cedric Gerbelot, Florent Krzakala

Abstract: The analytic characterization of the high-dimensional behavior of optimization for Generalized Linear Models (GLMs) with Gaussian data has been a central focus in statistics and probability in recent years. While convex cases, such as the LASSO, ridge regression, and logistic regression, have been extensively studied using a variety of techniques, the non-convex case remains far less understood despite its significance. A non-rigorous statistical physics framework has provided remarkable predictions for the behavior of high-dimensional optimization problems, but rigorously establishing their validity for non-convex problems has remained a fundamental challenge. In this work, we address this challenge by developing a systematic framework that rigorously proves replica-symmetric formulas for non-convex GLMs and precisely determines the conditions under which these formulas are valid. Remarkably, the rigorous replica-symmetric predictions align exactly with the conjectures made by physicists, and the so-called replicon condition. The originality of our approach lies in connecting two powerful theoretical tools: the Gaussian Min-Max Theorem, which we use to provide precise lower bounds, and Approximate Message Passing (AMP), which is shown to achieve these bounds algorithmically. We demonstrate the utility of this framework through significant applications: (i) by proving the optimality of the Tukey loss over the more commonly used Huber loss under a $\varepsilon$ contaminated data model, (ii) establishing the optimality of negative regularization in high-dimensional non-convex regression and (iii) characterizing the performance limits of linearized AMP algorithms. By rigorously validating statistical physics predictions in non-convex settings, we aim to open new pathways for analyzing increasingly complex optimization landscapes beyond the convex regime.

cross Hiring under Congestion and Algorithmic Monoculture: Value of Strategic Behavior

Authors: Jackie Baek, Hamsa Bastani, Shihan Chen

Abstract: We study the impact of strategic behavior in a setting where firms compete to hire from a shared pool of applicants, and firms use a common algorithm to evaluate them. Each applicant is associated with a scalar score that is observed by all firms, provided by the algorithm. Firms simultaneously make interview decisions, where the number of interviews is capacity-constrained. Job offers are given to those who pass the interview, and an applicant who receives multiple offers accepts one of them uniformly at random. We fully characterize the set of Nash equilibria under this model. Defining social welfare as the total number of applicants who find a job, we then compare the social welfare at a Nash equilibrium to a naive baseline where all firms interview applicants with the highest scores. We show that the Nash equilibrium greatly improves upon social welfare compared to the naive baseline, especially when the interview capacity is small and the number of firms is large. We also show that the price of anarchy is small, providing further appeal for the equilibrium solution. We then study how the firms may converge to a Nash equilibrium. We show that when firms make interview decisions sequentially and each firm takes the best response action assuming they are the last to act, this process converges to an equilibrium when interview capacities are small. However, we show that the task of computing the best response is difficult if firms have to use its own historical samples to estimate it, while this task becomes trivial if firms have information on the degree of competition for each applicant. Therefore, converging to an equilibrium can be greatly facilitated if firms have information on the level of competition for each applicant.

cross RouteRL: Multi-agent reinforcement learning framework for urban route choice with autonomous vehicles

Authors: Ahmet Onur Akman, Anastasia Psarou, {\L}ukasz Gorczyca, Zolt\'an Gy\"orgy Varga, Grzegorz Jamr\'oz, Rafa{\l} Kucharski

Abstract: RouteRL is a novel framework that integrates multi-agent reinforcement learning (MARL) with a microscopic traffic simulation, facilitating the testing and development of efficient route choice strategies for autonomous vehicles (AVs). The proposed framework simulates the daily route choices of driver agents in a city, including two types: human drivers, emulated using behavioral route choice models, and AVs, modeled as MARL agents optimizing their policies for a predefined objective. RouteRL aims to advance research in MARL, transport modeling, and human-AI interaction for transportation applications. This study presents a technical report on RouteRL, outlines its potential research contributions, and showcases its impact via illustrative examples.

cross Qini curve estimation under clustered network interference

Authors: Rickard K. A. Karlsson, Bram van den Akker, Felipe Moraes, Hugo M. Proen\c{c}a, Jesse H. Krijthe

Abstract: Qini curves are a widely used tool for assessing treatment policies under allocation constraints as they visualize the incremental gain of a new treatment policy versus the cost of its implementation. Standard Qini curve estimation assumes no interference between units: that is, that treating one unit does not influence the outcome of any other unit. In many real-life applications such as public policy or marketing, however, the presence of interference is common. Ignoring interference in these scenarios can lead to systematically biased Qini curves that over- or under-estimate a treatment policy's cost-effectiveness. In this paper, we address the problem of Qini curve estimation under clustered network interference, where interfering units form independent clusters. We propose a formal description of the problem setting with an experimental study design under which we can account for clustered network interference. Within this framework, we introduce three different estimation strategies suited for different conditions. Moreover, we introduce a marketplace simulator that emulates clustered network interference in a typical e-commerce setting. From both theoretical and empirical insights, we provide recommendations in choosing the best estimation strategy by identifying an inherent bias-variance trade-off among the estimation strategies.

cross Self-Training Elicits Concise Reasoning in Large Language Models

Authors: Tergel Munkhbat, Namgyu Ho, Seohyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun

Abstract: Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning

URLs: https://github.com/TergelMunkhbat/concise-reasoning

cross Discovering Antagonists in Networks of Systems: Robot Deployment

Authors: Ingeborg Wenger, Peter Eberhard, Henrik Ebel

Abstract: A contextual anomaly detection method is proposed and applied to the physical motions of a robot swarm executing a coverage task. Using simulations of a swarm's normal behavior, a normalizing flow is trained to predict the likelihood of a robot motion within the current context of its environment. During application, the predicted likelihood of the observed motions is used by a detection criterion that categorizes a robot agent as normal or antagonistic. The proposed method is evaluated on five different strategies of antagonistic behavior. Importantly, only readily available simulated data of normal robot behavior is used for training such that the nature of the anomalies need not be known beforehand. The best detection criterion correctly categorizes at least 80% of each antagonistic type while maintaining a false positive rate of less than 5% for normal robot agents. Additionally, the method is validated in hardware experiments, yielding results similar to the simulated scenarios. Compared to the state-of-the-art approach, both the predictive performance of the normalizing flow and the robustness of the detection criterion are increased.

cross Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking

Authors: Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin

Abstract: Chain-of-Thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. In this work, we (1) evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit, a subset of model components, responsible for tracking the world state, finding that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three realistic settings: skipping intermediate steps, introducing data noise, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSA), highlighting its resilience in challenging scenarios.

cross QPM: Discrete Optimization for Globally Interpretable Image Classification

Authors: Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Ramesh Manuvinakurike, Bodo Rosenhahn

Abstract: Understanding the classifications of deep neural networks, e.g. used in safety-critical situations, is becoming increasingly important. While recent models can locally explain a single decision, to provide a faithful global explanation about an accurate model's general behavior is a more challenging open task. Towards that goal, we introduce the Quadratic Programming Enhanced Model (QPM), which learns globally interpretable class representations. QPM represents every class with a binary assignment of very few, typically 5, features, that are also assigned to other classes, ensuring easily comparable contrastive class representations. This compact binary assignment is found using discrete optimization based on predefined similarity measures and interpretability constraints. The resulting optimal assignment is used to fine-tune the diverse features, so that each of them becomes the shared general concept between the assigned classes. Extensive evaluations show that QPM delivers unprecedented global interpretability across small and large-scale datasets while setting the state of the art for the accuracy of interpretable models.

cross Robust sensitivity control in digital pathology via tile score distribution matching

Authors: Arthur Pignet, John Klein, Genevieve Robin, Antoine Olivier

Abstract: Deploying digital pathology models across medical centers is challenging due to distribution shifts. Recent advances in domain generalization improve model transferability in terms of aggregated performance measured by the Area Under Curve (AUC). However, clinical regulations often require to control the transferability of other metrics, such as prescribed sensitivity levels. We introduce a novel approach to control the sensitivity of whole slide image (WSI) classification models, based on optimal transport and Multiple Instance Learning (MIL). Validated across multiple cohorts and tasks, our method enables robust sensitivity control with only a handful of calibration samples, providing a practical solution for reliable deployment of computational pathology systems.

cross Accelerating Model-Based Reinforcement Learning with State-Space World Models

Authors: Maria Krinner, Elie Aljalbout, Angel Romero, Davide Scaramuzza

Abstract: Reinforcement learning (RL) is a powerful approach for robot learning. However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies. This is due to the noisy RL training updates and the complexity of robotic systems, which typically involve highly non-linear dynamics and noisy sensor signals. In contrast, model-based RL (MBRL) not only trains a policy but simultaneously learns a world model that captures the environment's dynamics and rewards. The world model can either be used for planning, for data collection, or to provide first-order policy gradients for training. Leveraging a world model significantly improves sample efficiency compared to model-free RL. However, training a world model alongside the policy increases the computational complexity, leading to longer training times that are often intractable for complex real-world scenarios. In this work, we propose a new method for accelerating model-based RL using state-space world models. Our approach leverages state-space models (SSMs) to parallelize the training of the dynamics model, which is typically the main computational bottleneck. Additionally, we propose an architecture that provides privileged information to the world model during training, which is particularly relevant for partially observable environments. We evaluate our method in several real-world agile quadrotor flight tasks, involving complex dynamics, for both fully and partially observable environments. We demonstrate a significant speedup, reducing the world model training time by up to 10 times, and the overall MBRL training time by up to 4 times. This benefit comes without compromising performance, as our method achieves similar sample efficiency and task rewards to state-of-the-art MBRL methods.

cross Re-evaluating Open-ended Evaluation of Large Language Models

Authors: Siqi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot

Abstract: Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

cross Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge

Authors: Yan-Lun Chen, Yi-Ru Wei, Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

Abstract: Large language models (LLMs) demonstrate strong task-specific capabilities through fine-tuning, but merging multiple fine-tuned models often leads to degraded performance due to overlapping instruction-following components. Task Arithmetic (TA), which combines task vectors derived from fine-tuning, enables multi-task learning and task forgetting but struggles to isolate task-specific knowledge from general instruction-following behavior. To address this, we propose Layer-Aware Task Arithmetic (LATA), a novel approach that assigns layer-specific weights to task vectors based on their alignment with instruction-following or task-specific components. By amplifying task-relevant layers and attenuating instruction-following layers, LATA improves task learning and forgetting performance while preserving overall model utility. Experiments on multiple benchmarks, including WikiText-2, GSM8K, and HumanEval, demonstrate that LATA outperforms existing methods in both multi-task learning and selective task forgetting, achieving higher task accuracy and alignment with minimal degradation in output quality. Our findings highlight the importance of layer-wise analysis in disentangling task-specific and general-purpose knowledge, offering a robust framework for efficient model merging and editing.

cross Topological Autoencoders++: Fast and Accurate Cycle-Aware Dimensionality Reduction

Authors: Matt\'eo Cl\'emot, Julie Digne, Julien Tierny

Abstract: This paper presents a novel topology-aware dimensionality reduction approach aiming at accurately visualizing the cyclic patterns present in high dimensional data. To that end, we build on the Topological Autoencoders (TopoAE) formulation. First, we provide a novel theoretical analysis of its associated loss and show that a zero loss indeed induces identical persistence pairs (in high and low dimensions) for the $0$-dimensional persistent homology (PH$^0$) of the Rips filtration. We also provide a counter example showing that this property no longer holds for a naive extension of TopoAE to PH$^d$ for $d\ge 1$. Based on this observation, we introduce a novel generalization of TopoAE to $1$-dimensional persistent homology (PH$^1$), called TopoAE++, for the accurate generation of cycle-aware planar embeddings, addressing the above failure case. This generalization is based on the notion of cascade distortion, a new penalty term favoring an isometric embedding of the $2$-chains filling persistent $1$-cycles, hence resulting in more faithful geometrical reconstructions of the $1$-cycles in the plane. We further introduce a novel, fast algorithm for the exact computation of PH for Rips filtrations in the plane, yielding improved runtimes over previously documented topology-aware methods. Our method also achieves a better balance between the topological accuracy, as measured by the Wasserstein distance, and the visual preservation of the cycles in low dimensions. Our C++ implementation is available at https://github.com/MClemot/TopologicalAutoencodersPlusPlus.

URLs: https://github.com/MClemot/TopologicalAutoencodersPlusPlus.

cross Deep Convolutional Neural Networks for Palm Fruit Maturity Classification

Authors: Mingqiang Han, Chunlin Yi

Abstract: To maximize palm oil yield and quality, it is essential to harvest palm fruit at the optimal maturity stage. This project aims to develop an automated computer vision system capable of accurately classifying palm fruit images into five ripeness levels. We employ deep Convolutional Neural Networks (CNNs) to classify palm fruit images based on their maturity stage. A shallow CNN serves as the baseline model, while transfer learning and fine-tuning are applied to pre-trained ResNet50 and InceptionV3 architectures. The study utilizes a publicly available dataset of over 8,000 images with significant variations, which is split into 80\% for training and 20\% for testing. The proposed deep CNN models achieve test accuracies exceeding 85\% in classifying palm fruit maturity stages. This research highlights the potential of deep learning for automating palm fruit ripeness assessment, which can contribute to optimizing harvesting decisions and improving palm oil production efficiency.

cross Swap Regret and Correlated Equilibria Beyond Normal-Form Games

Authors: Eshwar Ram Arunachaleswaran, Natalie Collina, Yishay Mansour, Mehryar Mohri, Jon Schneider, Balasubramanian Sivan

Abstract: Swap regret is a notion that has proven itself to be central to the study of general-sum normal-form games, with swap-regret minimization leading to convergence to the set of correlated equilibria and guaranteeing non-manipulability against a self-interested opponent. However, the situation for more general classes of games -- such as Bayesian games and extensive-form games -- is less clear-cut, with multiple candidate definitions for swap-regret but no known efficiently minimizable variant of swap regret that implies analogous non-manipulability guarantees. In this paper, we present a new variant of swap regret for polytope games that we call ``profile swap regret'', with the property that obtaining sublinear profile swap regret is both necessary and sufficient for any learning algorithm to be non-manipulable by an opponent (resolving an open problem of Mansour et al., 2022). Although we show profile swap regret is NP-hard to compute given a transcript of play, we show it is nonetheless possible to design efficient learning algorithms that guarantee at most $O(\sqrt{T})$ profile swap regret. Finally, we explore the correlated equilibrium notion induced by low-profile-swap-regret play, and demonstrate a gap between the set of outcomes that can be implemented by this learning process and the set of outcomes that can be implemented by a third-party mediator (in contrast to the situation in normal-form games).

cross Generative adversarial neural networks for simulating neutrino interactions

Authors: Jose L. Bonilla, Krzysztof M. Graczyk, Artur M. Ankowski, Rwik Dharmapal Banerjee, Beata E. Kowal, Hemant Prasad, Jan T. Sobczyk

Abstract: We propose a new approach to simulate neutrino scattering events as an alternative to the standard Monte Carlo generator approach. Generative adversarial neural network (GAN) models are developed to simulate neutrino-carbon collisions in the few-GeV energy range. The models produce scattering events for a given neutrino energy. GAN models are trained on simulation data from NuWro Monte Carlo event generator. Two GAN models have been obtained: one simulating only quasielastic neutrino-nucleus scatterings and another simulating all interactions at given neutrino energy. The performance of both models has been assessed using two statistical metrics. It is shown that both GAN models successfully reproduce the event distributions.

cross HVI: A New color space for Low-light Image Enhancement

Authors: Qingsen Yan, Yixu Feng, Cheng Zhang, Guansong Pang, Kangbiao Shi, Peng Wu, Wei Dong, Jinqiu Sun, Yanning Zhang

Abstract: Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets. The code is available at https://github.com/Fediory/HVI-CIDNet.

URLs: https://github.com/Fediory/HVI-CIDNet.

cross Multiple Linked Tensor Factorization

Authors: Zhiyu Kang, Raghavendra B. Rao, Eric F. Lock

Abstract: In biomedical research and other fields, it is now common to generate high content data that are both multi-source and multi-way. Multi-source data are collected from different high-throughput technologies while multi-way data are collected over multiple dimensions, yielding multiple tensor arrays. Integrative analysis of these data sets is needed, e.g., to capture and synthesize different facets of complex biological systems. However, despite growing interest in multi-source and multi-way factorization techniques, methods that can handle data that are both multi-source and multi-way are limited. In this work, we propose a Multiple Linked Tensors Factorization (MULTIFAC) method extending the CANDECOMP/PARAFAC (CP) decomposition to simultaneously reduce the dimension of multiple multi-way arrays and approximate underlying signal. We first introduce a version of the CP factorization with L2 penalties on the latent factors, leading to rank sparsity. When extended to multiple linked tensors, the method automatically reveals latent components that are shared across data sources or individual to each data source. We also extend the decomposition algorithm to its expectation-maximization (EM) version to handle incomplete data with imputation. Extensive simulation studies are conducted to demonstrate MULTIFAC's ability to (i) approximate underlying signal, (ii) identify shared and unshared structures, and (iii) impute missing data. The approach yields an interpretable decomposition on multi-way multi-omics data for a study on early-life iron deficiency.

cross Visual Adaptive Prompting for Compositional Zero-Shot Learning

Authors: Kyle Stein, Arash Mahyari, Guillermo Francia, Eman El-Sheikh

Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in learning joint representations of visual and textual data, making them powerful tools for tasks such as Compositional Zero-Shot Learning (CZSL). CZSL requires models to generalize to novel combinations of visual primitives-such as attributes and objects-that were not explicitly encountered during training. Recent works in prompting for CZSL have focused on modifying inputs for the text encoder, often using static prompts that do not change across varying visual contexts. However, these approaches struggle to fully capture varying visual contexts, as they focus on text adaptation rather than leveraging visual features for compositional reasoning. To address this, we propose Visual Adaptive Prompting System (VAPS) that leverages a learnable visual prompt repository and similarity-based retrieval mechanism within the framework of VLMs to bridge the gap between semantic and visual features. Our method introduces a dynamic visual prompt repository mechanism that selects the most relevant attribute and object prompts based on the visual features of the image. Our proposed system includes a visual prompt adapter that encourages the model to learn a more generalizable embedding space. Experiments on three CZSL benchmarks, across both closed and open-world scenarios, demonstrate state-of-the-art results.

cross LangProBe: a Language Programs Benchmark

Authors: Shangyin Tan, Lakshya A Agrawal, Arnav Singhvi, Liheng Lai, Michael J Ryan, Dan Klein, Omar Khattab, Koushik Sen, Matei Zaharia

Abstract: Composing language models (LMs) into multi-step language programs and automatically optimizing their modular prompts is now a mainstream paradigm for building AI systems, but the tradeoffs in this space have only scarcely been studied before. We introduce LangProBe, the first large-scale benchmark for evaluating the architectures and optimization strategies for language programs, with over 2000 combinations of tasks, architectures, optimizers, and choices of LMs. Using LangProBe, we are the first to study the impact of program architectures and optimizers (and their compositions together and with different models) on tradeoffs of quality and cost. We find that optimized language programs offer strong cost--quality Pareto improvement over raw calls to models, but simultaneously demonstrate that human judgment (or empirical decisions) about which compositions to pursue is still necessary for best performance. We will open source the code and evaluation data for LangProBe.

cross Impilict Runge-Kutta based sparse identification of governing equations in biologically motivated systems

Authors: Mehrdad Anvari, Hamidreza Marasi, Hossein Kheiri

Abstract: Identifying governing equations in physical and biological systems from datasets remains a long-standing challenge across various scientific disciplines, providing mechanistic insights into complex system evolution. Common methods like sparse identification of nonlinear dynamics (SINDy) often rely on precise derivative estimations, making them vulnerable to data scarcity and noise. This study presents a novel data-driven framework by integrating high order implicit Runge-Kutta methods (IRKs) with the sparse identification, termed IRK-SINDy. The framework exhibits remarkable robustness to data scarcity and noise by leveraging the lower stepsize constraint of IRKs. Two methods for incorporating IRKs into sparse regression are introduced: one employs iterative schemes for numerically solving nonlinear algebraic system of equations, while the other utilizes deep neural networks to predict stage values of IRKs. The performance of IRK-SINDy is demonstrated through numerical experiments on benchmark problems with varied dynamical behaviors, including linear and nonlinear oscillators, the Lorenz system, and biologically relevant models like predator-prey dynamics, logistic growth, and the FitzHugh-Nagumo model. Results indicate that IRK-SINDy outperforms conventional SINDy and the RK4-SINDy framework, particularly under conditions of extreme data scarcity and noise, yielding interpretable and generalizable models.

cross T1-PILOT: Optimized Trajectories for T1 Mapping Acceleration

Authors: Tamir Shor, Moti Freiman, Chaim Baskin, Alex Bronstein

Abstract: Cardiac T1 mapping provides critical quantitative insights into myocardial tissue composition, enabling the assessment of pathologies such as fibrosis, inflammation, and edema. However, the inherently dynamic nature of the heart imposes strict limits on acquisition times, making high-resolution T1 mapping a persistent challenge. Compressed sensing (CS) approaches have reduced scan durations by undersampling k-space and reconstructing images from partial data, and recent studies show that jointly optimizing the undersampling patterns with the reconstruction network can substantially improve performance. Still, most current T1 mapping pipelines rely on static, hand-crafted masks that do not exploit the full acceleration and accuracy potential. In this work, we introduce T1-PILOT: an end-to-end method that explicitly incorporates the T1 signal relaxation model into the sampling-reconstruction framework to guide the learning of non-Cartesian trajectories, crossframe alignment, and T1 decay estimation. Through extensive experiments on the CMRxRecon dataset, T1-PILOT significantly outperforms several baseline strategies (including learned single-mask and fixed radial or golden-angle sampling schemes), achieving higher T1 map fidelity at greater acceleration factors. In particular, we observe consistent gains in PSNR and VIF relative to existing methods, along with marked improvements in delineating finer myocardial structures. Our results highlight that optimizing sampling trajectories in tandem with the physical relaxation model leads to both enhanced quantitative accuracy and reduced acquisition times. Code for reproducing all results will be made publicly available upon publication.

cross Improving the Efficiency of a Deep Reinforcement Learning-Based Power Management System for HPC Clusters Using Curriculum Learning

Authors: Thomas Budiarjo, Santana Yuda Pradata, Kadek Gemilang Santiyuda, Muhammad Alfian Amrizal, Reza Pulungan, Hiroyuki Takizawa

Abstract: High energy consumption remains a key challenge in high-performance computing (HPC) systems, which often feature hundreds or thousands of nodes drawing substantial power even in idle or standby modes. Although powering down unused nodes can improve energy efficiency, choosing the wrong time to do so can degrade quality of service by delaying job execution. Machine learning, in particular reinforcement learning (RL), has shown promise in determining optimal times to switch nodes on or off. In this study, we enhance the performance of a deep reinforcement learning (DRL) agent for HPC power management by integrating curriculum learning (CL), a training approach that introduces tasks with gradually increasing difficulty. Using the Batsim-py simulation framework, we compare the proposed CL-based agent to both a baseline DRL method (without CL) and the conventional fixed-time timeout strategy. Experimental results confirm that an easy-to-hard curriculum outperforms other training orders in terms of reducing wasted energy usage. The best agent achieves a 3.73% energy reduction over the baseline DRL method and a 4.66% improvement compared to the best timeout configuration (shutdown every 15 minutes of idle time). In addition, it reduces average job waiting time by 9.24% and maintains a higher job-filling rate, indicating more effective resource utilization. Sensitivity tests across various switch-on durations, power levels, and cluster sizes further reveal the agent's adaptability to changing system parameters without retraining. These findings demonstrate that curriculum learning can significantly improve DRL-based power management in HPC, balancing energy savings, quality of service, and robustness to diverse configurations.

cross Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs

Authors: Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T. Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, Jifan Zhang

Abstract: Large Language Models (LLMs) have shown significant limitations in understanding creative content, as demonstrated by Hessel et al. (2023)'s influential work on the New Yorker Cartoon Caption Contest (NYCCC). Their study exposed a substantial gap between LLMs and humans in humor comprehension, establishing that understanding and evaluating creative content is key challenge in AI development. We revisit this challenge by decomposing humor understanding into three components and systematically improve each: enhancing visual understanding through improved annotation, utilizing LLM-generated humor reasoning and explanations, and implementing targeted alignment with human preference data. Our refined approach achieves 82.4% accuracy in caption ranking, singificantly improving upon the previous 67% benchmark and matching the performance of world-renowned human experts in this domain. Notably, while attempts to mimic subgroup preferences through various persona prompts showed minimal impact, model finetuning with crowd preferences proved remarkably effective. These findings reveal that LLM limitations in creative judgment can be effectively addressed through focused alignment to specific subgroups and individuals. Lastly, we propose the position that achieving artificial general intelligence necessitates systematic collection of human preference data across creative domains. We advocate that just as human creativity is deeply influenced by individual and cultural preferences, training LLMs with diverse human preference data may be essential for developing true creative understanding.

cross Global Framework for Simultaneous Emulation Across the Nuclear Landscape

Authors: Antoine Belley, Jose M. Munoz, Ronald F. Garcia Ruiz

Abstract: We introduce a hierarchical framework that combines ab initio many-body calculations with a Bayesian neural network, developing emulators capable of accurately predicting nuclear properties across the nuclear chart, including multiple isotopes simultaneously. We benchmark our developments using the oxygen isotopic chain, achieving accurate results for ground-state energies and nuclear charge radii, while providing robust uncertainty quantification. Our framework enables global sensitivity analysis of nuclear binding energies and charge radii with respect to the low-energy constants that describe the nuclear force.

cross Tight Inversion: Image-Conditioned Inversion for Real Image Editing

Authors: Edo Kadosh, Nir Goren, Or Patashnik, Daniel Garibi, Daniel Cohen-Or

Abstract: Text-to-image diffusion models offer powerful image editing capabilities. To edit real images, many methods rely on the inversion of the image into Gaussian noise. A common approach to invert an image is to gradually add noise to the image, where the noise is determined by reversing the sampling equation. This process has an inherent tradeoff between reconstruction and editability, limiting the editing of challenging images such as highly-detailed ones. Recognizing the reliance of text-to-image models inversion on a text condition, this work explores the importance of the condition choice. We show that a condition that precisely aligns with the input image significantly improves the inversion quality. Based on our findings, we introduce Tight Inversion, an inversion method that utilizes the most possible precise condition -- the input image itself. This tight condition narrows the distribution of the model's output and enhances both reconstruction and editability. We demonstrate the effectiveness of our approach when combined with existing inversion methods through extensive experiments, evaluating the reconstruction accuracy as well as the integration with various editing methods.

cross Physics-Driven Data Generation for Contact-Rich Manipulation via Trajectory Optimization

Authors: Lujie Yang, H. J. Terry Suh, Tong Zhao, Bernhard Paus Graesdal, Tarik Kelestemur, Jiuguang Wang, Tao Pang, Russ Tedrake

Abstract: We present a low-cost data generation pipeline that integrates physics-based simulation, human demonstrations, and model-based planning to efficiently generate large-scale, high-quality datasets for contact-rich robotic manipulation tasks. Starting with a small number of embodiment-flexible human demonstrations collected in a virtual reality simulation environment, the pipeline refines these demonstrations using optimization-based kinematic retargeting and trajectory optimization to adapt them across various robot embodiments and physical parameters. This process yields a diverse, physically consistent dataset that enables cross-embodiment data transfer, and offers the potential to reuse legacy datasets collected under different hardware configurations or physical parameters. We validate the pipeline's effectiveness by training diffusion policies from the generated datasets for challenging contact-rich manipulation tasks across multiple robot embodiments, including a floating Allegro hand and bimanual robot arms. The trained policies are deployed zero-shot on hardware for bimanual iiwa arms, achieving high success rates with minimal human input. Project website: https://lujieyang.github.io/physicsgen/.

URLs: https://lujieyang.github.io/physicsgen/.

cross Scalable Signature Kernel Computations for Long Time Series via Local Neumann Series Expansions

Authors: Matthew Tamayo-Rios, Alexander Schell, Rima Alaifari

Abstract: The signature kernel is a recent state-of-the-art tool for analyzing high-dimensional sequential data, valued for its theoretical guarantees and strong empirical performance. In this paper, we present a novel method for efficiently computing the signature kernel of long, high-dimensional time series via dynamically truncated recursive local power series expansions. Building on the characterization of the signature kernel as the solution of a Goursat PDE, our approach employs tilewise Neumann-series expansions to derive rapidly converging power series approximations of the signature kernel that are locally defined on subdomains and propagated iteratively across the entire domain of the Goursat solution by exploiting the geometry of the time series. Algorithmically, this involves solving a system of interdependent local Goursat PDEs by recursively propagating boundary conditions along a directed graph via topological ordering, with dynamic truncation adaptively terminating each local power series expansion when coefficients fall below machine precision, striking an effective balance between computational cost and accuracy. This method achieves substantial performance improvements over state-of-the-art approaches for computing the signature kernel, providing (a) adjustable and superior accuracy, even for time series with very high roughness; (b) drastically reduced memory requirements; and (c) scalability to efficiently handle very long time series (e.g., with up to half a million points or more) on a single GPU. These advantages make our method particularly well-suited for rough-path-assisted machine learning, financial modeling, and signal processing applications that involve very long and highly volatile data.

cross Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids

Authors: Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu

Abstract: Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to overcome the identified challenges with empirical validation. Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap. We show promising results on three humanoid dexterous manipulation tasks, with ablation studies on each technique. Our work presents a successful approach to learning humanoid dexterous manipulation using sim-to-real reinforcement learning, achieving robust generalization and high performance without the need for human demonstration.

replace GSA-Forecaster: Forecasting Graph-Based Time-Dependent Data with Graph Sequence Attention

Authors: Yang Li, Di Wang, Jos\'e M. F. Moura

Abstract: Forecasting graph-based, time-dependent data has broad practical applications but presents challenges. Effective models must capture both spatial and temporal dependencies in the data, while also incorporating auxiliary information to enhance prediction accuracy. In this paper, we identify limitations in current state-of-the-art models regarding temporal dependency handling. To overcome this, we introduce GSA-Forecaster, a new deep learning model designed for forecasting in graph-based, time-dependent contexts. GSA-Forecaster utilizes graph sequence attention, a new attention mechanism proposed in this paper, to effectively manage temporal dependencies. GSA-Forecaster integrates the data's graph structure directly into its architecture, addressing spatial dependencies. Additionally, it incorporates auxiliary information to refine its predictions further. We validate its performance using real-world graph-based, time-dependent datasets, where it demonstrates superior effectiveness compared to existing state-of-the-art models.

replace FedICT: Federated Multi-task Distillation for Multi-access Edge Computing

Authors: Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Quyang Pan, Xuefeng Jiang, Bo Gao

Abstract: The growing interest in intelligent services and privacy protection for mobile devices has given rise to the widespread application of federated learning in Multi-access Edge Computing (MEC). Diverse user behaviors call for personalized services with heterogeneous Machine Learning (ML) models on different devices. Federated Multi-task Learning (FMTL) is proposed to train related but personalized ML models for different devices, whereas previous works suffer from excessive communication overhead during training and neglect the model heterogeneity among devices in MEC. Introducing knowledge distillation into FMTL can simultaneously enable efficient communication and model heterogeneity among clients, whereas existing methods rely on a public dataset, which is impractical in reality. To tackle this dilemma, Federated MultI-task Distillation for Multi-access Edge CompuTing (FedICT) is proposed. FedICT direct local-global knowledge aloof during bi-directional distillation processes between clients and the server, aiming to enable multi-task clients while alleviating client drift derived from divergent optimization directions of client-side local models. Specifically, FedICT includes Federated Prior Knowledge Distillation (FPKD) and Local Knowledge Adjustment (LKA). FPKD is proposed to reinforce the clients' fitting of local data by introducing prior knowledge of local data distributions. Moreover, LKA is proposed to correct the distillation loss of the server, making the transferred local knowledge better match the generalized representation. Experiments on three datasets show that FedICT significantly outperforms all compared benchmarks in various data heterogeneous and model architecture settings, achieving improved accuracy with less than 1.2% training communication overhead compared with FedAvg and no more than 75% training communication round compared with FedGKT.

replace Efficient and Accurate Optimal Transport with Mirror Descent and Conjugate Gradients

Authors: Mete Kemertas, Allan D. Jepson, Amir-massoud Farahmand

Abstract: We propose Mirror Descent Optimal Transport (MDOT), a novel method for solving discrete optimal transport (OT) problems with high precision, by unifying temperature annealing in entropic-regularized OT (EOT) with mirror descent techniques. In this framework, temperature annealing produces a sequence of EOT dual problems, whose solution gradually gets closer to the solution of the original OT problem. We solve each problem efficiently using a GPU-parallel nonlinear conjugate gradients algorithm (PNCG) that outperforms traditional Sinkhorn iterations under weak regularization. Moreover, our investigation also reveals that the theoretical convergence rate of Sinkhorn iterations can exceed existing non-asymptotic bounds when its stopping criterion is tuned in a manner analogous to MDOT. Our comprehensive ablation studies of MDOT-PNCG affirm its robustness across a wide range of algorithmic parameters. Benchmarking on 24 problem sets of size $n=4096$ in a GPU environment demonstrate that our method attains high-precision, feasible solutions significantly faster than a representative set of existing OT solvers (including accelerated gradient methods and advanced Sinkhorn variants) in both wall-clock time and number of operations. Empirical convergence rates range between $O(n^2 \varepsilon^{-1/4})$ and $O(n^2 \varepsilon^{-1})$, where $\varepsilon$ is the optimality gap. For problem sizes up to ${n=16,384}$, the empirical runtime scales as $\widetilde{O}(n^2)$ for moderate precision and as $\widetilde{O}(n^{5/2})$ at worst for high precision. These findings establish MDOT-PNCG as a compelling alternative to current OT solvers, particularly in challenging weak-regularization regimes.

replace Near-Linear Time Projection onto the $\ell_{1,\infty}$ Ball; Application to Sparse Autoencoders

Authors: Guillaume Perez, Laurent Condat, Michel Barlaud

Abstract: Looking for sparsity is nowadays crucial to speed up the training of large-scale neural networks. Projections onto the $\ell_{1,2}$ and $\ell_{1,\infty}$ are among the most efficient techniques to sparsify and reduce the overall cost of neural networks. In this paper, we introduce a new projection algorithm for the $\ell_{1,\infty}$ norm ball. The worst-case time complexity of this algorithm is $\mathcal{O}\big(nm+J\log(nm)\big)$ for a matrix in $\mathbb{R}^{n\times m}$. $J$ is a term that tends to 0 when the sparsity is high, and to $nm$ when the sparsity is low. Its implementation is easy and it is guaranteed to converge to the exact solution in a finite time. Moreover, we propose to incorporate the $\ell_{1,\infty}$ ball projection while training an autoencoder to enforce feature selection and sparsity of the weights. Sparsification appears in the encoder to primarily do feature selection due to our application in biology, where only a very small part ($<2\%$) of the data is relevant. We show that both in the biological case and in the general case of sparsity that our method is the fastest.

replace Elucidating the solution space of extended reverse-time SDE for diffusion models

Authors: Qinpeng Cui, Xinyi Zhang, Qiqi Bao, Qingmin Liao

Abstract: Sampling from Diffusion Models can alternatively be seen as solving differential equations, where there is a challenge in balancing speed and image visual quality. ODE-based samplers offer rapid sampling time but reach a performance limit, whereas SDE-based samplers achieve superior quality, albeit with longer iterations. In this work, we formulate the sampling process as an Extended Reverse-Time SDE (ER SDE), unifying prior explorations into ODEs and SDEs. Theoretically, leveraging the semi-linear structure of ER SDE solutions, we offer exact solutions and approximate solutions for VP SDE and VE SDE, respectively. Based on the approximate solution space of the ER SDE, referred to as one-step prediction errors, we yield mathematical insights elucidating the rapid sampling capability of ODE solvers and the high-quality sampling ability of SDE solvers. Additionally, we unveil that VP SDE solvers stand on par with their VE SDE counterparts. Based on these findings, leveraging the dual advantages of ODE solvers and SDE solvers, we devise efficient high-quality samplers, namely ER-SDE-Solvers. Experimental results demonstrate that ER-SDE-Solvers achieve state-of-the-art performance across all stochastic samplers while maintaining efficiency of deterministic samplers. Specifically, on the ImageNet $128\times128$ dataset, ER-SDE-Solvers obtain 8.33 FID in only 20 function evaluations. Code is available at \href{https://github.com/QinpengCui/ER-SDE-Solver}{https://github.com/QinpengCui/ER-SDE-Solver}

URLs: https://github.com/QinpengCui/ER-SDE-Solver, https://github.com/QinpengCui/ER-SDE-Solver

replace PFLlib: A Beginner-Friendly and Comprehensive Personalized Federated Learning Library and Benchmark

Authors: Jianqing Zhang, Yang Liu, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, Jian Cao

Abstract: Amid the ongoing advancements in Federated Learning (FL), a machine learning paradigm that allows collaborative learning with data privacy protection, personalized FL (pFL)has gained significant prominence as a research direction within the FL domain. Whereas traditional FL (tFL) focuses on jointly learning a global model, pFL aims to balance each client's global and personalized goals in FL settings. To foster the pFL research community, we started and built PFLlib, a comprehensive pFL library with an integrated benchmark platform. In PFLlib, we implemented 37 state-of-the-art FL algorithms (8 tFL algorithms and 29 pFL algorithms) and provided various evaluation environments with three statistically heterogeneous scenarios and 24 datasets. At present, PFLlib has gained more than 1600 stars and 300 forks on GitHub.

replace Vanishing Feature: Diagnosing Model Merging and Beyond

Authors: Xingyu Qu, Samuel Horvath

Abstract: Model merging offers an efficient way to combine pre-trained neural networks but often suffers from inconsistent performance, especially when merging models with different initializations. We identify the ``vanishing feature'' phenomenon, where input-induced features diminish during propagation through the merged model, degrading performance. Through theoretical and empirical analysis, we reveal that this phenomenon underpins challenges like variance collapse and explains techniques like permutation-based merging, post-merging normalization, etc. We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue. Leveraging these insights, we propose the ``Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features, enabling the merged models, for the first time, to outperform the original models in advanced settings without post-training. Furthermore, we demonstrate that the vanishing feature phenomenon extends to other contexts, such as model pruning. Applying post-pruning normalization to mitigate the issue significantly improves one-shot pruning performance at high sparsity, offering a simple and effective post-pruning solution. The code is available at https://github.com/XingyuQu/VF.

URLs: https://github.com/XingyuQu/VF.

replace Universal Physics Transformers: A Framework For Efficiently Scaling Neural Operators

Authors: Benedikt Alkin, Andreas F\"urst, Simon Schmid, Lukas Gruber, Markus Holzleitner, Johannes Brandstetter

Abstract: Neural operators, serving as physics surrogate models, have recently gained increased interest. With ever increasing problem complexity, the natural question arises: what is an efficient way to scale neural operators to larger and more complex simulations - most importantly by taking into account different types of simulation datasets. This is of special interest since, akin to their numerical counterparts, different techniques are used across applications, even if the underlying dynamics of the systems are similar. Whereas the flexibility of transformers has enabled unified architectures across domains, neural operators mostly follow a problem specific design, where GNNs are commonly used for Lagrangian simulations and grid-based models predominate Eulerian simulations. We introduce Universal Physics Transformers (UPTs), an efficient and unified learning paradigm for a wide range of spatio-temporal problems. UPTs operate without grid- or particle-based latent structures, enabling flexibility and scalability across meshes and particles. UPTs efficiently propagate dynamics in the latent space, emphasized by inverse encoding and decoding techniques. Finally, UPTs allow for queries of the latent space representation at any point in space-time. We demonstrate diverse applicability and efficacy of UPTs in mesh-based fluid simulations, and steady-state Reynolds averaged Navier-Stokes simulations, and Lagrangian-based dynamics.

replace Hyperdimensional Representation Learning for Node Classification and Link Prediction

Authors: Abhishek Dalvi, Vasant Honavar

Abstract: We introduce Hyperdimensional Graph Learner (HDGL), a novel method for node classification and link prediction in graphs. HDGL maps node features into a very high-dimensional space (\textit{hyperdimensional} or HD space for short) using the \emph{injectivity} property of node representations in a family of Graph Neural Networks (GNNs) and then uses HD operators such as \textit{bundling} and \textit{binding} to aggregate information from the local neighborhood of each node yielding latent node representations that can support both node classification and link prediction tasks. HDGL, unlike GNNs that rely on computationally expensive iterative optimization and hyperparameter tuning, requires only a single pass through the data set. We report results of experiments using widely used benchmark datasets which demonstrate that, on the node classification task, HDGL achieves accuracy that is competitive with that of the state-of-the-art GNN methods at substantially reduced computational cost; and on the link prediction task, HDGL matches the performance of DeepWalk and related methods, although it falls short of computationally demanding state-of-the-art GNNs.

replace Meta-Task: A Method-Agnostic Framework for Learning to Regularize in Few-Shot Learning

Authors: Mohammad Rostami, Atik Faysal, Huaxia Wang, Avimanyu Sahoo

Abstract: Overfitting is a significant challenge in Few-Shot Learning (FSL), where models trained on small, variable datasets tend to memorize rather than generalize to unseen tasks. Regularization is crucial in FSL to prevent overfitting and enhance generalization performance. To address this issue, we introduce Meta-Task, a novel, method-agnostic framework that leverages both labeled and unlabeled data to enhance generalization through auxiliary tasks for regularization. Specifically, Meta-Task introduces a Task-Decoder, which is a simple example of the broader framework that refines hidden representations by reconstructing input images from embeddings, effectively mitigating overfitting. Our framework's method-agnostic design ensures its broad applicability across various FSL settings. We validate Meta-Task's effectiveness on standard benchmarks, including Mini-ImageNet, Tiered-ImageNet, and FC100, where it consistently improves existing state-of-the-art meta-learning techniques, demonstrating superior performance, faster convergence, reduced generalization error, and lower variance-all without extensive hyperparameter tuning. These results underline Meta-Task's practical applicability and efficiency in real-world, resource-constrained scenarios.

replace Future-Proofing Class-Incremental Learning

Authors: Quentin Jodelet, Xin Liu, Yin Jun Phua, Tsuyoshi Murata

Abstract: Exemplar-Free Class Incremental Learning is a highly challenging setting where replay memory is unavailable. Methods relying on frozen feature extractors have drawn attention recently in this setting due to their impressive performances and lower computational costs. However, those methods are highly dependent on the data used to train the feature extractor and may struggle when an insufficient amount of classes are available during the first incremental step. To overcome this limitation, we propose to use a pre-trained text-to-image diffusion model in order to generate synthetic images of future classes and use them to train the feature extractor. Experiments on the standard benchmarks CIFAR100 and ImageNet-Subset demonstrate that our proposed method can be used to improve state-of-the-art methods for exemplar-free class incremental learning, especially in the most difficult settings where the first incremental step only contains few classes. Moreover, we show that using synthetic samples of future classes achieves higher performance than using real data from different classes, paving the way for better and less costly pre-training methods for incremental learning.

replace Exploring the Decentraland Economy: Multifaceted Parcel Attributes, Key Insights, and Benchmarking

Authors: Dipika Jha, Ankit K. Bhagat, Raju Halder, Rajendra N. Paramanik, Chandra M. Kumar

Abstract: This paper presents a comprehensive Decentraland parcels dataset, called IITP-VDLand, sourced from diverse platforms such as Decentraland, OpenSea, Etherscan, Google BigQuery, and various Social Media Platforms. Unlike existing datasets which have limited attributes and records, IITP-VDLand offers a rich array of attributes, encompassing parcel characteristics, trading history, past activities, transactions, and social media interactions. Alongside, we introduce a key attribute in the dataset, namely Rarity score, which measures the uniqueness of each parcel within the virtual world. Addressing the significant challenge posed by the dispersed nature of this data across various sources, we employ a systematic approach, utilizing both available APIs and custom scripts, to gather it. Subsequently, we meticulously curate and organize the information into four distinct fragments: (1) Characteristics, (2) OpenSea Trading History, (3) Ethereum Activity Transactions, and (4) Social Media. We envisage that this dataset would serve as a robust resource for training machine- and deep-learning models specifically designed to address real-world challenges within the domain of Decentraland parcels. The performance benchmarking of more than 20 state-of-the-art price prediction models on our dataset yields promising results, achieving a maximum R2 score of 0.8251 and an accuracy of 74.23% in case of Extra Trees Regressor and Classifier. The key findings reveal that the ensemble models perform better than both deep learning and linear models for our dataset. We observe a significant impact of coordinates, geographical proximity, rarity score, and few other economic indicators on the prediction of parcel prices.

replace Redefining the Shortest Path Problem Formulation of the Linear Non-Gaussian Acyclic Model: Pairwise Likelihood Ratios, Prior Knowledge, and Path Enumeration

Authors: Hans Jarett J. Ong, Brian Godwin S. Lim, Renzo Roel P. Tan, Kazushi Ikeda

Abstract: Effective causal discovery is essential for learning the causal graph from observational data. The linear non-Gaussian acyclic model (LiNGAM) operates under the assumption of a linear data generating process with non-Gaussian noise in determining the causal graph. Its assumption of unmeasured confounders being absent, however, poses practical limitations. In response, empirical research has shown that the reformulation of LiNGAM as a shortest path problem (LiNGAM-SPP) addresses this limitation. Within LiNGAM-SPP, mutual information is chosen to serve as the measure of independence. A challenge is introduced - parameter tuning is now needed due to its reliance on kNN mutual information estimators. The paper proposes a threefold enhancement to the LiNGAM-SPP framework. First, the need for parameter tuning is eliminated by using the pairwise likelihood ratio in lieu of kNN-based mutual information. This substitution is validated on a general data generating process and benchmark real-world data sets, outperforming existing methods especially when given a larger set of features. The incorporation of prior knowledge is then enabled by a node-skipping strategy implemented on the graph representation of all causal orderings to eliminate violations based on the provided input of relative orderings. Flexibility relative to existing approaches is achieved. Last among the three enhancements is the utilization of the distribution of paths in the graph representation of all causal orderings. From this, crucial properties of the true causal graph such as the presence of unmeasured confounders and sparsity may be inferred. To some extent, the expected performance of the causal discovery algorithm may be predicted. The refinements above advance the practicality and performance of LiNGAM-SPP, showcasing the potential of graph-search-based methodologies in advancing causal discovery.

replace Swarm Learning: A Survey of Concepts, Applications, and Trends

Authors: Elham Shammar, Xiaohui Cui, Mohammed A. A. Al-qaness

Abstract: Deep learning models have raised privacy and security concerns due to their reliance on large datasets on central servers. As the number of Internet of Things (IoT) devices increases, artificial intelligence (AI) will be crucial for resource management, data processing, and knowledge acquisition. To address those issues, federated learning (FL) has introduced a novel approach to building a versatile, large-scale machine learning framework that operates in a decentralized and hardware-agnostic manner. However, FL faces network bandwidth limitations and data breaches. To reduce the central dependency in FL and increase scalability, swarm learning (SL) has been proposed in collaboration with Hewlett Packard Enterprise (HPE). SL represents a decentralized machine learning framework that leverages blockchain technology for secure, scalable, and private data management. A blockchain-based network enables the exchange and aggregation of model parameters among participants, thus mitigating the risk of a single point of failure and eliminating communication bottlenecks. To the best of our knowledge, this survey is the first to introduce the principles of Swarm Learning, its architectural design, and its fields of application. In addition, it highlights numerous research avenues that require further exploration by academic and industry communities to unlock the full potential and applications of SL.

replace Uncertainty for Active Learning on Graphs

Authors: Dominik Fuchsgruber, Tom Wollschl\"ager, Bertrand Charpentier, Antonio Oroz, Stephan G\"unnemann

Abstract: Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models by iteratively acquiring labels of data points with the highest uncertainty. While it has proven effective for independent data its applicability to graphs remains under-explored. We propose the first extensive study of Uncertainty Sampling for node classification: (1) We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies. (2) We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries. We confirm our results on synthetic data and design an approximate approach that consistently outperforms other uncertainty estimators on real datasets. (3) Based on this analysis, we relate pitfalls in modeling uncertainty to existing methods. Our analysis enables and informs the development of principled uncertainty estimation on graphs.

replace Not All Language Model Features Are One-Dimensionally Linear

Authors: Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, Max Tegmark

Abstract: Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we examine the continuity of the days of the week feature in Mistral 7B. Overall, our work argues that understanding multi-dimensional features is necessary to mechanistically decompose some model behaviors.

replace A Counterfactual Analysis of the Dishonest Casino

Authors: Martin Haugh, Raghav Singal

Abstract: The dishonest casino is a well-known hidden Markov model (HMM) used in educational settings to introduce HMMs and graphical models. Here, a sequence of die rolls is observed, with the casino switching between a fair and a loaded die. Typically, the goal is to use the observed rolls to infer the pattern of fair and loaded dice, leading to filtering, smoothing, and Viterbi algorithms. This paper, however, explores how much of the winnings is attributable to the casino's cheating, a counterfactual question beyond the scope of HMM primitives. To address this, we introduce a structural causal model (SCM) consistent with the HMM and show the expected winnings attributable to cheating (EWAC) (which is only partially identifiable) can be bounded using linear programs (LPs). Through numerical experiments, we compute these bounds and develop intuition using benchmark SCMs based on independence, comonotonic, and counter-monotonic copulas. We show that tighter bounds are obtained with a time-homogeneity condition on the SCM, while looser bounds allow for an almost explicit LP solution. Domain-specific knowledge such as pathwise monotonicity or counterfactual stability can be incorporated via linear constraints. We also show the time-average EWAC is fully identifiable in the limit as the number of time periods goes to infinity. Our work contributes to bounding counterfactuals in causal inference and is the first to develop LP bounds in a dynamic HMM setting, benefiting educational contexts where counterfactual inference is taught.

replace Spectral-Refiner: Accurate Fine-Tuning of Spatiotemporal Fourier Neural Operator for Turbulent Flows

Authors: Shuhao Cao, Francesco Brarda, Ruipeng Li, Yuanzhe Xi

Abstract: Recent advancements in operator-type neural networks have shown promising results in approximating the solutions of spatiotemporal Partial Differential Equations (PDEs). However, these neural networks often entail considerable training expenses, and may not always achieve the desired accuracy required in many scientific and engineering disciplines. In this paper, we propose a new learning framework to address these issues. A new spatiotemporal adaptation is proposed to generalize any Fourier Neural Operator (FNO) variant to learn maps between Bochner spaces, which can perform an arbitrary-length temporal super-resolution for the first time. To better exploit this capacity, a new paradigm is proposed to refine the commonly adopted end-to-end neural operator training and evaluations with the help from the wisdom from traditional numerical PDE theory and techniques. Specifically, in the learning problems for the turbulent flow modeled by the Navier-Stokes Equations (NSE), the proposed paradigm trains an FNO only for a few epochs. Then, only the newly proposed spatiotemporal spectral convolution layer is fine-tuned without the frequency truncation. The spectral fine-tuning loss function uses a negative Sobolev norm for the first time in operator learning, defined through a reliable functional-type a posteriori error estimator whose evaluation is exact thanks to the Parseval identity. Moreover, unlike the difficult nonconvex optimization problems in the end-to-end training, this fine-tuning loss is convex. Numerical experiments on commonly used NSE benchmarks demonstrate significant improvements in both computational efficiency and accuracy, compared to end-to-end evaluation and traditional numerical PDE solvers under certain conditions. The source code is publicly available at https://github.com/scaomath/torch-cfd.

URLs: https://github.com/scaomath/torch-cfd.

replace Deep Modeling of Non-Gaussian Aleatoric Uncertainty

Authors: Aastha Acharya, Caleb Lee, Marissa D'Alonzo, Jared Shamwell, Nisar R. Ahmed, Rebecca Russell

Abstract: Deep learning offers promising new ways to accurately model aleatoric uncertainty in robotic state estimation systems, particularly when the uncertainty distributions do not conform to traditional assumptions of being fixed and Gaussian. In this study, we formulate and evaluate three fundamental deep learning approaches for conditional probability density modeling to quantify non-Gaussian aleatoric uncertainty: parametric, discretized, and generative modeling. We systematically compare the respective strengths and weaknesses of these three methods on simulated non-Gaussian densities as well as on real-world terrain-relative navigation data. Our results show that these deep learning methods can accurately capture complex uncertainty patterns, highlighting their potential for improving the reliability and robustness of estimation systems.

replace Energy-based Epistemic Uncertainty for Graph Neural Networks

Authors: Dominik Fuchsgruber, Tom Wollschl\"ager, Stephan G\"unnemann

Abstract: In domains with interdependent data, such as graphs, quantifying the epistemic uncertainty of a Graph Neural Network (GNN) is challenging as uncertainty can arise at different structural scales. Existing techniques neglect this issue or only distinguish between structure-aware and structure-agnostic uncertainty without combining them into a single measure. We propose GEBM, an energy-based model (EBM) that provides high-quality uncertainty estimates by aggregating energy at different structural levels that naturally arise from graph diffusion. In contrast to logit-based EBMs, we provably induce an integrable density in the data space by regularizing the energy function. We introduce an evidential interpretation of our EBM that significantly improves the predictive robustness of the GNN. Our framework is a simple and effective post hoc method applicable to any pre-trained GNN that is sensitive to various distribution shifts. It consistently achieves the best separation of in-distribution and out-of-distribution data on 6 out of 7 anomaly types while having the best average rank over shifts on \emph{all} datasets.

replace SigDiffusions: Score-Based Diffusion Models for Time Series via Log-Signature Embeddings

Authors: Barbora Barancikova, Zhuoyue Huang, Cristopher Salvi

Abstract: Score-based diffusion models have recently emerged as state-of-the-art generative models for a variety of data modalities. Nonetheless, it remains unclear how to adapt these models to generate long multivariate time series. Viewing a time series as the discretisation of an underlying continuous process, we introduce SigDiffusion, a novel diffusion model operating on log-signature embeddings of the data. The forward and backward processes gradually perturb and denoise log-signatures while preserving their algebraic structure. To recover a signal from its log-signature, we provide new closed-form inversion formulae expressing the coefficients obtained by expanding the signal in a given basis (e.g. Fourier or orthogonal polynomials) as explicit polynomial functions of the log-signature. Finally, we show that combining SigDiffusions with these inversion formulae results in high-quality long time series generation, competitive with the current state-of-the-art on various datasets of synthetic and real-world examples.

replace BoA: Attention-aware Post-training Quantization without Backpropagation

Authors: Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon

Abstract: Post-training quantization (PTQ) is a promising solution for deploying large language models (LLMs) on resource-constrained devices. Early methods developed for smaller networks like ResNet rely on gradient-based optimization, which becomes impractical for hyper-scale LLMs with billions of parameters. While recently proposed backpropagation-free or transformation-based methods alleviate this issue, their performance remains limited by either a lack of inter-layer dependency consideration or the use of naive nearest-rounding-based integer weight assignment to save the heavy computational cost of weight optimization. We thus introduce a novel backpropagation-free PTQ algorithm that optimizes integer weights by considering inter-layer dependencies. The key innovation is the development of attention-aware Hessian matrices that capture inter-layer interactions within the attention module. Extensive experiments demonstrate that our approach not only outperforms existing weight quantization methods but also shows good synergy with conventional methods to suppress activation outliers, leading to state-of-the-art weight-activation quantization performance.

replace Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization

Authors: Wentse Chen, Shiyu Huang, Jeff Schneider

Abstract: Multi-agent reinforcement learning (MARL) tasks often utilize a centralized training with decentralized execution (CTDE) framework. QMIX is a successful CTDE method that learns a credit assignment function to derive local value functions from a global value function, defining a deterministic local policy. However, QMIX is hindered by its poor exploration strategy. While maximum entropy reinforcement learning (RL) promotes better exploration through stochastic policies, QMIX's process of credit assignment conflicts with the maximum entropy objective and the decentralized execution requirement, making it unsuitable for maximum entropy RL. In this paper, we propose an enhancement to QMIX by incorporating an additional local Q-value learning method within the maximum entropy RL framework. Our approach constrains the local Q-value estimates to maintain the correct ordering of all actions. Due to the monotonicity of the QMIX value function, these updates ensure that locally optimal actions align with globally optimal actions. We theoretically prove the monotonic improvement and convergence of our method to an optimal solution. Experimentally, we validate our algorithm in matrix games, Multi-Agent Particle Environment and demonstrate state-of-the-art performance in SMAC-v2.

replace Cubic regularized subspace Newton for non-convex optimization

Authors: Jim Zhao, Aurelien Lucchi, Nikita Doikov

Abstract: This paper addresses the optimization problem of minimizing non-convex continuous functions, which is relevant in the context of high-dimensional machine learning applications characterized by over-parametrization. We analyze a randomized coordinate second-order method named SSCN which can be interpreted as applying cubic regularization in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, rendering it applicable in higher-dimensional scenarios. Theoretically, we establish convergence guarantees for non-convex functions, with interpolating rates for arbitrary subspace sizes and allowing inexact curvature estimation. When increasing subspace size, our complexity matches $\mathcal{O}(\epsilon^{-3/2})$ of the cubic regularization (CR) rate. Additionally, we propose an adaptive sampling scheme ensuring exact convergence rate of $\mathcal{O}(\epsilon^{-3/2}, \epsilon^{-3})$ to a second-order stationary point, even without sampling all coordinates. Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods.

replace Random Latent Exploration for Deep Reinforcement Learning

Authors: Srinath Mahankali, Zhang-Wei Hong, Ayush Sekhari, Alexander Rakhlin, Pulkit Agrawal

Abstract: We introduce Random Latent Exploration (RLE), a simple yet effective exploration strategy in reinforcement learning (RL). On average, RLE outperforms noise-based methods, which perturb the agent's actions, and bonus-based exploration, which rewards the agent for attempting novel behaviors. The core idea of RLE is to encourage the agent to explore different parts of the environment by pursuing randomly sampled goals in a latent space. RLE is as simple as noise-based methods, as it avoids complex bonus calculations but retains the deep exploration benefits of bonus-based methods. Our experiments show that RLE improves performance on average in both discrete (e.g., Atari) and continuous control tasks (e.g., Isaac Gym), enhancing exploration while remaining a simple and general plug-in for existing RL algorithms. Project website and code: https://srinathm1359.github.io/random-latent-exploration

URLs: https://srinathm1359.github.io/random-latent-exploration

replace Algebraic Adversarial Attacks on Integrated Gradients

Authors: Lachlan Simpson, Federico Costanza, Kyle Millar, Adriel Cheng, Cheng-Chew Lim, Hong Gunn Chew

Abstract: Adversarial attacks on explainability models have drastic consequences when explanations are used to understand the reasoning of neural networks in safety critical systems. Path methods are one such class of attribution methods susceptible to adversarial attacks. Adversarial learning is typically phrased as a constrained optimisation problem. In this work, we propose algebraic adversarial examples and study the conditions under which one can generate adversarial examples for integrated gradients. Algebraic adversarial examples provide a mathematically tractable approach to adversarial examples.

replace Back-Projection Diffusion: Solving the Wideband Inverse Scattering Problem with Diffusion Models

Authors: Borong Zhang, Mart\'in Guerra, Qin Li, Leonardo Zepeda-N\'u\~nez

Abstract: We present Wideband Back-Projection Diffusion, an end-to-end probabilistic framework for approximating the posterior distribution of the refractive index using the wideband scattering data through the inverse scattering map. This framework produces highly accurate reconstructions, leveraging conditional diffusion models to draw samples, and also honors the symmetries of the underlying physics of wave-propagation. The procedure is factored into two steps, with the first, inspired by the filtered back-propagation formula, transforms data into a physics-based latent representation, while the second learns a conditional score function conditioned on this latent representation. These two steps individually obey their associated symmetries and are amenable to compression by imposing the rank structure found in the filtered back-projection formula. Empirically, our framework has both low sample and computational complexity, with its number of parameters scaling only sub-linearly with the target resolution, and has stable training dynamics. It provides sharp reconstructions effortlessly and is capable of recovering even sub-Nyquist features in the multiple-scattering regime.

replace LiD-FL: Towards List-Decodable Federated Learning

Authors: Hong Liu, Liren Shan, Han Bao, Ronghui You, Yuhao Yi, Jiancheng Lv

Abstract: Federated learning is often used in environments with many unverified participants. Therefore, federated learning under adversarial attacks receives significant attention. This paper proposes an algorithmic framework for list-decodable federated learning, where a central server maintains a list of models, with at least one guaranteed to perform well. The framework has no strict restriction on the fraction of honest workers, extending the applicability of Byzantine federated learning to the scenario with more than half adversaries. Under proper assumptions on the loss function, we prove a convergence theorem for our method. Experimental results, including image classification tasks with both convex and non-convex losses, demonstrate that the proposed algorithm can withstand the malicious majority under various attacks.

replace Remove Symmetries to Control Model Expressivity and Improve Optimization

Authors: Liu Ziyin, Yizhou Xu, Isaac Chuang

Abstract: When symmetry is present in the loss function, the model is likely to be trapped in a low-capacity state that is sometimes known as a "collapse". Being trapped in these low-capacity states can be a major obstacle to training across many scenarios where deep learning technology is applied. We first prove two concrete mechanisms through which symmetries lead to reduced capacities and ignored features during training and inference. We then propose a simple and theoretically justified algorithm, syre, to remove almost all symmetry-induced low-capacity states in neural networks. When this type of entrapment is especially a concern, removing symmetries with the proposed method is shown to correlate well with improved optimization or performance. A remarkable merit of the proposed method is that it is model-agnostic and does not require any knowledge of the symmetry.

replace Continual learning with the neural tangent ensemble

Authors: Ari S. Benjamin, Christian Pehle, Kyle Daruwalla

Abstract: A natural strategy for continual learning is to weigh a Bayesian ensemble of fixed functions. This suggests that if a (single) neural network could be interpreted as an ensemble, one could design effective algorithms that learn without forgetting. To realize this possibility, we observe that a neural network classifier with N parameters can be interpreted as a weighted ensemble of N classifiers, and that in the lazy regime limit these classifiers are fixed throughout learning. We call these classifiers the neural tangent experts and show they output valid probability distributions over the labels. We then derive the likelihood and posterior probability of each expert given past data. Surprisingly, the posterior updates for these experts are equivalent to a scaled and projected form of stochastic gradient descent (SGD) over the network weights. Away from the lazy regime, networks can be seen as ensembles of adaptive experts which improve over time. These results offer a new interpretation of neural networks as Bayesian ensembles of experts, providing a principled framework for understanding and mitigating catastrophic forgetting in continual learning settings.

replace Neural CRNs: A Natural Implementation of Learning in Chemical Reaction Networks

Authors: Rajiv Teja Nagipogu, John H. Reif

Abstract: We present Neural CRNs, an efficient, autonomous, and general-purpose implementation of learning within mass action chemical reaction systems. Unlike prior works, which transliterate discrete neural networks into chemical systems, Neural CRNs are a purely analog chemical system, which encodes neural computations in the concentration dynamics of its chemical species. Consequently, the chemical reactions in this system stay true to their nature, behaving as atomic end-to-end computational units, resulting in concise and efficient reaction network implementations. We demonstrate this efficiency by assembling a highly streamlined supervised learning procedure that requires only two clock phases. We further validate the robustness of our framework by constructing Neural CRN circuits for several linear and nonlinear regression and classification tasks. Furthermore, a minimal linear regression circuit is assembled using only 13 reactions and 15 species. Our nonlinear modeling circuits significantly advance the state-of-the-art through compact and simple implementations. The synergistic nature of our framework with the analog chemical computing hardware leaves ample room for optimizations and approximations in the computational model, several of which are discussed in this work. Our work introduces a novel paradigm for chemical computing and learning, providing a foundational platform for future adaptive biochemical circuits with applications in fields such as synthetic biology, bioengineering, and adaptive biomedicine.

replace Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Authors: Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin

Abstract: Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

replace On the Optimal Memorization Capacity of Transformers

Authors: Tokio Kajitsuka, Issei Sato

Abstract: Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $\tilde{O}(\sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $\tilde{O}(\sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.

replace PhyMPGN: Physics-encoded Message Passing Graph Network for spatiotemporal PDE systems

Authors: Bocheng Zeng, Qi Wang, Mengtao Yan, Yang Liu, Ruizhi Chengze, Yi Zhang, Hongsheng Liu, Zidong Wang, Hao Sun

Abstract: Solving partial differential equations (PDEs) serves as a cornerstone for modeling complex dynamical systems. Recent progresses have demonstrated grand benefits of data-driven neural-based models for predicting spatiotemporal dynamics (e.g., tremendous speedup gain compared with classical numerical methods). However, most existing neural models rely on rich training data, have limited extrapolation and generalization abilities, and suffer to produce precise or reliable physical prediction under intricate conditions (e.g., irregular mesh or geometry, complex boundary conditions, diverse PDE parameters, etc.). To this end, we propose a new graph learning approach, namely, Physics-encoded Message Passing Graph Network (PhyMPGN), to model spatiotemporal PDE systems on irregular meshes given small training datasets. Specifically, we incorporate a GNN into a numerical integrator to approximate the temporal marching of spatiotemporal dynamics for a given PDE system. Considering that many physical phenomena are governed by diffusion processes, we further design a learnable Laplace block, which encodes the discrete Laplace-Beltrami operator, to aid and guide the GNN learning in a physically feasible solution space. A boundary condition padding strategy is also designed to improve the model convergence and accuracy. Extensive experiments demonstrate that PhyMPGN is capable of accurately predicting various types of spatiotemporal dynamics on coarse unstructured meshes, consistently achieves the state-of-the-art results, and outperforms other baselines with considerable gains.

replace Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

Authors: Ghada Sokar, Johan Obando-Ceron, Aaron Courville, Hugo Larochelle, Pablo Samuel Castro

Abstract: The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While soft mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.

replace A Prospect-Theoretic Policy Gradient Algorithm for Behavioral Alignment in Reinforcement Learning

Authors: Olivier Lepel, Anas Barakat

Abstract: Classical reinforcement learning (RL) typically assumes rational decision-making based on expected utility theory. However, this model has been shown to be empirically inconsistent with actual human preferences, as evidenced in psychology and behavioral economics. Cumulative Prospect Theory (CPT) provides a more nuanced model for human-based decision-making, capturing diverse attitudes and perceptions toward risk, gains, and losses. While prior work has integrated CPT with RL to solve a CPT policy optimization problem, the understanding and practical impact of this formulation remain limited. We revisit the CPT-RL framework, offering new theoretical insights into the nature of optimal policies. We further derive a novel policy gradient theorem for CPT objectives, generalizing the foundational result in standard RL. Building on this theorem, we design a model-free policy gradient algorithm for solving the CPT-RL problem and demonstrate its performance through simulations. Notably, our algorithm scales better to larger state spaces compared to existing zeroth-order methods. This work advances the integration of behavioral decision-making into RL.

replace Formation of Representations in Neural Networks

Authors: Liu Ziyin, Isaac Chuang, Tomer Galanti, Tomaso Poggio

Abstract: Understanding neural representations will help open the black box of neural networks and advance our scientific understanding of modern AI systems. However, how complex, structured, and transferable representations emerge in modern neural networks has remained a mystery. Building on previous results, we propose the Canonical Representation Hypothesis (CRH), which posits a set of six alignment relations to universally govern the formation of representations in most hidden layers of a neural network. Under the CRH, the latent representations (R), weights (W), and neuron gradients (G) become mutually aligned during training. This alignment implies that neural networks naturally learn compact representations, where neurons and weights are invariant to task-irrelevant transformations. We then show that the breaking of CRH leads to the emergence of reciprocal power-law relations between R, W, and G, which we refer to as the Polynomial Alignment Hypothesis (PAH). We present a minimal-assumption theory proving that the balance between gradient noise and regularization is crucial for the emergence of the canonical representation. The CRH and PAH lead to an exciting possibility of unifying major key deep learning phenomena, including neural collapse and the neural feature ansatz, in a single framework.

replace Optimization Proxies using Limited Labeled Data and Training Time -- A Semi-Supervised Bayesian Neural Network Approach

Authors: Parikshit Pareek, Abhijith Jayakumar, Kaarthik Sundar, Deepjyoti Deka, Sidhant Misra

Abstract: Constrained optimization problems arise in various engineering systems such as inventory management and power grids. Standard deep neural network (DNN) based machine learning proxies are ineffective in practical settings where labeled data is scarce and training times are limited. We propose a semi-supervised Bayesian Neural Networks (BNNs) based optimization proxy for this complex regime, wherein training commences in a sandwiched fashion, alternating between a supervised learning step for minimizing cost, and an unsupervised learning step for enforcing constraint feasibility. We show that the proposed semi-supervised BNN outperforms DNN architectures on important non-convex constrained optimization problems from energy network operations, achieving up to a tenfold reduction in expected maximum equality gap and halving the inequality gaps. Further, the BNN's ability to provide posterior samples is leveraged to construct practically meaningful probabilistic confidence bounds on performance using a limited validation data, unlike prior methods.

replace Towards Scalable General Utility Reinforcement Learning: Occupancy Approximation, Sample Complexity and Global Optimality

Authors: Anas Barakat, Souradip Chakraborty, Peihong Yu, Pratap Tokekar, Amrit Singh Bedi

Abstract: Reinforcement learning with general utilities has recently gained attention thanks to its ability to unify several problems, including imitation learning, pure exploration, and safe reinforcement learning. However, prior work for solving this general problem in a unified way has only focused on the tabular setting. This is restrictive when considering larger state-action spaces because of the need to estimate occupancy measures during policy optimization. In this work, we address this issue and propose to approximate occupancy measures within a function approximation class using maximum likelihood estimation (MLE). We propose a simple policy gradient algorithm where an actor updates the policy parameters to maximize the general utility objective whereas a critic approximates the occupancy measure using MLE. We provide a statistical complexity analysis showing that our occupancy measure estimation error only scales with the dimension of our function approximation class rather than the size of the state action space. Under suitable assumptions, we establish first order stationarity and global optimality performance bounds for the proposed algorithm for nonconcave and concave general utilities respectively. We complement our methodological and theoretical findings with promising empirical results showing the scalability potential of our approach compared to existing tabular count-based approaches.

replace When Graph meets Multimodal: Benchmarking and Meditating on Multimodal Attributed Graphs Learning

Authors: Hao Yan, Chaozhuo Li, Jun Yin, Zhigang Yu, Weihao Han, Mingzheng Li, Zhengxin Zeng, Hao Sun, Senzhang Wang

Abstract: Multimodal Attributed Graphs (MAGs) are ubiquitous in real-world applications, encompassing extensive knowledge through multimodal attributes attached to nodes (e.g., texts and images) and topological structure representing node interactions. Despite its potential to advance diverse research fields like social networks and e-commerce, MAG representation learning (MAGRL) remains underexplored due to the lack of standardized datasets and evaluation frameworks. In this paper, we first propose MAGB, a comprehensive MAG benchmark dataset, featuring curated graphs from various domains with both textual and visual attributes. Based on MAGB dataset, we further systematically evaluate two mainstream MAGRL paradigms: $\textit{GNN-as-Predictor}$, which integrates multimodal attributes via Graph Neural Networks (GNNs), and $\textit{VLM-as-Predictor}$, which harnesses Vision Language Models (VLMs) for zero-shot reasoning. Extensive experiments on MAGB reveal following critical insights: $\textit{(i)}$ Modality significances fluctuate drastically with specific domain characteristics. $\textit{(ii)}$ Multimodal embeddings can elevate the performance ceiling of GNNs. However, intrinsic biases among modalities may impede effective training, particularly in low-data scenarios. $\textit{(iii)}$ VLMs are highly effective at generating multimodal embeddings that alleviate the imbalance between textual and visual attributes. These discoveries, which illuminate the synergy between multimodal attributes and graph topologies, contribute to reliable benchmarks, paving the way for future MAG research. The MAGB dataset and evaluation pipeline are publicly available at https://github.com/sktsherlock/MAGB.

URLs: https://github.com/sktsherlock/MAGB.

replace Asymptotic Analysis of Sample-averaged Q-learning

Authors: Saunak Kumar Panda, Ruiqi Liu, Yisha Xiang

Abstract: Reinforcement learning (RL) has emerged as a key approach for training agents in complex and uncertain environments. Incorporating statistical inference in RL algorithms is essential for understanding and managing uncertainty in model performance. This paper introduces a generalized framework for time-varying batch-averaged Q-learning, termed sample-averaged Q-learning (SA-QL), which extends traditional single-sample Q-learning by aggregating samples of rewards and next states to better account for data variability and uncertainty. We leverage the functional central limit theorem (FCLT) to establish a novel framework that provides insights into the asymptotic normality of the sample-averaged algorithm under mild conditions. Additionally, we develop a random scaling method for interval estimation, enabling the construction of confidence intervals without requiring extra hyperparameters. Extensive numerical experiments across classic stochastic OpenAI Gym environments, including windy gridworld and slippery frozenlake, demonstrate how different batch scheduling strategies affect learning efficiency, coverage rates, and confidence interval widths. This work establishes a unified theoretical foundation for sample-averaged Q-learning, providing insights into effective batch scheduling and statistical inference for RL algorithms.

replace Progressive Mixed-Precision Decoding for Efficient LLM Inference

Authors: Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris

Abstract: In spite of the great potential of large language models (LLMs) across various tasks, their deployment on resource-constrained devices remains challenging due to their excessive computational and memory demands. Quantization has emerged as an effective solution by storing weights in reduced precision. However, utilizing low precisions (i.e.~2/3-bit) to substantially alleviate the memory-boundedness of LLM decoding, still suffers from prohibitive performance drop. In this work, we argue that existing approaches fail to explore the diversity in computational patterns, redundancy, and sensitivity to approximations of the different phases of LLM inference, resorting to a uniform quantization policy throughout. Instead, we propose a novel phase-aware method that selectively allocates precision during different phases of LLM inference, achieving both strong context extraction during prefill and efficient memory bandwidth utilization during decoding. To further address the memory-boundedness of the decoding phase, we introduce Progressive Mixed-Precision Decoding (PMPD), a technique that enables the gradual lowering of precision deeper in the generated sequence, together with a spectrum of precision-switching schedulers that dynamically drive the precision-lowering decisions in either task-adaptive or prompt-adaptive manner. Extensive evaluation across diverse language tasks shows that when targeting Nvidia GPUs, PMPD achieves 1.4$-$12.2$\times$ speedup in matrix-vector multiplications over fp16 models, while when targeting an LLM-optimized NPU, our approach delivers a throughput gain of 3.8$-$8.0$\times$ over fp16 models and up to 1.54$\times$ over uniform quantization approaches while preserving the output quality.

replace Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Authors: Xiyue Peng, Hengquan Guo, Jiawei Zhang, Dongqing Zou, Ziyu Shao, Honghao Wei, Xin Liu

Abstract: Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed "safety compensation", where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe. To address this issue, we propose Rectified Policy Optimization (RePO), which replaces the expected safety constraint with critical safety constraints imposed on every prompt. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments demonstrate that RePO outperforms strong baseline methods and significantly enhances LLM safety alignment.

replace Generator Matching: Generative modeling with arbitrary Markov processes

Authors: Peter Holderrieth, Marton Havasi, Jason Yim, Neta Shaul, Itai Gat, Tommi Jaakkola, Brian Karrer, Ricky T. Q. Chen, Yaron Lipman

Abstract: We introduce Generator Matching, a modality-agnostic framework for generative modeling using arbitrary Markov processes. Generators characterize the infinitesimal evolution of a Markov process, which we leverage for generative modeling in a similar vein to flow matching: we construct conditional generators which generate single data points, then learn to approximate the marginal generator which generates the full data distribution. We show that Generator Matching unifies various generative modeling methods, including diffusion models, flow matching and discrete diffusion models. Furthermore, it expands the design space to new and unexplored Markov processes such as jump processes. Finally, Generator Matching enables the construction of superpositions of Markov generative models and enables the construction of multimodal models in a rigorous manner. We empirically validate our method on image and multimodal generation, e.g. showing that superposition with a jump process improves performance.

replace Theoretical characterisation of the Gauss-Newton conditioning in Neural Networks

Authors: Jim Zhao, Sidak Pal Singh, Aurelien Lucchi

Abstract: The Gauss-Newton (GN) matrix plays an important role in machine learning, most evident in its use as a preconditioning matrix for a wide family of popular adaptive methods to speed up optimization. Besides, it can also provide key insights into the optimization landscape of neural networks. In the context of deep neural networks, understanding the GN matrix involves studying the interaction between different weight matrices as well as the dependencies introduced by the data, thus rendering its analysis challenging. In this work, we take a first step towards theoretically characterizing the conditioning of the GN matrix in neural networks. We establish tight bounds on the condition number of the GN in deep linear networks of arbitrary depth and width, which we also extend to two-layer ReLU networks. We expand the analysis to further architectural components, such as residual connections and convolutional layers. Finally, we empirically validate the bounds and uncover valuable insights into the influence of the analyzed architectural components.

replace NeuralDEM -- Real-time Simulation of Industrial Particulate Flows

Authors: Benedikt Alkin, Tobias Kronlachner, Samuele Papa, Stefan Pirker, Thomas Lichtenegger, Johannes Brandstetter

Abstract: Advancements in computing power have made it possible to numerically simulate large-scale fluid-mechanical and/or particulate systems, many of which are integral to core industrial processes. Among the different numerical methods available, the discrete element method (DEM) provides one of the most accurate representations of a wide range of physical systems involving granular and discontinuous materials. Consequently, DEM has become a widely accepted approach for tackling engineering problems connected to granular flows and powder mechanics. Additionally, DEM can be integrated with grid-based computational fluid dynamics (CFD) methods, enabling the simulation of chemical processes taking place, e.g., in fluidized beds. However, DEM is computationally intensive because of the intrinsic multiscale nature of particulate systems, restricting simulation duration or number of particles. Towards this end, NeuralDEM presents an end-to-end approach to replace slow numerical DEM routines with fast, adaptable deep learning surrogates. NeuralDEM is capable of picturing long-term transport processes across different regimes using macroscopic observables without any reference to microscopic model parameters. First, NeuralDEM treats the Lagrangian discretization of DEM as an underlying continuous field, while simultaneously modeling macroscopic behavior directly as additional auxiliary fields. Second, NeuralDEM introduces multi-branch neural operators scalable to real-time modeling of industrially-sized scenarios - from slow and pseudo-steady to fast and transient. Such scenarios have previously posed insurmountable challenges for deep learning models. Notably, NeuralDEM faithfully models coupled CFD-DEM fluidized bed reactors of 160k CFD cells and 500k DEM particles for trajectories of 28s. NeuralDEM will open many new doors to advanced engineering and much faster process cycles.

replace Machine learning for cerebral blood vessels' malformations

Authors: Irem Topal, Alexander Cherevko, Yuri Bugay, Maxim Shishlenin, Jean Barbier, Deniz Eroglu, \'Edgar Rold\'an, Roman Belousov

Abstract: Cerebral aneurysms and arteriovenous malformations are life-threatening hemodynamic pathologies of the brain. While surgical intervention is often essential to prevent fatal outcomes, it carries significant risks both during the procedure and in the postoperative period, making the management of these conditions highly challenging. Parameters of cerebral blood flow, routinely monitored during medical interventions or with modern noninvasive high-resolution imaging methods, could potentially be utilized in machine learning-assisted protocols for risk assessment and therapeutic prognosis. To this end, we developed a linear oscillatory model of blood velocity and pressure for clinical data acquired from neurosurgical operations. Using the method of Sparse Identification of Nonlinear Dynamics (SINDy), the parameters of our model can be reconstructed online within milliseconds from a short time series of the hemodynamic variables. The identified parameter values enable automated classification of the blood-flow pathologies by means of logistic regression, achieving an accuracy of 73 \%}. Our results demonstrate the potential of this model for both diagnostic and prognostic applications, providing a robust and interpretable framework for assessing cerebral blood vessel conditions.

replace On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

Authors: Xianliang Li, Jun Luo, Zhiwei Zheng, Hanxiao Wang, Li Luo, Lingkun Wen, Linlong Wu, Sheng Xu

Abstract: Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.

replace MALT: Improving Reasoning with Multi-Agent LLM Training

Authors: Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H. S. Torr, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt

Abstract: Large Language Models (LLMs) often produce answers with a single chain-of-thought, which restricts their ability to explore reasoning paths or self-correct flawed outputs in complex tasks. In this paper, we introduce MALT (Multi-Agent LLM Training), a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps using a sequential pipeline of heterogeneous agents. During data generation, each agent is repeatedly sampled to form a multi-agent search tree, where final outputs are graded against ground-truth data. We then apply value iteration to propagate reward signals back to each role-conditioned model, automatically producing multi-agent post-training data without human or teacher-model supervision. Our off-policy approach allows each agent to specialize by learning from correct and incorrect trajectories, ultimately improving the end-to-end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively, making it an important advance towards multi-agent cooperative training.

replace Learning High-Degree Parities: The Crucial Role of the Initialization

Authors: Emmanuel Abbe, Elisabetta Cornacchia, Jan H\k{a}z{\l}a, Donald Kougang-Yombi

Abstract: Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$, but fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation $\sigma$ prevents it. The positive result for almost-full parities is shown to hold up to $\sigma=O(d^{-1})$, pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.

replace Enhancing operational wind downscaling capabilities over Canada: Application of a Conditional Wasserstein GAN methodology

Authors: Jorge Guevara, Victor Nascimento, Johannes Schmude, Daniel Salles, Simon Corbeil-L\'etourneau, Madalina Surcel, Dominique Brunet

Abstract: Wind downscaling is essential for improving the spatial resolution of weather forecasts, particularly in operational Numerical Weather Prediction (NWP). This study advances wind downscaling by extending the DownGAN framework introduced by Annau et al.,to operational datasets from the Global Deterministic Prediction System (GDPS) and High-Resolution Deterministic Prediction System (HRDPS), covering the entire Canadian domain. We enhance the model by incorporating high-resolution static covariates, such as HRDPS-derived topography, into a Conditional Wasserstein Generative Adversarial Network with Gradient Penalty, implemented using a UNET-based generator. Following the DownGAN framework, our methodology integrates low-resolution GDPS forecasts (15 km, 10-day horizon) and high-resolution HRDPS forecasts (2.5 km, 48-hour horizon) with Frequency Separation techniques adapted from computer vision. Through robust training and inference over the Canadian region, we demonstrate the operational scalability of our approach, achieving significant improvements in wind downscaling accuracy. Statistical validation highlights reductions in root mean square error (RMSE) and log spectral distance (LSD) metrics compared to the original DownGAN. High-resolution conditioning covariates and Frequency Separation strategies prove instrumental in enhancing model performance. This work underscores the potential for extending high-resolution wind forecasts beyond the 48-hour horizon, bridging the gap to the 10-day low resolution global forecast window.

replace Simple Guidance Mechanisms for Discrete Diffusion Models

Authors: Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P. de Almeida, Alexander Rush, Thomas Pierrot, Volodymyr Kuleshov

Abstract: Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.

replace A Unifying Information-theoretic Perspective on Evaluating Generative Models

Authors: Alexis Fox, Samarth Swarup, Abhijin Adiga

Abstract: Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.

replace ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Authors: Ziteng Wang, Jun Zhu, Jianfei Chen

Abstract: Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.

URLs: https://github.com/thu-ml/ReMoE.

replace Uncertainty Herding: One Active Learning Method for All Label Budgets

Authors: Wonho Bae, Gabriel L. Oliveira, Danica J. Sutherland

Abstract: Most active learning research has focused on methods which perform well when many labels are available, but can be dramatically worse than random selection when label budgets are small. Other methods have focused on the low-budget regime, but do poorly as label budgets increase. As the line between "low" and "high" budgets varies by problem, this is a serious issue in practice. We propose uncertainty coverage, an objective which generalizes a variety of low- and high-budget objectives, as well as natural, hyperparameter-light methods to smoothly interpolate between low- and high-budget regimes. We call greedy optimization of the estimate Uncertainty Herding; this simple method is computationally fast, and we prove that it nearly optimizes the distribution-level coverage. In experimental validation across a variety of active learning tasks, our proposal matches or beats state-of-the-art performance in essentially all cases; it is the only method of which we are aware that reliably works well in both low- and high-budget settings.

replace A Survey on Diffusion Models for Anomaly Detection

Authors: Jing Liu, Zhenchao Ma, Zepu Wang, Chenxuanyin Zou, Jiayang Ren, Zehua Wang, Liang Song, Bo Hu, Yang Liu, Victor C. M. Leung

Abstract: Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we review recent advances in DMAD research. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at https://github.com/fdjingliu/DMAD.

URLs: https://github.com/fdjingliu/DMAD.

replace HadamRNN: Binary and Sparse Ternary Orthogonal RNNs

Authors: Armand Foucault (IMT, ANITI), Franck Mamalet (ANITI), Fran\c{c}ois Malgouyres (IMT)

Abstract: Binary and sparse ternary weights in neural networks enable faster computations and lighter representations, facilitating their use on edge devices with limited computational power. Meanwhile, vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. To date, no method has successfully achieved binarization or ternarization of vanilla RNN weights. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary orthogonal matrices. This method enables the training of orthogonal RNNs (ORNNs) with binary and sparse ternary recurrent weights, effectively creating a specific class of binary and sparse ternary vanilla RNNs. The resulting ORNNs, called HadamRNN and lock-HadamRNN, are evaluated on benchmarks such as the copy task, permuted and sequential MNIST tasks, and IMDB dataset. Despite binarization or sparse ternarization, these RNNs maintain performance levels comparable to state-of-the-art full-precision models, highlighting the effectiveness of our approach. Notably, our approach is the first solution with binary recurrent weights capable of tackling the copy task over 1000 timesteps.

replace Omni-Mol: Exploring Universal Convergent Space for Omni-Molecular Tasks

Authors: Chengxin Hu, Hao Li, Yihe Yuan, Zezheng Song, Haixin Wang

Abstract: Building generalist models has recently demonstrated remarkable capabilities in diverse scientific domains. Within the realm of molecular learning, several studies have explored unifying diverse tasks across diverse domains. However, negative conflicts and interference between molecules and knowledge from different domain may have a worse impact in threefold. First, conflicting molecular representations can lead to optimization difficulties for the models. Second, mixing and scaling up training data across diverse tasks is inherently challenging. Third, the computational cost of refined pretraining is prohibitively high. To address these limitations, this paper presents Omni-Mol, a scalable and unified LLM-based framework for direct instruction tuning. Omni-Mol builds on three key components to tackles conflicts: (1) a unified encoding mechanism for any task input; (2) an active-learning-driven data selection strategy that significantly reduces dataset size; (3) a novel design of the adaptive gradient stabilization module and anchor-and-reconcile MoE framework that ensures stable convergence. Experimentally, Omni-Mol achieves state-of-the-art performance across 15 molecular tasks, demonstrates the presence of scaling laws in the molecular domain, and is supported by extensive ablation studies and analyses validating the effectiveness of its design. The dataset, code and weights of the powerful AI-driven chemistry generalist are open-sourced.

replace Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs

Authors: Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi

Abstract: Federated learning (FL) is a popular paradigm for collaborative training which avoids direct data exposure between clients. However, data privacy issues still remain: FL-trained large language models are capable of memorizing and completing phrases and sentences contained in training data when given with their prefixes. Thus, it is possible for adversarial and honest-but-curious clients to recover training data of other participants simply through targeted prompting. In this work, we demonstrate that a popular and simple fine-tuning strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a factor of 10. We study this effect by performing a medical question-answering fine-tuning task and injecting multiple replicas of out-of-distribution sensitive sequences drawn from an external clinical dataset. We observe a reduction in memorization for a wide variety of Llama 2 and 3 models, and find that LoRA can reduce memorization in centralized learning as well. Furthermore, we show that LoRA can be combined with other privacy-preserving techniques such as gradient clipping and Gaussian noising, secure aggregation, and Goldfish loss to further improve record-level privacy while maintaining performance.

replace Removing Neural Signal Artifacts with Autoencoder-Targeted Adversarial Transformers (AT-AT)

Authors: Benjamin J. Choi

Abstract: Electromyogenic (EMG) noise is a major contamination source in EEG data that can impede accurate analysis of brain-specific neural activity. Recent literature on EMG artifact removal has moved beyond traditional linear algorithms in favor of machine learning-based systems. However, existing deep learning-based filtration methods often have large compute footprints and prohibitively long training times. In this study, we present a new machine learning-based system for filtering EMG interference from EEG data using an autoencoder-targeted adversarial transformer (AT-AT). By leveraging the lightweight expressivity of an autoencoder to determine optimal time-series transformer application sites, our AT-AT architecture achieves a >90% model size reduction compared to published artifact removal models. The addition of adversarial training ensures that filtered signals adhere to the fundamental characteristics of EEG data. We trained AT-AT using published neural data from 67 subjects and found that the system was able to achieve comparable test performance to larger models; AT-AT posted a mean reconstructive correlation coefficient above 0.95 at an initial signal-to-noise ratio (SNR) of 2 dB and 0.70 at -7 dB SNR. Further research generalizing these results to broader sample sizes beyond these isolated test cases will be crucial; while outside the scope of this study, we also include results from a real-world deployment of AT-AT in the Appendix.

replace Geometric Machine Learning on EEG Signals

Authors: Benjamin J. Choi

Abstract: Brain-computer interfaces (BCIs) offer transformative potential, but decoding neural signals presents significant challenges. The core premise of this paper is built around demonstrating methods to elucidate the underlying low-dimensional geometric structure present in high-dimensional brainwave data in order to assist in downstream BCI-related neural classification tasks. We demonstrate two pipelines related to electroencephalography (EEG) signal processing: (1) a preliminary pipeline removing noise from individual EEG channels, and (2) a downstream manifold learning pipeline uncovering geometric structure across networks of EEG channels. We conduct preliminary validation using two EEG datasets and situate our demonstration in the context of the BCI-relevant imagined digit decoding problem. Our preliminary pipeline uses an attention-based EEG filtration network to extract clean signal from individual EEG channels. Our primary pipeline uses a fast Fourier transform, a Laplacian eigenmap, a discrete analog of Ricci flow via Ollivier's notion of Ricci curvature, and a graph convolutional network to perform dimensionality reduction on high-dimensional multi-channel EEG data in order to enable regularizable downstream classification. Our system achieves competitive performance with existing signal processing and classification benchmarks; we demonstrate a mean test correlation coefficient of >0.95 at 2 dB on semi-synthetic neural denoising and a downstream EEG-based classification accuracy of 0.97 on distinguishing digit- versus non-digit- thoughts. Results are preliminary and our geometric machine learning pipeline should be validated by more extensive follow-up studies; generalizing these results to larger inter-subject sample sizes, different hardware systems, and broader use cases will be crucial.

replace Hyperparameters in Score-Based Membership Inference Attacks

Authors: Gauri Pradhan, Joonas J\"alk\"o, Marlon Tobaben, Antti Honkela

Abstract: Membership Inference Attacks (MIAs) have emerged as a valuable framework for evaluating privacy leakage by machine learning models. Score-based MIAs are distinguished, in particular, by their ability to exploit the confidence scores that the model generates for particular inputs. Existing score-based MIAs implicitly assume that the adversary has access to the target model's hyperparameters, which can be used to train the shadow models for the attack. In this work, we demonstrate that the knowledge of target hyperparameters is not a prerequisite for MIA in the transfer learning setting. Based on this, we propose a novel approach to select the hyperparameters for training the shadow models for MIA when the attacker has no prior knowledge about them by matching the output distributions of target and shadow models. We demonstrate that using the new approach yields hyperparameters that lead to an attack near indistinguishable in performance from an attack that uses target hyperparameters to train the shadow models. Furthermore, we study the empirical privacy risk of unaccounted use of training data for hyperparameter optimization (HPO) in differentially private (DP) transfer learning. We find no statistically significant evidence that performing HPO using training data would increase vulnerability to MIA.

replace Applying Deep Learning to Ads Conversion Prediction in Last Mile Delivery Marketplace

Authors: Di Li, Xiaochang Miao, Huiyu Song, Chao Chu, Hao Xu, Mandar Rahurkar

Abstract: Deep neural networks (DNNs) have revolutionized web-scale ranking systems, enabling breakthroughs in capturing complex user behaviors and driving performance gains. At DoorDash, we first harnessed this transformative power by transitioning our homepage Ads ranking system from traditional tree based models to cutting edge multi task DNNs. This evolution sparked advancements in data foundations, model design, training efficiency, evaluation rigor, and online serving, delivering substantial business impact and reshaping our approach to machine learning. In this paper, we talk about our problem driven journey, from identifying the right problems and crafting targeted solutions to overcoming the complexity of developing and scaling a deep learning recommendation system. Through our successes and learned lessons, we aim to share insights and practical guidance to teams pursuing similar advancements in machine learning systems.

replace K-Edit: Language Model Editing with Contextual Knowledge Awareness

Authors: Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan

Abstract: As the world changes, we need to be able to update our models and correct false information without costly retraining. Knowledge-based model editing enables precise modifications to the weights of large language models in order to modify the information encoded within. Recent approaches have seen success in enabling recall of edited information for thousands of edits at once. However, these approaches fail to produce edits that account for associated contextual information. We present K-Edit, an effective approach to generating contextually consistent knowledge edits. By using knowledge graphs, which maintain contextual consistency when an edge is edited, we are able to generate additional \textit{contextual edits} that ensure consistency of related information in the language model. Our experiments demonstrate significant improvements in multi-hop question answering while maintaining the general effectiveness and scalability of model edits.

replace Explainable Artificial Intelligence Model for Evaluating Shear Strength Parameters of Municipal Solid Waste Across Diverse Compositional Profiles

Authors: Parichat Suknark, Sompote Youwaib, Tipok Kitkobsin, Sirintornthep Towprayoon, Chart Chiemchaisri, Komsilp Wangyao

Abstract: Accurate prediction of shear strength parameters in Municipal Solid Waste (MSW) remains a critical challenge in geotechnical engineering due to the heterogeneous nature of waste materials and their temporal evolution through degradation processes. This paper presents a novel explainable artificial intelligence (XAI) framework for evaluating cohesion and friction angle across diverse MSW compositional profiles. The proposed model integrates a multi-layer perceptron architecture with SHAP (SHapley Additive exPlanations) analysis to provide transparent insights into how specific waste components influence strength characteristics. Training data encompassed large-scale direct shear tests across various waste compositions and degradation states. The model demonstrated superior predictive accuracy compared to traditional gradient boosting methods, achieving mean absolute percentage errors of 7.42% and 14.96% for friction angle and cohesion predictions, respectively. Through SHAP analysis, the study revealed that fibrous materials and particle size distribution were primary drivers of shear strength variation, with food waste and plastics showing significant but non-linear effects. The model's explainability component successfully quantified these relationships, enabling evidence-based recommendations for waste management practices. This research bridges the gap between advanced machine learning and geotechnical engineering practice, offering a reliable tool for rapid assessment of MSW mechanical properties while maintaining interpretability for engineering decision-making.

replace A Survey of Graph Transformers: Architectures, Theories and Applications

Authors: Chaohao Yuan, Kangfei Zhao, Ercan Engin Kuruoglu, Liang Wang, Tingyang Xu, Wenbing Huang, Deli Zhao, Hong Cheng, Yu Rong

Abstract: Graph Transformers (GTs) have demonstrated a strong capability in modeling graph structures by addressing the intrinsic limitations of graph neural networks (GNNs), such as over-smoothing and over-squashing. Recent studies have proposed diverse architectures, enhanced explainability, and practical applications for Graph Transformers. In light of these rapid developments, we conduct a comprehensive review of Graph Transformers, covering aspects such as their architectures, theoretical foundations, and applications within this survey. We categorize the architecture of Graph Transformers according to their strategies for processing structural information, including graph tokenization, positional encoding, structure-aware attention and model ensemble. Furthermore, from the theoretical perspective, we examine the expressivity of Graph Transformers in various discussed architectures and contrast them with other advanced graph learning algorithms to discover the connections. Furthermore, we provide a summary of the practical applications where Graph Transformers have been utilized, such as molecule, protein, language, vision, traffic, brain and material data. At the end of this survey, we will discuss the current challenges and prospective directions in Graph Transformers for potential future research.

replace BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning

Authors: Haiteng Zhao, Chang Ma, Fangzhi Xu, Lingpeng Kong, Zhi-Hong Deng

Abstract: The applications of large language models (LLMs) in various biological domains have been explored recently, but their reasoning ability in complex biological systems, such as pathways, remains underexplored, which is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments. This work explores the potential of LLMs in pathway reasoning. We introduce BioMaze, a dataset with 5.1K complex pathway problems derived from real research, covering various biological contexts including natural dynamic changes, disturbances, additional intervention conditions, and multi-scale research targets. Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning, especially in perturbed systems. To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation, enabling a more effective approach to handling the complexities of biological systems in a scientifically aligned manner. The dataset and code are available at https://github.com/zhao-ht/BioMaze.

URLs: https://github.com/zhao-ht/BioMaze.

replace Sustainable Greenhouse Management: A Comparative Analysis of Recurrent and Graph Neural Networks

Authors: Emiliano Seri, Marcello Petitta, Cristina Cornaro

Abstract: The integration of photovoltaic (PV) systems into greenhouses not only optimizes land use but also enhances sustainable agricultural practices by enabling dual benefits of food production and renewable energy generation. However, accurate prediction of internal environmental conditions is crucial to ensure optimal crop growth while maximizing energy production. This study introduces a novel application of Spatio-Temporal Graph Neural Networks (STGNNs) to greenhouse microclimate modeling, comparing their performance with traditional Recurrent Neural Networks (RNNs). While RNNs excel at temporal pattern recognition, they cannot explicitly model the directional relationships between environmental variables. Our STGNN approach addresses this limitation by representing these relationships as directed graphs, enabling the model to capture both spatial dependencies and their directionality. Using high-frequency data collected at 15-minute intervals from a greenhouse in Volos, Greece, we demonstrate that RNNs achieve exceptional accuracy in winter conditions (R^2 = 0.985) but show limitations during summer cooling system operation. Though STGNNs currently show lower performance (winter R^2 = 0.947), their architecture offers greater potential for integrating additional variables such as PV generation and crop growth indicators.

replace Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

Authors: Xu Chu, Zhixin Zhang, Tianyu Jia, Yujie Jin

Abstract: Aligning language models with human preferences is critical for real-world deployment, but existing methods often require large amounts of high-quality human annotations. Aiming at a data-efficient alignment method, we propose Stackelberg Game Preference Optimization (SGPO), a framework that models alignment as a two-player Stackelberg game, where a policy (leader) optimizes against a worst-case preference distribution (follower) within an $\epsilon$-Wasserstein ball, ensuring robustness to (self-)annotation noise and distribution shifts. SGPO guarantees $O(\epsilon)$-bounded regret, unlike Direct Preference Optimization (DPO), which suffers from linear regret growth in the distribution mismatch. We instantiate SGPO with the Stackelberg Self-Annotated Preference Optimization (SSAPO) algorithm, which iteratively self-annotates preferences and adversarially reweights synthetic annotated preferences. Using only 2K seed preferences, from the UltraFeedback dataset, i.e., 1/30 of human labels in the dataset, our method achieves 35.82% GPT-4 win-rate with Mistral-7B and 40.12% with Llama3-8B-Instruct within three rounds of SSAPO.

replace Bayesian Computation in Deep Learning

Authors: Wenlong Chen, Bolian Li, Ruqi Zhang, Yingzhen Li

Abstract: This review paper is intended for the 2nd edition of the Handbook of Markov chain Monte Carlo. We provide an introduction to approximate inference techniques as Bayesian computation methods applied to deep learning models. We organize the chapter by presenting popular computational methods for Bayesian neural networks and deep generative models, explaining their unique challenges in posterior inference as well as the solutions.

replace What is the Alignment Objective of GRPO?

Authors: Milan Vojnovic, Se-Young Yun

Abstract: In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function, which we show to essentially correspond to the reverse Kullback-Leibler (KL) divergence between the aggregation policy and the reference policy. Interestingly, we demonstrate that for groups of size two, the reward preference model corresponds to pairwise comparison preferences, similar to those in other alignment methods based on pairwise comparison feedback. We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size. This provides insights into the dependence of the aggregate preference on parameters such as the regularisation constant and the confidence margin of question answers. Finally, we discuss the aggregation of preferences obtained by modifying the GRPO algorithm to use direct KL divergence as the penalty or to use rewards without scale normalisation.

replace Reward Shaping to Mitigate Reward Hacking in RLHF

Authors: Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao

Abstract: Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to reward hacking, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. While reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests three key design principles: (1) RL reward is ideally bounded, (2) RL benefits from rapid initial growth followed by gradual convergence, and (3) RL reward is best formulated as a function of centered reward. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model itself as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. Code is available at https://github.com/PorUna-byte/PAR.

URLs: https://github.com/PorUna-byte/PAR.

replace BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction

Authors: Ruifeng Tan, Weixiang Hong, Jiayue Tang, Xibin Lu, Ruijun Ma, Xiang Zheng, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

Abstract: Battery Life Prediction (BLP), which relies on time series data produced by battery degradation tests, is crucial for battery utilization, optimization, and production. Despite impressive advancements, this research area faces three key challenges. Firstly, the limited size of existing datasets impedes insights into modern battery life data. Secondly, most datasets are restricted to small-capacity lithium-ion batteries tested under a narrow range of diversity in labs, raising concerns about the generalizability of findings. Thirdly, inconsistent and limited benchmarks across studies obscure the effectiveness of baselines and leave it unclear if models popular in other time series fields are effective for BLP. To address these challenges, we propose BatteryLife, a comprehensive dataset and benchmark for BLP. BatteryLife integrates 16 datasets, offering a 2.4 times sample size compared to the previous largest dataset, and provides the most diverse battery life resource with batteries from 8 formats, 80 chemical systems, 12 operating temperatures, and 646 charge/discharge protocols, including both laboratory and industrial tests. Notably, BatteryLife is the first to release battery life datasets of zinc-ion batteries, sodium-ion batteries, and industry-tested large-capacity lithium-ion batteries. With the comprehensive dataset, we revisit the effectiveness of baselines popular in this and other time series fields. Furthermore, we propose CyclePatch, a plug-in technique that can be employed in a series of neural networks. Extensive benchmarking of 18 methods reveals that models popular in other time series fields can be unsuitable for BLP, and CyclePatch consistently improves model performance establishing state-of-the-art benchmarks. Moreover, BatteryLife evaluates model performance across aging conditions and domains. BatteryLife is available at https://github.com/Ruifeng-Tan/BatteryLife.

URLs: https://github.com/Ruifeng-Tan/BatteryLife.

replace INFO-SEDD: Continuous Time Markov Chains as Scalable Information Metrics Estimators

Authors: Alberto Foresti, Giulio Franzese, Pietro Michiardi

Abstract: Information-theoretic quantities play a crucial role in understanding non-linear relationships between random variables and are widely used across scientific disciplines. However, estimating these quantities remains an open problem, particularly in the case of high-dimensional discrete distributions. Current approaches typically rely on embedding discrete data into a continuous space and applying neural estimators originally designed for continuous distributions, a process that may not fully capture the discrete nature of the underlying data. We consider Continuous-Time Markov Chains (CTMCs), stochastic processes on discrete state-spaces which have gained popularity due to their generative modeling applications. In this work, we introduce INFO-SEDD, a novel method for estimating information-theoretic quantities of discrete data, including mutual information and entropy. Our approach requires the training of a single parametric model, offering significant computational and memory advantages. Additionally, it seamlessly integrates with pretrained networks, allowing for efficient reuse of pretrained generative models. To evaluate our approach, we construct a challenging synthetic benchmark. Our experiments demonstrate that INFO-SEDD is robust and outperforms neural competitors that rely on embedding techniques. Moreover, we validate our method on a real-world task: estimating the entropy of an Ising model. Overall, INFO-SEDD outperforms competing methods and shows scalability to high-dimensional scenarios, paving the way for new applications where estimating MI between discrete distribution is the focus. The promising results in this complex, high-dimensional scenario highlight INFO-SEDD as a powerful new estimator in the toolkit for information-theoretical analysis.

replace CryptoPulse: Short-Term Cryptocurrency Forecasting with Dual-Prediction and Cross-Correlated Market Indicators

Authors: Amit Kumar, Taoran Ji

Abstract: Cryptocurrencies fluctuate in markets with high price volatility, posing significant challenges for investors. To aid in informed decision-making, systems predicting cryptocurrency market movements have been developed, typically focusing on historical patterns. However, these methods often overlook three critical factors influencing market dynamics: 1) the macro investing environment, reflected in major cryptocurrency fluctuations affecting collaborative investor behaviors; 2) overall market sentiment, heavily influenced by news impacting investor strategies; and 3) technical indicators, offering insights into overbought or oversold conditions, momentum, and market trends, which are crucial for short-term price movements. This paper proposes a dual prediction mechanism that forecasts the next day's closing price by incorporating macroeconomic fluctuations, technical indicators, and individual cryptocurrency price changes. Additionally, a novel refinement mechanism enhances predictions through market sentiment-based rescaling and fusion. Experiments demonstrate that the proposed model achieves state-of-the-art performance, consistently outperforming ten comparison methods.

replace-cross FBChain: A Blockchain-based Federated Learning Model with Efficiency and Secure Communication

Authors: Yang Li, Chunhe Xia, Tianbo Wang

Abstract: Privacy and security in the parameter transmission process of federated learning are currently among the most prominent concerns. However, there are two thorny problems caused by unprotected communication methods: "parameter-leakage" and "inefficient-communication". This article proposes Blockchain-based Federated Learning (FBChain) model for federated learning parameter communication to overcome the above two problems. First, we utilize the immutability of blockchain to store the global model and hash value of local model parameters in case of tampering during the communication process, protect data privacy by encrypting parameters, and verify data consistency by comparing the hash values of local parameters, thus addressing the "parameter-leakage" problem. Second, the Proof of Weighted Link Speed (PoWLS) consensus algorithm comprehensively selects nodes with the higher weighted link speed to aggregate global model and package blocks, thereby solving the "inefficient-communication" problem. Experimental results demonstrate the effectiveness of our proposed FBChain model and its ability to improve model communication efficiency in federated learning.

replace-cross A Systematic Review of Low-Rank and Local Low-Rank Matrix Approximation in Big Data Medical Imaging

Authors: Sisipho Hamlomo, Marcellin Atemkeng, Yusuf Brima, Chuneeta Nunhokee, Jeremy Baxter

Abstract: The large volume and complexity of medical imaging datasets are bottlenecks for storage, transmission, and processing. To tackle these challenges, the application of low-rank matrix approximation (LRMA) and its derivative, local LRMA (LLRMA) has demonstrated potential. A detailed analysis of the literature identifies LRMA and LLRMA methods applied to various imaging modalities, and the challenges and limitations associated with existing LRMA and LLRMA methods are addressed. We note a significant shift towards a preference for LLRMA in the medical imaging field since 2015, demonstrating its potential and effectiveness in capturing complex structures in medical data compared to LRMA. Acknowledging the limitations of shallow similarity methods used with LLRMA, we suggest advanced semantic image segmentation for similarity measure, explaining in detail how it can be used to measure similar patches and its feasibility. We note that LRMA and LLRMA are mainly applied to unstructured medical data, and we propose extending their application to different medical data types, including structured and semi-structured. This paper also discusses how LRMA and LLRMA can be applied to regular data with missing entries and the impact of inaccuracies in predicting missing values and their effects. We discuss the impact of patch size and propose the use of random search (RS) to determine the optimal patch size. To enhance feasibility, a hybrid approach using Bayesian optimization and RS is proposed, which could improve the application of LRMA and LLRMA in medical imaging.

replace-cross ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images

Authors: Fangqiang Ding, Yunzhou Zhu, Xiangyu Wen, Gaowen Liu, Chris Xiaoxuan Lu

Abstract: Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.

replace-cross Transfer Learning for T-Cell Response Prediction

Authors: Josua Stadelmaier (University of T\"ubingen), Brandon Malone (NEC OncoImmunity), Ralf Eggeling (University of T\"ubingen)

Abstract: We study the prediction of T-cell response for specific given peptides, which could, among other applications, be a crucial step towards the development of personalized cancer vaccines. It is a challenging task due to limited, heterogeneous training data featuring a multi-domain structure; such data entail the danger of shortcut learning, where models learn general characteristics of peptide sources, such as the source organism, rather than specific peptide characteristics associated with T-cell response. Using a transformer model for T-cell response prediction, we show that the danger of inflated predictive performance is not merely theoretical but occurs in practice. Consequently, we propose a domain-aware evaluation scheme. We then study different transfer learning techniques to deal with the multi-domain structure and shortcut learning. We demonstrate a per-source fine tuning approach to be effective across a wide range of peptide sources and further show that our final model is competitive with existing state-of-the-art approaches for predicting T-cell responses for human peptides.

replace-cross The Impact of Unstated Norms in Bias Analysis of Language Models

Authors: Farnaz Kohankhaki, D. B. Emerson, Jacob-Junqi Tian, Laleh Seyyed-Kalantari, Faiza Khan Khattak

Abstract: Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It measures whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can lead to unrealistic bias measurements. For example, LLMs appear to mistakenly cast text associated with White race as negative at higher rates than other groups. We hypothesize that this arises artificially via a mismatch between commonly unstated norms, in the form of markedness, in the pretraining text of LLMs (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). The findings highlight the potential misleading impact of varying group membership through explicit mention in counterfactual bias quantification.

replace-cross New methods to compute the generalized chi-square distribution

Authors: Abhranil Das

Abstract: We present four new mathematical methods, two exact and two approximate, along with open-source software, to compute the cdf, pdf and inverse cdf of the generalized chi-square distribution. Some methods are geared for speed, while others are designed to be accurate far into the tails, using which we can also measure large values of the discriminability index $d'$ between multivariate normal distributions. We compare the accuracy and speed of these and previous methods, characterize their advantages and limitations, and identify the best methods to use in different cases.

replace-cross A Dataset and Framework for Learning State-invariant Object Representations

Authors: Rohan Sarkar, Avinash Kak

Abstract: We add one more invariance - the state invariance - to the more commonly used other invariances for learning object representations for recognition and retrieval. By state invariance, we mean robust with respect to changes in the structural form of the objects, such as when an umbrella is folded, or when an item of clothing is tossed on the floor. In this work, we present a novel dataset, ObjectsWithStateChange, which captures state and pose variations in the object images recorded from arbitrary viewpoints. We believe that this dataset will facilitate research in fine-grained object recognition and retrieval of 3D objects that are capable of state changes. The goal of such research would be to train models capable of learning discriminative object embeddings that remain invariant to state changes while also staying invariant to transformations induced by changes in viewpoint, pose, illumination, etc. A major challenge in this regard is that instances of different objects (both within and across different categories) under various state changes may share similar visual characteristics and therefore may be close to one another in the learned embedding space, which would make it more difficult to discriminate between them. To address this, we propose a curriculum learning strategy that progressively selects object pairs with smaller inter-object distances in the learned embedding space during the training phase. This approach gradually samples harder-to-distinguish examples of visually similar objects, both within and across different categories. Our ablation related to the role played by curriculum learning indicates an improvement in object recognition accuracy of 7.9% and retrieval mAP of 9.2% over the state-of-the-art on our new dataset, as well as three other challenging multi-view datasets such as ModelNet40, ObjectPI, and FG3D.

replace-cross EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

Authors: Guangyao Zhai, Evin P{\i}nar \"Ornek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam

Abstract: We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.

replace-cross A Polynomial-Time Approximation for Pairwise Fair $k$-Median Clustering

Authors: Sayan Bandyapadhyay, Eden Chlamt\'a\v{c}, Zachary Friggstad, Mahya Jamshidian, Yury Makarychev, Ali Vakilian

Abstract: In this work, we study pairwise fair clustering with $\ell \ge 2$ groups, where for every cluster $C$ and every group $i \in [\ell]$, the number of points in $C$ from group $i$ must be at most $t$ times the number of points in $C$ from any other group $j \in [\ell]$, for a given integer $t$. To the best of our knowledge, only bi-criteria approximation and exponential-time algorithms follow for this problem from the prior work on fair clustering problems when $\ell > 2$. In our work, focusing on the $\ell > 2$ case, we design the first polynomial-time $O(k^2\cdot \ell \cdot t)$-approximation for this problem with $k$-median cost that does not violate the fairness constraints. We complement our algorithmic result by providing hardness of approximation results, which show that our problem even when $\ell=2$ is almost as hard as the popular uniform capacitated $k$-median, for which no polynomial-time algorithm with an approximation factor of $o(\log k)$ is known.

replace-cross Topic Classification of Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment

Authors: Holli Sargeant, Ahmed Izzidien, Felix Steffek

Abstract: This paper addresses a critical gap in legal analytics by developing and applying a novel taxonomy for topic classification of summary judgment cases in the United Kingdom. Using a curated dataset of summary judgment cases, we use the Large Language Model Claude 3 Opus to explore functional topics and trends. We find that Claude 3 Opus correctly classified the topic with an accuracy of 87.13% and an F1 score of 0.87. The analysis reveals distinct patterns in the application of summary judgments across various legal domains. As case law in the United Kingdom is not originally labelled with keywords or a topic filtering option, the findings not only refine our understanding of the thematic underpinnings of summary judgments but also illustrate the potential of combining traditional and AI-driven approaches in legal classification. Therefore, this paper provides a new and general taxonomy for UK law. The implications of this work serve as a foundation for further research and policy discussions in the field of judicial administration and computational legal research methodologies.

replace-cross On Conformal Isometry of Grid Cells: Learning Distance-Preserving Position Embedding

Authors: Dehong Xu, Ruiqi Gao, Wen-Hao Zhang, Xue-Xin Wei, Ying Nian Wu

Abstract: This paper investigates the conformal isometry hypothesis as a potential explanation for the hexagonal periodic patterns in grid cell response maps. We posit that grid cell activities form a high-dimensional vector in neural space, encoding the agent's position in 2D physical space. As the agent moves, this vector rotates within a 2D manifold in the neural space, driven by a recurrent neural network. The conformal hypothesis proposes that this neural manifold is a conformal isometric embedding of 2D physical space, where local physical distance is preserved by the embedding up to a scaling factor (or unit of metric). Such distance-preserving position embedding is indispensable for path planning in navigation, especially planning local straight path segments. We conduct numerical experiments to show that this hypothesis leads to the hexagonal grid firing patterns by learning maximally distance-preserving position embedding, agnostic to the choice of the recurrent neural network. Furthermore, we present a theoretical explanation of why hexagon periodic patterns emerge by minimizing our loss function by showing that hexagon flat torus is maximally distance preserving.

replace-cross Efficient mid-term forecasting of hourly electricity load using generalized additive models

Authors: Monika Zimmermann, Florian Ziel

Abstract: Accurate mid-term (weeks to one year) hourly electricity load forecasts are essential for strategic decision-making in power plant operation, ensuring supply security and grid stability, planning and building energy storage systems, and energy trading. While numerous models effectively predict short-term (hours to a few days) hourly load, mid-term forecasting solutions remain scarce. In mid-term load forecasting, capturing the multifaceted characteristics of load, including daily, weekly and annual seasonal patterns, as well as autoregressive effects, weather and holiday impacts, and socio-economic non-stationarities, presents significant modeling challenges. To address these challenges, we propose a novel forecasting method using Generalized Additive Models (GAMs) built from interpretable P-splines that is enhanced with autoregressive post-processing. This model incorporates smoothed temperatures, Error-Trend-Seasonal (ETS) modeled and persistently forecasted non-stationary socio-economic states, a nuanced representation of effects from vacation periods, fixed date and weekday holidays, and seasonal information as inputs. The proposed model is evaluated using load data from 24 European countries over more than 9 years (2015-2024). This analysis demonstrates that the model not only has significantly enhanced forecasting accuracy compared to state-of-the-art methods but also offers valuable insights into the influence of individual components on predicted load, given its full interpretability. Achieving performance akin to day-ahead Transmission System Operator (TSO) forecasts, with computation times of just a few seconds for several years of hourly data, underscores the potential of the model for practical application in the power system industry.

replace-cross Pursuing Feature Separation based on Neural Collapse for Out-of-Distribution Detection

Authors: Yingwen Wu, Ruiji Yu, Xinwen Cheng, Zhengbao He, Xiaolin Huang

Abstract: In the open world, detecting out-of-distribution (OOD) data, whose labels are disjoint with those of in-distribution (ID) samples, is important for reliable deep neural networks (DNNs). To achieve better detection performance, one type of approach proposes to fine-tune the model with auxiliary OOD datasets to amplify the difference between ID and OOD data through a separation loss defined on model outputs. However, none of these studies consider enlarging the feature disparity, which should be more effective compared to outputs. The main difficulty lies in the diversity of OOD samples, which makes it hard to describe their feature distribution, let alone design losses to separate them from ID features. In this paper, we neatly fence off the problem based on an aggregation property of ID features named Neural Collapse (NC). NC means that the penultimate features of ID samples within a class are nearly identical to the last layer weight of the corresponding class. Based on this property, we propose a simple but effective loss called Separation Loss, which binds the features of OOD data in a subspace orthogonal to the principal subspace of ID features formed by NC. In this way, the features of ID and OOD samples are separated by different dimensions. By optimizing the feature separation loss rather than purely enlarging output differences, our detection achieves SOTA performance on CIFAR10, CIFAR100 and ImageNet benchmarks without any additional data augmentation or sampling, demonstrating the importance of feature separation in OOD detection. Code is available at https://github.com/Wuyingwen/Pursuing-Feature-Separation-for-OOD-Detection.

URLs: https://github.com/Wuyingwen/Pursuing-Feature-Separation-for-OOD-Detection.

replace-cross Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity

Authors: Yan Yang, Bin Gao, Ya-xiang Yuan

Abstract: Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we design both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms enjoy the convergence rate $O(\epsilon^{-1})$. To extend the applicability, a stochastic version of the model-free algorithm is proposed, along with results on its iteration and sample complexity. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.

replace-cross Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Authors: Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong

Abstract: We study how to subvert large language models (LLMs) from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can still mislead both theoretical constructions and models learned from data. Furthermore, we demonstrate that popular attack algorithms on LLMs find adversarial prompts and induce attention patterns that align with our theory. Our novel logic-based framework provides a foundation for studying LLMs in rule-based settings, enabling a formal analysis of tasks like logical reasoning and jailbreak attacks.

replace-cross To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

Authors: Tao Ma, Xuzhi Yang, Zoltan Szabo

Abstract: Reinforcement learning (RL) -- finding the optimal behaviour (also referred to as policy) maximizing the collected long-term cumulative reward -- is among the most influential approaches in machine learning with a large number of successful applications. In several decision problems, however, one faces the possibility of policy switching -- changing from the current policy to a new one -- which incurs a non-negligible cost, and in the decision one is limited to using historical data without the availability for further online interaction. Despite the inevitable importance of this offline learning scenario, to our best knowledge, very little effort has been made to tackle the key problem of balancing between the gain and the cost of switching in a flexible and principled way. Leveraging ideas from the area of optimal transport, we initialize the systematic study of policy switching in offline RL. We establish fundamental properties and design a Net Actor-Critic algorithm for the proposed novel switching formulation. Numerical experiments demonstrate the efficiency of our approach on multiple robot control benchmarks of the Gymnasium and traffic light control from SUMO-RL.

replace-cross Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Authors: Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis B\'ethune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi

Abstract: Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models, by building a new dataset GBC10M that gathers GBC annotations for about 10M images of the CC12M dataset. Through CLIP training on GBC10M, we show that leveraging GBC nodes' annotations -- particularly those in composition and relation nodes -- significantly boosts the model's performance across various benchmarks compared to when other annotations are used. To further explore the opportunities provided by GBC, we also investigate the use of GBC as middleware for text-to-image generation, and show the extra benefits of incorporating the graph structure in this task. Our code and datasets are released at https://github.com/apple/ml-gbc and https://huggingface.co/graph-based-captions.

URLs: https://github.com/apple/ml-gbc, https://huggingface.co/graph-based-captions.

replace-cross Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Authors: Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos

Abstract: Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

replace-cross Persistent Sampling: Enhancing the Efficiency of Sequential Monte Carlo

Authors: Minas Karamanis, Uro\v{s} Seljak

Abstract: Sequential Monte Carlo (SMC) samplers are powerful tools for Bayesian inference but suffer from high computational costs due to their reliance on large particle ensembles for accurate estimates. We introduce persistent sampling (PS), an extension of SMC that systematically retains and reuses particles from all prior iterations to construct a growing, weighted ensemble. By leveraging multiple importance sampling and resampling from a mixture of historical distributions, PS mitigates the need for excessively large particle counts, directly addressing key limitations of SMC such as particle impoverishment and mode collapse. Crucially, PS achieves this without additional likelihood evaluations-weights for persistent particles are computed using cached likelihood values. This framework not only yields more accurate posterior approximations but also produces marginal likelihood estimates with significantly lower variance, enhancing reliability in model comparison. Furthermore, the persistent ensemble enables efficient adaptation of transition kernels by leveraging a larger, decorrelated particle pool. Experiments on high-dimensional Gaussian mixtures, hierarchical models, and non-convex targets demonstrate that PS consistently outperforms standard SMC and related variants, including recycled and waste-free SMC, achieving substantial reductions in mean squared error for posterior expectations and evidence estimates, all at reduced computational cost. PS thus establishes itself as a robust, scalable, and efficient alternative for complex Bayesian inference tasks.

replace-cross The Role of Explainable AI in Revolutionizing Human Health Monitoring: A Review

Authors: Abdullah Alharthi, Ahmed Alqurashi, Turki Alharbi, Mohammed Alammar, Nasser Aldosari, Houssem Bouchekara, Yusuf Shaaban, Mohammad Shoaib Shahriar, Abdulrahman Al Ayidh

Abstract: The complex nature of disease mechanisms and the variability of patient symptoms pose significant challenges in developing effective diagnostic tools. Although machine learning (ML) has made substantial advances in medical diagnosis, the decision-making processes of these models often lack transparency, potentially jeopardizing patient outcomes. This review aims to highlight the role of Explainable AI (XAI) in addressing the interpretability issues of ML models in healthcare, with a focus on chronic conditions such as Parkinson's, stroke, depression, cancer, heart disease, and Alzheimer's disease. A comprehensive literature search was conducted across multiple databases to identify studies that applied XAI techniques in healthcare. The search focused on XAI algorithms used in diagnosing and monitoring chronic diseases. The review identified the application of nine trending XAI algorithms, each evaluated for their advantages and limitations in various healthcare contexts. The findings underscore the importance of transparency in ML models, which is crucial for improving trust and outcomes in clinical practice. While XAI provides significant potential to bridge the gap between complex ML models and clinical practice, challenges such as scalability, validation, and clinician acceptance remain. The review also highlights areas requiring further research, particularly in integrating XAI into healthcare systems. The study concludes that XAI methods offer a promising path forward for enhancing human health monitoring and patient care, though significant challenges must be addressed to fully realize their potential in clinical settings.

replace-cross Transformers Handle Endogeneity in In-Context Linear Regression

Authors: Haodong Liang, Krishnakumar Balasubramanian, Lifeng Lai

Abstract: We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares $(\textsf{2SLS})$ solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the $\textsf{2SLS}$ method, in the presence of endogeneity.

replace-cross Exploring QUIC Dynamics: A Large-Scale Dataset for Encrypted Traffic Analysis

Authors: Barak Gahtan, Robert J. Shahla, Alex M. Bronstein, Reuven Cohen

Abstract: The increasing adoption of the QUIC transport protocol has transformed encrypted web traffic, necessitating new methodologies for network analysis. However, existing datasets lack the scope, metadata, and decryption capabilities required for robust benchmarking in encrypted traffic research. We introduce VisQUIC, a large-scale dataset of 100,000 labeled QUIC traces from over 44,000 websites, collected over four months. Unlike prior datasets, VisQUIC provides SSL keys for controlled decryption, supports multiple QUIC implementations (Chromium QUIC, Facebooks mvfst, Cloudflares quiche), and introduces a novel image-based representation that enables machine learning-driven encrypted traffic analysis. The dataset includes standardized benchmarking tools, ensuring reproducibility. To demonstrate VisQUICs utility, we present a benchmarking task for estimating HTTP/3 responses in encrypted QUIC traffic, achieving 97% accuracy using only observable packet features. By publicly releasing VisQUIC, we provide an open foundation for advancing encrypted traffic analysis, QUIC security research, and network monitoring.

replace-cross Taxonomy Tree Generation from Citation Graph

Authors: Yuntong Hu, Zhuofeng Li, Zheng Zhang, Chen Ling, Raasikh Kanjiani, Boxin Zhao, Liang Zhao

Abstract: Constructing taxonomies from citation graphs is essential for organizing scientific knowledge, facilitating literature reviews, and identifying emerging research trends. However, manual taxonomy construction is labor-intensive, time-consuming, and prone to human biases, often overlooking pivotal but less-cited papers. In this paper, to enable automatic hierarchical taxonomy generation from citation graphs, we propose HiGTL (Hierarchical Graph Taxonomy Learning), a novel end-to-end framework guided by human-provided instructions or preferred topics. Specifically, we propose a hierarchical citation graph clustering method that recursively groups related papers based on both textual content and citation structure, ensuring semantically meaningful and structurally coherent clusters. Additionally, we develop a novel taxonomy node verbalization strategy that iteratively generates central concepts for each cluster, leveraging a pre-trained large language model (LLM) to maintain semantic consistency across hierarchical levels. To further enhance performance, we design a joint optimization framework that fine-tunes both the clustering and concept generation modules, aligning structural accuracy with the quality of generated taxonomies. Extensive experiments demonstrate that HiGTL effectively produces coherent, high-quality taxonomies.

replace-cross Training-free Diffusion Model Alignment with Sampling Demons

Authors: Po-Hung Yeh, Kuang-Huei Lee, Jun-Cheng Chen

Abstract: Aligning diffusion models with user preferences has been a key challenge. Existing methods for aligning diffusion models either require retraining or are limited to differentiable reward functions. To address these limitations, we propose a stochastic optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through stochastic optimization. We provide comprehensive theoretical and empirical evidence to support and validate our approach, including experiments that use non-differentiable sources of rewards such as Visual-Language Model (VLM) APIs and human judgements. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models. Our method can be easily integrated with existing diffusion models without further training. Our experiments show that the proposed approach significantly improves the average aesthetics scores for text-to-image generation. Implementation is available at https://github.com/aiiu-lab/DemonSampling.

URLs: https://github.com/aiiu-lab/DemonSampling.

replace-cross Round and Round We Go! What makes Rotary Positional Encodings useful?

Authors: Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veli\v{c}kovi\'c

Abstract: Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.

replace-cross Strategic Classification With Externalities

Authors: Safwan Hossain, Evi Micha, Yiling Chen, Ariel Procaccia

Abstract: We propose a new variant of the strategic classification problem: a principal reveals a classifier, and $n$ agents report their (possibly manipulated) features to be classified. Motivated by real-world applications, our model crucially allows the manipulation of one agent to affect another; that is, it explicitly captures inter-agent externalities. The principal-agent interactions are formally modeled as a Stackelberg game, with the resulting agent manipulation dynamics captured as a simultaneous game. We show that under certain assumptions, the pure Nash Equilibrium of this agent manipulation game is unique and can be efficiently computed. Leveraging this result, PAC learning guarantees are established for the learner: informally, we show that it is possible to learn classifiers that minimize loss on the distribution, even when a random number of agents are manipulating their way to a pure Nash Equilibrium. We also comment on the optimization of such classifiers through gradient-based approaches. This work sets the theoretical foundations for a more realistic analysis of classifiers that are robust against multiple strategic actors interacting in a common environment.

replace-cross Herald: A Natural Language Annotated Lean 4 Dataset

Authors: Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, Bin Dong

Abstract: Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 93.2% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 22.5% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (74.0% and 7.5%) and TheoremLlama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, are open-sourced to the public.

replace-cross Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace

Authors: Jinluan Yang, Anke Tang, Didi Zhu, Zhengyu Chen, Li Shen, Fei Wu

Abstract: Model merging has gained significant attention as a cost-effective approach to integrate multiple single-task fine-tuned models into a unified one that can perform well on multiple tasks. However, existing model merging techniques primarily focus on resolving conflicts between task-specific models, they often overlook potential security threats, particularly the risk of backdoor attacks in the open-source model ecosystem. In this paper, we first investigate the vulnerabilities of existing model merging methods to backdoor attacks, identifying two critical challenges: backdoor succession and backdoor transfer. To address these issues, we propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Specifically, DAM employs a meta-learning-based optimization method with dual masks to identify a shared and safety-aware subspace for model merging. These masks are alternately optimized: the Task-Shared mask identifies common beneficial parameters across tasks, aiming to preserve task-specific knowledge while reducing interference, while the Backdoor-Detection mask isolates potentially harmful parameters to neutralize security threats. This dual-mask design allows us to carefully balance the preservation of useful knowledge and the removal of potential vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points while sacrificing only about 1% in accuracy. Furthermore, DAM exhibits robust performance and broad applicability across various types of backdoor attacks and the number of compromised models involved in the merging process. Our codes and models are available at https://github.com/Yangjinluan/DAM.

URLs: https://github.com/Yangjinluan/DAM.

replace-cross Improving Neuron-level Interpretability with White-box Language Models

Authors: Hao Bai, Yi Ma

Abstract: Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (CRATE), explicitly engineered to capture sparse, low-dimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining CRATE's robust performance in enhancing neural network interpretability. Further analysis shows that CRATE's increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.

replace-cross High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

Authors: M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, Samet Oymak

Abstract: A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that (i) W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but (ii) it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

replace-cross The Effects of Multi-Task Learning on ReLU Neural Network Functions

Authors: Julia Nakhleh, Joseph Shenouda, Robert D. Nowak

Abstract: This paper studies the properties of solutions to multi-task shallow ReLU neural network learning problems, wherein the network is trained to fit a dataset with minimal sum of squared weights. Remarkably, the solutions learned for each individual task resemble those obtained by solving a kernel regression problem, revealing a novel connection between neural networks and kernel methods. It is known that single-task neural network learning problems are equivalent to a minimum norm interpolation problem in a non-Hilbertian Banach space, and that the solutions of such problems are generally non-unique. In contrast, we prove that the solutions to univariate-input, multi-task neural network interpolation problems are almost always unique, and coincide with the solution to a minimum-norm interpolation problem in a Sobolev (Reproducing Kernel) Hilbert Space. We also demonstrate a similar phenomenon in the multivariate-input case; specifically, we show that neural network learning problems with large numbers of tasks are approximately equivalent to an $\ell^2$ (Hilbert space) minimization problem over a fixed kernel determined by the optimal neurons.

replace-cross Diffusion Models as Cartoonists: The Curious Case of High Density Regions

Authors: Rafa{\l} Karczewski, Markus Heinonen, Vikas Garg

Abstract: We investigate what kind of images lie in the high-density regions of diffusion models. We introduce a theoretical mode-tracking process capable of pinpointing the exact mode of the denoising distribution, and we propose a practical high-density sampler that consistently generates images of higher likelihood than usual samplers. Our empirical findings reveal the existence of significantly higher likelihood samples that typical samplers do not produce, often manifesting as cartoon-like drawings or blurry images depending on the noise level. Curiously, these patterns emerge in datasets devoid of such examples. We also present a novel approach to track sample likelihoods in diffusion SDEs, which remarkably incurs no additional computational cost.

replace-cross Expansion Quantization Network: An Efficient Micro-emotion Annotation and Detection Framework

Authors: Jingyi Zhou, Senlin Luo, Haofan Chen

Abstract: Text emotion detection constitutes a crucial foundation for advancing artificial intelligence from basic comprehension to the exploration of emotional reasoning. Most existing emotion detection datasets rely on manual annotations, which are associated with high costs, substantial subjectivity, and severe label imbalances. This is particularly evident in the inadequate annotation of micro-emotions and the absence of emotional intensity representation, which fail to capture the rich emotions embedded in sentences and adversely affect the quality of downstream task completion. By proposing an all-labels and training-set label regression method, we map label values to energy intensity levels, thereby fully leveraging the learning capabilities of machine models and the interdependencies among labels to uncover multiple emotions within samples. This led to the establishment of the Emotion Quantization Network (EQN) framework for micro-emotion detection and annotation. Using five commonly employed sentiment datasets, we conducted comparative experiments with various models, validating the broad applicability of our framework within NLP machine learning models. Based on the EQN framework, emotion detection and annotation are conducted on the GoEmotions dataset. A comprehensive comparison with the results from Google literature demonstrates that the EQN framework possesses a high capability for automatic detection and annotation of micro-emotions. The EQN framework is the first to achieve automatic micro-emotion annotation with energy-level scores, providing strong support for further emotion detection analysis and the quantitative research of emotion computing.

replace-cross Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation

Authors: Qiao Qiao, Yuepei Li, Qing Wang, Kang Zhou, Qi Li

Abstract: Knowledge graph completion (KGC) is a task of inferring missing triples based on existing Knowledge Graphs (KGs). Both structural and semantic information are vital for successful KGC. However, existing methods only use either the structural knowledge from the KG embeddings or the semantic information from pre-trained language models (PLMs), leading to suboptimal model performance. Moreover, since PLMs are not trained on KGs, directly using PLMs to encode triples may be inappropriate. To overcome these limitations, we propose a novel framework called Bridge, which jointly encodes structural and semantic information of KGs. Specifically, we strategically encode entities and relations separately by PLMs to better utilize the semantic knowledge of PLMs and enable structured representation learning via a structural learning principle. Furthermore, to bridge the gap between KGs and PLMs, we employ a self-supervised representation learning method called BYOL to fine-tune PLMs with two different views of a triple. Unlike BYOL, which uses augmentation methods to create two semantically similar views of the same image, potentially altering the semantic information. We strategically separate the triple into two parts to create different views, thus avoiding semantic alteration. Experiments demonstrate that Bridge outperforms the SOTA models on three benchmark datasets.

replace-cross LongSafety: Enhance Safety for Long-Context LLMs

Authors: Mianqiu Huang, Xiaoran Liu, Shaojun Zhou, Mozhi Zhang, Qipeng Guo, Linyang Li, Chenkun Tan, Yang Gao, Pengyu Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xipeng Qiu, Xuanjing Huang

Abstract: Recent advancements in model architectures and length extrapolation techniques have significantly extended the context length of large language models (LLMs), paving the way for their application in increasingly complex tasks. However, despite the growing capabilities of long-context LLMs, the safety issues in long-context scenarios remain underexplored. While safety alignment in short context has been widely studied, the safety concerns of long-context LLMs have not been adequately addressed. In this work, we introduce \textbf{LongSafety}, a comprehensive safety alignment dataset for long-context LLMs, containing 10 tasks and 17k samples, with an average length of 40.9k tokens. Our experiments demonstrate that training with LongSafety can enhance long-context safety performance while enhancing short-context safety and preserving general capabilities. Furthermore, we demonstrate that long-context safety does not equal long-context alignment with short-context safety data and LongSafety has generalizing capabilities in context length and long-context safety scenarios.

replace-cross CHAI for LLMs: Improving Code-Mixed Translation in Large Language Models through Reinforcement Learning with AI Feedback

Authors: Wenbo Zhang, Aditya Majumdar, Amulya Yadav

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various NLP tasks but struggle with code-mixed (or code-switched) language understanding. For example, prior work benchmarking the performance of multilingual LLMs on code-mixed translation tasks has demonstrated that current state-of-the-art multilingual LLMs are ineffective in dealing with code-mixed languages. However, the question of how to improve the capability of multilingual LLMs to handle code-mixed language has not received any attention to date. In this paper, we tackle this research gap by proposing CHAI, a novel general-purpose framework for improving the ability of multilingual LLMs to handle code-mixed languages. CHAI relies on three novel contributions made in this paper. First, we explore the ability of LLMs to provide accurate annotations for code-mixed translation tasks. Second, we leverage this ability of LLMs as annotators to generate preference data for code-mixed translation tasks at scale, which are then used within a reinforcement learning from AI feedback (RLAIF) procedure to improve LLMs' capability on code-mixed tasks. Third, we conduct a rigorous experimental evaluation across various real-world datasets and settings. Our analysis shows that CHAI-powered LLMs outperform state-of-the-art open-source LLMs by 25.66% (in terms of win rate adjudicated by human annotators) in code-mixed translation tasks. This work represents a first step towards developing more inclusive code-mixed LLMs.

replace-cross Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer

Authors: Shitong Shao, Zikai Zhou, Tian Ye, Lichen Bai, Zhiqiang Xu, Zeke Xie

Abstract: Text-to-image diffusion models (DMs) develop at an unprecedented pace, supported by thorough theoretical exploration and empirical analysis. Unfortunately, the discrepancy between DMs and autoregressive models (ARMs) complicates the path toward achieving the goal of unified vision and language generation. Recently, the masked generative Transformer (MGT) serves as a promising intermediary between DM and ARM by predicting randomly masked image tokens (i.e., masked image modeling), combining the efficiency of DM with the discrete token nature of ARM. However, we find that the comprehensive analyses regarding the inference for MGT are virtually non-existent, and thus we aim to present positive design choices to fill this gap. We propose and redesign a set of enhanced inference techniques tailored for MGT, providing a detailed analysis of their performance. Additionally, we explore several DM-based approaches aimed at accelerating the sampling process on MGT. Extensive experiments and empirical analyses on the recent SOTA MGT, such as MaskGIT and Meissonic lead to concrete and effective design choices, and these design choices can be merged to achieve further performance gains. For instance, in terms of enhanced inference, we achieve winning rates of approximately 70% compared to vanilla sampling on HPS v2 with Meissonic-1024x1024.

replace-cross Student-Informed Teacher Training

Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza

Abstract: Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.

replace-cross Improved Physics-informed neural networks loss function regularization with a variance-based term

Authors: John M. Hanna, Hugues Talbot, Irene E. Vignon-Clementel

Abstract: In machine learning and statistical modeling, the mean square or absolute error is commonly used as an error metric, also called a "loss function." While effective in reducing the average error, this approach may fail to address localized outliers, leading to significant inaccuracies in regions with sharp gradients or discontinuities. This issue is particularly evident in physics-informed neural networks (PINNs), where such localized errors are expected and affect the overall solution. To overcome this limitation, we propose a novel loss function that combines the mean and the standard deviation of the chosen error metric. By minimizing this combined loss function, the method ensures a more uniform error distribution and reduces the impact of localized high-error regions. The proposed loss function is easy to implement and tested on problems of varying complexity: the 1D Poisson equation, the unsteady Burgers' equation, 2D linear elastic solid mechanics, and 2D steady Navier-Stokes equations. Results demonstrate improved solution quality and lower maximum error compared to the standard mean-based loss, with minimal impact on computational time.

replace-cross A stochastic first-order method with multi-extrapolated momentum for highly smooth unconstrained optimization

Authors: Chuan He

Abstract: In this paper, we consider an unconstrained stochastic optimization problem where the objective function exhibits high-order smoothness. Specifically, we propose a new stochastic first-order method (SFOM) with multi-extrapolated momentum, in which multiple extrapolations are performed in each iteration, followed by a momentum update based on these extrapolations. We demonstrate that the proposed SFOM can accelerate optimization by exploiting the high-order smoothness of the objective function $f$. Assuming that the $p$th-order derivative of $f$ is Lipschitz continuous for some $p\ge2$, and under additional mild assumptions, we establish that our method achieves a sample complexity of $\widetilde{\mathcal{O}}(\epsilon^{-(3p+1)/p})$ for finding a point $x$ such that $\mathbb{E}[\|\nabla f(x)\|]\le\epsilon$. To the best of our knowledge, this is the first SFOM to leverage arbitrary-order smoothness of the objective function for acceleration, resulting in a sample complexity that improves upon the best-known results without assuming the mean-squared smoothness condition. Preliminary numerical experiments validate the practical performance of our method and support our theoretical findings.

replace-cross Detection Augmented Bandit Procedures for Piecewise Stationary MABs: A Modular Approach

Authors: Yu-Han Huang, Argyrios Gerogiannis, Subhonmesh Bose, Venugopal V. Veeravalli

Abstract: Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being nonstationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection (CD) have been previously proposed. Our goal is to modularize the design and analysis of such CD-based Bandit (CDB) procedures. To this end, we identify the requirements for stationary bandit algorithms and change detectors in a CDB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of CDB procedures can indeed be modularized, so that regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular CDB procedures that are order-optimal. We compare the performance of our modular CDB procedures with various other methods in simulations.

replace-cross Do generative video models understand physical principles?

Authors: Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos

Abstract: AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.

URLs: https://physics-iq.github.io;, https://github.com/google-deepmind/physics-IQ-benchmark.

replace-cross Dreamweaver: Learning Compositional World Models from Pixels

Authors: Junyeob Baek, Yi-Fu Wu, Gautam Singh, Sungjin Ahn

Abstract: Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into compositional concepts and generating unseen, recomposed futures without relying on auxiliary data, such as text, masks, or bounding boxes. In this paper, we propose Dreamweaver, a neural architecture designed to discover hierarchical and compositional representations from raw videos and generate compositional future simulations. Our approach leverages a novel Recurrent Block-Slot Unit (RBSU) to decompose videos into their constituent objects and attributes. In addition, Dreamweaver uses a multi-future-frame prediction objective to capture disentangled representations for dynamic concepts more effectively as well as static concepts. In experiments, we demonstrate our model outperforms current state-of-the-art baselines for world modeling when evaluated under the DCI framework across multiple datasets. Furthermore, we show how the modularized concept representations of our model enable compositional imagination, allowing the generation of novel videos by recombining attributes from previously seen objects. cun-bjy.github.io/dreamweaver-website

replace-cross Learning to Double Guess: An Active Perception Approach for Estimating the Center of Mass of Arbitrary Objects

Authors: Shengmiao Jin, Yuchen Mo, Wenzhen Yuan

Abstract: Manipulating arbitrary objects in unstructured environments is a significant challenge in robotics, primarily due to difficulties in determining an object's center of mass. This paper introduces U-GRAPH: Uncertainty-Guided Rotational Active Perception with Haptics, a novel framework to enhance the center of mass estimation using active perception. Traditional methods often rely on single interaction and are limited by the inherent inaccuracies of Force-Torque (F/T) sensors. Our approach circumvents these limitations by integrating a Bayesian Neural Network (BNN) to quantify uncertainty and guide the robotic system through multiple, information-rich interactions via grid search and a neural network that scores each action. We demonstrate the remarkable generalizability and transferability of our method with training on a small dataset with limited variation yet still perform well on unseen complex real-world objects.

replace-cross MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

Authors: Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, Ivan Oseledets

Abstract: Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, a training-free method based on the maximum volume principle, which selects and retains the most representative frames from the input video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. It also achieves a 3.47% improvement for LLaVA-Video-72B. The approach is simple to implement and works with existing VLLMs without the need for additional training, making it a practical and effective alternative to traditional uniform sampling methods.

replace-cross Inversion of Magnetic Data using Learned Dictionaries and Scale Space

Authors: Shadab Ahamed, Simon Ghyselincks, Pablo Chang Huang Arias, Julian Kloiber, Yasin Ranjbar, Jingrong Tang, Niloufar Zakariaei, Eldad Haber

Abstract: Magnetic data inversion is an important tool in geophysics, used to infer subsurface magnetic susceptibility distributions from surface magnetic field measurements. This inverse problem is inherently ill-posed, characterized by non-unique solutions, depth ambiguity, and sensitivity to noise. Traditional inversion approaches rely on predefined regularization techniques to stabilize solutions, limiting their adaptability to complex or diverse geological scenarios. In this study, we propose an approach that integrates variable dictionary learning and scale-space methods to address these challenges. Our method employs learned dictionaries, allowing for adaptive representation of complex subsurface features that are difficult to capture with predefined bases. Additionally, we extend classical variational inversion by incorporating multi-scale representations through a scale-space framework, enabling the progressive introduction of structural detail while mitigating overfitting. We implement both fixed and dynamic dictionary learning techniques, with the latter introducing iteration-dependent dictionaries for enhanced flexibility. Using a synthetic dataset to simulate geological scenarios, we demonstrate significant improvements in reconstruction accuracy and robustness compared to conventional variational and dictionary-based methods. Our results highlight the potential of learned dictionaries, especially when coupled with scale-space dynamics, to improve model recovery and noise handling. These findings underscore the promise of our data-driven approach for advance magnetic data inversion and its applications in geophysical exploration, environmental assessment, and mineral prospecting. The code is publicly available at: https://github.com/ahxmeds/magnetic-inversion-dictionary.git.

URLs: https://github.com/ahxmeds/magnetic-inversion-dictionary.git.

replace-cross A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative Systems

Authors: Linxiao Gong, Hao Yang, Gaoyun Fang, Bobo Ju, Juncen Guo, Xiaoguang Zhu, Xiping Hu, Yan Wang, Peng Sun, Azzedine Boukerche

Abstract: The explosive growth of video data has driven the development of distributed video analytics in cloud-edge-terminal collaborative (CETC) systems, enabling efficient video processing, real-time inference, and privacy-preserving analysis. Among multiple advantages, CETC systems can distribute video processing tasks and enable adaptive analytics across cloud, edge, and terminal devices, leading to breakthroughs in video surveillance, autonomous driving, and smart cities. In this survey, we first analyze fundamental architectural components, including hierarchical, distributed, and hybrid frameworks, alongside edge computing platforms and resource management mechanisms. Building upon these foundations, edge-centric approaches emphasize on-device processing, edge-assisted offloading, and edge intelligence, while cloud-centric methods leverage powerful computational capabilities for complex video understanding and model training. Our investigation also covers hybrid video analytics incorporating adaptive task offloading and resource-aware scheduling techniques that optimize performance across the entire system. Beyond conventional approaches, recent advances in large language models and multimodal integration reveal both opportunities and challenges in platform scalability, data protection, and system reliability. Future directions also encompass explainable systems, efficient processing mechanisms, and advanced video analytics, offering valuable insights for researchers and practitioners in this dynamic field.

replace-cross Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

replace-cross Graceful forgetting: Memory as a process

Authors: Alain de Cheveign\'e

Abstract: A rational theory of memory is proposed to explain how we can accommodate unbounded sensory input within bounded storage space. Memory is stored as statistics, organized into complex structures that are constantly summarized and compressed to make room for new input. This process, driven by space constraints, is guided by heuristics that optimize the memory for future needs. Sensory input is rapidly encoded as simple statistics that are more slowly elaborated into more abstract constructs. This theory differs from previous accounts of memory by (a) its reliance on statistics, (b) its use of heuristics to guide the choice of statistics, and (c) the emphasis on memory as a process that is intensive, complex, and expensive. The theory is intended as an aid to make sense of our extensive knowledge of memory, and bring us closer to an understanding of memory in functional and mechanistic terms.

replace-cross NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation

Authors: Zhiyuan Liu, Yanchen Luo, Han Huang, Enzhi Zhang, Sihang Li, Junfeng Fang, Yaorui Shi, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua

Abstract: 3D molecule generation is crucial for drug discovery and material design. While prior efforts focus on 3D diffusion models for their benefits in modeling continuous 3D conformers, they overlook the advantages of 1D SELFIES-based Language Models (LMs), which can generate 100% valid molecules and leverage the billion-scale 1D molecule datasets. To combine these advantages for 3D molecule generation, we propose a foundation model -- NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation. NExT-Mol uses an extensively pretrained molecule LM for 1D molecule generation, and subsequently predicts the generated molecule's 3D conformers with a 3D diffusion model. We enhance NExT-Mol's performance by scaling up the LM's model size, refining the diffusion neural architecture, and applying 1D to 3D transfer learning. Notably, our 1D molecule LM significantly outperforms baselines in distributional similarity while ensuring validity, and our 3D diffusion model achieves leading performances in conformer prediction. Given these improvements in 1D and 3D modeling, NExT-Mol achieves a 26% relative improvement in 3D FCD for de novo 3D generation on GEOM-DRUGS, and a 13% average relative gain for conditional 3D generation on QM9-2014. Our codes and pretrained checkpoints are available at https://github.com/acharkq/NExT-Mol.

URLs: https://github.com/acharkq/NExT-Mol.

replace-cross LUME: LLM Unlearning with Multitask Evaluations

Authors: Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, Rahul Gupta

Abstract: Unlearning aims to remove copyrighted, sensitive, or private content from large language models (LLMs) without a full retraining. In this work, we develop a multi-task unlearning benchmark (LUME) which features three tasks: (1) unlearn synthetically generated creative short novels, (2) unlearn synthetic biographies with sensitive information, and (3) unlearn a collection of public biographies. We further release two fine-tuned LLMs of 1B and 7B parameter sizes as the target models. We conduct detailed evaluations of several recently proposed unlearning algorithms and present results on carefully crafted metrics to understand their behavior and limitations.

replace-cross CoT-ICL Lab: A Petri Dish for Studying Chain-of-Thought Learning from In-Context Demonstrations

Authors: Vignesh Kothapalli, Hamed Firooz, Maziar Sanjabi

Abstract: We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of-thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.

replace-cross PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

Authors: Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan G\'omez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu

Abstract: Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8$\times$ and 11.1$\times$ speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.

replace-cross Image Fusion for Cross-Domain Sequential Recommendation

Authors: Wangyu Wu, Siqi Song, Xianglin Qiu, Xiaowei Huang, Fei Ma, Jimin Xiao

Abstract: Cross-Domain Sequential Recommendation (CDSR) aims to predict future user interactions based on historical interactions across multiple domains. The key challenge in CDSR is effectively capturing cross-domain user preferences by fully leveraging both intra-sequence and inter-sequence item interactions. In this paper, we propose a novel method, Image Fusion for Cross-Domain Sequential Recommendation (IFCDSR), which incorporates item image information to better capture visual preferences. Our approach integrates a frozen CLIP model to generate image embeddings, enriching original item embeddings with visual data from both intra-sequence and inter-sequence interactions. Additionally, we employ a multiple attention layer to capture cross-domain interests, enabling joint learning of single-domain and cross-domain user preferences. To validate the effectiveness of IFCDSR, we re-partitioned four e-commerce datasets and conducted extensive experiments. Results demonstrate that IFCDSR significantly outperforms existing methods.

replace-cross TLOB: A Novel Transformer Model with Dual Attention for Stock Price Trend Prediction with Limit Order Book Data

Authors: Leonardo Berti, Gjergji Kasneci

Abstract: Stock Price Trend Prediction (SPTP) based on Limit Order Book (LOB) data is a fundamental challenge in financial markets. Despite advances in deep learning, existing models fail to generalize across different market conditions and struggle to reliably predict short-term trends. Surprisingly, by adapting a simple MLP-based architecture to LOB, we show that we surpass SoTA performance; thus, challenging the necessity of complex architectures. Unlike past work that shows robustness issues, we propose TLOB, a transformer-based model that uses a dual attention mechanism to capture spatial and temporal dependencies in LOB data. This allows it to adaptively focus on the market microstructure, making it particularly effective for longer-horizon predictions and volatile market conditions. We also introduce a new labeling method that improves on previous ones, removing the horizon bias. We evaluate TLOB's effectiveness using the established FI-2010 benchmark, which exceeds the state-of-the-art by an average of 3.7 F1-score(\%). Additionally, TLOB shows improvements on Tesla and Intel with a 1.3 and 7.7 increase in F1-score(\%), respectively. Additionally, we empirically show how stock price predictability has declined over time (-6.68 absolute points in F1-score(\%)), highlighting the growing market efficiencies. Predictability must be considered in relation to transaction costs, so we experimented with defining trends using an average spread, reflecting the primary transaction cost. The resulting performance deterioration underscores the complexity of translating trend classification into profitable trading strategies. We argue that our work provides new insights into the evolving landscape of stock price trend prediction and sets a strong foundation for future advancements in financial AI. We release the code at https://github.com/LeonardoBerti00/TLOB.

URLs: https://github.com/LeonardoBerti00/TLOB.

replace-cross OptionZero: Planning with Learned Options

Authors: Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu

Abstract: Planning with options -- a sequence of primitive actions -- has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns options but also acquires strategic skills tailored to different game characteristics. Our findings show promising directions for discovering and using options in planning. Our code is available at https://rlg.iis.sinica.edu.tw/papers/optionzero.

URLs: https://rlg.iis.sinica.edu.tw/papers/optionzero.

replace-cross Multi-View Contrastive Network (MVCNet) for Motor Imagery Classification

Authors: Ziwei Wang, Siyang Li, Xiaoqing Chen, Wei Li, Dongrui Wu

Abstract: Objective: An electroencephalography (EEG)-based brain-computer interface (BCI) serves as a direct communication pathway between the human brain and an external device. While supervised learning has been extensively explored for motor imagery (MI) EEG classification, small data quantity has been a key factor limiting the performance of deep feature learning. Methods: This paper proposes a knowledge-driven time-space-frequency based multi-view contrastive network (MVCNet) for MI EEG decoding in BCIs. MVCNet integrates knowledge from the time, space, and frequency domains into the training process through data augmentations from multiple views, fostering more discriminative feature learning of the characteristics of EEG data. We introduce a cross-view contrasting module to learn from different augmented views and a cross-model contrasting module to enhance the consistency of features extracted between knowledge-guided and data-driven models. Results: The combination of EEG data augmentation strategies was systematically investigated for more informative supervised contrastive learning. Experiments on four public MI datasets and three different architectures demonstrated that MVCNet outperformed 10 existing approaches. Significance: Our approach can significantly boost EEG classification performance beyond designated networks, showcasing the potential to enhance the feature learning process for better EEG decoding.

replace-cross Say Less, Mean More: Leveraging Pragmatics in Retrieval-Augmented Generation

Authors: Haris Riaz, Ellen Riloff, Mihai Surdeanu

Abstract: We propose a simple, unsupervised method that injects pragmatic principles in retrieval-augmented generation (RAG) frameworks such as Dense Passage Retrieval to enhance the utility of retrieved contexts. Our approach first identifies which sentences in a pool of documents retrieved by RAG are most relevant to the question at hand, cover all the topics addressed in the input question and no more, and then highlights these sentences within their context, before they are provided to the LLM, without truncating or altering the context in any other way. We show that this simple idea brings consistent improvements in experiments on three question answering tasks (ARC-Challenge, PubHealth and PopQA) using five different LLMs. It notably enhances relative accuracy by up to 19.7% on PubHealth and 10% on ARC-Challenge compared to a conventional RAG system.

replace-cross Spatial-RAG: Spatial Retrieval Augmented Generation for Real-World Spatial Reasoning Questions

Authors: Dazhou Yu, Riyang Bao, Gengchen Mai, Liang Zhao

Abstract: Spatial reasoning remains a challenge for Large Language Models (LLMs), which struggle with spatial data retrieval and reasoning. We propose Spatial Retrieval-Augmented Generation (Spatial-RAG), a framework that extends RAG to spatial tasks by integrating sparse spatial retrieval (spatial databases) and dense semantic retrieval (LLM-based similarity). A multi-objective ranking strategy balances spatial constraints and semantic relevance, while an LLM-guided generator ensures coherent responses. Experiments on a real-world tourism dataset show that Spatial-RAG significantly improves spatial question answering, bridging the gap between LLMs and spatial intelligence.

replace-cross Kanana: Compute-efficient Bilingual Language Models

Authors: Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo

Abstract: We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.

replace-cross Forecasting intermittent time series with Gaussian Processes and Tweedie likelihood

Authors: Stefano Damato, Dario Azzimonti, Giorgio Corani

Abstract: We introduce the use of Gaussian Processes (GPs) for the probabilistic forecasting of intermittent time series. The model is trained in a Bayesian framework that accounts for the uncertainty about the latent function and marginalizes it out when making predictions. We couple the latent GP variable with two types of forecast distributions: the negative binomial (NegBinGP) and the Tweedie distribution (TweedieGP). While the negative binomial has already been used in forecasting intermittent time series, this is the first time in which a fully parameterized Tweedie density is used for intermittent time series. We properly evaluate the Tweedie density, which is both zero-inflated and heavy tailed, avoiding simplifying assumptions made in existing models. We test our models on thousands of intermittent count time series. Results show that our models provide consistently better probabilistic forecasts than the competitors. In particular, TweedieGP obtains the best estimates of the highest quantiles, thus showing that it is more flexible than NegBinGP.

replace-cross RetinaRegen: A Hybrid Model for Readability and Detail Restoration in Fundus Images

Authors: Yuhan Tang, Yudian Wang, Weizhen Li, Ye Yue, Chengchang Pan, Honggang Qi

Abstract: Fundus image quality is crucial for diagnosing eye diseases, but real-world conditions often result in blurred or unreadable images, increasing diagnostic uncertainty. To address these challenges, this study proposes RetinaRegen, a hybrid model for retinal image restoration that integrates a readability classifi-cation model, a Diffusion Model, and a Variational Autoencoder (VAE). Ex-periments on the SynFundus-1M dataset show that the proposed method achieves a PSNR of 27.4521, an SSIM of 0.9556, and an LPIPS of 0.1911 for the readability labels of the optic disc (RO) region. These results demonstrate superior performance in restoring key regions, offering an effective solution to enhance fundus image quality and support clinical diagnosis.