new QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Authors: Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Abstract: Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.

new Real Time Control of Tandem-Wing Experimental Platform Using Concerto Reinforcement Learning

Authors: Zhang Minghao, Yang Xiaojun, Wang Zhihe, Wang Liang

Abstract: This paper introduces the CRL2RT algorithm, an advanced reinforcement learning method aimed at improving the real-time control performance of the Direct-Drive Tandem-Wing Experimental Platform (DDTWEP). Inspired by dragonfly flight, DDTWEP's tandem wing structure causes nonlinear and unsteady aerodynamic interactions, leading to complex load behaviors during pitch, roll, and yaw maneuvers. These complexities challenge stable motion control at high frequencies (2000 Hz). To overcome these issues, we developed the CRL2RT algorithm, which combines classical control elements with reinforcement learning-based controllers using a time-interleaved architecture and a rule-based policy composer. This integration ensures finite-time convergence and single-life adaptability. Experimental results under various conditions, including different flapping frequencies and yaw disturbances, show that CRL2RT achieves a control frequency surpassing 2500 Hz on standard CPUs. Additionally, when integrated with classical controllers like PID, Adaptive PID, and Model Reference Adaptive Control (MRAC), CRL2RT enhances tracking performance by 18.3% to 60.7%. These findings demonstrate CRL2RT's broad applicability and superior performance in complex real-time control scenarios, validating its effectiveness in overcoming existing control strategy limitations and advancing robust, efficient real-time control for biomimetic aerial vehicles.

new Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning

Authors: Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar

Abstract: In many RL applications, ensuring an agent's actions adhere to constraints is crucial for safety. Most previous methods in Action-Constrained Reinforcement Learning (ACRL) employ a projection layer after the policy network to correct the action. However projection-based methods suffer from issues like the zero gradient problem and higher runtime due to the usage of optimization solvers. Recently methods were proposed to train generative models to learn a differentiable mapping between latent variables and feasible actions to address this issue. However, generative models require training using samples from the constrained action space, which itself is challenging. To address such limitations, first, we define a target distribution for feasible actions based on constraint violation signals, and train normalizing flows by minimizing the KL divergence between an approximated distribution over feasible actions and the target. This eliminates the need to generate feasible action samples, greatly simplifying the flow model learning. Second, we integrate the learned flow model with existing deep RL methods, which restrict it to exploring only the feasible action space. Third, we extend our approach beyond ACRL to handle state-wise constraints by learning the constraint violation signal from the environment. Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods.

new Analysis of Overparameterization in Continual Learning under a Linear Model

Authors: Daniel Goldfarb, Paul Hand

Abstract: Autonomous machine learning systems that learn many tasks in sequence are prone to the catastrophic forgetting problem. Mathematical theory is needed in order to understand the extent of forgetting during continual learning. As a foundational step towards this goal, we study continual learning and catastrophic forgetting from a theoretical perspective in the simple setting of gradient descent with no explicit algorithmic mechanism to prevent forgetting. In this setting, we analytically demonstrate that overparameterization alone can mitigate forgetting in the context of a linear regression model. We consider a two-task setting motivated by permutation tasks, and show that as the overparameterization ratio becomes sufficiently high, a model trained on both tasks in sequence results in a low-risk estimator for the first task. As part of this work, we establish a non-asymptotic bound of the risk of a single linear regression task, which may be of independent interest to the field of double descent theory.

new One Class Restricted Kernel Machines

Authors: A. Quadir, M. Sajid, M. Tanveer

Abstract: Restricted kernel machines (RKMs) have demonstrated a significant impact in enhancing generalization ability in the field of machine learning. Recent studies have introduced various methods within the RKM framework, combining kernel functions with the least squares support vector machine (LSSVM) in a manner similar to the energy function of restricted boltzmann machines (RBM), such that a better performance can be achieved. However, RKM's efficacy can be compromised by the presence of outliers and other forms of contamination within the dataset. These anomalies can skew the learning process, leading to less accurate and reliable outcomes. To address this critical issue and to ensure the robustness of the model, we propose the novel one-class RKM (OCRKM). In the framework of OCRKM, we employ an energy function akin to that of the RBM, which integrates both visible and hidden variables in a nonprobabilistic setting. The formulation of the proposed OCRKM facilitates the seamless integration of one-class classification method with the RKM, enhancing its capability to detect outliers and anomalies effectively. The proposed OCRKM model is evaluated over UCI benchmark datasets. Experimental findings and statistical analyses consistently emphasize the superior generalization capabilities of the proposed OCRKM model over baseline models across all scenarios.

new Evaluating and Explaining Earthquake-Induced Liquefaction Potential through Multi-Modal Transformers

Authors: Sompote Youwai, Tipok Kitkobsin, Sutat Leelataviwat, Pornkasem Jongpradist

Abstract: This study presents an explainable parallel transformer architecture for soil liquefaction prediction that integrates three distinct data streams: spectral seismic encoding, soil stratigraphy tokenization, and site-specific features. The architecture processes data from 165 case histories across 11 major earthquakes, employing Fast Fourier Transform for seismic waveform encoding and principles from large language models for soil layer tokenization. Interpretability is achieved through SHapley Additive exPlanations (SHAP), which decompose predictions into individual contributions from seismic characteristics, soil properties, and site conditions. The model achieves 93.75% prediction accuracy on cross-regional validation sets and demonstrates robust performance through sensitivity analysis of ground motion intensity and soil resistance parameters. Notably, validation against previously unseen ground motion data from the 2024 Noto Peninsula earthquake confirms the model's generalization capabilities and practical utility. Implementation as a publicly accessible web application enables rapid assessment of multiple sites simultaneously. This approach establishes a new framework in geotechnical deep learning where sophisticated multi-modal analysis meets practical engineering requirements through quantitative interpretation and accessible deployment.

new FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation

Authors: Zheng Fang, Lichuan Xiang, Xu Cai, Kaicheng Zhou, Hongkai Wen

Abstract: ControlNet offers a powerful way to guide diffusion-based generative models, yet most implementations rely on ad-hoc heuristics to choose which network blocks to control-an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that copies all diffusion blocks during training and employs a trainable gating mechanism to dynamically select which blocks to activate at each denoising step. With introducing a computation-aware loss, we can encourage control blocks only to activate when it benefit the generation quality. By eliminating manual block selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline, with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet (e.g., SD1.5) and DiT (e.g., SD3.0), we show that our method outperforms existing ControlNet variants in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks. These results underscore the potential of a flexible, data-driven approach for controlled diffusion and open new avenues for efficient generative model design.

new Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset

Authors: Vladimir Frants, Sos Agaian

Abstract: This paper addresses the vulnerability of deep-learning models designed for rain, snow, and haze removal. Despite enhancing image quality in adverse weather, these models are susceptible to adversarial attacks that compromise their effectiveness. Traditional defenses such as adversarial training and model distillation often require extensive retraining, making them costly and impractical for real-world deployment. While denoising and super-resolution techniques can aid image classification models, they impose high computational demands and introduce visual artifacts that hinder image processing tasks. We propose a model-agnostic defense against first-order white-box adversarial attacks using the Quaternion-Hadamard Network (QHNet) to tackle these challenges. White-box attacks are particularly difficult to defend against since attackers have full access to the model's architecture, weights, and training procedures. Our defense introduces the Quaternion Hadamard Denoising Convolutional Block (QHDCB) and the Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding. QHNet incorporates these blocks within an encoder-decoder architecture, enhanced by feature refinement, to effectively neutralize adversarial noise. Additionally, we introduce the Adversarial Weather Conditions Vision Dataset (AWCVD), created by applying first-order gradient attacks on state-of-the-art weather removal techniques in scenarios involving haze, rain streaks, and snow. Using PSNR and SSIM metrics, we demonstrate that QHNet significantly enhances the robustness of low-level computer vision models against adversarial attacks compared with state-of-the-art denoising and super-resolution techniques. The source code and dataset will be released alongside the final version of this paper.

new One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu

Abstract: Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.

new E2LVLM:Evidence-Enhanced Large Vision-Language Model for Multimodal Out-of-Context Misinformation Detection

Authors: Junjie Wu, Yumeng Fu, Nan Yu, Guohong Fu

Abstract: Recent studies in Large Vision-Language Models (LVLMs) have demonstrated impressive advancements in multimodal Out-of-Context (OOC) misinformation detection, discerning whether an authentic image is wrongly used in a claim. Despite their success, the textual evidence of authentic images retrieved from the inverse search is directly transmitted to LVLMs, leading to inaccurate or false information in the decision-making phase. To this end, we present E2LVLM, a novel evidence-enhanced large vision-language model by adapting textual evidence in two levels. First, motivated by the fact that textual evidence provided by external tools struggles to align with LVLMs inputs, we devise a reranking and rewriting strategy for generating coherent and contextually attuned content, thereby driving the aligned and effective behavior of LVLMs pertinent to authentic images. Second, to address the scarcity of news domain datasets with both judgment and explanation, we generate a novel OOC multimodal instruction-following dataset by prompting LVLMs with informative content to acquire plausible explanations. Further, we develop a multimodal instruction-tuning strategy with convincing explanations for beyond detection. This scheme contributes to E2LVLM for multimodal OOC misinformation detection and explanation. A multitude of experiments demonstrate that E2LVLM achieves superior performance than state-of-the-art methods, and also provides compelling rationales for judgments.

new Deep Reinforcement Learning-Based User Scheduling for Collaborative Perception

Authors: Yandi Liu, Guowei Liu, Le Liang, Hao Ye, Chongtao Guo, Shi Jin

Abstract: Stand-alone perception systems in autonomous driving suffer from limited sensing ranges and occlusions at extended distances, potentially resulting in catastrophic outcomes. To address this issue, collaborative perception is envisioned to improve perceptual accuracy by using vehicle-to-everything (V2X) communication to enable collaboration among connected and autonomous vehicles and roadside units. However, due to limited communication resources, it is impractical for all units to transmit sensing data such as point clouds or high-definition video. As a result, it is essential to optimize the scheduling of communication links to ensure efficient spectrum utilization for the exchange of perceptual data. In this work, we propose a deep reinforcement learning-based V2X user scheduling algorithm for collaborative perception. Given the challenges in acquiring perceptual labels, we reformulate the conventional label-dependent objective into a label-free goal, based on characteristics of 3D object detection. Incorporating both channel state information (CSI) and semantic information, we develop a double deep Q-Network (DDQN)-based user scheduling framework for collaborative perception, named SchedCP. Simulation results verify the effectiveness and robustness of SchedCP compared with traditional V2X scheduling methods. Finally, we present a case study to illustrate how our proposed algorithm adaptively modifies the scheduling decisions by taking both instantaneous CSI and perceptual semantics into account.

new I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Authors: Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu

Abstract: This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

URLs: https://mizhenxing.github.io/ThinkDiff.

new LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search

Authors: Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang

Abstract: Graph Neural Architecture Search (GNAS) facilitates the automatic design of Graph Neural Networks (GNNs) tailored to specific downstream graph learning tasks. However, existing GNAS approaches often require manual adaptation to new graph search spaces, necessitating substantial code optimization and domain-specific knowledge. To address this challenge, we present LLM4GNAS, a toolkit for GNAS that leverages the generative capabilities of Large Language Models (LLMs). LLM4GNAS includes an algorithm library for graph neural architecture search algorithms based on LLMs, enabling the adaptation of GNAS methods to new search spaces through the modification of LLM prompts. This approach reduces the need for manual intervention in algorithm adaptation and code modification. The LLM4GNAS toolkit is extensible and robust, incorporating LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture search, and LLM-enhanced hyperparameter optimization. Experimental results indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving both homogeneous and heterogeneous graphs.

new SenDaL: An Effective and Efficient Calibration Framework of Low-Cost Sensors for Daily Life

Authors: Seokho Ahn, Hyungjin Kim, Euijong Lee, Young-Duk Seo

Abstract: The collection of accurate and noise-free data is a crucial part of Internet of Things (IoT)-controlled environments. However, the data collected from various sensors in daily life often suffer from inaccuracies. Additionally, IoT-controlled devices with low-cost sensors lack sufficient hardware resources to employ conventional deep-learning models. To overcome this limitation, we propose sensors for daily life (SenDaL), the first framework that utilizes neural networks for calibrating low cost sensors. SenDaL introduces novel training and inference processes that enable it to achieve accuracy comparable to deep learning models while simultaneously preserving latency and energy consumption similar to linear models. SenDaL is first trained in a bottom-up manner, making decisions based on calibration results from both linear and deep learning models. Once both models are trained, SenDaL makes independent decisions through a top-down inference process, ensuring accuracy and inference speed. Furthermore, SenDaL can select the optimal deep learning model according to the resources of the IoT devices because it is compatible with various deep learning models, such as long short-term memory-based and Transformer-based models. We have verified that SenDaL outperforms existing deep learning models in terms of accuracy, latency, and energy efficiency through experiments conducted in different IoT environments and real-life scenarios.

new From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics

Authors: Qinshuo Liu, Weiqin Zhao, Wei Huang, Yanwen Fang, Lequan Yu, Guodong Li

Abstract: The depth of neural networks is a critical factor for their capability, with deeper models often demonstrating superior performance. Motivated by this, significant efforts have been made to enhance layer aggregation - reusing information from previous layers to better extract features at the current layer, to improve the representational power of deep neural networks. However, previous works have primarily addressed this problem from a discrete-state perspective which is not suitable as the number of network layers grows. This paper novelly treats the outputs from layers as states of a continuous process and considers leveraging the state space model (SSM) to design the aggregation of layers in very deep neural networks. Moreover, inspired by its advancements in modeling long sequences, the Selective State Space Models (S6) is employed to design a new module called Selective State Space Model Layer Aggregation (S6LA). This module aims to combine traditional CNN or transformer architectures within a sequential framework, enhancing the representational capabilities of state-of-the-art vision networks. Extensive experiments show that S6LA delivers substantial improvements in both image classification and detection tasks, highlighting the potential of integrating SSMs with contemporary deep learning techniques.

new SinSim: Sinkhorn-Regularized SimCLR

Authors: M. Hadi Sepanj, Paul Fiegth

Abstract: Self-supervised learning has revolutionized representation learning by eliminating the need for labeled data. Contrastive learning methods, such as SimCLR, maximize the agreement between augmented views of an image but lack explicit regularization to enforce a globally structured latent space. This limitation often leads to suboptimal generalization. We propose SinSim, a novel extension of SimCLR that integrates Sinkhorn regularization from optimal transport theory to enhance representation structure. The Sinkhorn loss, an entropy-regularized Wasserstein distance, encourages a well-dispersed and geometry-aware feature space, preserving discriminative power. Empirical evaluations on various datasets demonstrate that SinSim outperforms SimCLR and achieves competitive performance against prominent self-supervised methods such as VICReg and Barlow Twins. UMAP visualizations further reveal improved class separability and structured feature distributions. These results indicate that integrating optimal transport regularization into contrastive learning provides a principled and effective mechanism for learning robust, well-structured representations. Our findings open new directions for applying transport-based constraints in self-supervised learning frameworks.

new Chronic Diseases Prediction Using ML

Authors: Sri Varsha Mulakala, G. Neeharika, P. Vinay Kumar, A. Bhargava Kiran

Abstract: The recent increase in morbidity is primarily due to chronic diseases including Diabetes, Heart disease, Lung cancer, and brain tumours. The results for patients can be improved, and the financial burden on the healthcare system can be lessened, through the early detection and prevention of certain disorders. In this study, we built a machine-learning model for predicting the existence of numerous diseases utilising datasets from various sources, including Kaggle, Dataworld, and the UCI repository, that are relevant to each of the diseases we intended to predict. Following the acquisition of the datasets, we used feature engineering to extract pertinent features from the information, after which the model was trained on a training set and improved using a validation set. A test set was then used to assess the correctness of the final model. We provide an easy-to-use interface where users may enter the parameters for the selected ailment. Once the right model has been run, it will indicate whether the user has a certain ailment and offer suggestions for how to treat or prevent it.

new LiveVal: Time-aware Data Valuation via Adaptive Reference Points

Authors: Jie Xu, Zihan Wu, Cong Wang, Xiaohua Jia

Abstract: Time-aware data valuation enhances training efficiency and model robustness, as early detection of harmful samples could prevent months of wasted computation. However, existing methods rely on model retraining or convergence assumptions or fail to capture long-term training dynamics. We propose LiveVal, an efficient time-aware data valuation method with three key designs: 1) seamless integration with SGD training for efficient data contribution monitoring; 2) reference-based valuation with normalization for reliable benchmark establishment; and 3) adaptive reference point selection for real-time updating with optimized memory usage. We establish theoretical guarantees for LiveVal's stability and prove that its valuations are bounded and directionally aligned with optimization progress. Extensive experiments demonstrate that LiveVal provides efficient data valuation across different modalities and model scales, achieving 180 speedup over traditional methods while maintaining robust detection performance.

new Preference learning made easy: Everything should be understood through win rate

Authors: Lily H. Zhang, Rajesh Ranganath

Abstract: Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.

new MixMin: Finding Data Mixtures via Convex Minimization

Authors: Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison

Abstract: Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.

new Applying Deep Learning to Ads Conversion Prediction in Last Mile Delivery Marketplace

Authors: Di Li, Xiaochang Miao, Huiyu Song, Chao Chu, Hao Xu, Mandar Rahurkar

Abstract: Deep neural networks (DNNs) have revolutionized web-scale ranking systems, enabling breakthroughs in capturing complex user behaviors and driving performance gains. At DoorDash, we first harnessed this transformative power by transitioning our homepage Ads ranking system from traditional tree based models to cutting edge multi task DNNs. This evolution sparked advancements in data foundations, model design, training efficiency, evaluation rigor, and online serving, delivering substantial business impact and reshaping our approach to machine learning. In this paper, we talk about our problem driven journey, from identifying the right problems and crafting targeted solutions to overcoming the complexity of developing and scaling a deep learning recommendation system. Through our successes and learned lessons, we aim to share insights and practical guidance to teams pursuing similar advancements in machine learning systems.

new KernelBench: Can LLMs Write Efficient GPU Kernels?

Authors: Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R\'e, Azalia Mirhoseini

Abstract: Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.

new Expert-Agnostic Learning to Defer

Authors: Joshua Strong, Pramit Saha, Yasin Ibrahim, Cheng Ouyang, Alison Noble

Abstract: Learning to Defer (L2D) learns autonomous systems to independently manage straightforward cases, while deferring uncertain cases to human experts. Recent advancements in this field have introduced features enabling flexibility to unseen experts at test-time, but we find these approaches have significant limitations. To address these, we introduce EA-L2D: Expert-Agnostic Learning to Defer, a novel L2D framework that leverages a Bayesian approach to model expert behaviour in an expert-agnostic manner, facilitating optimal deferral decisions. EA-L2D offers several critical improvements over prior methods, including the ability to incorporate prior knowledge about experts, a reduced reliance on expert-annotated data, and robust performance when deferring to experts with expertise not seen during training. Evaluating on CIFAR-10, HAM10000, German Traffic Lights, Breast Ultrasound, Axial Organ Slices, and Blood Cell MNIST, we observe performance gains over the next state-of-the-art of 1-16\% for seen experts and 4-28\% for unseen experts in settings with high expert diversity.

new From Deep Additive Kernel Learning to Last-Layer Bayesian Neural Networks via Induced Prior Approximation

Authors: Wenyuan Zhao, Haoyuan Chen, Tie Liu, Rui Tuo, Chao Tian

Abstract: With the strengths of both deep learning and kernel methods like Gaussian Processes (GPs), Deep Kernel Learning (DKL) has gained considerable attention in recent years. From the computational perspective, however, DKL becomes challenging when the input dimension of the GP layer is high. To address this challenge, we propose the Deep Additive Kernel (DAK) model, which incorporates i) an additive structure for the last-layer GP; and ii) induced prior approximation for each GP unit. This naturally leads to a last-layer Bayesian neural network (BNN) architecture. The proposed method enjoys the interpretability of DKL as well as the computational advantages of BNN. Empirical results show that the proposed approach outperforms state-of-the-art DKL methods in both regression and classification tasks.

new Learning to be Smooth: An End-to-End Differentiable Particle Smoother

Authors: Ali Younis, Erik B. Sudderth

Abstract: For challenging state estimation problems arising in domains like vision and robotics, particle-based representations attractively enable temporal reasoning about multiple posterior modes. Particle smoothers offer the potential for more accurate offline data analysis by propagating information both forward and backward in time, but have classically required human-engineered dynamics and observation models. Extending recent advances in discriminative training of particle filters, we develop a framework for low-variance propagation of gradients across long time sequences when training particle smoothers. Our "two-filter'' smoother integrates particle streams that are propagated forward and backward in time, while incorporating stratification and importance weights in the resampling step to provide low-variance gradient estimates for neural network dynamics and observation models. The resulting mixture density particle smoother is substantially more accurate than state-of-the-art particle filters, as well as search-based baselines, for city-scale global vehicle localization from real-world videos and maps.

new Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Authors: Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, Aleksandr I. Panov

Abstract: Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base - a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop MIKASA-Robo - a novel benchmark of 32 carefully designed memory-intensive tasks that assess memory capabilities in tabletop robotic manipulation. Our contributions establish a unified framework for advancing memory RL research, driving the development of more reliable systems for real-world applications. The code is available at https://sites.google.com/view/memorybenchrobots/.

URLs: https://sites.google.com/view/memorybenchrobots/.

new Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Authors: Zhaoyi Zhou, Yuda Song, Andrea Zanette

Abstract: When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.

new Efficient Hierarchical Contrastive Self-supervising Learning for Time Series Classification via Importance-aware Resolution Selection

Authors: Kevin Garcia, Juan Manuel Perez, Yifeng Gao

Abstract: Recently, there has been a significant advancement in designing Self-Supervised Learning (SSL) frameworks for time series data to reduce the dependency on data labels. Among these works, hierarchical contrastive learning-based SSL frameworks, which learn representations by contrasting data embeddings at multiple resolutions, have gained considerable attention. Due to their ability to gather more information, they exhibit better generalization in various downstream tasks. However, when the time series data length is significant long, the computational cost is often significantly higher than that of other SSL frameworks. In this paper, to address this challenge, we propose an efficient way to train hierarchical contrastive learning models. Inspired by the fact that each resolution's data embedding is highly dependent, we introduce importance-aware resolution selection based training framework to reduce the computational cost. In the experiment, we demonstrate that the proposed method significantly improves training time while preserving the original model's integrity in extensive time series classification performance evaluations. Our code could be found here, https://github.com/KEEBVIN/IARS

URLs: https://github.com/KEEBVIN/IARS

new HADL Framework for Noise Resilient Long-Term Time Series Forecasting

Authors: Aditya Dey, Jonas Kusch, Fadi Al Machot

Abstract: Long-term time series forecasting is critical in domains such as finance, economics, and energy, where accurate and reliable predictions over extended horizons drive strategic decision-making. Despite the progress in machine learning-based models, the impact of temporal noise in extended lookback windows remains underexplored, often degrading model performance and computational efficiency. In this paper, we propose a novel framework that addresses these challenges by integrating the Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) to perform noise reduction and extract robust long-term features. These transformations enable the separation of meaningful temporal patterns from noise in both the time and frequency domains. To complement this, we introduce a lightweight low-rank linear prediction layer that not only reduces the influence of residual noise but also improves memory efficiency. Our approach demonstrates competitive robustness to noisy input, significantly reduces computational complexity, and achieves competitive or state-of-the-art forecasting performance across diverse benchmark datasets. Extensive experiments reveal that the proposed framework is particularly effective in scenarios with high noise levels or irregular patterns, making it well suited for real-world forecasting tasks. The code is available in https://github.com/forgee-master/HADL.

URLs: https://github.com/forgee-master/HADL.

new An Innovative Next Activity Prediction Approach Using Process Entropy and DAW-Transformer

Authors: Hadi Zare, Mostafa Abbasi, Maryam Ahang, Homayoun Najjaran

Abstract: Purpose - In Business Process Management (BPM), accurate prediction of the next activities is vital for operational efficiency and decision-making. Current Artificial Intelligence (AI)/Machine Learning (ML) models struggle with the complexity and evolving nature of business process event logs, balancing accuracy and interpretability. This paper proposes an entropy-driven model selection approach and DAW-Transformer, which stands for Dynamic Attribute-Aware Transformer, to integrate all attributes with a dynamic window for better accuracy. Design/methodology/approach - This paper introduces a novel next-activity prediction approach that uses process entropy to assess the complexity of event logs and dynamically select the most suitable ML model. A new transformer-based architecture with multi-head attention and dynamic windowing mechanism, DAW-Transformer, is proposed to capture long-range dependencies and utilize all relevant event log attributes. Experiments were conducted on six public datasets, and the performance was evaluated with process entropy. Finding - The results demonstrate the effectiveness of the approach across these publicly available datasets. DAW-Transformer achieved superior performance, especially on high-entropy datasets such as Sepsis exceeding Limited window Multi-Transformers by 4.69% and a benchmark CNN-LSTM-SAtt model by 3.07%. For low-entropy datasets like Road Traffic Fine, simpler, more interpretable algorithms like Random Forest performed nearly as well as the more complex DAW-Transformer and offered better handling of imbalanced data and improved explainability. Originality/ value - This work's novelty lies in the proposed DAW-Transformer, with a dynamic window and considering all relevant attributes. Also, entropy-driven selection methods offer a robust, accurate, and interpretable solution for next-activity prediction.

new Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

Authors: Zeyu Jia, Alexander Rakhlin, Tengyang Xie

Abstract: As large language models have evolved, it has become crucial to distinguish between process supervision and outcome supervision -- two key reinforcement learning approaches to complex reasoning tasks. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we take steps towards resolving this debate. Our main theorem shows that, under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision, up to polynomial factors in horizon. At the core of this result lies the novel Change of Trajectory Measure Lemma -- a technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a direct connection between outcome and process supervision. These findings suggest that the empirically observed performance gap -- if any -- between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data collection and algorithm design for reinforcement learning.

new Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression

Authors: Megh Shukla, Aziz Shameem, Mathieu Salzmann, Alexandre Alahi

Abstract: Deep heteroscedastic regression models the mean and covariance of the target distribution through neural networks. The challenge arises from heteroscedasticity, which implies that the covariance is sample dependent and is often unknown. Consequently, recent methods learn the covariance through unsupervised frameworks, which unfortunately yield a trade-off between computational complexity and accuracy. While this trade-off could be alleviated through supervision, obtaining labels for the covariance is non-trivial. Here, we study self-supervised covariance estimation in deep heteroscedastic regression. We address two questions: (1) How should we supervise the covariance assuming ground truth is available? (2) How can we obtain pseudo labels in the absence of the ground-truth? We address (1) by analysing two popular measures: the KL Divergence and the 2-Wasserstein distance. Subsequently, we derive an upper bound on the 2-Wasserstein distance between normal distributions with non-commutative covariances that is stable to optimize. We address (2) through a simple neighborhood based heuristic algorithm which results in surprisingly effective pseudo labels for the covariance. Our experiments over a wide range of synthetic and real datasets demonstrate that the proposed 2-Wasserstein bound coupled with pseudo label annotations results in a computationally cheaper yet accurate deep heteroscedastic regression.

new K-Edit: Language Model Editing with Contextual Knowledge Awareness

Authors: Elan Markowitz, Anil Ramakrishna, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan

Abstract: As the world changes, we need to be able to update our models and correct false information without costly retraining. Knowledge-based model editing enables precise modifications to the weights of large language models in order to modify the information encoded within. Recent approaches have seen success in enabling recall of edited information for thousands of edits at once. However, these approaches fail to produce edits that account for associated contextual information. We present K-Edit, an effective approach to generating contextually consistent knowledge edits. By using knowledge graphs, which maintain contextual consistency when an edge is edited, we are able to generate additional \textit{contextual edits} that ensure consistency of related information in the language model. Our experiments demonstrate significant improvements in multi-hop question answering while maintaining the general effectiveness and scalability of model edits.

new On Self-Adaptive Perception Loss Function for Sequential Lossy Compression

Authors: Sadaf Salehkalaibar, Buu Phan, Likun Cai, Joao Atz Dick, Wei Yu, Jun Chen, Ashish Khisti

Abstract: We consider causal, low-latency, sequential lossy compression, with mean squared-error (MSE) as the distortion loss, and a perception loss function (PLF) to enhance the realism of reconstructions. As the main contribution, we propose and analyze a new PLF that considers the joint distribution between the current source frame and the previous reconstructions. We establish the theoretical rate-distortion-perception function for first-order Markov sources and analyze the Gaussian model in detail. From a qualitative perspective, the proposed metric can simultaneously avoid the error-permanence phenomenon and also better exploit the temporal correlation between high-quality reconstructions. The proposed metric is referred to as self-adaptive perception loss function (PLF-SA), as its behavior adapts to the quality of reconstructed frames. We provide a detailed comparison of the proposed perception loss function with previous approaches through both information theoretic analysis as well as experiments involving moving MNIST and UVG datasets.

new ControllableGPT: A Ground-Up Designed Controllable GPT for Molecule Optimization

Authors: Xuefeng Liu, Songhao Jiang, Bo Li, Rick Stevens

Abstract: Large Language Models (LLMs) employ three popular training approaches: Masked Language Models (MLM), Causal Language Models (CLM), and Sequence-to-Sequence Models (seq2seq). However, each approach has its strengths and limitations, and faces challenges in addressing specific tasks that require controllable and bidirectional generation, such as drug optimization. To address this challenge, inspired by the biological processes of growth and evolution, which involve the expansion, shrinking, and mutation of sequences, we introduce ControllableGPT. This initiative represents the first effort to combine the advantages of MLM, CLM, and seq2seq into a single unified, controllable GPT framework. It enables the precise management of specific locations and ranges within a sequence, allowing for expansion, reduction, or mutation over chosen or random lengths, while maintaining the integrity of any specified positions or subsequences. In this work, we designed ControllableGPT for drug optimization from the ground up, which included proposing the Causally Masked Seq2seq (CMS) objective, developing the training corpus, introducing a novel pre-training approach, and devising a unique generation process. We demonstrate the effectiveness and controllability of ControllableGPT by conducting experiments on drug optimization tasks for both viral and cancer benchmarks, surpassing competing baselines.

new Privacy Preservation through Practical Machine Unlearning

Authors: Robert Dilworth

Abstract: Machine Learning models thrive on vast datasets, continuously adapting to provide accurate predictions and recommendations. However, in an era dominated by privacy concerns, Machine Unlearning emerges as a transformative approach, enabling the selective removal of data from trained models. This paper examines methods such as Naive Retraining and Exact Unlearning via the SISA framework, evaluating their Computational Costs, Consistency, and feasibility using the \texttt{HSpam14} dataset. We explore the potential of integrating unlearning principles into Positive Unlabeled (PU) Learning to address challenges posed by partially labeled datasets. Our findings highlight the promise of unlearning frameworks like \textit{DaRE} for ensuring privacy compliance while maintaining model performance, albeit with significant computational trade-offs. This study underscores the importance of Machine Unlearning in achieving ethical AI and fostering trust in data-driven systems.

new A Power Transform

Authors: Jonathan T. Barron

Abstract: Power transforms, such as the Box-Cox transform and Tukey's ladder of powers, are a fundamental tool in mathematics and statistics. These transforms are primarily used for normalizing and standardizing datasets, effectively by raising values to a power. In this work I present a novel power transform, and I show that it serves as a unifying framework for wide family of loss functions, kernel functions, probability distributions, bump functions, and neural network activation functions.

new LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization

Authors: Erica Zhang, Ryunosuke Goto, Naomi Sagan, Jurik Mutter, Nick Phillips, Ash Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, Robert Tibshirani

Abstract: We introduce LLM-Lasso, a novel framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. Unlike traditional methods that rely solely on numerical data, LLM-Lasso incorporates domain-specific knowledge extracted from natural language, enhanced through a retrieval-augmented generation (RAG) pipeline, to seamlessly integrate data-driven modeling with contextual insights. Specifically, the LLM generates penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model, while less relevant features are assigned higher penalties, reducing their influence. Importantly, LLM-Lasso has an internal validation step that determines how much to trust the contextual knowledge in our prediction pipeline. Hence it addresses key challenges in robustness, making it suitable for mitigating potential inaccuracies or hallucinations from the LLM. In various biomedical case studies, LLM-Lasso outperforms standard Lasso and existing feature selection baselines, all while ensuring the LLM operates without prior access to the datasets. To our knowledge, this is the first approach to effectively integrate conventional feature selection techniques directly with LLM-based domain-specific reasoning.

new Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction

Authors: Leisheng Yu, Yanxiao Cai, Minxing Zhang, Xia Hu

Abstract: The burgeoning volume of electronic health records (EHRs) has enabled deep learning models to excel in predictive healthcare. However, for high-stakes applications such as diagnosis prediction, model interpretability remains paramount. Existing deep learning diagnosis prediction models with intrinsic interpretability often assign attention weights to every past diagnosis or hospital visit, providing explanations lacking flexibility and succinctness. In this paper, we introduce SHy, a self-explaining hypergraph neural network model, designed to offer personalized, concise and faithful explanations that allow for interventions from clinical experts. By modeling each patient as a unique hypergraph and employing a message-passing mechanism, SHy captures higher-order disease interactions and extracts distinct temporal phenotypes as personalized explanations. It also addresses the incompleteness of the EHR data by accounting for essential false negatives in the original diagnosis record. A qualitative case study and extensive quantitative evaluations on two real-world EHR datasets demonstrate the superior predictive performance and interpretability of SHy over existing state-of-the-art models.

new Controlling Neural Collapse Enhances Out-of-Distribution Detection and Transfer Learning

Authors: Md Yousuf Harun, Jhair Gallardo, Christopher Kanan

Abstract: Out-of-distribution (OOD) detection and OOD generalization are widely studied in Deep Neural Networks (DNNs), yet their relationship remains poorly understood. We empirically show that the degree of Neural Collapse (NC) in a network layer is inversely related with these objectives: stronger NC improves OOD detection but degrades generalization, while weaker NC enhances generalization at the cost of detection. This trade-off suggests that a single feature space cannot simultaneously achieve both tasks. To address this, we develop a theoretical framework linking NC to OOD detection and generalization. We show that entropy regularization mitigates NC to improve generalization, while a fixed Simplex Equiangular Tight Frame (ETF) projector enforces NC for better detection. Based on these insights, we propose a method to control NC at different DNN layers. In experiments, our method excels at both tasks across OOD datasets and DNN architectures.

new Simulations of Common Unsupervised Domain Adaptation Algorithms for Image Classification

Authors: Ahmad Chaddad, Yihang Wu, Yuchen Jiang, Ahmed Bouridane, Christian Desrosiers

Abstract: Traditional machine learning assumes that training and test sets are derived from the same distribution; however, this assumption does not always hold in practical applications. This distribution disparity can lead to severe performance drops when the trained model is used in new data sets. Domain adaptation (DA) is a machine learning technique that aims to address this problem by reducing the differences between domains. This paper presents simulation-based algorithms of recent DA techniques, mainly related to unsupervised domain adaptation (UDA), where labels are available only in the source domain. Our study compares these techniques with public data sets and diverse characteristics, highlighting their respective strengths and drawbacks. For example, Safe Self-Refinement for Transformer-based DA (SSRT) achieved the highest accuracy (91.6\%) in the office-31 data set during our simulations, however, the accuracy dropped to 72.4\% in the Office-Home data set when using limited batch sizes. In addition to improving the reader's comprehension of recent techniques in DA, our study also highlights challenges and upcoming directions for research in this domain. The codes are available at https://github.com/AIPMLab/Domain_Adaptation.

URLs: https://github.com/AIPMLab/Domain_Adaptation.

new Superpose Singular Features for Model Merging

Authors: Haiquan Qiu, You Wu, Quanming Yao

Abstract: Model merging is a critical technique for combining the capabilities of multiple fine-tuned models without requiring additional training. While existing methods treat parameters as vectors, they overlook the intrinsic structure of linear transformation matrices - the core components that comprise the majority of model parameters. These matrices are fundamental to neural networks, mapping input representations to output features through linear combinations. Motivated by the linear representation hypothesis, we introduce task matrix and propose to Superpose Features from Task Matrix (SFTM), a novel approach that superposes features from individual task models into a merged model. SFTM employs singular value decomposition to identify feature bases of linear transformation matrices and solves a linear system to optimally combine them while preserving input-output mappings from individual task models. Extensive experiments on vision transformers and language models demonstrate that our method consistently outperforms existing methods, achieving superior performance and enhanced out-of-distribution generalization.

new Artificial intelligence-enabled detection and assessment of Parkinson's disease using multimodal data: A survey

Authors: Aite Zhao, Yongcan Liu, Xinglin Yu, Xinyue Xing

Abstract: The rapid emergence of highly adaptable and reusable artificial intelligence (AI) models is set to revolutionize the medical field, particularly in the diagnosis and management of Parkinson's disease (PD). Currently, there are no effective biomarkers for diagnosing PD, assessing its severity, or tracking its progression. Numerous AI algorithms are now being used for PD diagnosis and treatment, capable of performing various classification tasks based on multimodal and heterogeneous disease symptom data, such as gait, hand movements, and speech patterns of PD patients. They provide expressive feedback, including predicting the potential likelihood of PD, assessing the severity of individual or multiple symptoms, aiding in early detection, and evaluating rehabilitation and treatment effectiveness, thereby demonstrating advanced medical diagnostic capabilities. Therefore, this work provides a surveyed compilation of recent works regarding PD detection and assessment through biometric symptom recognition with a focus on machine learning and deep learning approaches, emphasizing their benefits, and exposing their weaknesses, and their impact in opening up newer research avenues. Additionally, it also presents categorized and characterized descriptions of the datasets, approaches, and architectures employed to tackle associated constraints. Furthermore, the paper explores the potential opportunities and challenges presented by data-driven AI technologies in the diagnosis of PD.

new Raising the Bar in Graph OOD Generalization: Invariant Learning Beyond Explicit Environment Modeling

Authors: Xu Shen, Yixin Liu, Yili Wang, Rui Miao, Yiwei Dai, Shirui Pan, Xin Wang

Abstract: Out-of-distribution (OOD) generalization has emerged as a critical challenge in graph learning, as real-world graph data often exhibit diverse and shifting environments that traditional models fail to generalize across. A promising solution to address this issue is graph invariant learning (GIL), which aims to learn invariant representations by disentangling label-correlated invariant subgraphs from environment-specific subgraphs. However, existing GIL methods face two major challenges: (1) the difficulty of capturing and modeling diverse environments in graph data, and (2) the semantic cliff, where invariant subgraphs from different classes are difficult to distinguish, leading to poor class separability and increased misclassifications. To tackle these challenges, we propose a novel method termed Multi-Prototype Hyperspherical Invariant Learning (MPHIL), which introduces two key innovations: (1) hyperspherical invariant representation extraction, enabling robust and highly discriminative hyperspherical invariant feature extraction, and (2) multi-prototype hyperspherical classification, which employs class prototypes as intermediate variables to eliminate the need for explicit environment modeling in GIL and mitigate the semantic cliff issue. Derived from the theoretical framework of GIL, we introduce two novel objective functions: the invariant prototype matching loss to ensure samples are matched to the correct class prototypes, and the prototype separation loss to increase the distinction between prototypes of different classes in the hyperspherical space. Extensive experiments on 11 OOD generalization benchmark datasets demonstrate that MPHIL achieves state-of-the-art performance, significantly outperforming existing methods across graph data from various domains and with different distribution shifts.

new Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model

Authors: Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong

Abstract: Electrocardiogram (ECG) is essential for the clinical diagnosis of arrhythmias and other heart diseases, but deep learning methods based on ECG often face limitations due to the need for high-quality annotations. Although previous ECG self-supervised learning (eSSL) methods have made significant progress in representation learning from unannotated ECG data, they typically treat ECG signals as ordinary time-series data, segmenting the signals using fixed-size and fixed-step time windows, which often ignore the form and rhythm characteristics and latent semantic relationships in ECG signals. In this work, we introduce a novel perspective on ECG signals, treating heartbeats as words and rhythms as sentences. Based on this perspective, we first designed the QRS-Tokenizer, which generates semantically meaningful ECG sentences from the raw ECG signals. Building on these, we then propose HeartLang, a novel self-supervised learning framework for ECG language processing, learning general representations at form and rhythm levels. Additionally, we construct the largest heartbeat-based ECG vocabulary to date, which will further advance the development of ECG language processing. We evaluated HeartLang across six public ECG datasets, where it demonstrated robust competitiveness against other eSSL methods. Our data and code are publicly available at https://github.com/PKUDigitalHealth/HeartLang.

URLs: https://github.com/PKUDigitalHealth/HeartLang.

new FuncGenFoil: Airfoil Generation and Editing Model in Function Space

Authors: Jinouwen Zhang, Junjie Ren, Aobo Yang, Yan Lu, Lu Chen, Hairun Xie, Jing Wang, Miao Zhang, Wanli Ouyang, Shixiang Tang

Abstract: Aircraft manufacturing is the jewel in the crown of industry, among which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. While existing deep-learning-based methods rely on predefined parametric function families, e.g., B\'ezier curves and discrete point-based representations, they suffer from inherent trade-offs between expressiveness and resolution flexibility. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly learns functional airfoil geometries. Our method inherits both the advantages of arbitrary resolution sampling and the smoothness of parametric functions, as well as the strong expressiveness of discrete point-based functions. Empirical evaluations on the AFBench dataset demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation by achieving a relative -74.4 label error reduction and +23.2 diversity increase on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design. Our code will be released.

new Why Domain Generalization Fail? A View of Necessity and Sufficiency

Authors: Long-Tung Vuong, Vy Vo, Hien Dang, Van-Anh Nguyen, Thanh-Toan Do, Mehrtash Harandi, Trung Le, Dinh Phung

Abstract: Despite a strong theoretical foundation, empirical experiments reveal that existing domain generalization (DG) algorithms often fail to consistently outperform the ERM baseline. We argue that this issue arises because most DG studies focus on establishing theoretical guarantees for generalization under unrealistic assumptions, such as the availability of sufficient, diverse (or even infinite) domains or access to target domain knowledge. As a result, the extent to which domain generalization is achievable in scenarios with limited domains remains largely unexplored. This paper seeks to address this gap by examining generalization through the lens of the conditions necessary for its existence and learnability. Specifically, we systematically establish a set of necessary and sufficient conditions for generalization. Our analysis highlights that existing DG methods primarily act as regularization mechanisms focused on satisfying sufficient conditions, while often neglecting necessary ones. However, sufficient conditions cannot be verified in settings with limited training domains. In such cases, regularization targeting sufficient conditions aims to maximize the likelihood of generalization, whereas regularization targeting necessary conditions ensures its existence. Using this analysis, we reveal the shortcomings of existing DG algorithms by showing that, while they promote sufficient conditions, they inadvertently violate necessary conditions. To validate our theoretical insights, we propose a practical method that promotes the sufficient condition while maintaining the necessary conditions through a novel subspace representation alignment strategy. This approach highlights the advantages of preserving the necessary conditions on well-established DG benchmarks.

new A Comprehensive Survey of Deep Learning for Multivariate Time Series Forecasting: A Channel Strategy Perspective

Authors: Xiangfei Qiu, Hanyin Cheng, Xingjian Wu, Jilin Hu, Chenjuan Guo

Abstract: Multivariate Time Series Forecasting (MTSF) plays a crucial role across diverse fields, ranging from economic, energy, to traffic. In recent years, deep learning has demonstrated outstanding performance in MTSF tasks. In MTSF, modeling the correlations among different channels is critical, as leveraging information from other related channels can significantly improve the prediction accuracy of a specific channel. This study systematically reviews the channel modeling strategies for time series and proposes a taxonomy organized into three hierarchical levels: the strategy perspective, the mechanism perspective, and the characteristic perspective. On this basis, we provide a structured analysis of these methods and conduct an in-depth examination of the advantages and limitations of different channel strategies. Finally, we summarize and discuss some future research directions to provide useful research guidance. Moreover, we maintain an up-to-date Github repository (https://github.com/decisionintelligence/CS4TS) which includes all the papers discussed in the survey.

URLs: https://github.com/decisionintelligence/CS4TS)

new A Mathematics Framework of Artificial Shifted Population Risk and Its Further Understanding Related to Consistency Regularization

Authors: Xiliang Yang, Shenyang Deng, Shicong Liu, Yuanchi Suo, Wing. W. Y NG, Jianjun Zhang

Abstract: Data augmentation is an important technique in training deep neural networks as it enhances their ability to generalize and remain robust. While data augmentation is commonly used to expand the sample size and act as a consistency regularization term, there is a lack of research on the relationship between them. To address this gap, this paper introduces a more comprehensive mathematical framework for data augmentation. Through this framework, we establish that the expected risk of the shifted population is the sum of the original population risk and a gap term, which can be interpreted as a consistency regularization term. The paper also provides a theoretical understanding of this gap, highlighting its negative effects on the early stages of training. We also propose a method to mitigate these effects. To validate our approach, we conducted experiments using same data augmentation techniques and computing resources under several scenarios, including standard training, out-of-distribution, and imbalanced classification. The results demonstrate that our methods surpass compared methods under all scenarios in terms of generalization ability and convergence stability. We provide our code implementation at the following link: https://github.com/ydlsfhll/ASPR.

URLs: https://github.com/ydlsfhll/ASPR.

new Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents

Authors: Mauricio Tec, Guojun Xiong, Haichuan Wang, Francesca Dominici, Milind Tambe

Abstract: Deep Reinforcement Learning (RL) is remarkably effective in addressing sequential resource allocation problems in domains such as healthcare, public policy, and resource management. However, deep RL policies often lack transparency and adaptability, challenging their deployment alongside human decision-makers. In contrast, Language Agents, powered by large language models (LLMs), provide human-understandable reasoning but may struggle with effective decision making. To bridge this gap, we propose Rule-Bottleneck Reinforcement Learning (RBRL), a novel framework that jointly optimizes decision and explanations. At each step, RBRL generates candidate rules with an LLM, selects among them using an attention-based RL policy, and determines the environment action with an explanation via chain-of-thought reasoning. The RL rule selection is optimized using the environment rewards and an explainability metric judged by the LLM. Evaluations in real-world scenarios highlight RBRL's competitive performance with deep RL and efficiency gains over LLM fine-tuning. A survey further confirms the enhanced quality of its explanations.

new Bone Soups: A Seek-and-Soup Model Merging Approach for Controllable Multi-Objective Generation

Authors: Guofu Xie, Xiao Zhang, Ting Yao, Yunsheng Shi

Abstract: User information needs are often highly diverse and varied. A key challenge in current research is how to achieve controllable multi-objective generation while enabling rapid adaptation to accommodate diverse user demands during test time. Existing solutions, such as Rewarded Soup, focus on merging language models individually tuned on single objectives. While easy to implement and widely used, these approaches face limitations in achieving optimal performance due to their disregard for the impacts of competing objectives on model tuning. To address this issue, we propose Bone Soup, a novel model merging approach that first seeks a series of backbone models by considering the impacts of multiple objectives and then makes the soup (i.e., merge the backbone models). Specifically, Bone Soup begins by training multiple backbone models for different objectives using multi-objective reinforcement learning. Each backbone model is guided by a combination of backbone reward signals. To ensure that these models are optimal for the Pareto front, the backbone rewards are crafted by combining standard reward functions into basis vectors, which can then be modified through a rule-based construction method. Bone Soup leverages a symmetric circulant matrix mapping to generate the merging coefficients, which are used to merge the backbone models according to user preferences. Extensive experimental results demonstrate that Bone Soup exhibits strong controllability and Pareto optimality in controllable multi-objective generation, providing a more effective and efficient approach to addressing diverse user needs at test time.

new Learning to Explain Air Traffic Situation

Authors: Hong-ah Chai, Seokbin Yoon, Keumjin Lee

Abstract: Understanding how air traffic controllers construct a mental 'picture' of complex air traffic situations is crucial but remains a challenge due to the inherently intricate, high-dimensional interactions between aircraft, pilots, and controllers. Previous work on modeling the strategies of air traffic controllers and their mental image of traffic situations often centers on specific air traffic control tasks or pairwise interactions between aircraft, neglecting to capture the comprehensive dynamics of an air traffic situation. To address this issue, we propose a machine learning-based framework for explaining air traffic situations. Specifically, we employ a Transformer-based multi-agent trajectory model that encapsulates both the spatio-temporal movement of aircraft and social interaction between them. By deriving attention scores from the model, we can quantify the influence of individual aircraft on overall traffic dynamics. This provides explainable insights into how air traffic controllers perceive and understand the traffic situation. Trained on real-world air traffic surveillance data collected from the terminal airspace around Incheon International Airport in South Korea, our framework effectively explicates air traffic situations. This could potentially support and enhance the decision-making and situational awareness of air traffic controllers.

new A Distillation-based Future-aware Graph Neural Network for Stock Trend Prediction

Authors: Zhipeng Liu, Peibo Duan, Mingyang Geng, Bin Zhang

Abstract: Stock trend prediction involves forecasting the future price movements by analyzing historical data and various market indicators. With the advancement of machine learning, graph neural networks (GNNs) have been extensively employed in stock prediction due to their powerful capability to capture spatiotemporal dependencies of stocks. However, despite the efforts of various GNN stock predictors to enhance predictive performance, the improvements remain limited, as they focus solely on analyzing historical spatiotemporal dependencies, overlooking the correlation between historical and future patterns. In this study, we propose a novel distillation-based future-aware GNN framework (DishFT-GNN) for stock trend prediction. Specifically, DishFT-GNN trains a teacher model and a student model, iteratively. The teacher model learns to capture the correlation between distribution shifts of historical and future data, which is then utilized as intermediate supervision to guide the student model to learn future-aware spatiotemporal embeddings for accurate prediction. Through extensive experiments on two real-world datasets, we verify the state-of-the-art performance of DishFT-GNN.

new Preconditioned Inexact Stochastic ADMM for Deep Model

Authors: Shenglong Zhou, Ouya Wang, Ziyan Luo, Yongxu Zhu, Geoffrey Ye Li

Abstract: The recent advancement of foundation models (FMs) has brought about a paradigm shift, revolutionizing various sectors worldwide. The popular optimizers used to train these models are stochastic gradient descent-based algorithms, which face inherent limitations, such as slow convergence and stringent assumptions for convergence. In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. This paper develops an algorithm, PISA ({P}reconditioned {I}nexact {S}tochastic {A}lternating Direction Method of Multipliers), which enables scalable parallel computing and supports various second-moment schemes. Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient, thereby removing the need for other conditions commonly imposed by stochastic methods. This capability enables PISA to tackle the challenge of data heterogeneity effectively. Comprehensive experimental evaluations for training or fine-tuning diverse FMs, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate its superior numerical performance compared to various state-of-the-art optimizers.

new Epidemic-guided deep learning for spatiotemporal forecasting of Tuberculosis outbreak

Authors: Madhab Barman, Madhurima Panja, Nachiketa Mishra, Tanujit Chakraborty

Abstract: Tuberculosis (TB) remains a formidable global health challenge, driven by complex spatiotemporal transmission dynamics and influenced by factors such as population mobility and behavioral changes. We propose an Epidemic-Guided Deep Learning (EGDL) approach that fuses mechanistic epidemiological principles with advanced deep learning techniques to enhance early warning systems and intervention strategies for TB outbreaks. Our framework is built upon a networked Susceptible-Infectious-Recovered (SIR) model augmented with a saturated incidence rate and graph Laplacian diffusion, capturing both long-term transmission dynamics and region-specific population mobility patterns. Compartmental model parameters are rigorously estimated using Bayesian inference via the Markov Chain Monte Carlo (MCMC) approach. Theoretical analysis leveraging the comparison principle and Green's formula establishes global stability properties of the disease-free and endemic equilibria. Building on these epidemiological insights, we design two forecasting architectures, EGDL-Parallel and EGDL-Series, that integrate the mechanistic outputs of the networked SIR model within deep neural networks. This integration mitigates the overfitting risks commonly encountered in data-driven methods and filters out noise inherent in surveillance data, resulting in reliable forecasts of real-world epidemic trends. Experiments conducted on TB incidence data from 47 prefectures in Japan demonstrate that our approach delivers robust and accurate predictions across multiple time horizons (short to medium-term forecasts). Additionally, incorporating uncertainty quantification through conformal prediction enhances the model's practical utility for guiding targeted public health interventions.

new ReReLRP - Remembering and Recognizing Tasks with LRP

Authors: Karolina Bogacka, Maximilian H\"ofler, Maria Ganzha, Wojciech Samek, Katarzyna Wasielewska-Michniewska

Abstract: Deep neural networks have revolutionized numerous research fields and applications. Despite their widespread success, a fundamental limitation known as catastrophic forgetting remains, where models fail to retain their ability to perform previously learned tasks after being trained on new ones. This limitation is particularly acute in certain continual learning scenarios, where models must integrate the knowledge from new domains with their existing capabilities. Traditional approaches to mitigate this problem typically rely on memory replay mechanisms, storing either original data samples, prototypes, or activation patterns. Although effective, these methods often introduce significant computational overhead, raise privacy concerns, and require the use of dedicated architectures. In this work we present ReReLRP (Remembering and Recognizing with LRP), a novel solution that leverages Layerwise Relevance Propagation (LRP) to preserve information across tasks. Our contribution provides increased privacy of existing replay-free methods while additionally offering built-in explainability, flexibility of model architecture and deployment, and a new mechanism to increase memory storage efficiency. We validate our approach on a wide variety of datasets, demonstrating results comparable with a well-known replay-based method in selected scenarios.

new Which Features are Best for Successor Features?

Authors: Yann Ollivier

Abstract: In reinforcement learning, universal successor features (SFs) are a way to provide zero-shot adaptation to new tasks at test time: they provide optimal policies for all downstream reward functions lying in the linear span of a set of base features. But it is unclear what constitutes a good set of base features, that could be useful for a wide set of downstream tasks beyond their linear span. Laplacian eigenfunctions (the eigenfunctions of $\Delta+\Delta^\ast$ with $\Delta$ the Laplacian operator of some reference policy and $\Delta^\ast$ that of the time-reversed dynamics) have been argued to play a role, and offer good empirical performance. Here, for the first time, we identify the optimal base features based on an objective criterion of downstream performance, in a non-tautological way without assuming the downstream tasks are linear in the features. We do this for three generic classes of downstream tasks: reaching a random goal state, dense random Gaussian rewards, and random ``scattered'' sparse rewards. The features yielding optimal expected downstream performance turn out to be the \emph{same} for these three task families. They do not coincide with Laplacian eigenfunctions in general, though they can be expressed from $\Delta$: in the simplest case (deterministic environment and decay factor $\gamma$ close to $1$), they are the eigenfunctions of $\Delta^{-1}+(\Delta^{-1})^\ast$. We obtain these results under an assumption of large behavior cloning regularization with respect to a reference policy, a setting often used for offline RL. Along the way, we get new insights into KL-regularized\option{natural} policy gradient, and into the lack of SF information in the norm of Bellman gaps.

new Tackling the Zero-Shot Reinforcement Learning Loss Directly

Authors: Yann Ollivier

Abstract: Zero-shot reinforcement learning (RL) methods aim at instantly producing a behavior for an RL task in a given environment, from a description of the reward function. These methods are usually tested by evaluating their average performance on a series of downstream tasks. Yet they cannot be trained directly for that objective, unless the distribution of downstream tasks is known. Existing approaches either use other learning criteria [BBQ+ 18, TRO23, TO21, HDB+ 19], or explicitly set a prior on downstream tasks, such as reward functions given by a random neural network [FPAL24]. Here we prove that the zero-shot RL loss can be optimized directly, for a range of non-informative priors such as white noise rewards, temporally smooth rewards, ``scattered'' sparse rewards, or a combination of those. Thus, it is possible to learn the optimal zero-shot features algorithmically, for a wide mixture of priors. Surprisingly, the white noise prior leads to an objective almost identical to the one in VISR [HDB+19], via a different approach. This shows that some seemingly arbitrary choices in VISR, such as Von Mises--Fisher distributions, do maximize downstream performance. This also suggests more efficient ways to tackle the VISR objective. Finally, we discuss some consequences and limitations of the zero-shot RL objective, such as its tendency to produce narrow optimal features if only using Gaussian dense reward priors.

new HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model

Authors: Mingqian Ma, Guoqing Liu, Chuan Cao, Pan Deng, Tri Dao, Albert Gu, Peiran Jin, Zhao Yang, Yingce Xia, Renqian Luo, Pipi Hu, Zun Wang, Yuan-Jyue Chen, Haiguang Liu, Tao Qin

Abstract: Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success in this domain requires excelling at both generative and understanding tasks: generative tasks hold potential for therapeutic and industrial applications, while understanding tasks provide crucial insights into biological mechanisms and diseases. To address these challenges, we propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture, seamlessly integrating the strengths of attention mechanisms with selective state-space models. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks, and demonstrates exceptional capability in generating synthetic cis-regulatory elements (CREs) with desired properties. Furthermore, we show that HybriDNA adheres to expected scaling laws, with performance improving consistently as the model scales from 300M to 3B and 7B parameters. These findings underscore HybriDNA's versatility and its potential to advance DNA research and applications, paving the way for innovations in understanding and engineering the "language of life".

new BalanceBenchmark: A Survey for Imbalanced Learning

Authors: Shaoxuan Xu, Menglu Cui, Chengxiang Huang, Hongfa Wang, DiHu

Abstract: Multimodal learning has gained attention for its capacity to integrate information from different modalities. However, it is often hindered by the multimodal imbalance problem, where certain modality dominates while others remain underutilized. Although recent studies have proposed various methods to alleviate this problem, they lack comprehensive and fair comparisons. In this paper, we systematically categorize various mainstream multimodal imbalance algorithms into four groups based on the strategies they employ to mitigate imbalance. To facilitate a comprehensive evaluation of these methods, we introduce BalanceBenchmark, a benchmark including multiple widely used multidimensional datasets and evaluation metrics from three perspectives: performance, imbalance degree, and complexity. To ensure fair comparisons, we have developed a modular and extensible toolkit that standardizes the experimental workflow across different methods. Based on the experiments using BalanceBenchmark, we have identified several key insights into the characteristics and advantages of different method groups in terms of performance, balance degree and computational complexity. We expect such analysis could inspire more efficient approaches to address the imbalance problem in the future, as well as foundation models. The code of the toolkit is available at https://github.com/GeWu-Lab/BalanceBenchmark.

URLs: https://github.com/GeWu-Lab/BalanceBenchmark.

new On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning

Authors: \'Alvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, Pierre Vandergheynst

Abstract: Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, this approach is well known to suffer from the over-smoothing and over-squashing phenomena, which result in representational collapse as the number of layers increases and insensitivity to the information contained at distant and poorly connected nodes, respectively. In this paper, we present a unified view of these problems through the lens of vanishing gradients, using ideas from linear control theory for our analysis. We propose an interpretation of GNNs as recurrent models and empirically demonstrate that a simple state-space formulation of a GNN effectively alleviates over-smoothing and over-squashing at no extra trainable parameter cost. Further, we show theoretically and empirically that (i) GNNs are by design prone to extreme gradient vanishing even after a few layers; (ii) Over-smoothing is directly related to the mechanism causing vanishing gradients; (iii) Over-squashing is most easily alleviated by a combination of graph rewiring and vanishing gradient mitigation. We believe our work will help bridge the gap between the recurrent and graph neural network literature and will unlock the design of new deep and performant GNNs.

new Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

Authors: J. Jon Ryu, Jeongyeol Kwon, Benjamin Koppe, Kwang-Sung Jun

Abstract: We consider the off-policy selection and learning in contextual bandits where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that our method achieves a significantly better, variance-adaptive guarantee upon prior art. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a difference balance in bias and variance. One special case that we call freezing tends to induce small variance, which is preferred in small-data regimes. Our analysis shows that they match the best existing guarantee. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.

new The Vendiscope: An Algorithmic Microscope For Data Collections

Authors: Amey P. Pasarkar, Adji Bousso Dieng

Abstract: The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the $250$ million protein sequences in the protein universe, discovering that over $200$ million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than $85\%$ of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from $13$ different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.

new LEAPS: A discrete neural sampler via locally equivariant networks

Authors: Peter Holderrieth, Michael S. Albergo, Tommi Jaakkola

Abstract: We propose LEAPS, an algorithm to sample from discrete distributions known up to normalization by learning a rate matrix of a continuous-time Markov chain (CTMC). LEAPS can be seen as a continuous-time formulation of annealed importance sampling and sequential Monte Carlo methods, extended so that the variance of the importance weights is offset by the inclusion of the CTMC. To derive these importance weights, we introduce a set of Radon-Nikodym derivatives of CTMCs over their path measures. Because the computation of these weights is intractable with standard neural network parameterizations of rate matrices, we devise a new compact representation for rate matrices via what we call locally equivariant functions. To parameterize them, we introduce a family of locally equivariant multilayer perceptrons, attention layers, and convolutional networks, and provide an approach to make deep networks that preserve the local equivariance. This property allows us to propose a scalable training algorithm for the rate matrix such that the variance of the importance weights associated to the CTMC are minimal. We demonstrate the efficacy of LEAPS on problems in statistical physics.

new Implicit Neural Representations of Molecular Vector-Valued Functions

Authors: Jirka Lhotka, Daniel Probst

Abstract: Molecules have various computational representations, including numerical descriptors, strings, graphs, point clouds, and surfaces. Each representation method enables the application of various machine learning methodologies from linear regression to graph neural networks paired with large language models. To complement existing representations, we introduce the representation of molecules through vector-valued functions, or $n$-dimensional vector fields, that are parameterized by neural networks, which we denote molecular neural fields. Unlike surface representations, molecular neural fields capture external features and the hydrophobic core of macromolecules such as proteins. Compared to discrete graph or point representations, molecular neural fields are compact, resolution independent and inherently suited for interpolation in spatial and temporal dimensions. These properties inherited by molecular neural fields lend themselves to tasks including the generation of molecules based on their desired shape, structure, and composition, and the resolution-independent interpolation between molecular conformations in space and time. Here, we provide a framework and proofs-of-concept for molecular neural fields, namely, the parametrization and superresolution reconstruction of a protein-ligand complex using an auto-decoder architecture and the embedding of molecular volumes in latent space using an auto-encoder architecture.

new To Bin or not to Bin: Alternative Representations of Mass Spectra

Authors: Niek de Jonge, Justin J. J. van der Hooft, Daniel Probst

Abstract: Mass spectrometry, especially so-called tandem mass spectrometry, is commonly used to assess the chemical diversity of samples. The resulting mass fragmentation spectra are representations of molecules of which the structure may have not been determined. This poses the challenge of experimentally determining or computationally predicting molecular structures from mass spectra. An alternative option is to predict molecular properties or molecular similarity directly from spectra. Various methodologies have been proposed to embed mass spectra for further use in machine learning tasks. However, these methodologies require preprocessing of the spectra, which often includes binning or sub-sampling peaks with the main reasoning of creating uniform vector sizes and removing noise. Here, we investigate two alternatives to the binning of mass spectra before down-stream machine learning tasks, namely, set-based and graph-based representations. Comparing the two proposed representations to train a set transformer and a graph neural network on a regression task, respectively, we show that they both perform substantially better than a multilayer perceptron trained on binned data.

new Learning Identifiable Structures Helps Avoid Bias in DNN-based Supervised Causal Learning

Authors: Jiaru Zhang, Rui Ding, Qiang Fu, Bojun Huang, Zizhen Deng, Yang Hua, Haibing Guan, Shi Han, Dongmei Zhang

Abstract: Causal discovery is a structured prediction task that aims to predict causal relations among variables based on their data samples. Supervised Causal Learning (SCL) is an emerging paradigm in this field. Existing Deep Neural Network (DNN)-based methods commonly adopt the "Node-Edge approach", in which the model first computes an embedding vector for each variable-node, then uses these variable-wise representations to concurrently and independently predict for each directed causal-edge. In this paper, we first show that this architecture has some systematic bias that cannot be mitigated regardless of model size and data size. We then propose SiCL, a DNN-based SCL method that predicts a skeleton matrix together with a v-tensor (a third-order tensor representing the v-structures). According to the Markov Equivalence Class (MEC) theory, both the skeleton and the v-structures are identifiable causal structures under the canonical MEC setting, so predictions about skeleton and v-structures do not suffer from the identifiability limit in causal discovery, thus SiCL can avoid the systematic bias in Node-Edge architecture, and enable consistent estimators for causal discovery. Moreover, SiCL is also equipped with a specially designed pairwise encoder module with a unidirectional attention layer to model both internal and external relationships of pairs of nodes. Experimental results on both synthetic and real-world benchmarks show that SiCL significantly outperforms other DNN-based SCL approaches.

new LLM-driven Knowledge Distillation for Dynamic Text-Attributed Graphs

Authors: Amit Roy, Ning Yan, Masood Mortazavi

Abstract: Dynamic Text-Attributed Graphs (DyTAGs) have numerous real-world applications, e.g. social, collaboration, citation, communication, and review networks. In these networks, nodes and edges often contain text descriptions, and the graph structure can evolve over time. Future link prediction, edge classification, relation generation, and other downstream tasks on DyTAGs require powerful representations that encode structural, temporal, and textual information. Although graph neural networks (GNNs) excel at handling structured data, encoding temporal information within dynamic graphs remains a significant challenge. In this work, we propose LLM-driven Knowledge Distillation for Dynamic Text Attributed Graph (LKD4DyTAG) with temporal encoding to address these challenges. We use a simple, yet effective approach to encode temporal information in edges so that graph convolution can simultaneously capture both temporal and structural information in the hidden representations. To leverage LLM's text processing capabilities for learning richer representations on DyTAGs, we distill knowledge from LLM-driven edge representations (based on a neighborhood's text attributes) into saptio-temporal representations using a lightweight GNN model that encodes temporal and structural information. The objective of knowledge distillation enables the GNN to learn representations that more effectively encode the available structural, temporal, and textual information in DyTAG. We conducted extensive experimentation on six real-world DyTAG datasets to verify the effectiveness of our approach LKD4DyTAG for future link prediction and edge classification task. The results show that our approach significantly improves the performance of downstream tasks compared to the baseline models.

new The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training

Authors: Matteo Saponati, Pascal Sager, Pau Vilimelis Aceituno, Thilo Stadelmann, Benjamin Grewe

Abstract: Self-attention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models - including ModernBERT, GPT, LLaMA3, and Mistral - and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models.

new Semantic Specialization in MoE Appears with Scale: A Study of DeepSeek R1 Expert Specialization

Authors: Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Man Luo, Sungduk Yu, Chendi Xue, Vasudev Lal

Abstract: DeepSeek-R1, the largest open-source Mixture-of-Experts (MoE) model, has demonstrated reasoning capabilities comparable to proprietary frontier models. Prior research has explored expert routing in MoE models, but findings suggest that expert selection is often token-dependent rather than semantically driven. Given DeepSeek-R1's enhanced reasoning abilities, we investigate whether its routing mechanism exhibits greater semantic specialization than previous MoE models. To explore this, we conduct two key experiments: (1) a word sense disambiguation task, where we examine expert activation patterns for words with differing senses, and (2) a cognitive reasoning analysis, where we assess DeepSeek-R1's structured thought process in an interactive task setting of DiscoveryWorld. We conclude that DeepSeek-R1's routing mechanism is more semantically aware and it engages in structured cognitive processes.

new Reduced Order Modeling with Shallow Recurrent Decoder Networks

Authors: Matteo Tomasetto, Jan P. Williams, Francesco Braghin, Andrea Manzoni, J. Nathan Kutz

Abstract: Reduced Order Modeling is of paramount importance for efficiently inferring high-dimensional spatio-temporal fields in parametric contexts, enabling computationally tractable parametric analyses, uncertainty quantification and control. However, conventional dimensionality reduction techniques are typically limited to known and constant parameters, inefficient for nonlinear and chaotic dynamics, and uninformed to the actual system behavior. In this work, we propose sensor-driven SHallow REcurrent Decoder networks for Reduced Order Modeling (SHRED-ROM). Specifically, we consider the composition of a long short-term memory network, which encodes the temporal dynamics of limited sensor data in multiple scenarios, and a shallow decoder, which reconstructs the corresponding high-dimensional states. SHRED-ROM is a robust decoding-only strategy that circumvents the numerically unstable approximation of an inverse which is required by encoding-decoding schemes. To enhance computational efficiency and memory usage, the full-order state snapshots are reduced by, e.g., proper orthogonal decomposition, allowing for compressive training of the networks with minimal hyperparameter tuning. Through applications on chaotic and nonlinear fluid dynamics, we show that SHRED-ROM (i) accurately reconstructs the state dynamics for new parameter values starting from limited fixed or mobile sensors, independently on sensor placement, (ii) can cope with both physical, geometrical and time-dependent parametric dependencies, while being agnostic to their actual values, (iii) can accurately estimate unknown parameters, and (iv) can deal with different data sources, such as high-fidelity simulations, coupled fields and videos.

new CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Authors: Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

Abstract: Large language models (LLMs) are revolutionizing many science and engineering fields. However, their huge model sizes impose extremely demanding needs of computational resources in the pre-training stage. Although low-rank factorizations can reduce model parameters, their direct application in LLM pre-training often lead to non-negligible performance loss. To address this fundamental challenge, we introduce CoLA and its memory-efficient implementation, CoLA-M. We leverage the low-rank structure observed widely in model activations, enforcing non-linear transformations between factorized weight matrices to reduce model size, boost model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $\bf 2\pmb{\times}$ and improves training throughput by $\bf 1.86\pmb{\times}$ while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also $\bf 2\pmb{\times}$ smaller, enabling faster inference with lower memory cost on resource-constrained platforms

new The Relationship between No-Regret Learning and Online Conformal Prediction

Authors: Ramya Ramalingam, Shayan Kiyani, Aaron Roth

Abstract: Existing algorithms for online conformal prediction -- guaranteeing marginal coverage in adversarial settings -- are variants of online gradient descent (OGD), but their analyses of worst-case coverage do not follow from the regret guarantee of OGD. What is the relationship between no-regret learning and online conformal prediction? We observe that although standard regret guarantees imply marginal coverage in i.i.d. settings, this connection fails as soon as we either move to adversarial environments or ask for group conditional coverage. On the other hand, we show a tight connection between threshold calibrated coverage and swap-regret in adversarial settings, which extends to group-conditional (multi-valid) coverage. We also show that algorithms in the follow the perturbed leader family of no regret learning algorithms (which includes online gradient descent) can be used to give group-conditional coverage guarantees in adversarial settings for arbitrary grouping functions. Via this connection we analyze and conduct experiments using a multi-group generalization of the ACI algorithm of Gibbs & Candes [2021] (arXiv:2106.00170).

new Graders should cheat: privileged information enables expert-level automated evaluations

Authors: Jin Peng Zhou, S\'ebastien M. R. Arnold, Nan Ding, Kilian Q. Weinberger, Nan Hua, Fei Sha

Abstract: Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today's LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains. We show that providing privileged information -- such as ground-truth solutions or problem-specific guidelines -- improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged information can be used to devise easier variations of challenging problems which improves the separability of different LMs on tasks where their performance is generally low. With this approach, general-purpose LM graders match the state of the art performance on RewardBench, surpassing almost all the specially-tuned models. LM graders also outperform individual human raters on Vibe-Eval, and approach human expert graders on Olympiad-level math problems.

new Is Elo Rating Reliable? A Study Under Model Misspecification

Authors: Shange Tang, Yuanhao Wang, Chi Jin

Abstract: Elo rating, widely used for skill assessment across diverse domains ranging from competitive games to large language models, is often understood as an incremental update algorithm for estimating a stationary Bradley-Terry (BT) model. However, our empirical analysis of practical matching datasets reveals two surprising findings: (1) Most games deviate significantly from the assumptions of the BT model and stationarity, raising questions on the reliability of Elo. (2) Despite these deviations, Elo frequently outperforms more complex rating systems, such as mElo and pairwise models, which are specifically designed to account for non-BT components in the data, particularly in terms of win rate prediction. This paper explains this unexpected phenomenon through three key perspectives: (a) We reinterpret Elo as an instance of online gradient descent, which provides no-regret guarantees even in misspecified and non-stationary settings. (b) Through extensive synthetic experiments on data generated from transitive but non-BT models, such as strongly or weakly stochastic transitive models, we show that the ''sparsity'' of practical matching data is a critical factor behind Elo's superior performance in prediction compared to more complex rating systems. (c) We observe a strong correlation between Elo's predictive accuracy and its ranking performance, further supporting its effectiveness in ranking.

new SSVEP-BiMA: Bifocal Masking Attention Leveraging Native and Symmetric-Antisymmetric Components for Robust SSVEP Decoding

Authors: Yuxin Liu, Zhenxi Song, Guoyang Xu, Zirui Wang, Feng Wan, Yong Hu, Min Zhang, Zhiguo Zhang

Abstract: Brain-computer interface (BCI) based on steady-state visual evoked potentials (SSVEP) is a popular paradigm for its simplicity and high information transfer rate (ITR). Accurate and fast SSVEP decoding is crucial for reliable BCI performance. However, conventional decoding methods demand longer time windows, and deep learning models typically require subject-specific fine-tuning, leaving challenges in achieving optimal performance in cross-subject settings. This paper proposed a biofocal masking attention-based method (SSVEP-BiMA) that synergistically leverages the native and symmetric-antisymmetric components for decoding SSVEP. By utilizing multiple signal representations, the network is able to integrate features from a wider range of sample perspectives, leading to more generalized and comprehensive feature learning, which enhances both prediction accuracy and robustness. We performed experiments on two public datasets, and the results demonstrate that our proposed method surpasses baseline approaches in both accuracy and ITR. We believe that this work will contribute to the development of more efficient SSVEP-based BCI systems.

new New Rates in Stochastic Decision-Theoretic Online Learning under Differential Privacy

Authors: Ruihan Wu, Yu-Xiang Wang

Abstract: Hu and Mehta (2024) posed an open problem: what is the optimal instance-dependent rate for the stochastic decision-theoretic online learning (with $K$ actions and $T$ rounds) under $\varepsilon$-differential privacy? Before, the best known upper bound and lower bound are $O\left(\frac{\log K}{\Delta_{\min}} + \frac{\log K\log T}{\varepsilon}\right)$ and $\Omega\left(\frac{\log K}{\Delta_{\min}} + \frac{\log K}{\varepsilon}\right)$ (where $\Delta_{\min}$ is the gap between the optimal and the second actions). In this paper, we partially address this open problem by having two new results. First, we provide an improved upper bound for this problem $O\left(\frac{\log K}{\Delta_{\min}} + \frac{\log^2K}{\varepsilon}\right)$, where the $T$-dependency has been removed. Second, we introduce the deterministic setting, a weaker setting of this open problem, where the received loss vector is deterministic and we can focus on the analysis for $\varepsilon$ regardless of the sampling error. At the deterministic setting, we prove upper and lower bounds that match at $\Theta\left(\frac{\log K}{\varepsilon}\right)$, while a direct application of the analysis and algorithms from the original setting still leads to an extra log factor. Technically, we introduce the Bernoulli resampling trick, which enforces a monotonic property for the output from report-noisy-max mechanism that enables a tighter analysis. Moreover, by replacing the Laplace noise with Gumbel noise, we derived explicit integral form that gives a tight characterization of the regret in the deterministic case.

new Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

Authors: Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton

Abstract: Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, beyond their large size, make their deployment more challenging during the inference stage. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design a local-cloud LLM inference offloading (LCIO) system, featuring (i) a large-scale cloud LLM that can handle multi-modal data sources and (ii) a lightweight local LLM that can process simple tasks at high speed. LCIO employs resource-constrained reinforcement learning (RCRL) to determine where to make the inference (i.e., local vs. cloud) and which multi-modal data sources to use for each dialogue/task, aiming to maximize the long-term reward (which incorporates response quality, latency, and usage cost) while adhering to resource constraints. We also propose M4A1, a new dataset that accounts for multi-modal, multi-task, multi-dialogue, and multi-LLM characteristics, to investigate the capabilities of LLMs in various practical scenarios. We demonstrate the effectiveness of LCIO compared to baselines, showing significant savings in latency and cost while achieving satisfactory response quality.

new Collaborative Deterministic-Diffusion Model for Probabilistic Urban Spatiotemporal Prediction

Authors: Zhi Sheng, Yuan Yuan, Yudi Zhang, Depeng Jin, Yong Li

Abstract: Accurate prediction of urban spatiotemporal dynamics is essential for enhancing urban management and decision-making. Existing spatiotemporal prediction models are predominantly deterministic, focusing on primary spatiotemporal patterns. However, those dynamics are highly complex, exhibiting multi-modal distributions that are challenging for deterministic models to capture. In this paper, we highlight the critical role of probabilistic prediction in capturing the uncertainties and complexities inherent in spatiotemporal data. While mainstream probabilistic models can capture uncertainty, they struggle with accurately learning primary patterns and often suffer from computational inefficiency. To address these challenges, we propose CoST, which collaborates deterministic and probabilistic models to improve both predictive accuracy and the ability to handle uncertainty. To achieve this, we design a mean-residual decomposition framework, where the mean value is modeled by a deterministic model, and the residual variations are learned by a probabilistic model, specifically diffusion models. Moreover, we introduce a scale-aware diffusion process, which better accounts for spatially heterogeneous dynamics across different regions. Extensive experiments on eight real-world datasets demonstrate that CoST significantly outperforms existing methods in both deterministic and probabilistic metrics, achieving a 20% improvement with low computational cost. CoST bridges the gap between deterministic precision and probabilistic uncertainty, making a significant advancement in the field of urban spatiotemporal prediction.

new Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning

Authors: Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, Yin Wei

Abstract: Catastrophic forgetting (CF) poses a significant challenge in machine learning, where a model forgets previously learned information upon learning new tasks. Despite the advanced capabilities of Large Language Models (LLMs), they continue to face challenges with CF during continual learning. The majority of existing research focuses on analyzing forgetting patterns through a singular training sequence, thereby overlooking the intricate effects that diverse tasks have on model behavior. Our study explores CF across various settings, discovering that model forgetting is influenced by both the specific training tasks and the models themselves. To this end, we interpret forgetting by examining the function vector (FV), a compact representation of functions in LLMs, offering a model-dependent indicator for the occurrence of CF. Through theoretical and empirical analyses, we demonstrated that CF in LLMs primarily stems from biases in function activation rather than the overwriting of task processing functions. Leveraging these insights, we propose a novel function vector guided training methodology, incorporating a regularization technique to stabilize the FV and mitigate forgetting. Empirical tests on four benchmarks confirm the effectiveness of our proposed training method, substantiating our theoretical framework concerning CF and model function dynamics. We plan to make our code publicly accessible in the near future.

new Simplify RLHF as Reward-Weighted SFT: A Variational Method

Authors: Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, Anningzhe Gao

Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $\textbf{V}$ariational $\textbf{A}$lignment with $\textbf{R}$e-weighting ($\textbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.

new Diversified Sampling Improves Scaling LLM inference

Authors: Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Haifeng Chen, Xiang Zhang, Wei Cheng

Abstract: While increasing training compute has significantly improved the performance of large language models (LLMs), similar gains have not been observed when scaling inference compute. We hypothesize that the primary issue lies in the uniformity of LLM outputs, which leads to inefficient sampling as models repeatedly generate similar but inaccurate responses. Motivated by an intriguing relationship between solution accuracy (Pass@10) and response diversity, we propose DivSampling-a novel and versatile sampling technique designed to enhance the diversity of candidate solutions by introducing prompt perturbations.DivSampling incorporates two categories of perturbations: task-agnostic approaches, which are general and not tailored to any specific task, and task-specific approaches, which are customized based on task content. Our theoretical analysis demonstrates that, under mild assumptions, the error rates of responses generated from diverse prompts are significantly lower compared to those produced by stationary prompts. Comprehensive evaluations across various tasks -including reasoning, mathematics, and code generation - highlight the effectiveness of DivSampling in improving solution accuracy. This scalable and efficient approach offers a new perspective on optimizing test-time inference, addressing limitations in current sampling strategies.

new A Critical Review of Predominant Bias in Neural Networks

Authors: Jiazhi Li, Mahyar Khayatkhoei, Jiageng Zhu, Hanchen Xie, Mohamed E. Hussein, Wael AbdAlmageed

Abstract: Bias issues of neural networks garner significant attention along with its promising advancement. Among various bias issues, mitigating two predominant biases is crucial in advancing fair and trustworthy AI: (1) ensuring neural networks yields even performance across demographic groups, and (2) ensuring algorithmic decision-making does not rely on protected attributes. However, upon the investigation of \pc papers in the relevant literature, we find that there exists a persistent, extensive but under-explored confusion regarding these two types of biases. Furthermore, the confusion has already significantly hampered the clarity of the community and subsequent development of debiasing methodologies. Thus, in this work, we aim to restore clarity by providing two mathematical definitions for these two predominant biases and leveraging these definitions to unify a comprehensive list of papers. Next, we highlight the common phenomena and the possible reasons for the existing confusion. To alleviate the confusion, we provide extensive experiments on synthetic, census, and image datasets, to validate the distinct nature of these biases, distinguish their different real-world manifestations, and evaluate the effectiveness of a comprehensive list of bias assessment metrics in assessing the mitigation of these biases. Further, we compare these two types of biases from multiple dimensions including the underlying causes, debiasing methods, evaluation protocol, prevalent datasets, and future directions. Last, we provide several suggestions aiming to guide researchers engaged in bias-related work to avoid confusion and further enhance clarity in the community.

new Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Authors: Uri Sherman, Tomer Koren, Yishay Mansour

Abstract: Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a strictly weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.

new AdaGC: Improving Training Stability for Large Language Model Pretraining

Authors: Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Tao Sun, Yanjun Ma, Dianhai Yu, Li Shen

Abstract: Large Language Models (LLMs) face increasing loss spikes during scaling, undermining training stability and final performance. While gradient clipping mitigates this issue, traditional global approaches poorly handle parameter-specific gradient variations and decaying gradient norms. We propose **AdaGC**, an adaptive gradient clipping framework that automatically adjusts local thresholds per parameter through exponential moving average of gradient norms. Theoretical analysis proves AdaGC's convergence under non-convex conditions. Extensive experiments demonstrate significant improvements: On Llama-2 7B/13B, AdaGC completely eliminates loss spikes while reducing WikiText perplexity by 3.5% (+0.14pp LAMBADA accuracy) for 7B and achieving 0.65% lower training loss with 1.47% reduced validation perplexity for 13B compared to global clipping. For CLIP ViT-Base, AdaGC converges 25% faster than StableAdamW with full spike elimination. The method shows universal effectiveness across architectures (Llama-2 7B/13B) and modalities (CLIP), with successful integration into diverse optimizers like AdamW and Lion. Source code will be released on GitHub.

new Deep Incomplete Multi-view Learning via Cyclic Permutation of VAEs

Authors: Xin Gao, Jian Pu

Abstract: Multi-View Representation Learning (MVRL) aims to derive a unified representation from multi-view data by leveraging shared and complementary information across views. However, when views are irregularly missing, the incomplete data can lead to representations that lack sufficiency and consistency. To address this, we propose Multi-View Permutation of Variational Auto-Encoders (MVP), which excavates invariant relationships between views in incomplete data. MVP establishes inter-view correspondences in the latent space of Variational Auto-Encoders, enabling the inference of missing views and the aggregation of more sufficient information. To derive a valid Evidence Lower Bound (ELBO) for learning, we apply permutations to randomly reorder variables for cross-view generation and then partition them by views to maintain invariant meanings under permutations. Additionally, we enhance consistency by introducing an informational prior with cyclic permutations of posteriors, which turns the regularization term into a similarity measure across distributions. We demonstrate the effectiveness of our approach on seven diverse datasets with varying missing ratios, achieving superior performance in multi-view clustering and generation tasks.

new ClimateLLM: Efficient Weather Forecasting via Frequency-Aware Large Language Models

Authors: Shixuan Li, Wei Yang, Peiyu Zhang, Xiongye Xiao, Defu Cao, Yuehan Qin, Xiaole Zhang, Yue Zhao, Paul Bogdan

Abstract: Weather forecasting is crucial for public safety, disaster prevention and mitigation, agricultural production, and energy management, with global relevance. Although deep learning has significantly advanced weather prediction, current methods face critical limitations: (i) they often struggle to capture both dynamic temporal dependencies and short-term abrupt changes, making extreme weather modeling difficult; (ii) they incur high computational costs due to extensive training and resource requirements; (iii) they have limited adaptability to multi-scale frequencies, leading to challenges when separating global trends from local fluctuations. To address these issues, we propose ClimateLLM, a foundation model for weather forecasting. It captures spatiotemporal dependencies via a cross-temporal and cross-spatial collaborative modeling framework that integrates Fourier-based frequency decomposition with Large Language Models (LLMs) to strengthen spatial and temporal modeling. Our framework uses a Mixture-of-Experts (MoE) mechanism that adaptively processes different frequency components, enabling efficient handling of both global signals and localized extreme events. In addition, we introduce a cross-temporal and cross-spatial dynamic prompting mechanism, allowing LLMs to incorporate meteorological patterns across multiple scales effectively. Extensive experiments on real-world datasets show that ClimateLLM outperforms state-of-the-art approaches in accuracy and efficiency, as a scalable solution for global weather forecasting.

new A Survey on Active Feature Acquisition Strategies

Authors: Arman Rahbar, Linus Aronsson, Morteza Haghir Chehreghani

Abstract: Active feature acquisition studies the challenge of making accurate predictions while limiting the cost of collecting complete data. By selectively acquiring only the most informative features for each instance, these strategies enable efficient decision-making in scenarios where data collection is expensive or time-consuming. This survey reviews recent progress in active feature acquisition, discussing common problem formulations, practical challenges, and key insights. We also highlight open issues and promising directions for future research.

new Accelerating Anchors via Specialization and Feature Transformation

Authors: Haonan Yu, Junhao Liu, Xin Zhang

Abstract: Anchors is a popular local model-agnostic explanation technique whose applicability is limited by its computational inefficiency. To address this limitation, we propose a pre-training-based approach to accelerate Anchors without compromising the explanation quality. Our approach leverages the iterative nature of Anchors' algorithm which gradually refines an explanation until it is precise enough for a given input by providing a general explanation that is obtained through pre-training as Anchors' initial explanation. Specifically, we develop a two-step rule transformation process: the horizontal transformation adapts a pre-trained explanation to the current input by replacing features, and the vertical transformation refines the general explanation until it is precise enough for the input. We evaluate our method across tabular, text, and image datasets, demonstrating that it significantly reduces explanation generation time while maintaining fidelity and interpretability, thereby enabling the practical adoption of Anchors in time-sensitive applications.

new Generalization of the Gibbs algorithm with high probability at low temperatures

Authors: Andreas Maurer

Abstract: The paper gives a bound on the generalization error of the Gibbs algorithm, which recovers known data-independent bounds for the high temperature range and extends to the low-temperature range, where generalization depends critically on the data-dependent loss-landscape. It is shown, that with high probability the generalization error of a single hypothesis drawn from the Gibbs posterior decreases with the total prior volume of all hypotheses with similar or smaller empirical error. This gives theoretical support to the belief in the benefit of flat minima. The zero temperature limit is discussed and the bound is extended to a class of similar stochastic algorithms.

new Towards Data-Efficient Pretraining for Atomic Property Prediction

Authors: Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem

Abstract: This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr\'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.

new Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL

Authors: Wei Yao, Wenkai Yang, Ziqiao Wang, Yankai Lin, Yong Liu

Abstract: As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence-whose mass-covering behavior risks overfitting to imperfect weak signals-with reverse KL divergence. Reverse KL divergence's zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence, establishing that reverse KL achieves at least comparable guarantees to forward KL. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last layer, reverse KL uniquely guarantees that it outperforms its weak supervisor by the magnitude of their disagreement-a guarantee that forward KL cannot provide. Empirically, we demonstrate that reverse KL and reverse cross-entropy enable strong models to consistently outperform those trained with forward KL and standard cross-entropy across most settings, highlighting the practical advantages of these reverse losses.

new UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation

Authors: Arka Mukherjee, Shreya Ghosh

Abstract: Multimodal fake news detection typically demands complex architectures and substantial computational resources, posing deployment challenges in real-world settings. We introduce UNITE-FND, a novel framework that reframes multimodal fake news detection as a unimodal text classification task. We propose six specialized prompting strategies with Gemini 1.5 Pro, converting visual content into structured textual descriptions, and enabling efficient text-only models to preserve critical visual information. To benchmark our approach, we introduce Uni-Fakeddit-55k, a curated dataset family of 55,000 samples each, each processed through our multimodal-to-unimodal translation framework. Experimental results demonstrate that UNITE-FND achieves 92.52% accuracy in binary classification, surpassing prior multimodal models while reducing computational costs by over 10x (TinyBERT variant: 14.5M parameters vs. 250M+ in SOTA models). Additionally, we propose a comprehensive suite of five novel metrics to evaluate image-to-text conversion quality, ensuring optimal information preservation. Our results demonstrate that structured text-based representations can replace direct multimodal processing with minimal loss of accuracy, making UNITE-FND a practical and scalable alternative for resource-constrained environments.

new MasRouter: Learning to Route LLMs for Multi-Agent Systems

Authors: Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, Yiyan Qi

Abstract: Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynamic LLM selection. Current LLM routing methods effectively reduce overhead in single-agent scenarios by customizing LLM selection for each query, but they overlook the critical decisions regarding collaboration modes and agent roles in MAS. In response to this challenge, we first introduce the problem of Multi-Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high-performing, cost-effective, and inductive MASR solution. MasRouter employs collaboration mode determination, role allocation, and LLM routing through a cascaded controller network, progressively constructing a MAS that balances effectiveness and efficiency. Extensive experiments demonstrate that MasRouter is (1) high-performing, achieving a $1.8\%\sim8.2\%$ improvement over the state-of-the-art method on MBPP; (2) economical, reducing overhead by up to $52.07\%$ compared to SOTA methods on HumanEval; and (3) plug-and-play, seamlessly integrating with mainstream MAS frameworks, reducing overhead by $17.21\%\sim28.17\%$ via customized routing. The code is available at https://github.com/yanweiyue/masrouter.

URLs: https://github.com/yanweiyue/masrouter.

new Machine Learning-Based Intrusion Detection and Prevention System for IIoT Smart Metering Networks: Challenges and Solutions

Authors: Sahar Lazim, Qutaiba I. Ali

Abstract: The Industrial Internet of Things (IIoT) has revolutionized industries by enabling automation, real-time data exchange, and smart decision-making. However, its increased connectivity introduces cybersecurity threats, particularly in smart metering networks, which play a crucial role in monitoring and optimizing energy consumption. This paper explores the challenges associated with securing IIoT-based smart metering networks and proposes a Machine Learning (ML)-based Intrusion Detection and Prevention System (IDPS) for safeguarding edge devices. The study reviews various intrusion detection approaches, highlighting the strengths and limitations of both signature-based and anomaly-based detection techniques. The findings suggest that integrating ML-driven IDPS in IIoT smart metering environments enhances security, efficiency, and resilience against evolving cyber threats.

new Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity

Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Li, Tiancheng Hu, Zhixia Liu, Xusheng Chen, Tao Xie, Yizhou Shan

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires long decoding chains (of thoughts), which incur $O(N)$ time and memory consumption, where $N$ is the chain length. To mitigate $O(N)$ time and memory consumption, existing sparsity-based algorithms propose retaining only the most critical token's intermediate data (i.e., key-value cache) and discarding the rest. However, these existing algorithms struggle with the ``impossible trinity'' of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with $O(L)$ time but $O(N)$ memory ($L$ is the cache budget, $L \ll N$). To address this issue, in this paper, we identify a new attention pattern during the decode stage of reasoning tasks, where milestone tokens (analogous to lemmas in mathematical proofs) emerge, are utilized, and then become unimportant afterward. Based on this pattern, we propose a new algorithm named RaaS that identifies and retains milestone tokens only until they are no longer needed, achieving high accuracy with $O(L)$ time and $O(L)$ memory complexity.

new Large Language-Geometry Model: When LLM meets Equivariance

Authors: Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao

Abstract: Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fall in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integrates E(3)-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adaptor modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.

new Logarithmic Width Suffices for Robust Memorization

Authors: Amitsour Egosi, Gilad Yehudai, Ohad Shamir

Abstract: The memorization capacity of neural networks with a given architecture has been thoroughly studied in many works. Specifically, it is well-known that memorizing $N$ samples can be done using a network of constant width, independent of $N$. However, the required constructions are often quite delicate. In this paper, we consider the natural question of how well feedforward ReLU neural networks can memorize robustly, namely while being able to withstand adversarial perturbations of a given radius. We establish both upper and lower bounds on the possible radius for general $l_p$ norms, implying (among other things) that width logarithmic in the number of input samples is necessary and sufficient to achieve robust memorization (with robustness radius independent of $N$).

new SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Authors: Bohan Lyu, Siqiao Huang, Zichen Liang

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.

URLs: https://github.com/Imbernoulli/SURGE.

new How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

Authors: Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen

Abstract: Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.

URLs: https://github.com/zjunlp/DynamicKnowledgeCircuits.

new Deep Contrastive Learning for Feature Alignment: Insights from Housing-Household Relationship Inference

Authors: Xiao Qian, Shangjia Dong, Rachel Davidson

Abstract: Housing and household characteristics are key determinants of social and economic well-being, yet our understanding of their interrelationships remains limited. This study addresses this knowledge gap by developing a deep contrastive learning (DCL) model to infer housing-household relationships using the American Community Survey (ACS) Public Use Microdata Sample (PUMS). More broadly, the proposed model is suitable for a class of problems where the goal is to learn joint relationships between two distinct entities without explicitly labeled ground truth data. Our proposed dual-encoder DCL approach leverages co-occurrence patterns in PUMS and introduces a bisect K-means clustering method to overcome the absence of ground truth labels. The dual-encoder DCL architecture is designed to handle the semantic differences between housing (building) and household (people) features while mitigating noise introduced by clustering. To validate the model, we generate a synthetic ground truth dataset and conduct comprehensive evaluations. The model further demonstrates its superior performance in capturing housing-household relationships in Delaware compared to state-of-the-art methods. A transferability test in North Carolina confirms its generalizability across diverse sociodemographic and geographic contexts. Finally, the post-hoc explainable AI analysis using SHAP values reveals that tenure status and mortgage information play a more significant role in housing-household matching than traditionally emphasized factors such as the number of persons and rooms.

new Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL

Authors: Matthew Zurek, Yudong Chen

Abstract: We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term horizon calibration. We also develop an empirical span penalization approach, inspired by sample variance penalization, which satisfies an oracle inequality performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.

new Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens

Authors: Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso

Abstract: Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we first extend RSs to the more complex setting of Concept-based Models and then derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of reasoning shortcuts and show that existing methods, even when combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.

new Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View

Authors: Yanran Wu, Inez Hua, Yi Ding

Abstract: Large language models (LLMs) offer powerful capabilities but come with significant environmental costs, particularly in carbon emissions. Existing studies benchmark these emissions but lack a standardized basis for comparison across models. To address this, we introduce the concept of a functional unit (FU) and develop FUEL, the first FU-based framework for evaluating LLM serving's environmental impact. Through case studies on model size, quantization, and hardware, we uncover key trade-offs in sustainability. Our findings highlight the potential for reducing carbon emissions by optimizing model selection, deployment strategies, and hardware choices, paving the way for more sustainable AI infrastructure.

new Scalable Multi-Agent Offline Reinforcement Learning and the Role of Information

Authors: Riccardo Zamboni, Enrico Brunetti, Marcello Restelli

Abstract: Offline Reinforcement Learning (RL) focuses on learning policies solely from a batch of previously collected data. of- fering the potential to leverage such datasets effectively without the need for costly or risky active exploration. While recent advances in Offline Multi-Agent RL (MARL) have shown promise, most existing methods either rely on large datasets jointly collected by all agents or agent-specific datasets collected independently. The former approach ensures strong performance but raises scalability concerns, while the latter emphasizes scalability at the expense of performance guarantees. In this work, we propose a novel scalable routine for both dataset collection and offline learning. Agents first collect diverse datasets coherently with a pre-specified information-sharing network and subsequently learn coherent localized policies without requiring either full observability or falling back to complete decentralization. We theoretically demonstrate that this structured approach allows a multi-agent extension of the seminal Fitted Q-Iteration (FQI) algorithm to globally converge, in high probability, to near-optimal policies. The convergence is subject to error terms that depend on the informativeness of the shared information. Furthermore, we show how this approach allows to bound the inherent error of the supervised-learning phase of FQI with the mutual information between shared and unshared information. Our algorithm, SCAlable Multi-agent FQI (SCAM-FQI), is then evaluated on a distributed decision-making problem. The empirical results align with our theoretical findings, supporting the effectiveness of SCAM-FQI in achieving a balance between scalability and policy performance.

new OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Authors: Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, James Zou

Abstract: Solving complex reasoning tasks may involve visual understanding, domain knowledge retrieval, numerical calculation, and multi-step reasoning. Existing methods augment large language models (LLMs) with external tools but are restricted to specialized domains, limited tool types, or require additional training data. In this paper, we introduce OctoTools, a training-free, user-friendly, and easily extensible open-source agentic framework designed to tackle complex reasoning across diverse domains. OctoTools introduces standardized tool cards to encapsulate tool functionality, a planner for both high-level and low-level planning, and an executor to carry out tool usage. We validate OctoTools' generality across 16 diverse tasks (including MathVista, MMLU-Pro, MedQA, and GAIA-Text), achieving substantial average accuracy gains of 9.3% over GPT-4o. Furthermore, OctoTools outperforms AutoGen, GPT-Functions and LangChain by up to 10.6% when given the same set of tools. Through comprehensive analysis and ablations, OctoTools demonstrates advantages in task planning, effective tool usage, and multi-step problem solving.

new Neural Operators for Stochastic Modeling of Nonlinear Structural System Response to Natural Hazards

Authors: Somdatta Goswami, Dimitris G. Giovanis, Bowei Li, Seymour M. J. Spence, Michael D. Shields

Abstract: Traditionally, neural networks have been employed to learn the mapping between finite-dimensional Euclidean spaces. However, recent research has opened up new horizons, focusing on the utilization of deep neural networks to learn operators capable of mapping infinite-dimensional function spaces. In this work, we employ two state-of-the-art neural operators, the deep operator network (DeepONet) and the Fourier neural operator (FNO) for the prediction of the nonlinear time history response of structural systems exposed to natural hazards, such as earthquakes and wind. Specifically, we propose two architectures, a self-adaptive FNO and a Fast Fourier Transform-based DeepONet (DeepFNOnet), where we employ a FNO beyond the DeepONet to learn the discrepancy between the ground truth and the solution predicted by the DeepONet. To demonstrate the efficiency and applicability of the architectures, two problems are considered. In the first, we use the proposed model to predict the seismic nonlinear dynamic response of a six-story shear building subject to stochastic ground motions. In the second problem, we employ the operators to predict the wind-induced nonlinear dynamic response of a high-rise building while explicitly accounting for the stochastic nature of the wind excitation. In both cases, the trained metamodels achieve high accuracy while being orders of magnitude faster than their corresponding high-fidelity models.

new Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

Authors: Mohit Raghavendra, Junmo Kang, Alan Ritter

Abstract: Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT) using methods like Direct Preference Optimization. Both stages require annotated data that are very different in structure and costs. We study how to optimally allocate a fixed training data budget between the two stages, through extensive experiments spanning four diverse tasks, multiple model sizes and various data annotation costs. Our findings reveal that just SFT on the base model dominates performance in low-data regimes ($<1,000$ annotated examples). With larger data-budgets, we observe that a combination of SFT and PFT, often with increasing portions allocated towards preference data yields optimal performance. However, completely eliminating SFT and running PFT directly on the base model yields suboptimal performance, described as the cold start problem on tasks like mathematics. We observe that this is due to the distribution shift arising from using DPO directly on the base model to elicit step-by-step reasoning. This limitation can be effectively addressed by allocating even a small portion ($<10$%) of the budget to SFT first, resulting in performance improvements of $15-20$% on analytical benchmarks like GSM8k. These results provide actionable insights for researchers and practitioners optimizing model development under budget constraints, where high-quality data curation often represents a significant portion of the total costs of model development.

new Non-Uniform Memory Sampling in Experience Replay

Authors: Andrii Krutsylo

Abstract: Continual learning is the process of training machine learning models on a sequence of tasks where data distributions change over time. A well-known obstacle in this setting is catastrophic forgetting, a phenomenon in which a model drastically loses performance on previously learned tasks when learning new ones. A popular strategy to alleviate this problem is experience replay, in which a subset of old samples is stored in a memory buffer and replayed with new data. Despite continual learning advances focusing on which examples to store and how to incorporate them into the training loss, most approaches assume that sampling from this buffer is uniform by default. We challenge the assumption that uniform sampling is necessarily optimal. We conduct an experiment in which the memory buffer updates the same way in every trial, but the replay probability of each stored sample changes between trials based on different random weight distributions. Specifically, we generate 50 different non-uniform sampling probability weights for each trial and compare their final accuracy to the uniform sampling baseline. We find that there is always at least one distribution that significantly outperforms the baseline across multiple buffer sizes, models, and datasets. These results suggest that more principled adaptive replay policies could yield further gains. We discuss how exploiting this insight could inspire new research on non-uniform memory sampling in continual learning to better mitigate catastrophic forgetting. The code supporting this study is available at $\href{https://github.com/DentonJC/memory-sampling}{https://github.com/DentonJC/memory-sampling}$.

URLs: https://github.com/DentonJC/memory-sampling, https://github.com/DentonJC/memory-sampling

new Inverse Flow and Consistency Models

Authors: Yuchen Zhang, Jian Zhou

Abstract: Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and real-world applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF's utility in scientific problems. Overall, this work expands the applications of powerful generative models to inversion generation problems.

new S2TX: Cross-Attention Multi-Scale State-Space Transformer for Time Series Forecasting

Authors: Zihao Wu, Juncheng Dong, Haoming Yang, Vahid Tarokh

Abstract: Time series forecasting has recently achieved significant progress with multi-scale models to address the heterogeneity between long and short range patterns. Despite their state-of-the-art performance, we identify two potential areas for improvement. First, the variates of the multivariate time series are processed independently. Moreover, the multi-scale (long and short range) representations are learned separately by two independent models without communication. In light of these concerns, we propose State Space Transformer with cross-attention (S2TX). S2TX employs a cross-attention mechanism to integrate a Mamba model for extracting long-range cross-variate context and a Transformer model with local window attention to capture short-range representations. By cross-attending to the global context, the Transformer model further facilitates variate-level interactions as well as local/global communications. Comprehensive experiments on seven classic long-short range time-series forecasting benchmark datasets demonstrate that S2TX can achieve highly robust SOTA results while maintaining a low memory footprint.

new Biases in Edge Language Models: Detection, Analysis, and Mitigation

Authors: Vinamra Sharma, Danilo Pietro Pau, Jos\'e Cano

Abstract: The integration of large language models (LLMs) on low-power edge devices such as Raspberry Pi, known as edge language models (ELMs), has introduced opportunities for more personalized, secure, and low-latency language intelligence that is accessible to all. However, the resource constraints inherent in edge devices and the lack of robust ethical safeguards in language models raise significant concerns about fairness, accountability, and transparency in model output generation. This paper conducts a comparative analysis of text-based bias across language model deployments on edge, cloud, and desktop environments, aiming to evaluate how deployment settings influence model fairness. Specifically, we examined an optimized Llama-2 model running on a Raspberry Pi 4; GPT 4o-mini, Gemini-1.5-flash, and Grok-beta models running on cloud servers; and Gemma2 and Mistral models running on a MacOS desktop machine. Our results demonstrate that Llama-2 running on Raspberry Pi 4 is 43.23% and 21.89% more prone to showing bias over time compared to models running on the desktop and cloud-based environments. We also propose the implementation of a feedback loop, a mechanism that iteratively adjusts model behavior based on previous outputs, where predefined constraint weights are applied layer-by-layer during inference, allowing the model to correct bias patterns, resulting in 79.28% reduction in model bias.

new SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Authors: Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du

Abstract: The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instructions. Through analysis of SAE latent activations, we identify specific latents responsible for instruction following behavior. Our findings reveal that instruction following capabilities are encoded by a distinct set of instruction-relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Additionally, we demonstrate that our methodology scales effectively across SAEs and LLMs of varying sizes.

new Teleportation With Null Space Gradient Projection for Optimization Acceleration

Authors: Zihao Wu, Juncheng Dong, Ahmed Aloui, Vahid Tarokh

Abstract: Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.

new Sparse Autoencoder Features for Classifications and Transferability

Authors: Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

Abstract: Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.

URLs: https://github.com/shan23chen/MOSAIC.

new Oversmoothing as Loss of Sign: Towards Structural Balance in Graph Neural Networks

Authors: Jiaqi Wang, Xinyi Wu, James Cheng, Yifei Wang

Abstract: Oversmoothing is a common issue in graph neural networks (GNNs), where node representations become excessively homogeneous as the number of layers increases, resulting in degraded performance. Various strategies have been proposed to combat oversmoothing in practice, yet they are based on different heuristics and lack a unified understanding of their inherent mechanisms. In this paper, we show that three major classes of anti-oversmoothing techniques can be mathematically interpreted as message passing over signed graphs comprising both positive and negative edges. By analyzing the asymptotic behavior of signed graph propagation, we demonstrate that negative edges can repel nodes to a certain extent, providing deeper insights into how these methods mitigate oversmoothing. Furthermore, our results suggest that the structural balance of a signed graph-where positive edges exist only within clusters and negative edges appear only between clusters-is crucial for clustering node representations in the long term through signed graph propagation. Motivated by these observations, we propose a solution to mitigate oversmoothing with theoretical guarantees-Structural Balance Propagation (SBP), by incorporating label and feature information to create a structurally balanced graph for message-passing. Experiments on nine datasets against twelve baselines demonstrate the effectiveness of our method, highlighting the value of our signed graph perspective.

new Structure based SAT dataset for analysing GNN generalisation

Authors: Yi Fu, Anthony Tompkins, Yang Song, Maurice Pagnucco

Abstract: Satisfiability (SAT) solvers based on techniques such as conflict driven clause learning (CDCL) have produced excellent performance on both synthetic and real world industrial problems. While these CDCL solvers only operate on a per-problem basis, graph neural network (GNN) based solvers bring new benefits to the field by allowing practitioners to exploit knowledge gained from solved problems to expedite solving of new SAT problems. However, one specific area that is often studied in the context of CDCL solvers, but largely overlooked in GNN solvers, is the relationship between graph theoretic measure of structure in SAT problems and the generalisation ability of GNN solvers. To bridge the gap between structural graph properties (e.g., modularity, self-similarity) and the generalisability (or lack thereof) of GNN based SAT solvers, we present StructureSAT: a curated dataset, along with code to further generate novel examples, containing a diverse set of SAT problems from well known problem domains. Furthermore, we utilise a novel splitting method that focuses on deconstructing the families into more detailed hierarchies based on their structural properties. With the new dataset, we aim to help explain problematic generalisation in existing GNN SAT solvers by exploiting knowledge of structural graph properties. We conclude with multiple future directions that can help researchers in GNN based SAT solving develop more effective and generalisable SAT solvers.

new Detecting and Filtering Unsafe Training Data via Data Attribution

Authors: Yijun Pan, Taiwei Shi, Jieyu Zhao, Jiaqi W. Ma

Abstract: Large language models (LLMs) are vulnerable to unsafe training data that even small amounts of unsafe data can lead to harmful model behaviors. Detecting and filtering such unsafe training data is essential for trustworthy model development. Current state-of-the-art (SOTA) approaches typically rely on training moderation classifiers which requires significant computational overhead and are limited to predefined taxonomies, making them less adaptable to evolving safety concerns. Moreover, these classifiers lack insight into the training process, limiting their effectiveness in filtering unsafe data. To address these limitations, we propose DABUF, leveraging data attribution to detect and filter unsafe training data by attributing harmful model outputs to influential training data points. DABUF enables flexible identification of various unsafe data types without predefined taxonomies. However, in practice, model outputs can be complex with combined safe linguistic features and unsafe content, leading to reduced attribution accuracy. In such cases, DABUF will integrate moderation classifiers to identify a minimal subset of unsafe training data for targeted attribution (such as jailbreak). When model outputs are relatively straightforward, DABUF uses model outputs directly as the attribution targets. We evaluate the performance on two different tasks: in filtering jailbreaking training data and in identifying and mitigating gender bias. DABUF outperforms SOTA approaches by up to 7.5\% in detection AUPRC in jailbreaking scenarios, and 44.1\% in detecting gender bias. Moreover, retraining on DABUF-filtered data leads to higher model safety across experiments, underscoring its versatility in addressing a broad spectrum of unsafe data issues.

new Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise

Authors: Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos

Abstract: We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $\sigma: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels. As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomial Statistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.

new DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services

Authors: Ting Sun, Penghan Wang, Fan Lai

Abstract: The rapid rise of large language models (LLMs) in text streaming services has introduced significant cost and Quality of Experience (QoE) challenges in serving millions of daily requests, especially in meeting Time-To-First-Token (TTFT) and Time-Between-Token (TBT) requirements for real-time interactions. Our real-world measurements show that both server-based and on-device deployments struggle to meet diverse QoE demands: server deployments face high costs and last-hop issues (e.g., Internet latency and dynamics), while on-device LLM inference is constrained by resources. We introduce DiSCo, a device-server cooperative scheduler designed to optimize users' QoE by adaptively routing requests and migrating response generation between endpoints while maintaining cost constraints. DiSCo employs cost-aware scheduling, leveraging the predictable speed of on-device LLM inference with the flexible capacity of server-based inference to dispatch requests on the fly, while introducing a token-level migration mechanism to ensure consistent token delivery during migration. Evaluations on real-world workloads -- including commercial services like OpenAI GPT and DeepSeek, and open-source deployments such as LLaMA3 -- show that DiSCo can improve users' QoE by reducing tail TTFT (11-52\%) and mean TTFT (6-78\%) across different model-device configurations, while dramatically reducing serving costs by up to 84\% through its migration mechanism while maintaining comparable QoE levels.

new Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models

Authors: Yingqing Guo, Yukang Yang, Hui Yuan, Mengdi Wang

Abstract: Training-free guidance enables controlled generation in diffusion and flow models, but most existing methods assume differentiable objectives and rely on gradients. This work focuses on training-free guidance addressing challenges from non-differentiable objectives and discrete data distributions. We propose an algorithmic framework TreeG: Tree Search-Based Path Steering Guidance, applicable to both continuous and discrete settings in diffusion and flow models. TreeG offers a unified perspective on training-free guidance: proposing candidates for the next step, evaluating candidates, and selecting the best to move forward, enhanced by a tree search mechanism over active paths or parallelizing exploration. We comprehensively investigate the design space of TreeG over the candidate proposal module and the evaluation function, instantiating TreeG into three novel algorithms. Our experiments show that TreeG consistently outperforms the top guidance baselines in symbolic music generation, small molecule generation, and enhancer DNA design, all of which involve non-differentiable challenges. Additionally, we identify an inference-time scaling law showing TreeG's scalability in inference-time computation.

new What's in a Query: Polarity-Aware Distribution-Based Fair Ranking

Authors: Aparna Balagopalan, Kai Wang, Olawale Salaudeen, Asia Biega, Marzyeh Ghassemi

Abstract: Machine learning-driven rankings, where individuals (or items) are ranked in response to a query, mediate search exposure or attention in a variety of safety-critical settings. Thus, it is important to ensure that such rankings are fair. Under the goal of equal opportunity, attention allocated to an individual on a ranking interface should be proportional to their relevance across search queries. In this work, we examine amortized fair ranking -- where relevance and attention are cumulated over a sequence of user queries to make fair ranking more feasible in practice. Unlike prior methods that operate on expected amortized attention for each individual, we define new divergence-based measures for attention distribution-based fairness in ranking (DistFaiR), characterizing unfairness as the divergence between the distribution of attention and relevance corresponding to an individual over time. This allows us to propose new definitions of unfairness, which are more reliable at test time. Second, we prove that group fairness is upper-bounded by individual fairness under this definition for a useful class of divergence measures, and experimentally show that maximizing individual fairness through an integer linear programming-based optimization is often beneficial to group fairness. Lastly, we find that prior research in amortized fair ranking ignores critical information about queries, potentially leading to a fairwashing risk in practice by making rankings appear more fair than they actually are.

new ADO: Automatic Data Optimization for Inputs in LLM Prompts

Authors: Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang

Abstract: This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at https://anonymous.4open.science/r/ADO-6BC5/

URLs: https://anonymous.4open.science/r/ADO-6BC5/

new Does Editing Provide Evidence for Localization?

Authors: Zihao Wang, Victor Veitch

Abstract: A basic aspiration for interpretability research in large language models is to "localize" semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretation of the localization. The question we address here is: how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at random localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

new Fishing For Cheap And Efficient Pruners At Initialization

Authors: Ivo Gollini Navarrete, Nicolas Mauricio Cuadrado, Jose Renato Restom, Martin Tak\'a\v{c}, Samuel Horv\'ath

Abstract: Pruning offers a promising solution to mitigate the associated costs and environmental impact of deploying large deep neural networks (DNNs). Traditional approaches rely on computationally expensive trained models or time-consuming iterative prune-retrain cycles, undermining their utility in resource-constrained settings. To address this issue, we build upon the established principles of saliency (LeCun et al., 1989) and connection sensitivity (Lee et al., 2018) to tackle the challenging problem of one-shot pruning neural networks (NNs) before training (PBT) at initialization. We introduce Fisher-Taylor Sensitivity (FTS), a computationally cheap and efficient pruning criterion based on the empirical Fisher Information Matrix (FIM) diagonal, offering a viable alternative for integrating first- and second-order information to identify a model's structurally important parameters. Although the FIM-Hessian equivalency only holds for convergent models that maximize the likelihood, recent studies (Karakida et al., 2019) suggest that, even at initialization, the FIM captures essential geometric information of parameters in overparameterized NNs, providing the basis for our method. Finally, we demonstrate empirically that layer collapse, a critical limitation of data-dependent pruning methodologies, is easily overcome by pruning within a single training epoch after initialization. We perform experiments on ResNet18 and VGG19 with CIFAR-10 and CIFAR-100, widely used benchmarks in pruning research. Our method achieves competitive performance against state-of-the-art techniques for one-shot PBT, even under extreme sparsity conditions. Our code is made available to the public.

new Connector-S: A Survey of Connectors in Multi-modal Large Language Models

Authors: Xun Zhu, Zheng Zhang, Xi Chen, Yiming Shi, Miao Li, Ji Wu

Abstract: With the rapid advancements in multi-modal large language models (MLLMs), connectors play a pivotal role in bridging diverse modalities and enhancing model performance. However, the design and evolution of connectors have not been comprehensively analyzed, leaving gaps in understanding how these components function and hindering the development of more powerful connectors. In this survey, we systematically review the current progress of connectors in MLLMs and present a structured taxonomy that categorizes connectors into atomic operations (mapping, compression, mixture of experts) and holistic designs (multi-layer, multi-encoder, multi-modal scenarios), highlighting their technical contributions and advancements. Furthermore, we discuss several promising research frontiers and challenges, including high-resolution input, dynamic compression, guide information selection, combination strategy, and interpretability. This survey is intended to serve as a foundational reference and a clear roadmap for researchers, providing valuable insights into the design and optimization of next-generation connectors to enhance the performance and adaptability of MLLMs.

new Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

Authors: Jiecheng Zhou, Ding Tang, Rong Fu, Boni Hu, Haoran Xu, Yi Wang, Zhilin Pei, Zhongling Su, Liang Liu, Xingcheng Zhang, Weiming Zhang

Abstract: The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown potential, leveraging FP4 remains challenging due to inherent quantization errors and limited representation capability. Based on the Transformer architecture, we present an FP4 training scheme for LLMs, overcoming these obstacles through mixed-precision quantization strategies tailed for different modules and training stages. This allows us to apply the precision level suitable to distinct components within the model, ensuring that multi-head attention and linear layers are handled appropriately. Our pretraining recipe ensures stability in backpropagation by incorporating fine-grained quantization methods with a target precision training schedule. Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost. With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.

new GiFT: Gibbs Fine-Tuning for Code Generation

Authors: Haochen Li, Wanjin Feng, Xin Zhou, Zhiqi Shen

Abstract: Training Large Language Models (LLMs) with synthetic data is a prevalent practice in code generation. A key approach is self-training, where LLMs are iteratively trained on self-generated correct code snippets. In this case, the self-generated codes are drawn from a conditional distribution, conditioned on a specific seed description. However, the seed description is not the only valid representation that aligns with its intended meaning. With all valid descriptions and codes forming a joint space, codes drawn from the conditional distribution would lead to an underrepresentation of the full description-code space. As such, we propose Gibbs Fine-Tuning (GiFT), a novel self-training method inspired by Gibbs sampling. GiFT allows self-generated data to be drawn from the marginal distribution of the joint space, thereby mitigating the biases inherent in conditional sampling. We provide a theoretical analysis demonstrating the potential benefits of fine-tuning LLMs with code derived from the marginal distribution. Furthermore, we propose a perplexity-based code selection method to mitigate the imbalanced long-tail distribution of the self-generated codes. Empirical evaluation of two LLMs across four datasets demonstrates that GiFT achieves superior performance, particularly on more challenging benchmarks.

new Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size

Authors: Naoki Takeshita, Masaaki Imaizumi

Abstract: Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of transformers has garnered significant attention. A notable example is the mathematical analysis of their approximation power, which validates the empirical expressive capability of transformers. In this study, we investigate the ability of transformers to approximate column-symmetric polynomials, an extension of symmetric polynomials that take matrices as input. Consequently, we establish an explicit relationship between the size of the transformer network and its approximation capability, leveraging the parameter efficiency of transformers and their compatibility with symmetry by focusing on the algebraic properties of symmetric polynomials.

new Enhancing Offline Model-Based RL via Active Model Selection: A Bayesian Optimization Perspective

Authors: Yu-Wei Yang, Yun-Ming Chan, Wei Hung, Xi Liu, Ping-Chun Hsieh

Abstract: Offline model-based reinforcement learning (MBRL) serves as a competitive framework that can learn well-performing policies solely from pre-collected data with the help of learned dynamics models. To fully unleash the power of offline MBRL, model selection plays a pivotal role in determining the dynamics model utilized for downstream policy learning. However, offline MBRL conventionally relies on validation or off-policy evaluation, which are rather inaccurate due to the inherent distribution shift in offline RL. To tackle this, we propose BOMS, an active model selection framework that enhances model selection in offline MBRL with only a small online interaction budget, through the lens of Bayesian optimization (BO). Specifically, we recast model selection as BO and enable probabilistic inference in BOMS by proposing a novel model-induced kernel, which is theoretically grounded and computationally efficient. Through extensive experiments, we show that BOMS improves over the baseline methods with a small amount of online interaction comparable to only $1\%$-$2.5\%$ of offline training data on various RL tasks.

new DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning

Authors: Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu

Abstract: Continual learning (CL) is essential for Large Language Models (LLMs) to adapt to evolving real-world demands, yet they are susceptible to catastrophic forgetting (CF). While traditional CF solutions rely on expensive data rehearsal, recent rehearsal-free methods employ model-based and regularization-based strategies to address this issue. However, these approaches often neglect the model's plasticity, which is crucial to achieving optimal performance on newly learned tasks. Consequently, a key challenge in CL is striking a balance between preserving plasticity and mitigating CF. To tackle this challenge, we propose the $\textbf{D}$ecomposed $\textbf{A}$ttention-based $\textbf{T}$ask $\textbf{A}$daptation (DATA), which explicitly decouples and learns both task-specific and task-shared knowledge using high-rank and low-rank task adapters (e.g., LoRAs). For new tasks, DATA dynamically adjusts the weights of adapters of different ranks based on their relevance and distinction from previous tasks, allowing the model to acquire new task-specific skills while effectively retaining previously learned knowledge. Specifically, we implement a decomposed component weighting strategy comprising learnable components that collectively generate attention-based weights, allowing the model to integrate and utilize diverse knowledge from each DATA. Extensive experiments on three widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Notably, our approach significantly enhances model plasticity and mitigates CF by extending learnable components and employing stochastic restoration during training iterations.

new No-regret incentive-compatible online learning under exact truthfulness with non-myopic experts

Authors: Junpei Komiyama, Nishant A. Mehta, Ali Mortazavi

Abstract: We study an online forecasting setting in which, over $T$ rounds, $N$ strategic experts each report a forecast to a mechanism, the mechanism selects one forecast, and then the outcome is revealed. In any given round, each expert has a belief about the outcome, but the expert wishes to select its report so as to maximize the total number of times it is selected. The goal of the mechanism is to obtain low belief regret: the difference between its cumulative loss (based on its selected forecasts) and the cumulative loss of the best expert in hindsight (as measured by the experts' beliefs). We consider exactly truthful mechanisms for non-myopic experts, meaning that truthfully reporting its belief strictly maximizes the expert's subjective probability of being selected in any future round. Even in the full-information setting, it is an open problem to obtain the first no-regret exactly truthful mechanism in this setting. We develop the first no-regret mechanism for this setting via an online extension of the Independent-Event Lotteries Forecasting Competition Mechanism (I-ELF). By viewing this online I-ELF as a novel instance of Follow the Perturbed Leader (FPL) with noise based on random walks with loss-dependent perturbations, we obtain $\tilde{O}(\sqrt{T N})$ regret. Our results are fueled by new tail bounds for Poisson binomial random variables that we develop. We extend our results to the bandit setting, where we give an exactly truthful mechanism obtaining $\tilde{O}(T^{2/3} N^{1/3})$ regret; this is the first no-regret result even among approximately truthful mechanisms.

new Dictionary-Learning-Based Data Pruning for System Identification

Authors: Tingna Wang (Department of Bridge Engineering, Tongji University, Shanghai, China, Shanghai Qi Zhi Institute, Shanghai, China), Sikai Zhang (Baosight Software), Limin Sun (Department of Bridge Engineering, Tongji University, Shanghai, China, State Key Laboratory of Disaster Reduction in Civil Engineering, Tongji University, Shanghai, China)

Abstract: System identification is normally involved in augmenting time series data by time shifting and nonlinearisation (via polynomial basis), which introduce redundancy both feature-wise and sample-wise. Many research works focus on reducing redundancy feature-wise, while less attention is paid to sample-wise redundancy. This paper proposes a novel data pruning method, called (mini-batch) FastCan, to reduce sample-wise redundancy based on dictionary learning. Time series data is represented by some representative samples, called atoms, via dictionary learning. The useful samples are selected based on their correlation with the atoms. The method is tested on one simulated dataset and two benchmark datasets. The R-squared between the coefficients of models trained on the full and the coefficients of models trained on pruned datasets is adopted to evaluate the performance of data pruning methods. It is found that the proposed method significantly outperforms the random pruning method.

new GPU-accelerated Multi-relational Parallel Graph Retrieval for Web-scale Recommendations

Authors: Zhuoning Guo, Guangxing Chen, Qian Gao, Xiaochao Liao, Jianjia Zheng, Lu Shen, Hao Liu

Abstract: Web recommendations provide personalized items from massive catalogs for users, which rely heavily on retrieval stages to trade off the effectiveness and efficiency of selecting a small relevant set from billion-scale candidates in online digital platforms. As one of the largest Chinese search engine and news feed providers, Baidu resorts to Deep Neural Network (DNN) and graph-based Approximate Nearest Neighbor Search (ANNS) algorithms for accurate relevance estimation and efficient search for relevant items. However, current retrieval at Baidu fails in comprehensive user-item relational understanding due to dissected interaction modeling, and performs inefficiently in large-scale graph-based ANNS because of suboptimal traversal navigation and the GPU computational bottleneck under high concurrency. To this end, we propose a GPU-accelerated Multi-relational Parallel Graph Retrieval (GMP-GR) framework to achieve effective yet efficient retrieval in web-scale recommendations. First, we propose a multi-relational user-item relevance metric learning method that unifies diverse user behaviors through multi-objective optimization and employs a self-covariant loss to enhance pathfinding performance. Second, we develop a hierarchical parallel graph-based ANNS to boost graph retrieval throughput, which conducts breadth-depth-balanced searches on a large-scale item graph and cost-effectively handles irregular neural computation via adaptive aggregation on GPUs. In addition, we integrate system optimization strategies in the deployment of GMP-GR in Baidu. Extensive experiments demonstrate the superiority of GMP-GR in retrieval accuracy and efficiency. Deployed across more than twenty applications at Baidu, GMP-GR serves hundreds of millions of users with a throughput exceeding one hundred million requests per second.

new Accelerated Gradient-based Design Optimization Via Differentiable Physics-Informed Neural Operator: A Composites Autoclave Processing Case Study

Authors: Janak M. Patel, Milad Ramezankhani, Anirudh Deodhar, Dagnachew Birru

Abstract: Simulation and optimization are crucial for advancing the engineering design of complex systems and processes. Traditional optimization methods require substantial computational time and effort due to their reliance on resource-intensive simulations, such as finite element analysis, and the complexity of rigorous optimization algorithms. Data-agnostic AI-based surrogate models, such as Physics-Informed Neural Operators (PINOs), offer a promising alternative to these conventional simulations, providing drastically reduced inference time, unparalleled data efficiency, and zero-shot super-resolution capability. However, the predictive accuracy of these models is often constrained to small, low-dimensional design spaces or systems with relatively simple dynamics. To address this, we introduce a novel Physics-Informed DeepONet (PIDON) architecture, which extends the capabilities of conventional neural operators to effectively model the nonlinear behavior of complex engineering systems across high-dimensional design spaces and a wide range of dynamic design configurations. This new architecture outperforms existing SOTA models, enabling better predictions across broader design spaces. Leveraging PIDON's differentiability, we integrate a gradient-based optimization approach using the Adam optimizer to efficiently determine optimal design variables. This forms an end-to-end gradient-based optimization framework that accelerates the design process while enhancing scalability and efficiency. We demonstrate the effectiveness of this framework in the optimization of aerospace-grade composites curing processes achieving a 3x speedup in obtaining optimal design variables compared to gradient-free methods. Beyond composites processing, the proposed model has the potential to be used as a scalable and efficient optimization tool for broader applications in advanced engineering and digital twin systems.

new A GNN-based Spectral Filtering Mechanism for Imbalance Classification in Network Digital Twin

Authors: Abubakar Isah, Ibrahim Aliyu, Sulaiman Muhammad Rashid, Jaehyung Park, Minsoo Hahn, Jinsul Kim

Abstract: Graph Neural Networks are gaining attention in Fifth-Generation (5G) core network digital twins, which are data-driven complex systems with numerous components. Analyzing these data can be challenging due to rare failure types, leading to imbalanced classification in multiclass settings. Digital twins of 5G networks increasingly employ graph classification as the main method for identifying failure types. However, the skewed distribution of failure occurrences is a major class imbalance issue that prevents effective graph data mining. Previous studies have not sufficiently tackled this complex problem. In this paper, we propose Class-Fourier Graph Neural Network (CF-GNN) introduces a class-oriented spectral filtering mechanism that ensures precise classification by estimating a unique spectral filter for each class. We employ eigenvalue and eigenvector spectral filtering to capture and adapt to variations in the minority classes, ensuring accurate class-specific feature discrimination, and adept at graph representation learning for complex local structures among neighbors in an end-to-end setting. Extensive experiments have demonstrated that the proposed CF-GNN could help with both the creation of new techniques for enhancing classifiers and the investigation of the characteristics of the multi-class imbalanced data in a network digital twin system.

new Learning Surrogate Potential Mean Field Games via Gaussian Processes: A Data-Driven Approach to Ill-Posed Inverse Problems

Authors: Jingguo Zhang, Xianjin Yang, Chenchen Mou, Chao Zhou

Abstract: Mean field games (MFGs) describe the collective behavior of large populations of interacting agents. In this work, we tackle ill-posed inverse problems in potential MFGs, aiming to recover the agents' population, momentum, and environmental setup from limited, noisy measurements and partial observations. These problems are ill-posed because multiple MFG configurations can explain the same data, or different parameters can yield nearly identical observations. Nonetheless, they remain crucial in practice for real-world scenarios where data are inherently sparse or noisy, or where the MFG structure is not fully determined. Our focus is on finding surrogate MFGs that accurately reproduce the observed data despite these challenges. We propose two Gaussian process (GP)-based frameworks: an inf-sup formulation and a bilevel approach. The choice between them depends on whether the unknown parameters introduce concavity in the objective. In the inf-sup framework, we use the linearity of GPs and their parameterization structure to maintain convex-concave properties, allowing us to apply standard convex optimization algorithms. In the bilevel framework, we employ a gradient-descent-based algorithm and introduce two methods for computing the outer gradient. The first method leverages an existing solver for the inner potential MFG and applies automatic differentiation, while the second adopts an adjoint-based strategy that computes the outer gradient independently of the inner solver. Our numerical experiments show that when sufficient prior information is available, the unknown parameters can be accurately recovered. Otherwise, if prior information is limited, the inverse problem is ill-posed, but our frameworks can still produce surrogate MFG models that closely match observed data.

new DifCluE: Generating Counterfactual Explanations with Diffusion Autoencoders and modal clustering

Authors: Suparshva Jain, Amit Sangroya, Lovekesh Vig

Abstract: Generating multiple counterfactual explanations for different modes within a class presents a significant challenge, as these modes are distinct yet converge under the same classification. Diffusion probabilistic models (DPMs) have demonstrated a strong ability to capture the underlying modes of data distributions. In this paper, we harness the power of a Diffusion Autoencoder to generate multiple distinct counterfactual explanations. By clustering in the latent space, we uncover the directions corresponding to the different modes within a class, enabling the generation of diverse and meaningful counterfactuals. We introduce a novel methodology, DifCluE, which consistently identifies these modes and produces more reliable counterfactual explanations. Our experimental results demonstrate that DifCluE outperforms the current state-of-the-art in generating multiple counterfactual explanations, offering a significant advance- ment in model interpretability.

new MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models

Authors: Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang

Abstract: Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has largely focused on single-task learning, leaving its application to multi-task learning unexplored. Multi-task learning is crucial for leveraging shared knowledge across tasks to improve generalization, yet it introduces unique challenges under ZO settings, such as amplified gradient variance and collinearity. In this paper, we present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization. MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters, reducing the dimensionality of the parameter space and mitigating task conflicts. Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.

new $\text{M}^{\text{3}}$: A Modular World Model over Streams of Tokens

Authors: Lior Cohen, Kaixin Wang, Bingyi Kang, Uri Gadot, Shie Mannor

Abstract: Token-based world models emerged as a promising modular framework, modeling dynamics over token streams while optimizing tokenization separately. While successful in visual environments with discrete actions (e.g., Atari games), their broader applicability remains uncertain. In this paper, we introduce $\text{M}^{\text{3}}$, a $\textbf{m}$odular $\textbf{w}$orld $\textbf{m}$odel that extends this framework, enabling flexible combinations of observation and action modalities through independent modality-specific components. $\text{M}^{\text{3}}$ integrates several improvements from existing literature to enhance agent performance. Through extensive empirical evaluation across diverse benchmarks, $\text{M}^{\text{3}}$ achieves state-of-the-art sample efficiency for planning-free world models. Notably, among these methods, it is the first to reach a human-level median score on Atari 100K, with superhuman performance on 13 games. We $\href{https://github.com/leor-c/M3}{\text{open-source our code and weights}}$.

URLs: https://github.com/leor-c/M3

new Continuous Diffusion Model for Language Modeling

Authors: Jaehyeong Jo, Sung Ju Hwang

Abstract: Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.

URLs: https://github.com/harryjo97/RDLM, https://github.com/harryjo97/RDLM

new Towards a Trustworthy Anomaly Detection for Critical Applications through Approximated Partial AUC Loss

Authors: Arnaud Bougaham, Beno\^it Fr\'enay

Abstract: Anomaly Detection is a crucial step for critical applications such in the industrial, medical or cybersecurity domains. These sectors share the same requirement of handling differently the different types of classification errors. Indeed, even if false positives are acceptable, false negatives are not, because it would reflect a missed detection of a quality issue, a disease or a cyber threat. To fulfill this requirement, we propose a method that dynamically applies a trustworthy approximated partial AUC ROC loss (tapAUC). A binary classifier is trained to optimize the specific range of the AUC ROC curve that prevents the True Positive Rate (TPR) to reach 100% while minimizing the False Positive Rate (FPR). The optimal threshold that does not trigger any false negative is then kept and used at the test step. The results show a TPR of 92.52% at a 20.43% FPR for an average across 6 datasets, representing a TPR improvement of 4.3% for a FPR cost of 12.2% against other state-of-the-art methods. The code is available at https://github.com/ArnaudBougaham/tapAUC.

URLs: https://github.com/ArnaudBougaham/tapAUC.

new LLM Embeddings for Deep Learning on Tabular Data

Authors: Boshko Koloski, Andrei Margeloiu, Xiangjian Jiang, Bla\v{z} \v{S}krlj, Nikola Simidjievski, Mateja Jamnik

Abstract: Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of tabular data by employing separate type-specific encoding approaches. This limits the cross-table transfer potential and the exploitation of pre-trained knowledge. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution to improv ing deep-learning tabular methods. We demonstrate that our approach improves accuracy over competitive models, such as MLP, ResNet and FT-Transformer, by validating on seven classification datasets.

new An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Authors: Soumyajit Guin, Vivek S. Borkar, Shalabh Bhatnagar

Abstract: In this paper, we consider the risk-sensitive cost criterion with exponentiated costs for Markov decision processes and develop a model-free policy gradient algorithm in this setting. Unlike additive cost criteria such as average or discounted cost, the risk-sensitive cost criterion is less studied due to the complexity resulting from the multiplicative structure of the resulting Bellman equation. We develop an actor-critic algorithm with function approximation in this setting and provide its asymptotic convergence analysis. We also show the results of numerical experiments that demonstrate the superiority in performance of our algorithm over other recent algorithms in the literature.

new GraphThought: Graph Combinatorial Optimization with Thought Generation

Authors: Zixiao Huang, Lifeng Guo, Junjie Sheng, Haosheng Chen, Wenhao Li, Bo Jin, Changhong Lu, Xiangfeng Wang

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, especially in text processing and generative tasks. Recent advancements in the reasoning capabilities of state-of-the-art LLMs, such as OpenAI-o1, have significantly broadened their applicability, particularly in complex problem-solving and logical inference. However, most existing LLMs struggle with notable limitations in handling graph combinatorial optimization (GCO) problems. To bridge this gap, we formally define the Optimal Thoughts Design (OTD) problem, including its state and action thought space. We then introduce a novel framework, GraphThought, designed to generate high-quality thought datasets for GCO problems. Leveraging these datasets, we fine-tune the Llama-3-8B-Instruct model to develop Llama-GT. Notably, despite its compact 8B-parameter architecture, Llama-GT matches the performance of state-of-the-art LLMs on the GraphArena benchmark. Experimental results show that our approach outperforms both proprietary and open-source models, even rivaling specialized models like o1-mini. This work sets a new state-of-the-art benchmark while challenging the prevailing notion that model scale is the primary driver of reasoning capability.

new Exploiting Task Relationships for Continual Learning Using Transferability-Aware Task Embeddings

Authors: Yanru Wu, Xiangyu Chen, Jianning Wang, Enming Zhang, Hanbing Liu, Yang Li

Abstract: Continual learning (CL) has been an essential topic in the contemporary application of deep neural networks, where catastrophic forgetting (CF) can impede a model's ability to acquire knowledge progressively. Existing CL strategies primarily address CF by regularizing model updates or separating task-specific and shared components. However, these methods focus on task model elements while overlooking the potential of leveraging inter-task relationships for learning enhancement. To address this, we propose a transferability-aware task embedding named H-embedding and train a hypernet under its guidance to learn task-conditioned model weights for CL tasks. Particularly, H-embedding is introduced based on an information theoretical transferability measure and is designed to be online and easy to compute. The framework is also characterized by notable practicality, which only requires storing a low-dimensional task embedding for each task, and can be efficiently trained in an end-to-end way. Extensive evaluations and experimental analyses on datasets including Permuted MNIST, Cifar10/100, and ImageNet-R demonstrate that our framework performs prominently compared to various baseline methods, displaying great potential in exploiting intrinsic task relationships.

new Maximum Entropy Reinforcement Learning with Diffusion Policy

Authors: Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang

Abstract: The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.

URLs: https://github.com/diffusionyes/MaxEntDP.

new In-Context Parametric Inference: Point or Distribution Estimators?

Authors: Sarthak Mittal, Yoshua Bengio, Nikolay Malkin, Guillaume Lajoie

Abstract: Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.

new Neural Interpretable Reasoning

Authors: Pietro Barbiero, Giuseppe Marra, Gabriele Ciravegna, David Debot, Francesco De Santis, Michelangelo Diligenti, Mateo Espinosa Zarlenga, Francesco Giannini

Abstract: We formalize a novel modeling framework for achieving interpretability in deep learning, anchored in the principle of inference equivariance. While the direct verification of interpretability scales exponentially with the number of variables of the system, we show that this complexity can be mitigated by treating interpretability as a Markovian property and employing neural re-parametrization techniques. Building on these insights, we propose a new modeling paradigm -- neural generation and interpretable execution -- that enables scalable verification of equivariance. This paradigm provides a general approach for designing Neural Interpretable Reasoners that are not only expressive but also transparent.

new Hyperspherical Energy Transformer with Recurrent Depth

Authors: Yunzhe Hu, Difan Zou, Dong Xu

Abstract: Transformer-based foundation models have achieved unprecedented success with a gigantic amount of parameters and computational resources. Yet, the core building blocks of these models, the Transformer layers, and how they are arranged and configured are primarily engineered from the bottom up and driven by heuristics. For advancing next-generation architectures, it demands exploring a prototypical model that is amenable to high interpretability and of practical competence. To this end, we take a step from the top-down view and design neural networks from an energy minimization perspective. Specifically, to promote isotropic token distribution on the sphere, we formulate a modified Hopfield energy function on the subspace-embedded hypersphere, based on which Transformer layers with symmetric structures are designed as the iterative optimization for the energy function. By integrating layers with the same parameters, we propose \textit{Hyper-Spherical Energy Transformer} (Hyper-SET), an alternative to the vanilla Transformer with recurrent depth. This design inherently provides greater interpretability and allows for scaling to deeper layers without a significant increase in the number of parameters. We also empirically demonstrate that Hyper-SET achieves comparable or even superior performance on both synthetic and real-world tasks, such as solving Sudoku and masked image modeling, while utilizing fewer parameters.

new Exact Upper and Lower Bounds for the Output Distribution of Neural Networks with Random Inputs

Authors: Andrey Kofnov, Daniel Kapla, Ezio Bartocci, Efstathia Bura

Abstract: We derive exact upper and lower bounds for the cumulative distribution function (cdf) of the output of a neural network over its entire support subject to noisy (stochastic) inputs. The upper and lower bounds converge to the true cdf over its domain as the resolution increases. Our method applies to any feedforward NN using continuous monotonic piecewise differentiable activation functions (e.g., ReLU, tanh and softmax) and convolutional NNs, which were beyond the scope of competing approaches. The novelty and an instrumental tool of our approach is to bound general NNs with ReLU NNs. The ReLU NN based bounds are then used to derive upper and lower bounds of the cdf of the NN output. Experiments demonstrate that our method delivers guaranteed bounds of the predictive output distribution over its support, thus providing exact error guarantees, in contrast to competing approaches.

new Best of Both Worlds: Regret Minimization versus Minimax Play

Authors: Adrian M\"uller, Jon Schneider, Stratis Skoulakis, Luca Viano, Volkan Cevher

Abstract: In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $O(\sqrt{T})$ regret compared to the best strategy in hindsight, where $T$ is the number of rounds. We provide the first affirmative answer to this question. In the context of symmetric zero-sum games, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $\Omega(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.

new Spectral structure learning for clinical time series

Authors: Ivan Lerner, Anita Burgun, Francis Bach

Abstract: We develop and evaluate a structure learning algorithm for clinical time series. Clinical time series are multivariate time series observed in multiple patients and irregularly sampled, challenging existing structure learning algorithms. We assume that our times series are realizations of StructGP, a k-dimensional multi-output or multi-task stationary Gaussian process (GP), with independent patients sharing the same covariance function. StructGP encodes ordered conditional relations between time series, represented in a directed acyclic graph. We implement an adapted NOTEARS algorithm, which based on a differentiable definition of acyclicity, recovers the graph by solving a series of continuous optimization problems. Simulation results show that up to mean degree 3 and 20 tasks, we reach a median recall of 0.93% [IQR, 0.86, 0.97] while keeping a median precision of 0.71% [0.57-0.84], for recovering directed edges. We further show that the regularization path is key to identifying the graph. With StructGP, we proposed a model of time series dependencies, that flexibly adapt to different time series regularity, while enabling us to learn these dependencies from observations.

new Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Authors: Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, Eduard Gorbunov

Abstract: Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the superiority of Clip21-SGD2M over baselines in terms of the optimization performance for a given DP-budget.

new Knowledge-aware contrastive heterogeneous molecular graph learning

Authors: Mukun Chen, Jia Wu, Shirui Pan, Fu Lin, Bo Du, Xiuwen Gong, Wenbin Hu

Abstract: Molecular representation learning is pivotal in predicting molecular properties and advancing drug design. Traditional methodologies, which predominantly rely on homogeneous graph encoding, are limited by their inability to integrate external knowledge and represent molecular structures across different levels of granularity. To address these limitations, we propose a paradigm shift by encoding molecular graphs into heterogeneous structures, introducing a novel framework: Knowledge-aware Contrastive Heterogeneous Molecular Graph Learning (KCHML). This approach leverages contrastive learning to enrich molecular representations with embedded external knowledge. KCHML conceptualizes molecules through three distinct graph views-molecular, elemental, and pharmacological-enhanced by heterogeneous molecular graphs and a dual message-passing mechanism. This design offers a comprehensive representation for property prediction, as well as for downstream tasks such as drug-drug interaction (DDI) prediction. Extensive benchmarking demonstrates KCHML's superiority over state-of-the-art molecular property prediction models, underscoring its ability to capture intricate molecular features.

new Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing

Authors: Site Qu, Guoqiang Hu

Abstract: The Location-Routing Problem (LRP), which combines the challenges of facility (depot) locating and vehicle route planning, is critically constrained by the reliance on predefined depot candidates, limiting the solution space and potentially leading to suboptimal outcomes. Previous research on LRP without predefined depots is scant and predominantly relies on heuristic algorithms that iteratively attempt depot placements across a planar area. Such approaches lack the ability to proactively generate depot locations that meet specific geographic requirements, revealing a notable gap in current research landscape. To bridge this gap, we propose a data-driven generative DRL framework, designed to proactively generate depots for LRP without predefined depot candidates, solely based on customer requests data which include geographic and demand information. It can operate in two distinct modes: direct generation of exact depot locations, and the creation of a multivariate Gaussian distribution for flexible depots sampling. By extracting depots' geographic pattern from customer requests data, our approach can dynamically respond to logistical needs, identifying high-quality depot locations that further reduce total routing costs compared to traditional methods. Extensive experiments demonstrate that, for a same group of customer requests, compared with those depots identified through random attempts, our framework can proactively generate depots that lead to superior solution routes with lower routing cost. The implications of our framework potentially extend into real-world applications, particularly in emergency medical rescue and disaster relief logistics, where rapid establishment and adjustment of depot locations are paramount, showcasing its potential in addressing LRP for dynamic and unpredictable environments.

new Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent

Authors: Junda Wu, Yuxin Xiong, Xintong Li, Yu Xia, Ruoyu Wang, Yu Wang, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Jingbo Shang, Julian McAuley

Abstract: Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets. Unlike pre-training, where MLLMs receive rich visual-text alignment, instruction-tuning is often text-driven with weaker visual supervision, leading to the degradation of pre-trained visual understanding and causing visual forgetting. Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue, often compressing visual representations and prioritizing task alignment over visual retention, which further worsens visual forgetting. To overcome this limitation, we introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation richness, interpreting this degradation through the information bottleneck principle as excessive compression that leads to the degradation of crucial pre-trained visual knowledge. Building on this view, we propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations while mitigating the over-compression effects described by the information bottleneck. By explicitly disentangling the optimization of visual understanding from task-specific alignment, MDGD preserves pre-trained visual knowledge while enabling efficient task adaptation. To enable lightweight instruction-tuning, we further develop a memory-efficient fine-tuning approach using gradient masking, which selectively updates a subset of model parameters to enable parameter-efficient fine-tuning (PEFT), reducing computational overhead while preserving rich visual representations. Extensive experiments across various downstream tasks and backbone MLLMs demonstrate that MDGD effectively mitigates visual forgetting from pre-trained tasks while enabling strong adaptation to new tasks.

new Robust Partial-Label Learning by Leveraging Class Activation Values

Authors: Tobias Fuchs, Florian Kalinke

Abstract: Real-world training data is often noisy; for example, human annotators assign conflicting class labels to the same instances. Partial-label learning (PLL) is a weakly supervised learning paradigm that allows training classifiers in this context without manual data cleaning. While state-of-the-art methods have good predictive performance, their predictions are sensitive to high noise levels, out-of-distribution data, and adversarial perturbations. We propose a novel PLL method based on subjective logic, which explicitly represents uncertainty by leveraging the magnitudes of the underlying neural network's class activation values. Thereby, we effectively incorporate prior knowledge about the class labels by using a novel label weight re-distribution strategy that we prove to be optimal. We empirically show that our method yields more robust predictions in terms of predictive performance under high PLL noise levels, handling out-of-distribution examples, and handling adversarial perturbations on the test instances.

new On the Computation of the Fisher Information in Continual Learning

Authors: Gido M. van de Ven

Abstract: One of the most popular methods for continual learning with deep neural networks is Elastic Weight Consolidation (EWC), which involves computing the Fisher Information. The exact way in which the Fisher Information is computed is however rarely described, and multiple different implementations for it can be found online. This blog post discusses and empirically compares several often-used implementations, which highlights that many currently reported results for EWC could likely be improved by changing the way the Fisher Information is computed.

new From Selection to Generation: A Survey of LLM-based Active Learning

Authors: Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang, Puneet Mathur, Soumyabrata Pal, Koyel Mukherjee, Zhehao Zhang, Namyong Park, Thien Huu Nguyen, Jiebo Luo, Ryan A. Rossi, Julian McAuley

Abstract: Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.

new Interpretable Machine Learning for Kronecker Coefficients

Authors: Giorgi Butbaia, Kyu-Hwan Lee, Fabian Ruehle

Abstract: We analyze the saliency of neural networks and employ interpretable machine learning models to predict whether the Kronecker coefficients of the symmetric group are zero or not. Our models use triples of partitions as input features, as well as b-loadings derived from the principal component of an embedding that captures the differences between partitions. Across all approaches, we achieve an accuracy of approximately 83% and derive explicit formulas for a decision function in terms of b-loadings. Additionally, we develop transformer-based models for prediction, achieving the highest reported accuracy of over 99%.

new IMTS-Mixer: Mixer-Networks for Irregular Multivariate Time Series Forecasting

Authors: Christian Kl\"otergens, Tim Dernedde, Lars Schmidt-Thieme

Abstract: Forecasting Irregular Multivariate Time Series (IMTS) has recently emerged as a distinct research field, necessitating specialized models to address its unique challenges. While most forecasting literature assumes regularly spaced observations without missing values, many real-world datasets - particularly in healthcare, climate research, and biomechanics - violate these assumptions. Time Series (TS)-mixer models have achieved remarkable success in regular multivariate time series forecasting. However, they remain unexplored for IMTS due to their requirement for complete and evenly spaced observations. To bridge this gap, we introduce IMTS-Mixer, a novel forecasting architecture designed specifically for IMTS. Our approach retains the core principles of TS mixer models while introducing innovative methods to transform IMTS into fixed-size matrix representations, enabling their seamless integration with mixer modules. We evaluate IMTS-Mixer on a benchmark of four real-world datasets from various domains. Our results demonstrate that IMTS-Mixer establishes a new state-of-the-art in forecasting accuracy while also improving computational efficiency.

new Intersectional Fairness in Reinforcement Learning with Large State and Constraint Spaces

Authors: Eric Eaton, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

Abstract: In traditional reinforcement learning (RL), the learner aims to solve a single objective optimization problem: find the policy that maximizes expected reward. However, in many real-world settings, it is important to optimize over multiple objectives simultaneously. For example, when we are interested in fairness, states might have feature annotations corresponding to multiple (intersecting) demographic groups to whom reward accrues, and our goal might be to maximize the reward of the group receiving the minimal reward. In this work, we consider a multi-objective optimization problem in which each objective is defined by a state-based reweighting of a single scalar reward function. This generalizes the problem of maximizing the reward of the minimum reward group. We provide oracle-efficient algorithms to solve these multi-objective RL problems even when the number of objectives is exponentially large-for tabular MDPs, as well as for large MDPs when the group functions have additional structure. Finally, we experimentally validate our theoretical results and demonstrate applications on a preferential attachment graph MDP.

new Model Generalization on Text Attribute Graphs: Principles with Large Language Models

Authors: Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li

Abstract: Large language models (LLMs) have recently been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce. Among these applications, inference over text-attributed graphs (TAGs) presents unique challenges: existing methods struggle with LLMs' limited context length for processing large node neighborhoods and the misalignment between node embeddings and the LLM token space. To address these issues, we establish two key principles for ensuring generalization and derive the framework LLM-BP accordingly: (1) Unifying the attribute space with task-adaptive embeddings, where we leverage LLM-based encoders and task-aware prompting to enhance generalization of the text attribute embeddings; (2) Developing a generalizable graph information aggregation mechanism, for which we adopt belief propagation with LLM-estimated parameters that adapt across graphs. Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches, achieving 8.10% improvement with task-conditional embeddings and an additional 1.71% gain from adaptive aggregation.

new Steering the LoCoMotif: Using Domain Knowledge in Time Series Motif Discovery

Authors: Aras Yurtman, Daan Van Wesenbeeck, Wannes Meert, Hendrik Blockeel

Abstract: Time Series Motif Discovery (TSMD) identifies repeating patterns in time series data, but its unsupervised nature might result in motifs that are not interesting to the user. To address this, we propose a framework that allows the user to impose constraints on the motifs to be discovered, where constraints can easily be defined according to the properties of the desired motifs in the application domain. We also propose an efficient implementation of the framework, the LoCoMotif-DoK algorithm. We demonstrate that LoCoMotif-DoK can effectively leverage domain knowledge in real and synthetic data, outperforming other TSMD techniques which only support a limited form of domain knowledge.

new StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models

Authors: Shehel Yoosuf, Temoor Ali, Ahmed Lekssays, Mashael AlSabah, Issa Khalil

Abstract: In this work, we present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces, ranging from simple structure formats and basic query languages (e.g. SQL) to new novel spaces and syntaxes created entirely by LLMs. Our extensive evaluation shows that our simplest attacks can achieve close to 90% success rate, even on strict LLMs (such as Claude 3.5 Sonnet) using SOTA alignment mechanisms. We improve the attack performance further by using an adaptive scheme that combines structure transformations along with existing \textit{content transformations}, resulting in over 96% ASR with 0% refusals. To generalize our attacks, we explore numerous structure formats, including syntaxes purely generated by LLMs. Our results indicate that such novel syntaxes are easy to generate and result in a high ASR, suggesting that defending against our attacks is not a straightforward process. Finally, we develop a benchmark and evaluate existing safety-alignment defenses against it, showing that most of them fail with 100% ASR. Our results show that existing safety alignment mostly relies on token-level patterns without recognizing harmful concepts, highlighting and motivating the need for serious research efforts in this direction. As a case study, we demonstrate how attackers can use our attack to easily generate a sample malware, and a corpus of fraudulent SMS messages, which perform well in bypassing detection.

new FedEAT: A Robustness Optimization Framework for Federated LLMs

Authors: Yahao Pang, Xingyuan Wu, Xiaojin Zhang, Wei Chen, Hai Jin

Abstract: Significant advancements have been made by Large Language Models (LLMs) in the domains of natural language understanding and automated content creation. However, they still face persistent problems, including substantial computational costs and inadequate availability of training data. The combination of Federated Learning (FL) and LLMs (federated LLMs) offers a solution by leveraging distributed data while protecting privacy, which positions it as an ideal choice for sensitive domains. However, Federated LLMs still suffer from robustness challenges, including data heterogeneity, malicious clients, and adversarial attacks, which greatly hinder their applications. We first introduce the robustness problems in federated LLMs, to address these challenges, we propose FedEAT (Federated Embedding space Adversarial Training), a novel framework that applies adversarial training in the embedding space of client LLM and employs a robust aggregation approach, specifically geometric median aggregation, to enhance the robustness of Federated LLMs. Our experiments demonstrate that FedEAT effectively improves the robustness of Federated LLMs with minimal performance loss.

new Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Authors: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei

Abstract: The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.

URLs: https://github.com/microsoft/BitNet/tree/paper

new LIMR: Less is More for RL Scaling

Authors: Xuefeng Li, Haoyang Zou, Pengfei Liu

Abstract: In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models' reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly underperforms at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https://github.com/GAIR-NLP/LIMR.

URLs: https://github.com/GAIR-NLP/LIMR.

new Rethinking Benign Overfitting in Two-Layer Neural Networks

Authors: Ruichen Xu, Kexin Chen

Abstract: Recent theoretical studies (Kou et al., 2023; Cao et al., 2022) have revealed a sharp phase transition from benign to harmful overfitting when the noise-to-feature ratio exceeds a threshold-a situation common in long-tailed data distributions where atypical data is prevalent. However, harmful overfitting rarely happens in overparameterized neural networks. Further experimental results suggested that memorization is necessary for achieving near-optimal generalization error in long-tailed data distributions (Feldman & Zhang, 2020). We argue that this discrepancy between theoretical predictions and empirical observations arises because previous feature-noise data models overlook the heterogeneous nature of noise across different data classes. In this paper, we refine the feature-noise data model by incorporating class-dependent heterogeneous noise and re-examine the overfitting phenomenon in neural networks. Through a comprehensive analysis of the training dynamics, we establish test loss bounds for the refined model. Our findings reveal that neural networks can leverage "data noise", previously deemed harmful, to learn implicit features that improve the classification accuracy for long-tailed data. Experimental validation on both synthetic and real-world datasets supports our theoretical results.

new Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?

Authors: Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke

Abstract: Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have undergone 16-bit training. We further investigate the effects of retaining the optimizer state at the transition point and gradually phasing in quantization strength -- finding that both techniques alleviate the magnitude of loss spikes, but also that these effects can be compensated through further training.

new CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning

Authors: Yanxiao Zhao, Yangge Qian, Jingyang Shan, Xiaolin Qin

Abstract: Reinforcement learning (RL) in continuous action spaces encounters persistent challenges, such as inefficient exploration and convergence to suboptimal solutions. To address these limitations, we propose CAMEL, a novel framework integrating LLM-generated suboptimal policies into the RL training pipeline. CAMEL leverages dynamic action masking and an adaptive epsilon-masking mechanism to guide exploration during early training stages while gradually enabling agents to optimize policies independently. At the core of CAMEL lies the integration of Python-executable suboptimal policies generated by LLMs based on environment descriptions and task objectives. Although simplistic and hard-coded, these policies offer valuable initial guidance for RL agents. To effectively utilize these priors, CAMEL employs masking-aware optimization to dynamically constrain the action space based on LLM outputs. Additionally, epsilon-masking gradually reduces reliance on LLM-generated guidance, enabling agents to transition from constrained exploration to autonomous policy refinement. Experimental validation on Gymnasium MuJoCo environments demonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generated policies significantly improve sample efficiency, achieving performance comparable to or surpassing expert masking baselines. For Walker2d-v4, where LLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robust RL performance without notable degradation, highlighting the framework's adaptability across diverse tasks. While CAMEL shows promise in enhancing sample efficiency and mitigating convergence challenges, these issues remain open for further research. Future work aims to generalize CAMEL to multimodal LLMs for broader observation-action spaces and automate policy evaluation, reducing human intervention and enhancing scalability in RL training pipelines.

new Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives

Authors: Leo Schwinn, Yan Scholten, Tom Wollschl\"ager, Sophie Xhonneux, Stephen Casper, Stephan G\"unnemann, Gauthier Gidel

Abstract: Misaligned research objectives have considerably hindered progress in adversarial robustness research over the past decade. For instance, an extensive focus on optimizing target metrics, while neglecting rigorous standardized evaluation, has led researchers to pursue ad-hoc heuristic defenses that were seemingly effective. Yet, most of these were exposed as flawed by subsequent evaluations, ultimately contributing little measurable progress to the field. In this position paper, we illustrate that current research on the robustness of large language models (LLMs) risks repeating past patterns with potentially worsened real-world implications. To address this, we argue that realigned objectives are necessary for meaningful progress in adversarial alignment. To this end, we build on established cybersecurity taxonomy to formally define differences between past and emerging threat models that apply to LLMs. Using this framework, we illustrate that progress requires disentangling adversarial alignment into addressable sub-problems and returning to core academic principles, such as measureability, reproducibility, and comparability. Although the field presents significant challenges, the fresh start on adversarial robustness offers the unique opportunity to build on past experience while avoiding previous mistakes.

new VLP: Vision-Language Preference Learning for Embodied Manipulation

Authors: Runze Liu, Chenjia Bai, Jiafei Lyu, Shengjie Sun, Yali Du, Xiu Li

Abstract: Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel \textbf{V}ision-\textbf{L}anguage \textbf{P}reference learning framework, named \textbf{VLP}, which learns a vision-language preference model to provide preference feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders without human annotations. The preference model learns to extract language-related features, and then serves as a preference annotator in various downstream tasks. The policy can be learned according to the annotated preferences via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin.

new Continual Learning Should Move Beyond Incremental Classification

Authors: Rupert Mitchell, Antonio Alliegro, Raffaello Camoriano, Dustin Carri\'on-Ojeda, Antonio Carta, Georgia Chalvatzaki, Nikhil Churamani, Carlo D'Eramo, Samin Hamidi, Robin Hesse, Fabian Hinder, Roshni Ramanna Kamath, Vincenzo Lomonaco, Subarnaduti Paul, Francesca Pistilli, Tinne Tuytelaars, Gido M van de Ven, Kristian Kersting, Simone Schaub-Meyer, Martin Mundt

Abstract: Continual learning (CL) is the sub-field of machine learning concerned with accumulating knowledge in dynamic environments. So far, CL research has mainly focused on incremental classification tasks, where models learn to classify new categories while retaining knowledge of previously learned ones. Here, we argue that maintaining such a focus limits both theoretical development and practical applicability of CL methods. Through a detailed analysis of concrete examples - including multi-target classification, robotics with constrained output spaces, learning in continuous task domains, and higher-level concept memorization - we demonstrate how current CL approaches often fail when applied beyond standard classification. We identify three fundamental challenges: (C1) the nature of continuity in learning problems, (C2) the choice of appropriate spaces and metrics for measuring similarity, and (C3) the role of learning objectives beyond classification. For each challenge, we provide specific recommendations to help move the field forward, including formalizing temporal dynamics through distribution processes, developing principled approaches for continuous task spaces, and incorporating density estimation and generative objectives. In so doing, this position paper aims to broaden the scope of CL research while strengthening its theoretical foundations, making it more applicable to real-world problems.

new FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control

Authors: Yutong Ye, Yingbo Zhou, Zhusen Liu, Xiao Du, Hao Zhou, Xiang Lian, Mingsong Chen

Abstract: Although Reinforcement Learning (RL)-based Traffic Signal Control (TSC) methods have been extensively studied, their practical applications still raise some serious issues such as high learning cost and poor generalizability. This is because the ``trial-and-error'' training style makes RL agents extremely dependent on the specific traffic environment, which also requires a long convergence time. To address these issues, we propose a novel Federated Imitation Learning (FIL)-based framework for multi-intersection TSC, named FitLight, which allows RL agents to plug-and-play for any traffic environment without additional pre-training cost. Unlike existing imitation learning approaches that rely on pre-training RL agents with demonstrations, FitLight allows real-time imitation learning and seamless transition to reinforcement learning. Due to our proposed knowledge-sharing mechanism and novel hybrid pressure-based agent design, RL agents can quickly find a best control policy with only a few episodes. Moreover, for resource-constrained TSC scenarios, FitLight supports model pruning and heterogeneous model aggregation, such that RL agents can work on a micro-controller with merely 16{\it KB} RAM and 32{\it KB} ROM. Extensive experiments demonstrate that, compared to state-of-the-art methods, FitLight not only provides a superior starting point but also converges to a better final solution on both real-world and synthetic datasets, even under extreme resource limitations.

new Deep Spatio-Temporal Neural Network for Air Quality Reanalysis

Authors: Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy

Abstract: Air quality prediction is key to mitigating health impacts and guiding decisions, yet existing models tend to focus on temporal trends while overlooking spatial generalization. We propose AQ-Net, a spatiotemporal reanalysis model for both observed and unobserved stations in the near future. AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. We also propose a cyclic encoding technique to ensure continuous time representation. To learn fine-grained spatial air quality estimation, we incorporate AQ-Net with the neural kNN to explore feature-based interpolation, such that we can fill the spatial gaps given coarse observation stations. To demonstrate the efficiency of our model for spatiotemporal reanalysis, we use data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive experiments show that AQ-Net excels in air quality reanalysis, highlighting the potential of hybrid spatio-temporal models to better capture environmental dynamics, especially in urban areas where both spatial and temporal variability are critical.

new Sharp-PINNs: staggered hard-constrained physics-informed neural networks for phase field modelling of corrosion

Authors: Nanxi Chen, Chuanjie Cui, Rujin Ma, Airong Chen, Sifan Wang

Abstract: Physics-informed neural networks have shown significant potential in solving partial differential equations (PDEs) across diverse scientific fields. However, their performance often deteriorates when addressing PDEs with intricate and strongly coupled solutions. In this work, we present a novel Sharp-PINN framework to tackle complex phase field corrosion problems. Instead of minimizing all governing PDE residuals simultaneously, the Sharp-PINNs introduce a staggered training scheme that alternately minimizes the residuals of Allen-Cahn and Cahn-Hilliard equations, which govern the corrosion system. To further enhance its efficiency and accuracy, we design an advanced neural network architecture that integrates random Fourier features as coordinate embeddings, employs a modified multi-layer perceptron as the primary backbone, and enforces hard constraints in the output layer. This framework is benchmarked through simulations of corrosion problems with multiple pits, where the staggered training scheme and network architecture significantly improve both the efficiency and accuracy of PINNs. Moreover, in three-dimensional cases, our approach is 5-10 times faster than traditional finite element methods while maintaining competitive accuracy, demonstrating its potential for real-world engineering applications in corrosion prediction.

new Massively Scaling Explicit Policy-conditioned Value Functions

Authors: Nico Bohlinger, Jan Peters

Abstract: We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V({\theta}) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.

new Theoretical Barriers in Bellman-Based Reinforcement Learning

Authors: Brieuc Pinon, Rapha\"el Jungers, Jean-Charles Delvenne

Abstract: Reinforcement Learning algorithms designed for high-dimensional spaces often enforce the Bellman equation on a sampled subset of states, relying on generalization to propagate knowledge across the state space. In this paper, we identify and formalize a fundamental limitation of this common approach. Specifically, we construct counterexample problems with a simple structure that this approach fails to exploit. Our findings reveal that such algorithms can neglect critical information about the problems, leading to inefficiencies. Furthermore, we extend this negative result to another approach from the literature: Hindsight Experience Replay learning state-to-state reachability.

new Machine Learning Should Maximize Welfare, Not (Only) Accuracy

Authors: Nir Rosenfeld, Haifeng Xu

Abstract: Decades of research in machine learning have given us powerful tools for making accurate predictions. But when used in social settings and on human inputs, better accuracy does not immediately translate to better social outcomes. This may not be surprising given that conventional learning frameworks are not designed to express societal preferences -- let alone promote them. This position paper argues that machine learning is currently missing, and can gain much from incorporating, a proper notion of social welfare. The field of welfare economics asks: how should we allocate limited resources to self-interested agents in a way that maximizes social benefit? We argue that this perspective applies to many modern applications of machine learning in social contexts, and advocate for its adoption. Rather than disposing of prediction, we aim to leverage this forte of machine learning for promoting social welfare. We demonstrate this idea by proposing a conceptual framework that gradually transitions from accuracy maximization (with awareness to welfare) to welfare maximization (via accurate prediction). We detail applications and use-cases for which our framework can be effective, identify technical challenges and practical opportunities, and highlight future avenues worth pursuing.

new Selective Task Group Updates for Multi-Task Optimization

Authors: Wooseong Jeong, Kuk-Jin Yoon

Abstract: Multi-task learning enables the acquisition of task-generic knowledge by training multiple tasks within a unified architecture. However, training all tasks together in a single architecture can lead to performance degradation, known as negative transfer, which is a main concern in multi-task learning. Previous works have addressed this issue by optimizing the multi-task network through gradient manipulation or weighted loss adjustments. However, their optimization strategy focuses on addressing task imbalance in shared parameters, neglecting the learning of task-specific parameters. As a result, they show limitations in mitigating negative transfer, since the learning of shared space and task-specific information influences each other during optimization. To address this, we propose a different approach to enhance multi-task performance by selectively grouping tasks and updating them for each batch during optimization. We introduce an algorithm that adaptively determines how to effectively group tasks and update them during the learning process. To track inter-task relations and optimize multi-task networks simultaneously, we propose proximal inter-task affinity, which can be measured during the optimization process. We provide a theoretical analysis on how dividing tasks into multiple groups and updating them sequentially significantly affects multi-task performance by enhancing the learning of task-specific parameters. Our methods substantially outperform previous multi-task optimization approaches and are scalable to different architectures and various numbers of tasks.

new Unsupervised Structural-Counterfactual Generation under Domain Shift

Authors: Krishn Vishwas Kher, Lokesh Venkata Siva Maruthi Badisa, Kusampudi Venkata Datta Sri Harsha, Chitneedi Geetha Sowmya, SakethaNath Jagarlapudi

Abstract: Motivated by the burgeoning interest in cross-domain learning, we present a novel generative modeling challenge: generating counterfactual samples in a target domain based on factual observations from a source domain. Our approach operates within an unsupervised paradigm devoid of parallel or joint datasets, relying exclusively on distinct observational samples and causal graphs for each domain. This setting presents challenges that surpass those of conventional counterfactual generation. Central to our methodology is the disambiguation of exogenous causes into effect-intrinsic and domain-intrinsic categories. This differentiation facilitates the integration of domain-specific causal graphs into a unified joint causal graph via shared effect-intrinsic exogenous variables. We propose leveraging Neural Causal models within this joint framework to enable accurate counterfactual generation under standard identifiability assumptions. Furthermore, we introduce a novel loss function that effectively segregates effect-intrinsic from domain-intrinsic variables during model training. Given a factual observation, our framework combines the posterior distribution of effect-intrinsic variables from the source domain with the prior distribution of domain-intrinsic variables from the target domain to synthesize the desired counterfactuals, adhering to Pearl's causal hierarchy. Intriguingly, when domain shifts are restricted to alterations in causal mechanisms without accompanying covariate shifts, our training regimen parallels the resolution of a conditional optimal transport problem. Empirical evaluations on a synthetic dataset show that our framework generates counterfactuals in the target domain that very closely resemble the ground truth.

new The geometry of BERT

Authors: Matteo Bonino, Giorgia Ghione, Giansalvo Cirrincione

Abstract: Transformer neural networks, particularly Bidirectional Encoder Representations from Transformers (BERT), have shown remarkable performance across various tasks such as classification, text summarization, and question answering. However, their internal mechanisms remain mathematically obscure, highlighting the need for greater explainability and interpretability. In this direction, this paper investigates the internal mechanisms of BERT proposing a novel perspective on the attention mechanism of BERT from a theoretical perspective. The analysis encompasses both local and global network behavior. At the local level, the concept of directionality of subspace selection as well as a comprehensive study of the patterns emerging from the self-attention matrix are presented. Additionally, this work explores the semantic content of the information stream through data distribution analysis and global statistical measures including the novel concept of cone index. A case study on the classification of SARS-CoV-2 variants using RNA which resulted in a very high accuracy has been selected in order to observe these concepts in an application. The insights gained from this analysis contribute to a deeper understanding of BERT's classification process, offering potential avenues for future architectural improvements in Transformer models and further analysis in the training process.

new Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

Authors: Jiayang Zhang, Xianyuan Liu, Wei Wu, Sina Tabakhi, Wenrui Fan, Shuo Zhou, Kang Lan Tee, Tuck Seng Wong, Haiping Lu

Abstract: Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.

URLs: https://github.com/Shef-AIRE/StoicIML.

new APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun

Abstract: While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.

URLs: https://github.com/thunlp/APB.

new Unifying Explainable Anomaly Detection and Root Cause Analysis in Dynamical Systems

Authors: Yue Sun, Rick S. Blum, Parv Venkitasubramaniam

Abstract: Dynamical systems, prevalent in various scientific and engineering domains, are susceptible to anomalies that can significantly impact their performance and reliability. This paper addresses the critical challenges of anomaly detection, root cause localization, and anomaly type classification in dynamical systems governed by ordinary differential equations (ODEs). We define two categories of anomalies: cyber anomalies, which propagate through interconnected variables, and measurement anomalies, which remain localized to individual variables. To address these challenges, we propose the Interpretable Causality Ordinary Differential Equation (ICODE) Networks, a model-intrinsic explainable learning framework. ICODE leverages Neural ODEs for anomaly detection while employing causality inference through an explanation channel to perform root cause analysis (RCA), elucidating why specific time periods are flagged as anomalous. ICODE is designed to simultaneously perform anomaly detection, RCA, and anomaly type classification within a single, interpretable framework. Our approach is grounded in the hypothesis that anomalies alter the underlying ODEs of the system, manifesting as changes in causal relationships between variables. We provide a theoretical analysis of how perturbations in learned model parameters can be utilized to identify anomalies and their root causes in time series data. Comprehensive experimental evaluations demonstrate the efficacy of ICODE across various dynamical systems, showcasing its ability to accurately detect anomalies, classify their types, and pinpoint their origins.

new Meta-Statistical Learning: Supervised Learning of Statistical Inference

Authors: Maxime Peyrard, Kyunghyun Cho

Abstract: This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks, where the goal is to predict properties of the data-generating distribution rather than labels for individual datapoints. These tasks encompass statistical inference problems such as parameter estimation, hypothesis testing, or mutual information estimation. Framing these tasks within traditional machine learning pipelines is challenging, as supervision is typically tied to individual datapoint. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems. In this approach, entire datasets are treated as single inputs to neural networks, which predict distribution-level parameters. Transformer-based architectures, without positional encoding, provide a natural fit due to their permutation-invariance properties. By training on large-scale synthetic datasets, meta-statistical models can leverage the scalability and optimization infrastructure of Transformer-based LLMs. We demonstrate the framework's versatility with applications in hypothesis testing and mutual information estimation, showing strong performance, particularly for small datasets where traditional neural methods struggle.

new Using the Path of Least Resistance to Explain Deep Networks

Authors: Sina Salek, Joseph Enguehard

Abstract: Integrated Gradients (IG), a widely used axiomatic path-based attribution method, assigns importance scores to input features by integrating model gradients along a straight path from a baseline to the input. While effective in some cases, we show that straight paths can lead to flawed attributions. In this paper, we identify the cause of these misattributions and propose an alternative approach that treats the input space as a Riemannian manifold, computing attributions by integrating gradients along geodesics. We call this method Geodesic Integrated Gradients (GIG). To approximate geodesic paths, we introduce two techniques: a k-Nearest Neighbours-based approach for smaller models and a Stochastic Variational Inference-based method for larger ones. Additionally, we propose a new axiom, Strong Completeness, extending the axioms satisfied by IG. We show that this property is desirable for attribution methods and that GIG is the only method that satisfies it. Through experiments on both synthetic and real-world data, we demonstrate that GIG outperforms existing explainability methods, including IG.

new SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Authors: Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke

Abstract: We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

URLs: https://github.com/openai/SWELancer-Benchmark).

new Scaling Test-Time Compute Without Verification or RL is Suboptimal

Authors: Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

Abstract: Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

new LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Authors: Prasanna Mayilvahanan, Thadd\"aus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

Abstract: Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

new Minimal Ranks, Maximum Confidence: Parameter-efficient Uncertainty Quantification for LoRA

Authors: Patryk Marsza{\l}ek, Klaudia Ba{\l}azy, Jacek Tabor, Tomasz Ku\'smierczyk

Abstract: Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large language models by decomposing weight updates into low-rank matrices, significantly reducing storage and computational overhead. While effective, standard LoRA lacks mechanisms for uncertainty quantification, leading to overconfident and poorly calibrated models. Bayesian variants of LoRA address this limitation, but at the cost of a significantly increased number of trainable parameters, partially offsetting the original efficiency gains. Additionally, these models are harder to train and may suffer from unstable convergence. In this work, we propose a novel parameter-efficient Bayesian LoRA, demonstrating that effective uncertainty quantification can be achieved in very low-dimensional parameter spaces. The proposed method achieves strong performance with improved calibration and generalization while maintaining computational efficiency. Our empirical findings show that, with the appropriate projection of the weight space: (1) uncertainty can be effectively modeled in a low-dimensional space, and (2) weight covariances exhibit low ranks.

new LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Authors: Florian Sestak, Artur Toshev, Andreas F\"urst, G\"unter Klambauer, Andreas Mayr, Johannes Brandstetter

Abstract: Generative models are spearheading recent progress in deep learning, showing strong promise for trajectory sampling in dynamical systems as well. However, while latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), combines the advantages of graph neural networks, i.e., the traceability of entities across time-steps, with the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder are frozen to enable generative modeling in the latent space. The core idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for retrieval of entity properties, e.g., entity coordinates, from latent system representations and thus enables traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. (Code is available at https://github.com/ml-jku/LaM-SLidE)

URLs: https://github.com/ml-jku/LaM-SLidE)

cross A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

Authors: Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre K{\i}c{\i}man, Hamid Palangi, Barun Patra, Robert West

Abstract: Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown, especially in situations where contextual information contradicts factual knowledge stored in the parameters, which LLMs also excel at recalling. Favoring the contextual information is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify outdated or noisy stored knowledge. We present a novel method to study grounding abilities using Fakepedia, a novel dataset of counterfactual texts constructed to clash with a model's internal parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the internal parametric knowledge clashes with the contextual information. We benchmark various LLMs with Fakepedia and conduct a causal mediation analysis of LLM components when answering Fakepedia queries, based on our Masked Grouped Causal Tracing (MGCT) method. Through this analysis, we identify distinct computational patterns between grounded and ungrounded responses. We finally demonstrate that distinguishing grounded from ungrounded responses is achievable through computational analysis alone. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.

cross Surrogate models for diffusion on graphs via sparse polynomials

Authors: Giuseppe Alessio D'Inverno, Kylian Ajavon, Simone Brugiapaglia

Abstract: Diffusion kernels over graphs have been widely utilized as effective tools in various applications due to their ability to accurately model the flow of information through nodes and edges. However, there is a notable gap in the literature regarding the development of surrogate models for diffusion processes on graphs. In this work, we fill this gap by proposing sparse polynomial-based surrogate models for parametric diffusion equations on graphs with community structure. In tandem, we provide convergence guarantees for both least squares and compressed sensing-based approximations by showing the holomorphic regularity of parametric solutions to these diffusion equations. Our theoretical findings are accompanied by a series of numerical experiments conducted on both synthetic and real-world graphs that demonstrate the applicability of our methodology.

cross DASKT: A Dynamic Affect Simulation Method for Knowledge Tracing

Authors: Xinjie Sun, Kai Zhang, Qi Liu, Shuanghong Shen, Fei Wang, Yuxiang Guo, Enhong Chen

Abstract: Knowledge Tracing (KT) predicts future performance by modeling students' historical interactions, and understanding students' affective states can enhance the effectiveness of KT, thereby improving the quality of education. Although traditional KT values students' cognition and learning behaviors, efficient evaluation of students' affective states and their application in KT still require further exploration due to the non-affect-oriented nature of the data and budget constraints. To address this issue, we propose a computation-driven approach, Dynamic Affect Simulation Knowledge Tracing (DASKT), to explore the impact of various student affective states (such as frustration, concentration, boredom, and confusion) on their knowledge states. In this model, we first extract affective factors from students' non-affect-oriented behavioral data, then use clustering and spatiotemporal sequence modeling to accurately simulate students' dynamic affect changes when dealing with different problems. Subsequently, {\color{blue}we incorporate affect with time-series analysis to improve the model's ability to infer knowledge states over time and space.} Extensive experimental results on two public real-world educational datasets show that DASKT can achieve more reasonable knowledge states under the effect of students' affective states. Moreover, DASKT outperforms the most advanced KT methods in predicting student performance. Our research highlights a promising avenue for future KT studies, focusing on achieving high interpretability and accuracy.

cross Practical Application and Limitations of AI Certification Catalogues

Authors: Gregor Autischer, Kerstin Waxnegger, Dominik Kowald

Abstract: In this work-in-progress, we investigate the certification of artificial intelligence (AI) systems, focusing on the practical application and limitations of existing certification catalogues by attempting to certify a publicly available AI system. We aim to evaluate how well current approaches work to effectively certify an AI system, and how publicly accessible AI systems, that might not be actively maintained or initially intended for certification, can be selected and used for a sample certification process. Our methodology involves leveraging the Fraunhofer AI Assessment Catalogue as a comprehensive tool to systematically assess an AI model's compliance with certification standards. We find that while the catalogue effectively structures the evaluation process, it can also be cumbersome and time-consuming to use. We observe the limitations of an AI system that has no active development team anymore and highlighted the importance of complete system documentation. Finally, we identify some limitations of the certification catalogues used and proposed ideas on how to streamline the certification process.

cross Machine Learning-Driven Convergence Analysis in Multijurisdictional Compliance Using BERT and K-Means Clustering

Authors: Raj Sonani, Lohalekar Prayas

Abstract: Digital data continues to grow, there has been a shift towards using effective regulatory mechanisms to safeguard personal information. The CCPA of California and the General Data Protection Regulation (GDPR) of the European Union are two of the most important privacy laws. The regulation is intended to safeguard consumer privacy, but it varies greatly in scope, definitions, and methods of enforcement. This paper presents a fresh approach to adaptive compliance, using machine learning and emphasizing natural language processing (NLP) as the primary focus of comparison between the GDPR and CCPA. Using NLP, this study compares various regulations to identify areas where they overlap or diverge. This includes the "right to be forgotten" provision in the GDPR and the "opt-out of sale" provision under CCPA. International companies can learn valuable lessons from this report, as it outlines strategies for better enforcement of laws across different nations. Additionally, the paper discusses the challenges of utilizing NLP in legal literature and proposes methods to enhance the model-ability of machine learning models for studying regulations. The study's objective is to "bridge the gap between legal knowledge and technical expertise" by developing regulatory compliance strategies that are more efficient in operation and more effective in data protection.

cross A Neural Network Training Method Based on Neuron Connection Coefficient Adjustments

Authors: Kun Jiang

Abstract: In previous studies, we introduced a neural network framework based on symmetric differential equations, along with one of its training methods. In this article, we present another training approach for this neural network. This method leverages backward signal propagation and eliminates reliance on the traditional chain derivative rule, offering a high degree of biological interpretability. Unlike the previously introduced method, this approach does not require adjustments to the fixed points of the differential equations. Instead, it focuses solely on modifying the connection coefficients between neurons, closely resembling the training process of traditional multilayer perceptron (MLP) networks. By adopting a suitable adjustment strategy, this method effectively avoids certain potential local minima. To validate this approach, we tested it on the MNIST dataset and achieved promising results. Through further analysis, we identified certain limitations of the current neural network architecture and proposed measures for improvement.

cross A Novel Multi-Objective Evolutionary Algorithm for Counterfactual Generation

Authors: Gabriel Doyle-Finch, Alex A. Freitas

Abstract: Machine learning algorithms that learn black-box predictive models (which cannot be directly interpreted) are increasingly used to make predictions affecting the lives of people. It is important that users understand the predictions of such models, particularly when the model outputs a negative prediction for the user (e.g. denying a loan). Counterfactual explanations provide users with guidance on how to change some of their characteristics to receive a different, positive classification by a predictive model. For example, if a predictive model rejected a loan application from a user, a counterfactual explanation might state: If your salary was {\pounds}50,000 (rather than your current {\pounds}35,000), then your loan would be approved. This paper proposes two novel contributions: (a) a novel multi-objective Evolutionary Algorithm (EA) for counterfactual generation based on lexicographic optimisation, rather than the more popular Pareto dominance approach; and (b) an extension to the definition of the objective of validity for a counterfactual, based on measuring the resilience of a counterfactual to violations of monotonicity constraints which are intuitively expected by users; e.g., intuitively, the probability of a loan application to be approved would monotonically increase with an increase in the salary of the applicant. Experiments involving 15 experimental settings (3 types of black box models times 5 datasets) have shown that the proposed lexicographic optimisation-based EA is very competitive with an existing Pareto dominance-based EA; and the proposed extension of the validity objective has led to a substantial increase in the validity of the counterfactuals generated by the proposed EA.

cross A Hybrid Swarm Intelligence Approach for Optimizing Multimodal Large Language Models Deployment in Edge-Cloud-based Federated Learning Environments

Authors: Gaith Rjouba, Hanae Elmekki, Saidul Islam, Jamal Bentahar, Rachida Dssouli

Abstract: The combination of Federated Learning (FL), Multimodal Large Language Models (MLLMs), and edge-cloud computing enables distributed and real- time data processing while preserving privacy across edge devices and cloud infrastructure. However, the deployment of MLLMs in FL environments with resource-constrained edge devices presents significant challenges, in- cluding resource management, communication overhead, and non-IID data. To address these challenges, we propose a novel hybrid framework wherein MLLMs are deployed on edge devices equipped with sufficient resources and battery life, while the majority of training occurs in the cloud. To identify suitable edge devices for deployment, we employ Particle Swarm Optimiza- tion (PSO), and Ant Colony Optimization (ACO) is utilized to optimize the transmission of model updates between edge and cloud nodes. This proposed swarm intelligence-based framework aims to enhance the efficiency of MLLM training by conducting extensive training in the cloud and fine-tuning at the edge, thereby reducing energy consumption and communication costs. Our experimental results show that the proposed method significantly improves system performance, achieving an accuracy of 92%, reducing communica- tion cost by 30%, and enhancing client participation compared to traditional FL methods. These results make the proposed approach highly suitable for large-scale edge-cloud computing systems.

cross DRiVE: Dynamic Recognition in VEhicles using snnTorch

Authors: Heerak Vora, Param Pathak, Parul Bakaraniya

Abstract: Spiking Neural Networks (SNNs) mimic biological brain activity, processing data efficiently through an event-driven design, wherein the neurons activate only when inputs exceed specific thresholds. Their ability to track voltage changes over time via membrane potential dynamics helps retain temporal information. This study combines SNNs with PyTorch's adaptable framework, snnTorch, to test their potential for image-based tasks. We introduce DRiVE, a vehicle detection model that uses spiking neuron dynamics to classify images, achieving 94.8% accuracy and a near-perfect 0.99 AUC score. These results highlight DRiVE's ability to distinguish vehicle classes effectively, challenging the notion that SNNs are limited to temporal data. As interest grows in energy-efficient neural models, DRiVE's success emphasizes the need to refine SNN optimization for visual tasks. This work encourages broader exploration of SNNs in scenarios where conventional networks struggle, particularly for real-world applications requiring both precision and efficiency.

cross Spiking Neural Network Feature Discrimination Boosts Modality Fusion

Authors: Katerina Maria Oikonomou, Ioannis Kansizoglou, Antonios Gasteratos

Abstract: Feature discrimination is a crucial aspect of neural network design, as it directly impacts the network's ability to distinguish between classes and generalize across diverse datasets. The accomplishment of achieving high-quality feature representations ensures high intra-class separability and poses one of the most challenging research directions. While conventional deep neural networks (DNNs) rely on complex transformations and very deep networks to come up with meaningful feature representations, they usually require days of training and consume significant energy amounts. To this end, spiking neural networks (SNNs) offer a promising alternative. SNN's ability to capture temporal and spatial dependencies renders them particularly suitable for complex tasks, where multi-modal data are required. In this paper, we propose a feature discrimination approach for multi-modal learning with SNNs, focusing on audio-visual data. We employ deep spiking residual learning for visual modality processing and a simpler yet efficient spiking network for auditory modality processing. Lastly, we deploy a spiking multilayer perceptron for modality fusion. We present our findings and evaluate our approach against similar works in the field of classification challenges. To the best of our knowledge, this is the first work investigating feature discrimination in SNNs.

cross Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning

Authors: Libo Wang

Abstract: To reduce the cost and consumption of computing resources caused by computational redundancy and delayed reward assignment in long CoT, this research proposes the dynamic chain-of-thought with adaptive reasoning time and steps. The researcher used simulation experiment to simulate the integration of D-CoT through Python 3.13 IDLE combined with a Python simulator based on GPTs. At the same time, the researcher used DeepSeek R1 as a control group to test and compare the performance of the D-CoT simulator in processing MIT OpenCourseWare's linear algebra exam questions. Experimental results show that D-CoT is better than DeepSeek R1 based on long CoT in three indicators: reasoning time, CoT length (reasoning steps) and token count, which achieves a significant reduction in computing resource consumption. In addition, this research has potential value in deep reasoning optimization and can be used as a reference for future dynamic deep reasoning frameworks.

cross Neural Genetic Search in Discrete Spaces

Authors: Hyeonah Kim, Sanghyeok Choi, Jiwoo Son, Jinkyoo Park, Changhyun Kwon

Abstract: Effective search methods are crucial for improving the performance of deep generative models at test time. In this paper, we introduce a novel test-time search method, Neural Genetic Search (NGS), which incorporates the evolutionary mechanism of genetic algorithms into the generation procedure of deep models. The core idea behind NGS is its crossover, which is defined as parent-conditioned generation using trained generative models. This approach offers a versatile and easy-to-implement search algorithm for deep generative models. We demonstrate the effectiveness and flexibility of NGS through experiments across three distinct domains: routing problems, adversarial prompt generation for language models, and molecular design.

cross MERGE$^3$: Efficient Evolutionary Merging on Consumer-grade GPUs

Authors: Tommaso Mencattini, Adrian Robert Minut, Donato Crisostomi, Andrea Santilli, Emanuele Rodol\`a

Abstract: Evolutionary model merging enables the creation of high-performing multi-task models but remains computationally prohibitive for consumer hardware. We introduce MERGE$^3$, an efficient framework that makes evolutionary merging feasible on a single GPU by reducing fitness computation costs 50$\times$ while preserving performance. MERGE$^3$ achieves this by Extracting a reduced dataset for evaluation, Estimating model abilities using Item Response Theory (IRT), and Evolving optimal merges via IRT-based performance estimators. Our method enables state-of-the-art multilingual and cross-lingual merging, transferring knowledge across languages with significantly lower computational overhead. We provide theoretical guarantees and an open-source library, democratizing high-quality model merging.

cross Injecting Universal Jailbreak Backdoors into LLMs in Minutes

Authors: Zhuowei Chen, Qiannan Zhang, Shichao Pei

Abstract: Jailbreak backdoor attacks on LLMs have garnered attention for their effectiveness and stealth. However, existing methods rely on the crafting of poisoned datasets and the time-consuming process of fine-tuning. In this work, we propose JailbreakEdit, a novel jailbreak backdoor injection method that exploits model editing techniques to inject a universal jailbreak backdoor into safety-aligned LLMs with minimal intervention in minutes. JailbreakEdit integrates a multi-node target estimation to estimate the jailbreak space, thus creating shortcuts from the backdoor to this estimated jailbreak space that induce jailbreak actions. Our attack effectively shifts the models' attention by attaching strong semantics to the backdoor, enabling it to bypass internal safety mechanisms. Experimental results show that JailbreakEdit achieves a high jailbreak success rate on jailbreak prompts while preserving generation quality, and safe performance on normal queries. Our findings underscore the effectiveness, stealthiness, and explainability of JailbreakEdit, emphasizing the need for more advanced defense mechanisms in LLMs.

cross Crypto Miner Attack: GPU Remote Code Execution Attacks

Authors: Ariel Szabo, Uzy Hadad

Abstract: Remote Code Execution (RCE) exploits pose a significant threat to AI and ML systems, particularly in GPU-accelerated environments where the computational power of GPUs can be misused for malicious purposes. This paper focuses on RCE attacks leveraging deserialization vulnerabilities and custom layers, such as TensorFlow Lambda layers, which are often overlooked due to the complexity of monitoring GPU workloads. These vulnerabilities enable attackers to execute arbitrary code, blending malicious activity seamlessly into expected model behavior and exploiting GPUs for unauthorized tasks such as cryptocurrency mining. Unlike traditional CPU-based attacks, the parallel processing nature of GPUs and their high resource utilization make runtime detection exceptionally challenging. In this work, we provide a comprehensive examination of RCE exploits targeting GPUs, demonstrating an attack that utilizes these vulnerabilities to deploy a crypto miner on a GPU. We highlight the technical intricacies of such attacks, emphasize their potential for significant financial and computational costs, and propose strategies for mitigation. By shedding light on this underexplored attack vector, we aim to raise awareness and encourage the adoption of robust security measures in GPU-driven AI and ML systems, with an emphasis on static and model scanning as an easier way to detect exploits.

cross Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Ownership Verification with Reasoning

Authors: Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, Heng Huang

Abstract: Large language models (LLMs) are increasingly integrated into real-world applications through retrieval-augmented generation (RAG) mechanisms to supplement their responses with up-to-date and domain-specific knowledge. However, the valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries. Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning attacks. However, these methods require to alter the results of verification samples (\eg, generating incorrect outputs), inevitably making them susceptible to anomaly detection and even introduce new security risks. To address these challenges, we propose \name{} for `harmless' copyright protection of knowledge bases. Instead of manipulating LLM's final output, \name{} implants distinct verification behaviors in the space of chain-of-thought (CoT) reasoning, maintaining the correctness of the final answer. Our method has three main stages: (1) \textbf{Generating CoTs}: For each verification question, we generate two CoTs, including a target CoT for building watermark behaviors; (2) \textbf{Optimizing Watermark Phrases and Target CoTs}: We optimize them to minimize retrieval errors under the black-box setting of suspicious LLM, ensuring that the watermarked verification queries activate the target CoTs without being activated in non-watermarked ones; (3) \textbf{Ownership Verification}: We exploit a pairwise Wilcoxon test to statistically verify whether a suspicious LLM is augmented with the protected knowledge base by comparing its responses to watermarked and benign verification queries. Our experiments on diverse benchmarks demonstrate that \name{} effectively protects knowledge bases against unauthorized usage while preserving the integrity and performance of the RAG.

cross AI Alignment at Your Discretion

Authors: Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C. Vieira Machado, Flavio du Pin Calmon

Abstract: In AI alignment, extensive latitude must be granted to annotators, either human or algorithmic, to judge which model outputs are `better' or `safer.' We refer to this latitude as alignment discretion. Such discretion remains largely unexamined, posing two risks: (i) annotators may use their power of discretion arbitrarily, and (ii) models may fail to mimic this discretion. To study this phenomenon, we draw on legal concepts of discretion that structure how decision-making authority is conferred and exercised, particularly in cases where principles conflict or their application is unclear or irrelevant. Extended to AI alignment, discretion is required when alignment principles and rules are (inevitably) conflicting or indecisive. We present a set of metrics to systematically analyze when and how discretion in AI alignment is exercised, such that both risks (i) and (ii) can be observed. Moreover, we distinguish between human and algorithmic discretion and analyze the discrepancy between them. By measuring both human and algorithmic discretion over safety alignment datasets, we reveal layers of discretion in the alignment process that were previously unaccounted for. Furthermore, we demonstrate how algorithms trained on these datasets develop their own forms of discretion in interpreting and applying these principles, which challenges the purpose of having any principles at all. Our paper presents the first step towards formalizing this core gap in current alignment processes, and we call on the community to further scrutinize and control alignment discretion.

cross MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Authors: Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

Abstract: Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

cross Trustworthy AI on Safety, Bias, and Privacy: A Survey

Authors: Xingli Fang, Jianwei Li, Varun Mulchandani, Jung-Eun Kim

Abstract: The capabilities of artificial intelligence systems have been advancing to a great extent, but these systems still struggle with failure modes, vulnerabilities, and biases. In this paper, we study the current state of the field, and present promising insights and perspectives regarding concerns that challenge the trustworthiness of AI models. In particular, this paper investigates the issues regarding three thrusts: safety, privacy, and bias, which hurt models' trustworthiness. For safety, we discuss safety alignment in the context of large language models, preventing them from generating toxic or harmful content. For bias, we focus on spurious biases that can mislead a network. Lastly, for privacy, we cover membership inference attacks in deep neural networks. The discussions addressed in this paper reflect our own experiments and observations.

cross Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach

Authors: R\'egnier Avice, Bernhard Haslhofer, Zhidong Li, Jianlong Zhou

Abstract: Attribution tags form the foundation of modern cryptoasset forensics. However, inconsistent or incorrect tags can mislead investigations and even result in false accusations. To address this issue, we propose a novel computational method based on Large Language Models (LLMs) to link attribution tags with well-defined knowledge graph concepts. We implemented this method in an end-to-end pipeline and conducted experiments showing that our approach outperforms baseline methods by up to 37.4% in F1-score across three publicly available attribution tag datasets. By integrating concept filtering and blocking procedures, we generate candidate sets containing five knowledge graph entities, achieving a recall of 93% without the need for labeled data. Additionally, we demonstrate that local LLM models can achieve F1-scores of 90%, comparable to remote models which achieve 94%. We also analyze the cost-performance trade-offs of various LLMs and prompt templates, showing that selecting the most cost-effective configuration can reduce costs by 90%, with only a 1% decrease in performance. Our method not only enhances attribution tag quality but also serves as a blueprint for fostering more reliable forensic evidence.

cross Diverse Transformer Decoding for Offline Reinforcement Learning Using Financial Algorithmic Approaches

Authors: Dan Elbaz, Oren Salzman

Abstract: Offline Reinforcement Learning (RL) algorithms learn a policy using a fixed training dataset, which is then deployed online to interact with the environment and make decisions. Transformers, a standard choice for modeling time-series data, are gaining popularity in offline RL. In this context, Beam Search (BS), an approximate inference algorithm, is the go-to decoding method. Offline RL eliminates the need for costly or risky online data collection. However, the restricted dataset induces uncertainty as the agent may encounter unfamiliar sequences of states and actions during execution that were not covered in the training data. In this context, BS lacks two important properties essential for offline RL: It does not account for the aforementioned uncertainty, and its greedy left-right search approach often results in sequences with minimal variations, failing to explore potentially better alternatives. To address these limitations, we propose Portfolio Beam Search (PBS), a simple-yet-effective alternative to BS that balances exploration and exploitation within a Transformer model during decoding. We draw inspiration from financial economics and apply these principles to develop an uncertainty-aware diversification mechanism, which we integrate into a sequential decoding algorithm at inference time. We empirically demonstrate the effectiveness of PBS on the D4RL locomotion benchmark, where it achieves higher returns and significantly reduces outcome variability.

cross Knowledge Integration Strategies in Autonomous Vehicle Prediction and Planning: A Comprehensive Survey

Authors: Kumar Manas, Adrian Paschke

Abstract: This comprehensive survey examines the integration of knowledge-based approaches into autonomous driving systems, with a focus on trajectory prediction and planning. We systematically review methodologies for incorporating domain knowledge, traffic rules, and commonsense reasoning into these systems, spanning purely symbolic representations to hybrid neuro-symbolic architectures. In particular, we analyze recent advancements in formal logic and differential logic programming, reinforcement learning frameworks, and emerging techniques that leverage large foundation models and diffusion models for knowledge representation. Organized under a unified literature survey section, our discussion synthesizes the state-of-the-art into a high-level overview, supported by a detailed comparative table that maps key works to their respective methodological categories. This survey not only highlights current trends -- including the growing emphasis on interpretable AI, formal verification in safety-critical systems, and the increased use of generative models in prediction and planning -- but also outlines the challenges and opportunities for developing robust, knowledge-enhanced autonomous driving systems.

cross Forecasting time series with constraints

Authors: Nathan Doum\`eche (LPSM, EDF R&D OSIRIS), Francis Bach (DI-ENS, SIERRA), \'Eloi Bedek (EDF R&D OSIRIS), G\'erard Biau (LPSM, IUF), Claire Boyer (LMO, IUF), Yannig Goude (EDF R&D OSIRIS, LMO)

Abstract: Time series forecasting presents unique challenges that limit the effectiveness of traditional machine learning algorithms. To address these limitations, various approaches have incorporated linear constraints into learning algorithms, such as generalized additive models and hierarchical forecasting. In this paper, we propose a unified framework for integrating and combining linear constraints in time series forecasting. Within this framework, we show that the exact minimizer of the constrained empirical risk can be computed efficiently using linear algebra alone. This approach allows for highly scalable implementations optimized for GPUs. We validate the proposed methodology through extensive benchmarking on real-world tasks, including electricity demand forecasting and tourism forecasting, achieving state-of-the-art performance.

cross F-StrIPE: Fast Structure-Informed Positional Encoding for Symbolic Music Generation

Authors: Manvi Agarwal (IP Paris, LTCI, IDS), Changhong Wang (LTCI), Gael Richard (S2A, IDS)

Abstract: While music remains a challenging domain for generative models like Transformers, recent progress has been made by exploiting suitable musically-informed priors. One technique to leverage information about musical structure in Transformers is inserting such knowledge into the positional encoding (PE) module. However, Transformers carry a quadratic cost in sequence length. In this paper, we propose F-StrIPE, a structure-informed PE scheme that works in linear complexity. Using existing kernel approximation techniques based on random features, we show that F-StrIPE is a generalization of Stochastic Positional Encoding (SPE). We illustrate the empirical merits of F-StrIPE using melody harmonization for symbolic music.

cross SWA-LDM: Toward Stealthy Watermarks for Latent Diffusion Models

Authors: Zhonghao Yang, Linye Lyu, Xuanhang Chang, Daojing He, YU LI

Abstract: In the rapidly evolving landscape of image generation, Latent Diffusion Models (LDMs) have emerged as powerful tools, enabling the creation of highly realistic images. However, this advancement raises significant concerns regarding copyright infringement and the potential misuse of generated content. Current watermarking techniques employed in LDMs often embed constant signals to the generated images that compromise their stealthiness, making them vulnerable to detection by malicious attackers. In this paper, we introduce SWA-LDM, a novel approach that enhances watermarking by randomizing the embedding process, effectively eliminating detectable patterns while preserving image quality and robustness. Our proposed watermark presence attack reveals the inherent vulnerabilities of existing latent-based watermarking methods, demonstrating how easily these can be exposed. Through comprehensive experiments, we validate that SWA-LDM not only fortifies watermark stealthiness but also maintains competitive performance in watermark robustness and visual fidelity. This work represents a pivotal step towards securing LDM-generated images against unauthorized use, ensuring both copyright protection and content integrity in an era where digital image authenticity is paramount.

cross GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs

Authors: Shima Khoshraftar, Niaz Abedini, Amir Hajian

Abstract: The application of large language models (LLMs) to graph data has attracted a lot of attention recently. LLMs allow us to use deep contextual embeddings from pretrained models in text-attributed graphs, where shallow embeddings are often used for the text at- tributes of nodes. However, it is still challenging to efficiently en- code the graph structure and features into a sequential form for use by LLMs. In addition, the performance of an LLM alone, is highly dependent on the structure of the input prompt, which limits their effectiveness as a reliable approach and often requires iterative man- ual adjustments that could be slow, tedious and difficult to replicate programmatically. In this paper, we propose GraphiT (Graphs in Text), a framework for encoding graphs into a textual format and optimizing LLM prompts for graph prediction tasks. Here we focus on node classification for text-attributed graphs. We encode the graph data for every node and its neighborhood into a concise text to enable LLMs to better utilize the information in the graph. We then further programmatically optimize the LLM prompts us- ing the DSPy framework to automate this step and make it more efficient and reproducible. GraphiT outperforms our LLM-based baselines on three datasets and we show how the optimization step in GraphiT leads to measurably better results without manual prompt tweaking. We also demonstrated that our graph encoding approach is competitive to other graph encoding methods while being less expensive because it uses significantly less tokens for the same task.

cross Towards Watermarking of Open-Source LLMs

Authors: Thibaud Gloaguen, Nikola Jovanovi\'c, Robin Staab, Martin Vechev

Abstract: While watermarks for closed LLMs have matured and have been included in large-scale deployments, these methods are not applicable to open-source models, which allow users full control over the decoding process. This setting is understudied yet critical, given the rising performance of open-source models. In this work, we lay the foundation for systematic study of open-source LLM watermarking. For the first time, we explicitly formulate key requirements, including durability against common model modifications such as model merging, quantization, or finetuning, and propose a concrete evaluation setup. Given the prevalence of these modifications, durability is crucial for an open-source watermark to be effective. We survey and evaluate existing methods, showing that they are not durable. We also discuss potential ways to improve their durability and highlight remaining challenges. We hope our work enables future progress on this important problem.

cross PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation

Authors: Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner

Abstract: The interpretation of histopathology cases underlies many important diagnostic and treatment decisions in medicine. Notably, this process typically requires pathologists to integrate and summarize findings across multiple slides per case. Existing vision-language capabilities in computational pathology have so far been largely limited to small regions of interest, larger regions at low magnification, or single whole-slide images (WSIs). This limits interpretation of findings that span multiple high-magnification regions across multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model (LMM) with a 1-million token context window, we demonstrate the ability to generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches from multiple WSIs at 10X magnification. This is the equivalent of up to 11 hours of video at 1 fps. Expert pathologist evaluations demonstrate that the generated report text is clinically accurate and equivalent to or preferred over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide examples with up to 5 slides. While performance decreased for examples with 6 or more slides, this study demonstrates the promise of leveraging the long-context capabilities of modern LMMs for the uniquely challenging task of medical report generation where each case can contain thousands of image patches.

cross Recent Advances in Malware Detection: Graph Learning and Explainability

Authors: Hossein Shokouhinejad, Roozbeh Razavi-Far, Hesamodin Mohammadian, Mahdi Rabbani, Samuel Ansong, Griffin Higgins, Ali A Ghorbani

Abstract: The rapid evolution of malware has necessitated the development of sophisticated detection methods that go beyond traditional signature-based approaches. Graph learning techniques have emerged as powerful tools for modeling and analyzing the complex relationships inherent in malware behavior, leveraging advancements in Graph Neural Networks (GNNs) and related methods. This survey provides a comprehensive exploration of recent advances in malware detection, focusing on the interplay between graph learning and explainability. It begins by reviewing malware analysis techniques and datasets, emphasizing their foundational role in understanding malware behavior and supporting detection strategies. The survey then discusses feature engineering, graph reduction, and graph embedding methods, highlighting their significance in transforming raw data into actionable insights, while ensuring scalability and efficiency. Furthermore, this survey focuses on explainability techniques and their applications in malware detection, ensuring transparency and trustworthiness. By integrating these components, this survey demonstrates how graph learning and explainability contribute to building robust, interpretable, and scalable malware detection systems. Future research directions are outlined to address existing challenges and unlock new opportunities in this critical area of cybersecurity.

cross Detecting and Monitoring Bias for Subgroups in Breast Cancer Detection AI

Authors: Amit Kumar Kundu, Florence X. Doo, Vaishnavi Patil, Amitabh Varshney, Joseph Jaja

Abstract: Automated mammography screening plays an important role in early breast cancer detection. However, current machine learning models, developed on some training datasets, may exhibit performance degradation and bias when deployed in real-world settings. In this paper, we analyze the performance of high-performing AI models on two mammography datasets-the Emory Breast Imaging Dataset (EMBED) and the RSNA 2022 challenge dataset. Specifically, we evaluate how these models perform across different subgroups, defined by six attributes, to detect potential biases using a range of classification metrics. Our analysis identifies certain subgroups that demonstrate notable underperformance, highlighting the need for ongoing monitoring of these subgroups' performance. To address this, we adopt a monitoring method designed to detect performance drifts over time. Upon identifying a drift, this method issues an alert, which can enable timely interventions. This approach not only provides a tool for tracking the performance but also helps ensure that AI models continue to perform effectively across diverse populations.

cross Post-training an LLM for RAG? Train on Self-Generated Demonstrations

Authors: Matthew Finlayson, Ilia Kulikov, Daneil M. Bikel, Barlas Oguz, Xilun Chen, Aasish Pappu

Abstract: Large language models (LLMs) often struggle with knowledge intensive NLP tasks, such as answering "Who won the latest World Cup?" because the knowledge they learn during training may be insufficient or outdated. Conditioning generation on retrieved documents -- a technique known as retrieval augmented generation (RAG) -- mitigates these shortcomings by allowing the model to leverage in-context information. Practitioners can improve LLM RAG performance by fine-tuning on retrieval-augmented instructions, but must beware that this can cause undesirable model behaviors like hallucinations. We attribute this degradation to the fact that the training data is likely to be out-of-distribution for the model and may suffer from quality issues, such as misalignment between retrievals and target responses (since retrievals are frequently added post-hoc). We propose a recipe for training RAG-enabled LLMs using self-generated demonstrations, thereby avoiding training on out-of-distribution text and integrating retrievals into the LLM responses. We evaluate our method on knowledge intensive question answering (QA) tasks and show that our method teaches LLMs to properly handle in-context retrievals and abstain from questions it will likely get wrong. Compared to conventional RA-IT methods, our method prevents model degradation in non-RAG settings while exhibiting superior QA performance.

cross Federated Learning-Driven Cybersecurity Framework for IoT Networks with Privacy-Preserving and Real-Time Threat Detection Capabilities

Authors: Milad Rahmati

Abstract: The rapid expansion of the Internet of Things (IoT) ecosystem has transformed various sectors but has also introduced significant cybersecurity challenges. Traditional centralized security methods often struggle to balance privacy preservation and real-time threat detection in IoT networks. To address these issues, this study proposes a Federated Learning-Driven Cybersecurity Framework designed specifically for IoT environments. The framework enables decentralized data processing by training models locally on edge devices, ensuring data privacy. Secure aggregation of these locally trained models is achieved using homomorphic encryption, allowing collaborative learning without exposing sensitive information. The proposed framework utilizes recurrent neural networks (RNNs) for anomaly detection, optimized for resource-constrained IoT networks. Experimental results demonstrate that the system effectively detects complex cyber threats, including distributed denial-of-service (DDoS) attacks, with over 98% accuracy. Additionally, it improves energy efficiency by reducing resource consumption by 20% compared to centralized approaches. This research addresses critical gaps in IoT cybersecurity by integrating federated learning with advanced threat detection techniques. The framework offers a scalable and privacy-preserving solution adaptable to various IoT applications. Future work will explore the integration of blockchain for transparent model aggregation and quantum-resistant cryptographic methods to further enhance security in evolving technological landscapes.

cross Weighted quantization using MMD: From mean field to mean shift via gradient flows

Authors: Ayoub Belhadji, Daniel Sharp, Youssef Marzouk

Abstract: Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a finite weighted mixture of Dirac measures that best approximates the target distribution. While much existing work relies on the Wasserstein distance to quantify approximation errors, maximum mean discrepancy (MMD) has received comparatively less attention, especially when allowing for variable particle weights. We study the quantization problem from the perspective of minimizing MMD via gradient flow in the Wasserstein-Fisher-Rao (WFR) geometry. This gradient flow yields an ODE system from which we further derive a fixed-point algorithm called mean shift interacting particles (MSIP). We show that MSIP extends the (non-interacting) mean shift algorithm, widely used for identifying modes in kernel density estimates. Moreover, we show that MSIP can be interpreted as preconditioned gradient descent, and that it acts as a relaxation of Lloyd's algorithm for clustering. Our numerical experiments demonstrate that MSIP and the WFR ODEs outperform other algorithms for quantization of multi-modal and high-dimensional targets.

cross Data-driven Super-Resolution of Flood Inundation Maps using Synthetic Simulations

Authors: Akshay Aravamudan, Zimeena Rasheed, Xi Zhang, Kira E. Scarpignato, Efthymios I. Nikolopoulos, Witold F. Krajewski, Georgios C. Anagnostopoulos

Abstract: The frequency of extreme flood events is increasing throughout the world. Daily, high-resolution (30m) Flood Inundation Maps (FIM) observed from space play a key role in informing mitigation and preparedness efforts to counter these extreme events. However, the temporal frequency of publicly available high-resolution FIMs, e.g., from Landsat, is at the order of two weeks thus limiting the effective monitoring of flood inundation dynamics. Conversely, global, low-resolution (~300m) Water Fraction Maps (WFM) are publicly available from NOAA VIIRS daily. Motivated by the recent successes of deep learning methods for single image super-resolution, we explore the effectiveness and limitations of similar data-driven approaches to downscaling low-resolution WFMs to high-resolution FIMs. To overcome the scarcity of high-resolution FIMs, we train our models with high-quality synthetic data obtained through physics-based simulations. We evaluate our models on real-world data from flood events in the state of Iowa. The study indicates that data-driven approaches exhibit superior reconstruction accuracy over non-data-driven alternatives and that the use of synthetic data is a viable proxy for training purposes. Additionally, we show that our trained models can exhibit superior zero-shot performance when transferred to regions with hydroclimatological similarity to the U.S. Midwest.

cross Batch-Adaptive Annotations for Causal Inference with Complex-Embedded Outcomes

Authors: Ezinne Nwankwo, Lauri Goldkind, Angela Zhou

Abstract: Estimating the causal effects of an intervention on outcomes is crucial. But often in domains such as healthcare and social services, this critical information about outcomes is documented by unstructured text, e.g. clinical notes in healthcare or case notes in social services. For example, street outreach to homeless populations is a common social services intervention, with ambiguous and hard-to-measure outcomes. Outreach workers compile case note records which are informative of outcomes. Although experts can succinctly extract relevant information from such unstructured case notes, it is costly or infeasible to do so for an entire corpus, which can span millions of notes. Recent advances in large language models (LLMs) enable scalable but potentially inaccurate annotation of unstructured text data. We leverage the decision of which datapoints should receive expert annotation vs. noisy imputation under budget constraints in a "design-based" estimator combining limited expert and plentiful noisy imputation data via \textit{causal inference with missing outcomes}. We develop a two-stage adaptive algorithm that optimizes the expert annotation probabilities, estimating the ATE with optimal asymptotic variance. We demonstrate how expert labels and LLM annotations can be combined strategically, efficiently and responsibly in a causal estimator. We run experiments on simulated data and two real-world datasets, including one on street outreach, to show the versatility of our proposed method.

cross Universal Lesion Segmentation Challenge 2023: A Comparative Research of Different Algorithms

Authors: Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, Kobe Guo

Abstract: In recent years, machine learning algorithms have achieved much success in segmenting lesions across various tissues. There is, however, not one satisfying model that works well on all tissue types universally. In response to this need, we attempt to train a model that 1) works well on all tissue types, and 2) is capable of still performing fast inferences. To this end, we design our architectures, test multiple existing architectures, compare their results, and settle upon SwinUnet. We document our rationales, successes, and failures. Finally, we propose some further directions that we think are worth exploring. codes: https://github.com/KWFredShi/ULS2023NGKD.git

URLs: https://github.com/KWFredShi/ULS2023NGKD.git

cross Optimizing CNN Architectures for Advanced Thoracic Disease Classification

Authors: Tejas Mirthipati

Abstract: Machine learning, particularly convolutional neural networks (CNNs), has shown promise in medical image analysis, especially for thoracic disease detection using chest X-ray images. In this study, we evaluate various CNN architectures, including binary classification, multi-label classification, and ResNet50 models, to address challenges like dataset imbalance, variations in image quality, and hidden biases. We introduce advanced preprocessing techniques such as principal component analysis (PCA) for image compression and propose a novel class-weighted loss function to mitigate imbalance issues. Our results highlight the potential of CNNs in medical imaging but emphasize that issues like unbalanced datasets and variations in image acquisition methods must be addressed for optimal model performance.

cross Dark Deceptions in DHCP: Dismantling Network Defenses

Authors: Robert Dilworth

Abstract: This paper explores vulnerabilities in the Dynamic Host Configuration Protocol (DHCP) and their implications on the Confidentiality, Integrity, and Availability (CIA) triad. Through an analysis of various attacks, including DHCP Starvation, Rogue DHCP Servers, Replay Attacks, and TunnelVision exploits, the paper provides a taxonomic classification of threats, assesses risks, and proposes appropriate controls. The discussion also highlights the dangers of VPN decloaking through DHCP exploits and underscores the importance of safeguarding network infrastructures. By bringing awareness to the TunnelVision exploit, this paper aims to mitigate risks associated with these prevalent vulnerabilities.

cross Generative Adversarial Networks for High-Dimensional Item Factor Analysis: A Deep Adversarial Learning Algorithm

Authors: Nanyu Luo, Feng Ji

Abstract: Advances in deep learning and representation learning have transformed item factor analysis (IFA) in the item response theory (IRT) literature by enabling more efficient and accurate parameter estimation. Variational Autoencoders (VAEs) have been one of the most impactful techniques in modeling high-dimensional latent variables in this context. However, the limited expressiveness of the inference model based on traditional VAEs can still hinder the estimation performance. This study introduces Adversarial Variational Bayes (AVB) algorithms as an improvement to VAEs for IFA with improved flexibility and accuracy. By bridging the strengths of VAEs and Generative Adversarial Networks (GANs), AVB incorporates an auxiliary discriminator network to reframe the estimation process as a two-player adversarial game and removes the restrictive assumption of standard normal distributions in the inference model. Theoretically, AVB can achieve similar or higher likelihood compared to VAEs. A further enhanced algorithm, Importance-weighted Adversarial Variational Bayes (IWAVB) is proposed and compared with Importance-weighted Autoencoders (IWAE). In an exploratory analysis of real empirical data, IWAVB demonstrated superior expressiveness by achieving a higher likelihood compared to IWAE. In confirmatory studies with simulated data, IWAVB achieved similar mean-square error results to IWAE while consistently achieving higher likelihoods. Moreover, in simulations where latent variables followed a multimodal distribution, IWAVB outperformed IWAE by providing more accurate parameter estimates. With its innovative use of GANs, IWAVB is shown to have the potential to extend IFA to handle large-scale data, facilitating the potential integration of psychometrics and multimodal data analysis.

cross Deep Learning for Wound Tissue Segmentation: A Comprehensive Evaluation using A Novel Dataset

Authors: Muhammad Ashad Kabir, Nidita Roy, Md. Ekramul Hossain, Jill Featherston, Sayed Ahmed

Abstract: Deep learning (DL) techniques have emerged as promising solutions for medical wound tissue segmentation. However, a notable limitation in this field is the lack of publicly available labelled datasets and a standardised performance evaluation of state-of-the-art DL models on such datasets. This study addresses this gap by comprehensively evaluating various DL models for wound tissue segmentation using a novel dataset. We have curated a dataset comprising 147 wound images exhibiting six tissue types: slough, granulation, maceration, necrosis, bone, and tendon. The dataset was meticulously labelled for semantic segmentation employing supervised machine learning techniques. Three distinct labelling formats were developed -- full image, patch, and superpixel. Our investigation encompassed a wide array of DL segmentation and classification methodologies, ranging from conventional approaches like UNet, to generative adversarial networks such as cGAN, and modified techniques like FPN+VGG16. Also, we explored DL-based classification methods (e.g., ResNet50) and machine learning-based classification leveraging DL features (e.g., AlexNet+RF). In total, 82 wound tissue segmentation models were derived across the three labelling formats. Our analysis yielded several notable findings, including identifying optimal DL models for each labelling format based on weighted average Dice or F1 scores. Notably, FPN+VGG16 emerged as the top-performing DL model for wound tissue segmentation, achieving a dice score of 82.25%. This study provides a valuable benchmark for evaluating wound image segmentation and classification models, offering insights to inform future research and clinical practice in wound care. The labelled dataset created in this study is available at https://github.com/akabircs/WoundTissue.

URLs: https://github.com/akabircs/WoundTissue.

cross Towards Zero-Shot Task-Generalizable Learning on fMRI

Authors: Jiyao Wang, Nicha C. Dvornek, Peiyu Duan, Lawrence H. Staib, James S. Duncan

Abstract: Functional MRI measuring BOLD signal is an increasingly important imaging modality in studying brain functions and neurological disorders. It can be acquired in either a resting-state or a task-based paradigm. Compared to resting-state fMRI, task-based fMRI is acquired while the subject is performing a specific task designed to enhance study-related brain activities. Consequently, it generally has more informative task-dependent signals. However, due to the variety of task designs, it is much more difficult than in resting state to aggregate task-based fMRI acquired in different tasks to train a generalizable model. To resolve this complication, we propose a supervised task-aware network TA-GAT that jointly learns a general-purpose encoder and task-specific contextual information. The encoder-generated embedding and the learned contextual information are then combined as input to multiple modules for performing downstream tasks. We believe that the proposed task-aware architecture can plug-and-play in any neural network architecture to incorporate the prior knowledge of fMRI tasks into capturing functional brain patterns.

cross Hybrid Deepfake Image Detection: A Comprehensive Dataset-Driven Approach Integrating Convolutional and Attention Mechanisms with Frequency Domain Features

Authors: Kafi Anan, Anindya Bhattacharjee, Ashir Intesher, Kaidul Islam, Abrar Assaeem Fuad, Utsab Saha, Hafiz Imtiaz

Abstract: Effective deepfake detection tools are becoming increasingly essential over the last few years due to the growing usage of deepfakes in unethical practices. There exists a diverse range of deepfake generation techniques, which makes it challenging to develop an accurate universal detection mechanism. The 2025 Signal Processing Cup (DFWild-Cup competition) provided a diverse dataset of deepfake images, which are generated from multiple deepfake image generators, for training machine learning model(s) to emphasize the generalization of deepfake detection. To this end, we proposed an ensemble-based approach that employs three different neural network architectures: a ResNet-34-based architecture, a data-efficient image transformer (DeiT), and an XceptionNet with Wavelet Transform to capture both local and global features of deepfakes. We visualize the specific regions that these models focus for classification using Grad-CAM, and empirically demonstrate the effectiveness of these models in grouping real and fake images into cohesive clusters using t-SNE plots. Individually, the ResNet-34 architecture has achieved 88.9% accuracy, whereas the Xception network and the DeiT architecture have achieved 87.76% and 89.32% accuracy, respectively. With these networks, our weighted ensemble model achieves an excellent accuracy of 93.23% on the validation dataset of the SP Cup 2025 competition. Finally, the confusion matrix and an Area Under the ROC curve of 97.44% further confirm the stability of our proposed method.

cross Why is prompting hard? Understanding prompts on binary sequence predictors

Authors: Li Kevin Wenliang, Anian Ruoss, Jordi Grau-Moya, Marcus Hutter, Tim Genewein

Abstract: Large language models (LLMs) can be prompted to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We explore these issues by viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources. Through numerous prompt search experiments, we show that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice. Moreover, even using exhaustive search, reliably identifying optimal prompts from practical neural predictors can be difficult. Further, we demonstrate that common prompting methods, such as using intuitive prompts or samples from the targeted task, are in fact suboptimal. Thus, this work takes an initial step towards understanding the difficulties in finding and understanding optimal prompts from a statistical and empirical perspective.

cross Dynamic Influence Tracker: Measuring Time-Varying Sample Influence During Training

Authors: Jie Xu, Zihan Wu

Abstract: Existing methods for measuring training sample influence on models only provide static, overall measurements, overlooking how sample influence changes during training. We propose Dynamic Influence Tracker (DIT), which captures the time-varying sample influence across arbitrary time windows during training. DIT offers three key insights: 1) Samples show different time-varying influence patterns, with some samples important in the early training stage while others become important later. 2) Sample influences show a weak correlation between early and late stages, demonstrating that the model undergoes distinct learning phases with shifting priorities. 3) Analyzing influence during the convergence period provides more efficient and accurate detection of corrupted samples than full-training analysis. Supported by theoretical guarantees without assuming loss convexity or model convergence, DIT significantly outperforms existing methods, achieving up to 0.99 correlation with ground truth and above 98\% accuracy in detecting corrupted samples in complex architectures.

cross Generalizable speech deepfake detection via meta-learned LoRA

Authors: Janne Laakkonen, Ivan Kukanov, Ville Hautam\"aki

Abstract: Generalizable deepfake detection can be formulated as a detection problem where labels (bonafide and fake) are fixed but distributional drift affects the deepfake set. We can always train our detector with one-selected attacks and bonafide data, but an attacker can generate new attacks by just retraining his generator with a different seed. One reasonable approach is to simply pool all different attack types available in training time. Our proposed approach is to utilize meta-learning in combination with LoRA adapters to learn the structure in the training data that is common to all attack types.

cross The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis

Authors: Ge Lei, Samuel J. Cooper

Abstract: This study investigates how large language models (LLMs) represent and recall multi-associated attributes across transformer layers. We show that intermediate layers encode factual knowledge by superimposing related attributes in overlapping spaces, along with effective recall even when attributes are not explicitly prompted. In contrast, later layers refine linguistic patterns and progressively separate attribute representations, optimizing task-specific outputs while appropriately narrowing attribute recall. We identify diverse encoding patterns including, for the first time, the observation of 3D spiral structures when exploring information related to the periodic table of elements. Our findings reveal a dynamic transition in attribute representations across layers, contributing to mechanistic interpretability and providing insights for understanding how LLMs handle complex, interrelated knowledge.

cross A Geometric Approach to Personalized Recommendation with Set-Theoretic Constraints Using Box Embeddings

Authors: Shib Dasgupta, Michael Boratko, Andrew McCallum

Abstract: Personalized item recommendation typically suffers from data sparsity, which is most often addressed by learning vector representations of users and items via low-rank matrix factorization. While this effectively densifies the matrix by assuming users and movies can be represented by linearly dependent latent features, it does not capture more complicated interactions. For example, vector representations struggle with set-theoretic relationships, such as negation and intersection, e.g. recommending a movie that is "comedy and action, but not romance". In this work, we formulate the problem of personalized item recommendation as matrix completion where rows are set-theoretically dependent. To capture this set-theoretic dependence we represent each user and attribute by a hyper-rectangle or box (i.e. a Cartesian product of intervals). Box embeddings can intuitively be understood as trainable Venn diagrams, and thus not only inherently represent similarity (via the Jaccard index), but also naturally and faithfully support arbitrary set-theoretic relationships. Queries involving set-theoretic constraints can be efficiently computed directly on the embedding space by performing geometric operations on the representations. We empirically demonstrate the superiority of box embeddings over vector-based neural methods on both simple and complex item recommendation queries by up to 30 \% overall.

cross Broadcast Channel Cooperative Gain: An Operational Interpretation of Partial Information Decomposition

Authors: Chao Tian (Shitz), Shlomo Shamai (Shitz)

Abstract: Partial information decomposition has recently found applications in biological signal processing and machine learning. Despite its impacts, the decomposition was introduced through an informal and heuristic route, and its exact operational meaning is unclear. In this work, we fill this gap by connecting partial information decomposition to the capacity of the broadcast channel, which has been well-studied in the information theory literature. We show that the synergistic information in the decomposition can be rigorously interpreted as the cooperative gain, or a lower bound of this gain, on the corresponding broadcast channel. This interpretation can help practitioners to better explain and expand the applications of the partial information decomposition technique.

cross Bridging the Sim-to-Real Gap for Athletic Loco-Manipulation

Authors: Nolan Fey, Gabriel B. Margolis, Martin Peticco, Pulkit Agrawal

Abstract: Achieving athletic loco-manipulation on robots requires moving beyond traditional tracking rewards - which simply guide the robot along a reference trajectory - to task rewards that drive truly dynamic, goal-oriented behaviors. Commands such as "throw the ball as far as you can" or "lift the weight as quickly as possible" compel the robot to exhibit the agility and power inherent in athletic performance. However, training solely with task rewards introduces two major challenges: these rewards are prone to exploitation (reward hacking), and the exploration process can lack sufficient direction. To address these issues, we propose a two-stage training pipeline. First, we introduce the Unsupervised Actuator Net (UAN), which leverages real-world data to bridge the sim-to-real gap for complex actuation mechanisms without requiring access to torque sensing. UAN mitigates reward hacking by ensuring that the learned behaviors remain robust and transferable. Second, we use a pre-training and fine-tuning strategy that leverages reference trajectories as initial hints to guide exploration. With these innovations, our robot athlete learns to lift, throw, and drag with remarkable fidelity from simulation to reality.

cross Breaking Down the Hierarchy: A New Approach to Leukemia Classification

Authors: Ibraheem Hamdi, Hosam El-Gendy, Ahmed Sharshar, Mohamed Saeed, Muhammad Ridzuan, Shahrukh K. Hashmi, Naveed Syed, Imran Mirza, Shakir Hussain, Amira Mahmoud Abdalla, Mohammad Yaqub

Abstract: The complexities inherent to leukemia, multifaceted cancer affecting white blood cells, pose considerable diagnostic and treatment challenges, primarily due to reliance on laborious morphological analyses and expert judgment that are susceptible to errors. Addressing these challenges, this study presents a refined, comprehensive strategy leveraging advanced deep-learning techniques for the classification of leukemia subtypes. We commence by developing a hierarchical label taxonomy, paving the way for differentiating between various subtypes of leukemia. The research further introduces a novel hierarchical approach inspired by clinical procedures capable of accurately classifying diverse types of leukemia alongside reactive and healthy cells. An integral part of this study involves a meticulous examination of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) as classifiers. The proposed method exhibits an impressive success rate, achieving approximately 90\% accuracy across all leukemia subtypes, as substantiated by our experimental results. A visual representation of the experimental findings is provided to enhance the model's explainability and aid in understanding the classification process.

cross Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images

Authors: Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub

Abstract: Fetal gestational age (GA) is vital clinical information that is estimated during pregnancy in order to assess fetal growth. This is usually performed by measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan which is then correlated with fetal age and growth trajectory. A major issue when performing the CRL measurement is ensuring that the image is acquired at the correct view, otherwise it could be misleading. Although clinical guidelines specify the criteria for the correct CRL view, sonographers may not regularly adhere to such rules. In this paper, we propose a new deep learning-based solution that is able to verify the adherence of a CRL image to clinical guidelines in order to assess image quality and facilitate accurate estimation of GA. We first segment out important fetal structures then use the localized structures to perform a clinically-guided mapping that verifies the adherence of criteria. The segmentation method combines the benefits of Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment fetal structures in ultrasound images and localize important fetal landmarks. For segmentation purposes, we compare our proposed work with UNet and show that our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, we compare the output of the mapping with classification CNNs when assessing the clinical criteria and the overall acceptability of CRL images. We show that the proposed mapping is not only explainable but also more accurate than the best performing classification CNNs.

cross Learning the Exact Time Integration Algorithm for Initial Value Problems by Randomized Neural Networks

Authors: Suchuan Dong, Naxian Ni

Abstract: We present a method leveraging extreme learning machine (ELM) type randomized neural networks (NNs) for learning the exact time integration algorithm for initial value problems (IVPs). The exact time integration algorithm for non-autonomous systems can be represented by an algorithmic function in higher dimensions, which satisfies an associated system of partial differential equations with corresponding boundary conditions. Our method learns the algorithmic function by solving this associated system using ELM with a physics informed approach. The trained ELM network serves as the learned algorithm and can be used to solve the IVP with arbitrary initial data or step sizes from some domain. When the right hand side of the non-autonomous system exhibits a periodicity with respect to any of its arguments, while the solution itself to the problem is not periodic, we show that the algorithmic function is either periodic, or when it is not, satisfies a well-defined relation for different periods. This property can greatly simplify the algorithm learning in many problems. We consider explicit and implicit NN formulations, leading to explicit or implicit time integration algorithms, and discuss how to train the ELM network by the nonlinear least squares method. Extensive numerical experiments with benchmark problems, including non-stiff, stiff and chaotic systems, show that the learned NN algorithm produces highly accurate solutions in long-time simulations, with its time-marching errors decreasing nearly exponentially with increasing degrees of freedom in the neural network. We compare extensively the computational performance (accuracy vs.~cost) between the current NN algorithm and the leading traditional time integration algorithms. The learned NN algorithm is computationally competitive, markedly outperforming the traditional algorithms in many problems.

cross Learning to Stop Overthinking at Test Time

Authors: Hieu Tran Bao, Nguyen Cong Dat, Nguyen Duc Anh, Hoang Thanh Tung

Abstract: Test time scaling is currently one of the most active research areas that shows promise after training time scaling has reached its limits. Deep-thinking (DT) models are a class of recurrent models that can perform easy-to-hard generalization by assigning more compute to harder test samples. However, due to their inability to determine the complexity of a test sample, DT models have to use a large amount of computation for both easy and hard test samples. Excessive test time computation is wasteful and can cause the ``overthinking'' problem where more test time computation leads to worse results. In this paper, we introduce a test time training method for determining the optimal amount of computation needed for each sample during test time. We also propose Conv-LiGRU, a novel recurrent architecture for efficient and robust visual reasoning. Extensive experiments demonstrate that Conv-LiGRU is more stable than DT, effectively mitigates the ``overthinking'' phenomenon, and achieves superior accuracy.

cross QuOTE: Question-Oriented Text Embeddings

Authors: Andrew Neeser, Kaylen Latimer, Aadyant Khatri, Chris Latimer, Naren Ramakrishnan

Abstract: We present QuOTE (Question-Oriented Text Embeddings), a novel enhancement to retrieval-augmented generation (RAG) systems, aimed at improving document representation for accurate and nuanced retrieval. Unlike traditional RAG pipelines, which rely on embedding raw text chunks, QuOTE augments chunks with hypothetical questions that the chunk can potentially answer, enriching the representation space. This better aligns document embeddings with user query semantics, and helps address issues such as ambiguity and context-dependent relevance. Through extensive experiments across diverse benchmarks, we demonstrate that QuOTE significantly enhances retrieval accuracy, including in multi-hop question-answering tasks. Our findings highlight the versatility of question generation as a fundamental indexing strategy, opening new avenues for integrating question generation into retrieval-based AI pipelines.

cross RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference Optimization

Authors: Tianci Liu, Haoxiang Jiang, Tianze Wang, Ran Xu, Yue Yu, Linjun Zhang, Tuo Zhao, Haoyu Wang

Abstract: Large language models (LLMs) have achieved impressive performance but face high computational costs and latency, limiting their deployment in resource-constrained settings. In contrast, small-scale LLMs (SLMs) are more efficient yet struggle to capture evolving real-world knowledge. Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs. We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization. RoseRAG employs multi-turn prompting for detailed reasoning, rejection sampling for high-quality explanations, and contrastive preference selection to refine responses by maximizing the likelihood gap between preferred and non-preferred outputs. By integrating these components into a margin-aware optimization process, RoseRAG robustly enhances the accuracy and reliability of SLMs for RAG applications. Extensive experiments on three open-domain question answering benchmarks indicate that our innovative RoseRAG surpasses state-of-the-art baselines significantly.

cross CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening

Authors: Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu

Abstract: Due to the rise in antimicrobial resistance, identifying novel compounds with antibiotic potential is crucial for combatting this global health issue. However, traditional drug development methods are costly and inefficient. Recognizing the pressing need for more effective solutions, researchers have turned to machine learning techniques to streamline the prediction and development of novel antibiotic compounds. While foundation models have shown promise in antibiotic discovery, current mainstream efforts still fall short of fully leveraging the potential of multimodal molecular data. Recent studies suggest that contrastive learning frameworks utilizing multimodal data exhibit excellent performance in representation learning across various domains. Building upon this, we introduce CL-MFAP, an unsupervised contrastive learning (CL)-based multimodal foundation (MF) model specifically tailored for discovering small molecules with potential antibiotic properties (AP) using three types of molecular data. This model employs 1.6 million bioactive molecules with drug-like properties from the ChEMBL dataset to jointly pretrain three encoders: (1) a transformer-based encoder with rotary position embedding for processing SMILES strings; (2) another transformer-based encoder, incorporating a novel bi-level routing attention mechanism to handle molecular graph representations; and (3) a Morgan fingerprint encoder using a multilayer perceptron, to achieve the contrastive learning purpose. The CL-MFAP outperforms baseline models in antibiotic property prediction by effectively utilizing different molecular modalities and demonstrates superior domain-specific performance when fine-tuned for antibiotic-related property prediction tasks.

cross DT4ECG: A Dual-Task Learning Framework for ECG-Based Human Identity Recognition and Human Activity Detection

Authors: Siyu You, Boyuan Gu, Yanhui Yang, Shiyu Yu, Shisheng Guo

Abstract: This article introduces DT4ECG, an innovative dual-task learning framework for Electrocardiogram (ECG)-based human identity recognition and activity detection. The framework employs a robust one-dimensional convolutional neural network (1D-CNN) backbone integrated with residual blocks to extract discriminative ECG features. To enhance feature representation, we propose a novel Sequence Channel Attention (SCA) mechanism, which combines channel-wise and sequential context attention to prioritize informative features across both temporal and channel dimensions. Furthermore, to address gradient imbalance in multi-task learning, we integrate GradNorm, a technique that dynamically adjusts loss weights based on gradient magnitudes, ensuring balanced training across tasks. Experimental results demonstrate the superior performance of our model, achieving accuracy rates of 99.12% in ID classification and 90.11% in activity classification. These findings underscore the potential of the DT4ECG framework in enhancing security and user experience across various applications such as fitness monitoring and personalized healthcare, thereby presenting a transformative approach to integrating ECG-based biometrics in everyday technologies.

cross Detecting Cadastral Boundary from Satellite Images Using U-Net model

Authors: Neda Rahimpour Anaraki, Maryam Tahmasbi, Saeed Reza Kheradpisheh

Abstract: Finding the cadastral boundaries of farmlands is a crucial concern for land administration. Therefore, using deep learning methods to expedite and simplify the extraction of cadastral boundaries from satellite and unmanned aerial vehicle (UAV) images is critical. In this paper, we employ transfer learning to train a U-Net model with a ResNet34 backbone to detect cadastral boundaries through three-class semantic segmentation: "boundary", "field", and "background". We evaluate the performance on two satellite images from farmlands in Iran using "precision", "recall", and "F-score", achieving high values of 88%, 75%, and 81%, respectively, which indicate promising results.

cross Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

cross Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems

Authors: Zhao Wang, Sota Moriyama, Wei-Yao Wang, Briti Gangopadhyay, Shingo Takamatsu

Abstract: Recent advancements in LLM-based multi-agent (LLM-MA) systems have shown promise, yet significant challenges remain in managing communication and refinement when agents collaborate on complex tasks. In this paper, we propose \textit{Talk Structurally, Act Hierarchically (TalkHier)}, a novel framework that introduces a structured communication protocol for context-rich exchanges and a hierarchical refinement system to address issues such as incorrect outputs, falsehoods, and biases. \textit{TalkHier} surpasses various types of SoTA, including inference scaling model (OpenAI-o1), open-source multi-agent models (e.g., AgentVerse), and majority voting strategies on current LLM and single-agent baselines (e.g., ReAct, GPT4o), across diverse tasks, including open-domain question answering, domain-specific selective questioning, and practical advertisement text generation. These results highlight its potential to set a new standard for LLM-MA systems, paving the way for more effective, adaptable, and collaborative multi-agent frameworks. The code is available https://github.com/sony/talkhier.

URLs: https://github.com/sony/talkhier.

cross OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling

Authors: Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, Zaiwen Wen

Abstract: Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach.

cross Graceful forgetting: Memory as a process

Authors: Alain de Cheveign\'e

Abstract: A rational theory of memory is proposed to explain how we can accommodate unbounded sensory input within bounded storage space. Memory is stored as statistics, organized into complex structures that are constantly summarized and compressed to make room for new input. This process, driven by space constraints, is guided by heuristics that optimize the memory for future needs. Sensory input is rapidly encoded as simple statistics that are more slowly elaborated into more abstract constructs. This theory differs from previous accounts of memory by (a) its reliance on statistics, (b) its use of heuristics to guide the choice of statistics, and (c) the emphasis on memory as a process that is intensive, complex, and expensive. The theory is intended as an aid to make sense of our extensive knowledge of memory, and bring us closer to an understanding of memory in functional and mechanistic terms.

cross G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems

Authors: Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, Yang Wang

Abstract: Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees. The code is available at https://github.com/wslong20/G-safeguard.

URLs: https://github.com/wslong20/G-safeguard.

cross Error Bound Analysis for the Regularized Loss of Deep Linear Neural Networks

Authors: Po Chen, Rujun Jiang, Peng Wang

Abstract: The optimization foundations of deep linear networks have received significant attention lately. However, due to the non-convexity and hierarchical structure, analyzing the regularized loss of deep linear networks remains a challenging task. In this work, we study the local geometric landscape of the regularized squared loss of deep linear networks, providing a deeper understanding of its optimization properties. Specifically, we characterize the critical point set and establish an error-bound property for all critical points under mild conditions. Notably, we identify the sufficient and necessary conditions under which the error bound holds. To support our theoretical findings, we conduct numerical experiments demonstrating that gradient descent exhibits linear convergence when optimizing the regularized loss of deep linear networks.

cross Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis

Authors: Shiguo Lian, Kaikai Zhao, Xuejiao Lei, Ning Wang, Zhenhong Long, Peijun Yang, Minjie Hua, Chaoyang Ma, Wen Liu, Kai Wang, Zhaoxiang Liu

Abstract: DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we evaluate the DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, and DeepSeek-R1-Distill-Llama series on A-Eval, an application-driven benchmark. By comparing original instruction-tuned models with their distilled counterparts, we analyze how reasoning enhancements impact performance across diverse practical tasks. Our results show that reasoning-enhanced models, while generally powerful, do not universally outperform across all tasks, with performance gains varying significantly across tasks and models. To further assist users in model selection, we quantify the capability boundary of DeepSeek models through performance tier classifications and intuitive line charts. Specific examples provide actionable insights to help users select and deploy the most cost-effective DeepSeek models, ensuring optimal performance and resource efficiency in real-world applications.

cross Evaluating the Potential of Quantum Machine Learning in Cybersecurity: A Case-Study on PCA-based Intrusion Detection Systems

Authors: Armando Bellante, Tommaso Fioravanti, Michele Carminati, Stefano Zanero, Alessandro Luongo

Abstract: Quantum computing promises to revolutionize our understanding of the limits of computation, and its implications in cryptography have long been evident. Today, cryptographers are actively devising post-quantum solutions to counter the threats posed by quantum-enabled adversaries. Meanwhile, quantum scientists are innovating quantum protocols to empower defenders. However, the broader impact of quantum computing and quantum machine learning (QML) on other cybersecurity domains still needs to be explored. In this work, we investigate the potential impact of QML on cybersecurity applications of traditional ML. First, we explore the potential advantages of quantum computing in machine learning problems specifically related to cybersecurity. Then, we describe a methodology to quantify the future impact of fault-tolerant QML algorithms on real-world problems. As a case study, we apply our approach to standard methods and datasets in network intrusion detection, one of the most studied applications of machine learning in cybersecurity. Our results provide insight into the conditions for obtaining a quantum advantage and the need for future quantum hardware and software advancements.

cross ReLearn: Unlearning via Learning for Large Language Models

Authors: Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang

Abstract: Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.

URLs: https://github.com/zjunlp/unlearn.

cross ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

Authors: Bidyarthi Paul, Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque, Shahriar Manzoor

Abstract: ANCHOLIK-NER is a linguistically diverse dataset for Named Entity Recognition (NER) in Bangla regional dialects, capturing variations across Sylhet, Chittagong, and Barishal. The dataset has around 10,443 sentences, 3,481 sentences per region. The data was collected from two publicly available datasets and through web scraping from various online newspapers, articles. To ensure high-quality annotations, the BIO tagging scheme was employed, and professional annotators with expertise in regional dialects carried out the labeling process. The dataset is structured into separate subsets for each region and is available both in CSV format. Each entry contains textual data along with identified named entities and their corresponding annotations. Named entities are categorized into ten distinct classes: Person, Location, Organization, Food, Animal, Colour, Role, Relation, Object, and Miscellaneous. This dataset serves as a valuable resource for developing and evaluating NER models for Bangla dialectal variations, contributing to regional language processing and low-resource NLP applications. It can be utilized to enhance NER systems in Bangla dialects, improve regional language understanding, and support applications in machine translation, information retrieval, and conversational AI.

cross Multiscale autonomous forecasting of plasma systems' dynamics using neural networks

Authors: Farbod Faraji, Maryam Reza

Abstract: Plasma systems exhibit complex multiscale dynamics, resolving which poses significant challenges for conventional numerical simulations. Machine learning (ML) offers an alternative by learning data-driven representations of these dynamics. Yet existing ML time-stepping models suffer from error accumulation, instability, and limited long-term forecasting horizons. This paper demonstrates the application of a hierarchical multiscale neural network architecture for autonomous plasma forecasting. The framework integrates multiple neural networks trained across different temporal scales to capture both fine-scale and large-scale behaviors while mitigating compounding error in recursive evaluation. Fine-scale networks accurately resolve fast-evolving features, while coarse-scale networks provide broader temporal context, reducing the frequency of recursive updates and limiting the accumulation of small prediction errors over time. We first evaluate the method using canonical nonlinear dynamical systems and compare its performance against classical single-scale neural networks. The results demonstrate that single-scale neural networks experience rapid divergence due to recursive error accumulation, whereas the multiscale approach improves stability and extends prediction horizons. Next, our ML model is applied to two plasma configurations of high scientific and applied significance, demonstrating its ability to preserve spatial structures and capture multiscale plasma dynamics. By leveraging multiple time-stepping resolutions, the applied framework is shown to outperform conventional single-scale networks for the studied plasma test cases. The results of this work position the hierarchical multiscale neural network as a promising tool for efficient plasma forecasting and digital twin applications.

cross Stochastic Optimization of Inventory at Large-scale Supply Chains

Authors: Zhaoyang Larry Jin, Mehdi Maasoumy, Yimin Liu, Zeshi Zheng, Zizhuo Ren

Abstract: Today's global supply chains face growing challenges due to rapidly changing market conditions, increased network complexity and inter-dependency, and dynamic uncertainties in supply, demand, and other factors. To combat these challenges, organizations employ Material Requirements Planning (MRP) software solutions to set inventory stock buffers - for raw materials, work-in-process goods, and finished products - to help them meet customer service levels. However, holding excess inventory further complicates operations and can lock up millions of dollars of capital that could be otherwise deployed. Furthermore, most commercially available MRP solutions fall short in considering uncertainties and do not result in optimal solutions for modern enterprises. At C3 AI, we fundamentally reformulate the inventory management problem as a constrained stochastic optimization. We then propose a simulation-optimization framework that minimizes inventory and related costs while maintaining desired service levels. The framework's goal is to find the optimal reorder parameters that minimize costs subject to a pre-defined service-level constraint and all other real-world operational constraints. These optimal reorder parameters can be fed back into an MRP system to drive optimal order placement, or used to place optimal orders directly. This approach has proven successful in reducing inventory levels by 10-35 percent, resulting in hundreds of millions of dollars of economic benefit for major enterprises at a global scale.

cross Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

Authors: Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell

Abstract: This paper investigates the convergence properties of the hypergradient descent method (HDM), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis of HDM using the online learning framework of [Gao24] and apply this analysis to develop new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably, HDM automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability of HDM reported in the literature and proposes efficient strategies to address it. We also develop two HDM variants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show HDM with heavy-ball momentum (HDM-HB) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, HDM-HB often matches the performance of L-BFGS, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.

cross Towards identifying possible fault-tolerant advantage of quantum linear system algorithms in terms of space, time and energy

Authors: Yue Tu, Mark Dubynskyi, Mohammad Mohammadisiahroudi, Ekaterina Riashchentceva, Jinglei Cheng, Dmitry Ryashchentsev, Tam\'as Terlaky, Junyu Liu

Abstract: Quantum computing, a prominent non-Von Neumann paradigm beyond Moore's law, can offer superpolynomial speedups for certain problems. Yet its advantages in efficiency for tasks like machine learning remain under investigation, and quantum noise complicates resource estimations and classical comparisons. We provide a detailed estimation of space, time, and energy resources for fault-tolerant superconducting devices running the Harrow-Hassidim-Lloyd (HHL) algorithm, a quantum linear system solver relevant to linear algebra and machine learning. Excluding memory and data transfer, possible quantum advantages over the classical conjugate gradient method could emerge at $N \approx 2^{33} \sim 2^{48}$ or even lower, requiring ${O}(10^5)$ physical qubits, ${O}(10^{12}\sim10^{13})$ Joules, and ${O}(10^6)$ seconds under surface code fault-tolerance with three types of magic state distillation (15-1, 116-12, 225-1). Key parameters include condition number, sparsity, and precision $\kappa, s\approx{O}(10\sim100)$, $\epsilon\sim0.01$, and physical error $10^{-5}$. Our resource estimator adjusts $N, \kappa, s, \epsilon$, providing a map of quantum-classical boundaries and revealing where a practical quantum advantage may arise. Our work quantitatively determine how advanced a fault-tolerant quantum computer should be to achieve possible, significant benefits on problems related to real-world.

cross Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent

Authors: Zeyu He, Saniya Naphade, Ting-Hao 'Kenneth' Huang

Abstract: Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark," where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable-only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.

cross Unlocking the Potential of Generative AI through Neuro-Symbolic Architectures: Benefits and Limitations

Authors: Oualid Bougzime, Samir Jabbar, Christophe Cruz, Fr\'ed\'eric Demoly

Abstract: Neuro-symbolic artificial intelligence (NSAI) represents a transformative approach in artificial intelligence (AI) by combining deep learning's ability to handle large-scale and unstructured data with the structured reasoning of symbolic methods. By leveraging their complementary strengths, NSAI enhances generalization, reasoning, and scalability while addressing key challenges such as transparency and data efficiency. This paper systematically studies diverse NSAI architectures, highlighting their unique approaches to integrating neural and symbolic components. It examines the alignment of contemporary AI techniques such as retrieval-augmented generation, graph neural networks, reinforcement learning, and multi-agent systems with NSAI paradigms. This study then evaluates these architectures against comprehensive set of criteria, including generalization, reasoning capabilities, transferability, and interpretability, therefore providing a comparative analysis of their respective strengths and limitations. Notably, the Neuro > Symbolic < Neuro model consistently outperforms its counterparts across all evaluation metrics. This result aligns with state-of-the-art research that highlight the efficacy of such architectures in harnessing advanced technologies like multi-agent systems.

cross The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval

Authors: Ting-Rui Chiang, Dani Yogatama

Abstract: The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.

cross Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation

Authors: Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman

Abstract: Large language models (LLMs) often suffer from hallucination, generating factually incorrect or ungrounded content, which limits their reliability in high-stakes applications. A key factor contributing to hallucination is the use of hard labels during training, which enforce deterministic supervision, encourage overconfidence, and disregard the uncertainty inherent in natural language. To address this, we propose mitigating hallucination through knowledge distillation (KD), where a teacher model provides smoothed soft labels to a student model, reducing overconfidence and improving factual grounding. We apply KD during supervised finetuning on instructional data, evaluating its effectiveness across LLMs from different families. Experimental results on summarization benchmarks demonstrate that KD reduces hallucination compared to standard finetuning while preserving performance on general NLP tasks. These findings highlight KD as a promising approach for mitigating hallucination in LLMs and improving model reliability.

cross Generalized Factor Neural Network Model for High-dimensional Regression

Authors: Zichuan Guo, Mihai Cucuringu, Alexander Y. Shestopaloff

Abstract: We tackle the challenges of modeling high-dimensional data sets, particularly those with latent low-dimensional structures hidden within complex, non-linear, and noisy relationships. Our approach enables a seamless integration of concepts from non-parametric regression, factor models, and neural networks for high-dimensional regression. Our approach introduces PCA and Soft PCA layers, which can be embedded at any stage of a neural network architecture, allowing the model to alternate between factor modeling and non-linear transformations. This flexibility makes our method especially effective for processing hierarchical compositional data. We explore ours and other techniques for imposing low-rank structures on neural networks and examine how architectural design impacts model performance. The effectiveness of our method is demonstrated through simulation studies, as well as applications to forecasting future price movements of equity ETF indices and nowcasting with macroeconomic data.

cross A statistical theory of overfitting for imbalanced classification

Authors: Jingyang Lyu, Kangjie Zhou, Yiqiao Zhong

Abstract: Classification with imbalanced data is a common challenge in data analysis, where certain classes (minority classes) account for a small fraction of the training data compared with other classes (majority classes). Classical statistical theory based on large-sample asymptotics and finite-sample corrections is often ineffective for high-dimensional data, leaving many overfitting phenomena in empirical machine learning unexplained. In this paper, we develop a statistical theory for high-dimensional imbalanced classification by investigating support vector machines and logistic regression. We find that dimensionality induces truncation or skewing effects on the logit distribution, which we characterize via a variational problem under high-dimensional asymptotics. In particular, for linearly separable data generated from a two-component Gaussian mixture model, the logits from each class follow a normal distribution $\mathsf{N}(0,1)$ on the testing set, but asymptotically follow a rectified normal distribution $\max\{\kappa, \mathsf{N}(0,1)\}$ on the training set -- which is a pervasive phenomenon we verified on tabular data, image data, and text data. This phenomenon explains why the minority class is more severely affected by overfitting. Further, we show that margin rebalancing, which incorporates class sizes into the loss function, is crucial for mitigating the accuracy drop for the minority class. Our theory also provides insights into the effects of overfitting on calibration and other uncertain quantification measures.

cross Robust High-Dimensional Mean Estimation With Low Data Size, an Empirical Study

Authors: Cullen Anderson, Jeff M. Phillips

Abstract: Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the high-dimensional setting.

cross Transfer Learning of CATE with Kernel Ridge Regression

Authors: Seok-Jin Kim, Hongjie Liu, Molei Liu, Kaizheng Wang

Abstract: The proliferation of data has sparked significant interest in leveraging findings from one study to estimate treatment effects in a different target population without direct outcome observations. However, the transfer learning process is frequently hindered by substantial covariate shift and limited overlap between (i) the source and target populations, as well as (ii) the treatment and control groups within the source. We propose a novel method for overlap-adaptive transfer learning of conditional average treatment effect (CATE) using kernel ridge regression (KRR). Our approach involves partitioning the labeled source data into two subsets. The first one is used to train candidate CATE models based on regression adjustment and pseudo-outcomes. An optimal model is then selected using the second subset and unlabeled target data, employing another pseudo-outcome-based strategy. We provide a theoretical justification for our method through sharp non-asymptotic MSE bounds, highlighting its adaptivity to both weak overlaps and the complexity of CATE function. Extensive numerical studies confirm that our method achieves superior finite-sample efficiency and adaptability. We conclude by demonstrating the effectiveness of our approach using a 401(k) eligibility dataset.

cross A Framework for Learning Scoring Rules in Autonomous Driving Planning Systems

Authors: Zikang Xiong, Joe Kurian Eappen, Suresh Jagannathan

Abstract: In autonomous driving systems, motion planning is commonly implemented as a two-stage process: first, a trajectory proposer generates multiple candidate trajectories, then a scoring mechanism selects the most suitable trajectory for execution. For this critical selection stage, rule-based scoring mechanisms are particularly appealing as they can explicitly encode driving preferences, safety constraints, and traffic regulations in a formalized, human-understandable format. However, manually crafting these scoring rules presents significant challenges: the rules often contain complex interdependencies, require careful parameter tuning, and may not fully capture the nuances present in real-world driving data. This work introduces FLoRA, a novel framework that bridges this gap by learning interpretable scoring rules represented in temporal logic. Our method features a learnable logic structure that captures nuanced relationships across diverse driving scenarios, optimizing both rules and parameters directly from real-world driving demonstrations collected in NuPlan. Our approach effectively learns to evaluate driving behavior even though the training data only contains positive examples (successful driving demonstrations). Evaluations in closed-loop planning simulations demonstrate that our learned scoring rules outperform existing techniques, including expert-designed rules and neural network scoring models, while maintaining interpretability. This work introduces a data-driven approach to enhance the scoring mechanism in autonomous driving systems, designed as a plug-in module to seamlessly integrate with various trajectory proposers. Our video and code are available on xiong.zikang.me/FLoRA.

cross Physics-Informed Gaussian Process Classification for Constraint-Aware Alloy Design

Authors: Christofer Hardcastle, Ryan O Mullan, Raymundo Arroyave, Brent Vela

Abstract: Alloy design can be framed as a constraint-satisfaction problem. Building on previous methodologies, we propose equipping Gaussian Process Classifiers (GPCs) with physics-informed prior mean functions to model the boundaries of feasible design spaces. Through three case studies, we highlight the utility of informative priors for handling constraints on continuous and categorical properties. (1) Phase Stability: By incorporating CALPHAD predictions as priors for solid-solution phase stability, we enhance model validation using a publicly available XRD dataset. (2) Phase Stability Prediction Refinement: We demonstrate an in silico active learning approach to efficiently correct phase diagrams. (3) Continuous Property Thresholds: By embedding priors into continuous property models, we accelerate the discovery of alloys meeting specific property thresholds via active learning. In each case, integrating physics-based insights into the classification framework substantially improved model performance, demonstrating an efficient strategy for constraint-aware alloy design.

cross Robot Deformable Object Manipulation via NMPC-generated Demonstrations in Deep Reinforcement Learning

Authors: Haoyuan Wang, Zihao Dong, Hongliang Lei, Zejia Zhang, Weizhuang Shi, Wei Luo, Weiwei Wan, Jian Huang

Abstract: In this work, we conducted research on deformable object manipulation by robots based on demonstration-enhanced reinforcement learning (RL). To improve the learning efficiency of RL, we enhanced the utilization of demonstration data from multiple aspects and proposed the HGCR-DDPG algorithm. It uses a novel high-dimensional fuzzy approach for grasping-point selection, a refined behavior-cloning method to enhance data-driven learning in Rainbow-DDPG, and a sequential policy-learning strategy. Compared to the baseline algorithm (Rainbow-DDPG), our proposed HGCR-DDPG achieved 2.01 times the global average reward and reduced the global average standard deviation to 45% of that of the baseline algorithm. To reduce the human labor cost of demonstration collection, we proposed a low-cost demonstration collection method based on Nonlinear Model Predictive Control (NMPC). Simulation experiment results show that demonstrations collected through NMPC can be used to train HGCR-DDPG, achieving comparable results to those obtained with human demonstrations. To validate the feasibility of our proposed methods in real-world environments, we conducted physical experiments involving deformable object manipulation. We manipulated fabric to perform three tasks: diagonal folding, central axis folding, and flattening. The experimental results demonstrate that our proposed method achieved success rates of 83.3%, 80%, and 100% for these three tasks, respectively, validating the effectiveness of our approach. Compared to current large-model approaches for robot manipulation, the proposed algorithm is lightweight, requires fewer computational resources, and offers task-specific customization and efficient adaptability for specific tasks.

cross PrivilegedDreamer: Explicit Imagination of Privileged Information for Rapid Adaptation of Learned Policies

Authors: Morgan Byrd, Jackson Crandell, Mili Das, Jessica Inman, Robert Wright, Sehoon Ha

Abstract: Numerous real-world control problems involve dynamics and objectives affected by unobservable hidden pa- rameters, ranging from autonomous driving to robotic manipu- lation, which cause performance degradation during sim-to-real transfer. To represent these kinds of domains, we adopt hidden- parameter Markov decision processes (HIP-MDPs), which model sequential decision problems where hidden variables parameterize transition and reward functions. Existing ap- proaches, such as domain randomization, domain adaptation, and meta-learning, simply treat the effect of hidden param- eters as additional variance and often struggle to effectively handle HIP-MDP problems, especially when the rewards are parameterized by hidden variables. We introduce Privileged- Dreamer, a model-based reinforcement learning framework that extends the existing model-based approach by incorporating an explicit parameter estimation module. PrivilegedDreamer features its novel dual recurrent architecture that explicitly estimates hidden parameters from limited historical data and enables us to condition the model, actor, and critic networks on these estimated parameters. Our empirical analysis on five diverse HIP-MDP tasks demonstrates that PrivilegedDreamer outperforms state-of-the-art model-based, model-free, and do- main adaptation learning algorithms. Additionally, we conduct ablation studies to justify the inclusion of each component in the proposed architecture.

cross Intelligent Mobile AI-Generated Content Services via Interactive Prompt Engineering and Dynamic Service Provisioning

Authors: Yinqiu Liu, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Xianbin Wang, Dong In Kim, Hongyang Du

Abstract: Due to massive computational demands of large generative models, AI-Generated Content (AIGC) can organize collaborative Mobile AIGC Service Providers (MASPs) at network edges to provide ubiquitous and customized content generation for resource-constrained users. However, such a paradigm faces two significant challenges: 1) raw prompts (i.e., the task description from users) often lead to poor generation quality due to users' lack of experience with specific AIGC models, and 2) static service provisioning fails to efficiently utilize computational and communication resources given the heterogeneity of AIGC tasks. To address these challenges, we propose an intelligent mobile AIGC service scheme. Firstly, we develop an interactive prompt engineering mechanism that leverages a Large Language Model (LLM) to generate customized prompt corpora and employs Inverse Reinforcement Learning (IRL) for policy imitation through small-scale expert demonstrations. Secondly, we formulate a dynamic mobile AIGC service provisioning problem that jointly optimizes the number of inference trials and transmission power allocation. Then, we propose the Diffusion-Enhanced Deep Deterministic Policy Gradient (D3PG) algorithm to solve the problem. By incorporating the diffusion process into Deep Reinforcement Learning (DRL) architecture, the environment exploration capability can be improved, thus adapting to varying mobile AIGC scenarios. Extensive experimental results demonstrate that our prompt engineering approach improves single-round generation success probability by 6.3 times, while D3PG increases the user service experience by 67.8% compared to baseline DRL approaches.

cross TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

Authors: Geon Lee, Wenchao Yu, Kijung Shin, Wei Cheng, Haifeng Chen

Abstract: Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.

cross SMART: Self-Aware Agent for Tool Overuse Mitigation

Authors: Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-T\"ur, Gokhan Tur, Heng Ji

Abstract: Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent's self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART-ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4o. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.

cross An Efficient Row-Based Sparse Fine-Tuning

Authors: Cen-Jhih Li, Aditya Bhaskara

Abstract: Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SFT framework, based on ideas from neural network pruning. At a high level, we first identify "important" neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Using experiments on common language tasks, we demonstrate that our method significantly improves the memory efficiency of SFT without increasing training time complexity and implementation complexity, while achieving accuracy comparable to state-of-the-art methods such as LoRA and its variants.

cross Multi-Turn Multi-Modal Question Clarification for Enhanced Conversational Understanding

Authors: Kimia Ramezan, Alireza Amiri Bavandpour, Yifei Yuan, Clemencia Siro, Mohammad Aliannejadi

Abstract: Conversational query clarification enables users to refine their search queries through interactive dialogue, improving search effectiveness. Traditional approaches rely on text-based clarifying questions, which often fail to capture complex user preferences, particularly those involving visual attributes. While recent work has explored single-turn multi-modal clarification with images alongside text, such methods do not fully support the progressive nature of user intent refinement over multiple turns. Motivated by this, we introduce the Multi-turn Multi-modal Clarifying Questions (MMCQ) task, which combines text and visual modalities to refine user queries in a multi-turn conversation. To facilitate this task, we create a large-scale dataset named ClariMM comprising over 13k multi-turn interactions and 33k question-answer pairs containing multi-modal clarifying questions. We propose Mario, a retrieval framework that employs a two-phase ranking strategy: initial retrieval with BM25, followed by a multi-modal generative re-ranking model that integrates textual and visual information from conversational history. Our experiments show that multi-turn multi-modal clarification outperforms uni-modal and single-turn approaches, improving MRR by 12.88%. The gains are most significant in longer interactions, demonstrating the value of progressive refinement for complex queries.

cross LMFCA-Net: A Lightweight Model for Multi-Channel Speech Enhancement with Efficient Narrow-Band and Cross-Band Attention

Authors: Yaokai Zhang, Hanchen Pei, Wanqi Wang, Gongping Huang

Abstract: Deep learning based end-to-end multi-channel speech enhancement methods have achieved impressive performance by leveraging sub-band, cross-band, and spatial information. However, these methods often demand substantial computational resources, limiting their practicality on terminal devices. This paper presents a lightweight multi-channel speech enhancement network with decoupled fully connected attention (LMFCA-Net). The proposed LMFCA-Net introduces time-axis decoupled fully-connected attention (T-FCA) and frequency-axis decoupled fully-connected attention (F-FCA) mechanisms to effectively capture long-range narrow-band and cross-band information without recurrent units. Experimental results show that LMFCA-Net performs comparably to state-of-the-art methods while significantly reducing computational complexity and latency, making it a promising solution for practical applications.

cross All Models Are Miscalibrated, But Some Less So: Comparing Calibration with Conditional Mean Operators

Authors: Peter Moskvichev, Dino Sejdinovic

Abstract: When working in a high-risk setting, having well calibrated probabilistic predictive models is a crucial requirement. However, estimators for calibration error are not always able to correctly distinguish which model is better calibrated. We propose the \emph{conditional kernel calibration error} (CKCE) which is based on the Hilbert-Schmidt norm of the difference between conditional mean operators. By working directly with the definition of strong calibration as the distance between conditional distributions, which we represent by their embeddings in reproducing kernel Hilbert spaces, the CKCE is less sensitive to the marginal distribution of predictive models. This makes it more effective for relative comparisons than previously proposed calibration metrics. Our experiments, using both synthetic and real data, show that CKCE provides a more consistent ranking of models by their calibration error and is more robust against distribution shift.

cross TAPS: Throat and Acoustic Paired Speech Dataset for Deep Learning-Based Speech Enhancement

Authors: Yunsik Kim, Yonghun Song, Yoonyoung Chung

Abstract: In high-noise environments such as factories, subways, and busy streets, capturing clear speech is challenging due to background noise. Throat microphones provide a solution with their noise-suppressing properties, reducing the noise while recording speech. However, a significant limitation remains: high-frequency information is attenuated as sound waves pass through skin and tissue, reducing speech clarity. Recent deep learning approaches have shown promise in enhancing throat microphone recordings, but further progress is constrained by the absence of standardized dataset. We introduce a throat and acoustic paired speech dataset (TAPS), a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. To demonstrate the TAPS's utility, we tested three baseline deep learning models and identified the mapping-based approach as superior in improving speech quality and restoring content. Additionally, we propose an optimal method to mitigate the signal mismatch between throat and acoustic microphones, ensuring model performance. These results highlight the potential of TAPS to serve as a standardized dataset and advance research in throat microphone-based speech enhancement.

cross Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Authors: Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

Abstract: Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.

cross Generative Multi-Agent Collaboration in Embodied AI: A Systematic Review

Authors: Di Wu, Xian Wei, Guang Chen, Hao Shen, Xiangfeng Wang, Wenhao Li, Bo Jin

Abstract: Embodied multi-agent systems (EMAS) have attracted growing attention for their potential to address complex, real-world challenges in areas such as logistics and robotics. Recent advances in foundation models pave the way for generative agents capable of richer communication and adaptive problem-solving. This survey provides a systematic examination of how EMAS can benefit from these generative capabilities. We propose a taxonomy that categorizes EMAS by system architectures and embodiment modalities, emphasizing how collaboration spans both physical and virtual contexts. Central building blocks, perception, planning, communication, and feedback, are then analyzed to illustrate how generative techniques bolster system robustness and flexibility. Through concrete examples, we demonstrate the transformative effects of integrating foundation models into embodied, multi-agent frameworks. Finally, we discuss challenges and future directions, underlining the significant promise of EMAS to reshape the landscape of AI-driven collaboration.

cross A Survey of Automatic Prompt Engineering: An Optimization Perspective

Authors: Wenwu Li, Xiangfeng Wang, Wenhao Li, Bo Jin

Abstract: The rise of foundation models has shifted focus from resource-intensive fine-tuning to prompt engineering, a paradigm that steers model behavior through input design rather than weight updates. While manual prompt engineering faces limitations in scalability, adaptability, and cross-modal alignment, automated methods, spanning foundation model (FM) based optimization, evolutionary methods, gradient-based optimization, and reinforcement learning, offer promising solutions. Existing surveys, however, remain fragmented across modalities and methodologies. This paper presents the first comprehensive survey on automated prompt engineering through a unified optimization-theoretic lens. We formalize prompt optimization as a maximization problem over discrete, continuous, and hybrid prompt spaces, systematically organizing methods by their optimization variables (instructions, soft prompts, exemplars), task-specific objectives, and computational frameworks. By bridging theoretical formulation with practical implementations across text, vision, and multimodal domains, this survey establishes a foundational framework for both researchers and practitioners, while highlighting underexplored frontiers in constrained optimization and agent-oriented prompt design.

cross Towards Reasoning Ability of Small Language Models

Authors: Gaurav Srivastava, Shuxiang Cao, Xuan Wang

Abstract: Reasoning has long been viewed as an emergent property of large language models (LLMs), appearing at or above a certain scale ($\sim$100B parameters). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. SLMs are increasingly favored for their efficiency and deployability. However, there is a lack of systematic study on the reasoning abilities of diverse SLMs, including those trained from scratch or derived from LLMs through quantization, pruning, and distillation. This raises a critical question: Can SLMs achieve reasoning abilities comparable to LLMs? In this work, we systematically survey, benchmark, and analyze 72 SLMs from six model families across 14 reasoning benchmarks. For reliable evaluation, we examine four evaluation methods and compare four LLM judges against human evaluations on 800 data points. We repeat all experiments three times to ensure a robust performance assessment. Additionally, we analyze the impact of different prompting strategies in small models. Beyond accuracy, we also evaluate model robustness under adversarial conditions and intermediate reasoning steps. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. They can serve as efficient alternatives to LLMs for reasoning-intensive tasks.

cross FaMTEB: Massive Text Embedding Benchmark in Persian Language

Authors: Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, Reza Kazemi, Arash Amini

Abstract: In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.

cross Distributional autoencoders know the score

Authors: Andrej Leban

Abstract: This work presents novel and desirable properties of a recently introduced class of autoencoders -- the Distributional Principal Autoencoder (DPA) -- that combines distributionally correct reconstruction with principal components-like interpretability of the encodings. First, we show that the level sets of the encoder orient themselves exactly with regard to the score of the data distribution. This both explains the method's often remarkable performance in disentangling the the factors of variation of the data, as well as opens up possibilities of recovering its distribution while having access to samples only. In settings where the score itself has physical meaning -- such as when the data obey the Boltzmann distribution -- we demonstrate that the method can recover scientifically important quantities such as the \textit{minimum free energy path}. Second, we show that if the data lie on a manifold that can be approximated by the encoder, the optimal encoder's components beyond the dimension of the manifold will carry absolutely no additional information about the data distribution. This promises new ways of determining the number of relevant dimensions of the data beyond common heuristics such as the scree plot. Finally, the fact that the method is learning the score means that it could have promise as a generative model, potentially rivaling approaches such as diffusion, which similarly attempts to approximate the score of the data distribution.

cross How does ion temperature gradient turbulence depend on magnetic geometry? Insights from data and machine learning

Authors: Matt Landreman, Jong Youl Choi, Caio Alves, Prasanna Balaprakash, R. Michael Churchill, Rory Conlin, Gareth Roberg-Clark

Abstract: Magnetic geometry has a significant effect on the level of turbulent transport in fusion plasmas. Here, we model and analyze this dependence using multiple machine learning methods and a dataset of > 200,000 nonlinear simulations of ion-temperature-gradient turbulence in diverse non-axisymmetric geometries. The dataset is generated using a large collection of both optimized and randomly generated stellarator equilibria. At fixed gradients, the turbulent heat flux varies between geometries by several orders of magnitude. Trends are apparent among the configurations with particularly high or low heat flux. Regression and classification techniques from machine learning are then applied to extract patterns in the dataset. Due to a symmetry of the gyrokinetic equation, the heat flux and regressions thereof should be invariant to translations of the raw features in the parallel coordinate, similar to translation invariance in computer vision applications. Multiple regression models including convolutional neural networks (CNNs) and decision trees can achieve reasonable predictive power for the heat flux in held-out test configurations, with highest accuracy for the CNNs. Using Spearman correlation, sequential feature selection, and Shapley values to measure feature importance, it is consistently found that the most important geometric lever on the heat flux is the flux surface compression in regions of bad curvature. The second most important feature relates to the magnitude of geodesic curvature. These two features align remarkably with surrogates that have been proposed based on theory, while the methods here allow a natural extension to more features for increased accuracy. The dataset, released with this publication, may also be used to test other proposed surrogates, and we find many previously published proxies do correlate well with both the heat flux and stability boundary.

cross On the kernel learning problem

Authors: Yang Li, Feng Ruan

Abstract: The classical kernel ridge regression problem aims to find the best fit for the output $Y$ as a function of the input data $X\in \mathbb{R}^d$, with a fixed choice of regularization term imposed by a given choice of a reproducing kernel Hilbert space, such as a Sobolev space. Here we consider a generalization of the kernel ridge regression problem, by introducing an extra matrix parameter $U$, which aims to detect the scale parameters and the feature variables in the data, and thereby improve the efficiency of kernel ridge regression. This naturally leads to a nonlinear variational problem to optimize the choice of $U$. We study various foundational mathematical aspects of this variational problem, and in particular how this behaves in the presence of multiscale structures in the data.

cross Deep Subspace Learning for Surface Anomaly Classification Based on 3D Point Cloud Data

Authors: Xuanming Cao, Chengyu Tao, Juan Du

Abstract: Surface anomaly classification is critical for manufacturing system fault diagnosis and quality control. However, the following challenges always hinder accurate anomaly classification in practice: (i) Anomaly patterns exhibit intra-class variation and inter-class similarity, presenting challenges in the accurate classification of each sample. (ii) Despite the predefined classes, new types of anomalies can occur during production that require to be detected accurately. (iii) Anomalous data is rare in manufacturing processes, leading to limited data for model learning. To tackle the above challenges simultaneously, this paper proposes a novel deep subspace learning-based 3D anomaly classification model. Specifically, starting from a lightweight encoder to extract the latent representations, we model each class as a subspace to account for the intra-class variation, while promoting distinct subspaces of different classes to tackle the inter-class similarity. Moreover, the explicit modeling of subspaces offers the capability to detect out-of-distribution samples, i.e., new types of anomalies, and the regularization effect with much fewer learnable parameters of our proposed subspace classifier, compared to the popular Multi-Layer Perceptions (MLPs). Extensive numerical experiments demonstrate our method achieves better anomaly classification results than benchmark methods, and can effectively identify the new types of anomalies.

cross Diversity-Oriented Data Augmentation with Large Language Models

Authors: Zaitian Wang, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu, Pengfei Wang, Yuanchun Zhou

Abstract: Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP models. However, a significant challenge remains: \textit{Insufficient Attention to Sample Distribution Diversity}. Most existing methods focus on increasing the sample numbers while neglecting the sample distribution diversity, which can lead to model overfitting. In response, we explore data augmentation's impact on dataset diversity and propose a \textbf{\underline{D}}iversity-\textbf{\underline{o}}riented data \textbf{\underline{Aug}}mentation framework (\textbf{DoAug}). % $\mathscr{DoAug}$ Specifically, we utilize a diversity-oriented fine-tuning approach to train an LLM as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases. Then, we apply the LLM paraphraser to a selected coreset of highly informative samples and integrate the paraphrases with the original data to create a more diverse augmented dataset. Finally, we conduct extensive experiments on 12 real-world textual datasets. The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks. Specifically, it achieves an average performance gain of $10.52\%$, surpassing the runner-up baseline with more than three percentage points.

cross ReVeil: Unconstrained Concealed Backdoor Attack on Deep Neural Networks using Machine Unlearning

Authors: Manaar Alam, Hithem Lamri, Michail Maniatakos

Abstract: Backdoor attacks embed hidden functionalities in deep neural networks (DNN), triggering malicious behavior with specific inputs. Advanced defenses monitor anomalous DNN inferences to detect such attacks. However, concealed backdoors evade detection by maintaining a low pre-deployment attack success rate (ASR) and restoring high ASR post-deployment via machine unlearning. Existing concealed backdoors are often constrained by requiring white-box or black-box access or auxiliary data, limiting their practicality when such access or data is unavailable. This paper introduces ReVeil, a concealed backdoor attack targeting the data collection phase of the DNN training pipeline, requiring no model access or auxiliary data. ReVeil maintains low pre-deployment ASR across four datasets and four trigger patterns, successfully evades three popular backdoor detection methods, and restores high ASR post-deployment through machine unlearning.

cross LLM Agents Making Agent Tools

Authors: Georg W\"olflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovi\'c, Jakob Nikolas Kather

Abstract: Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains which demand large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, a novel agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a short task description and a repository URL, ToolMaker autonomously installs required dependencies and generates code to perform the task, using a closed-loop self-correction mechanism to iteratively diagnose and rectify errors. To evaluate our approach, we introduce a benchmark comprising 15 diverse and complex computational tasks spanning both medical and non-medical domains with over 100 unit tests to objectively assess tool correctness and robustness. ToolMaker correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows.

cross Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Authors: Francesco Croce, Christian Schlarmann, Naman Deep Singh, Matthias Hein

Abstract: Measuring perceptual similarity is a key tool in computer vision. In recent years perceptual metrics based on features extracted from neural networks with large and diverse training sets, e.g. CLIP, have become popular. At the same time, the metrics extracted from features of neural networks are not adversarially robust. In this paper we show that adversarially robust CLIP models, called R-CLIP$_\textrm{F}$, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.

cross Private Synthetic Graph Generation and Fused Gromov-Wasserstein Distance

Authors: Leoni Carla Wirth, Gholamali Aminian, Gesine Reinert

Abstract: Networks are popular for representing complex data. In particular, differentially private synthetic networks are much in demand for method and algorithm development. The network generator should be easy to implement and should come with theoretical guarantees. Here we start with complex data as input and jointly provide a network representation as well as a synthetic network generator. Using a random connection model, we devise an effective algorithmic approach for generating attributed synthetic graphs which is $\epsilon$-differentially private at the vertex level, while preserving utility under an appropriate notion of distance which we develop. We provide theoretical guarantees for the accuracy of the private synthetic graphs using the fused Gromov-Wasserstein distance, which extends the Wasserstein metric to structured data. Our method draws inspiration from the PSMM method of \citet{he2023}.

cross 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency

Authors: Sheng-Yu Huang, Zi-Ting Chou, Yu-Chiang Frank Wang

Abstract: When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gaussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes.Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.

cross Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Authors: Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou

Abstract: Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, which are closer to the practical setting. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, which is in contrast to the previous work \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that show circuits only add some additional components after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method, which assigns ranks to layers based on edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA algorithm achieves an average performance improvement of 2.46\% over standard LoRA with similar parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, providing new insights into the design of such tasks and deepening the understanding of circuit dynamics and fine-tuning mechanisms.

cross AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling

Authors: Hao Zhou, Wenge Rong, Jianfei Zhang, Qing Sun, Yuanxin Ouyang, Zhang Xiong

Abstract: Knowledge Tracing (KT) aims to predict students' future performances based on their former exercises and additional information in educational settings. KT has received significant attention since it facilitates personalized experiences in educational situations. Simultaneously, the autoregressive modeling on the sequence of former exercises has been proven effective for this task. One of the primary challenges in autoregressive modeling for Knowledge Tracing is effectively representing the anterior (pre-response) and posterior (post-response) states of learners across exercises. Existing methods often employ complex model architectures to update learner states using question and response records. In this study, we propose a novel perspective on knowledge tracing task by treating it as a generative process, consistent with the principles of autoregressive models. We demonstrate that knowledge states can be directly represented through autoregressive encodings on a question-response alternate sequence, where model generate the most probable representation in hidden state space by analyzing history interactions. This approach underpins our framework, termed Alternate Autoregressive Knowledge Tracing (AAKT). Additionally, we incorporate supplementary educational information, such as question-related skills, into our framework through an auxiliary task, and include extra exercise details, like response time, as additional inputs. Our proposed framework is implemented using advanced autoregressive technologies from Natural Language Generation (NLG) for both training and prediction. Empirical evaluations on four real-world KT datasets indicate that AAKT consistently outperforms all baseline models in terms of AUC, ACC, and RMSE. Furthermore, extensive ablation studies and visualized analysis validate the effectiveness of key components in AAKT.

cross ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

Authors: Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo

Abstract: Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.

cross BaxBench: Can LLMs Generate Correct and Secure Backends?

Authors: Mark Vero, Niels M\"undler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanovi\'c, Jingxuan He, Martin Vechev

Abstract: The automatic generation of programs has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 60% on code correctness; (ii) on average, we could successfully execute security exploits on more than half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.

cross Enhanced Anomaly Detection in IoMT Networks using Ensemble AI Models on the CICIoMT2024 Dataset

Authors: Prathamesh Chandekar, Mansi Mehta, Swet Chandan

Abstract: The rapid proliferation of Internet of Medical Things (IoMT) devices in healthcare has introduced unique cybersecurity challenges, primarily due to the diverse communication protocols and critical nature of these devices This research aims to develop an advanced, real-time anomaly detection framework tailored for IoMT network traffic, leveraging AI/ML models and the CICIoMT2024 dataset By integrating multi-protocol (MQTT, WiFi), attack-specific (DoS, DDoS), time-series (active/idle states), and device-specific (Bluetooth) data, our study captures a comprehensive range of IoMT interactions As part of our data analysis, various machine learning techniques are employed which include an ensemble model using XGBoost for improved performance against specific attack types, sequential models comprised of LSTM and CNN-LSTM that leverage time dependencies, and unsupervised models such as Autoencoders and Isolation Forest that are good in general anomaly detection The results of the experiment prove with an ensemble model lowers false positive rates and reduced detections.

cross JoLT: Joint Probabilistic Predictions on Tabular Data Using LLMs

Authors: Aliaksandra Shysheya, John Bronskill, James Requeima, Shoaib Ahmed Siddiqui, Javier Gonzalez, David Duvenaud, Richard E. Turner

Abstract: We introduce a simple method for probabilistic predictions on tabular data based on Large Language Models (LLMs) called JoLT (Joint LLM Process for Tabular data). JoLT uses the in-context learning capabilities of LLMs to define joint distributions over tabular data conditioned on user-specified side information about the problem, exploiting the vast repository of latent problem-relevant knowledge encoded in LLMs. JoLT defines joint distributions for multiple target variables with potentially heterogeneous data types without any data conversion, data preprocessing, special handling of missing data, or model training, making it accessible and efficient for practitioners. Our experiments show that JoLT outperforms competitive methods on low-shot single-target and multi-target tabular classification and regression tasks. Furthermore, we show that JoLT can automatically handle missing data and perform data imputation by leveraging textual side information. We argue that due to its simplicity and generality, JoLT is an effective approach for a wide variety of real prediction problems.

cross Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration

Authors: Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen

Abstract: Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.

URLs: https://github.com/sjtu-marl/DPT-Agent.

cross Ansatz-free Hamiltonian learning with Heisenberg-limited scaling

Authors: Hong-Ye Hu, Muzhou Ma, Weiyuan Gong, Qi Ye, Yu Tong, Steven T. Flammia, Susanne F. Yelin

Abstract: Learning the unknown interactions that govern a quantum system is crucial for quantum information processing, device benchmarking, and quantum sensing. The problem, known as Hamiltonian learning, is well understood under the assumption that interactions are local, but this assumption may not hold for arbitrary Hamiltonians. Previous methods all require high-order inverse polynomial dependency with precision, unable to surpass the standard quantum limit and reach the gold standard Heisenberg-limited scaling. Whether Heisenberg-limited Hamiltonian learning is possible without prior assumptions about the interaction structures, a challenge we term \emph{ansatz-free Hamiltonian learning}, remains an open question. In this work, we present a quantum algorithm to learn arbitrary sparse Hamiltonians without any structure constraints using only black-box queries of the system's real-time evolution and minimal digital controls to attain Heisenberg-limited scaling in estimation error. Our method is also resilient to state-preparation-and-measurement errors, enhancing its practical feasibility. Moreover, we establish a fundamental trade-off between total evolution time and quantum control on learning arbitrary interactions, revealing the intrinsic interplay between controllability and total evolution time complexity for any learning algorithm. These results pave the way for further exploration into Heisenberg-limited Hamiltonian learning in complex quantum systems under minimal assumptions, potentially enabling new benchmarking and verification protocols.

cross Neural Guided Diffusion Bridges

Authors: Gefan Yang, Frank van der Meulen, Stefan Sommer

Abstract: We propose a novel method for simulating conditioned diffusion processes (diffusion bridges) in Euclidean spaces. By training a neural network to approximate bridge dynamics, our approach eliminates the need for computationally intensive Markov Chain Monte Carlo (MCMC) methods or reverse-process modeling. Compared to existing methods, it offers greater robustness across various diffusion specifications and conditioning scenarios. This applies in particular to rare events and multimodal distributions, which pose challenges for score-learning- and MCMC-based approaches. We propose a flexible variational family for approximating the diffusion bridge path measure which is partially specified by a neural network. Once trained, it enables efficient independent sampling at a cost comparable to sampling the unconditioned (forward) process.

cross PreAdaptFWI: Pretrained-Based Adaptive Residual Learning for Full-Waveform Inversion Without Dataset Dependency

Authors: Xintong Dong, Zhengyi Yuan, Jun Lin, Shiqi Dong, Xunqian Tong, Yue Li

Abstract: Full-waveform inversion (FWI) is a method that utilizes seismic data to invert the physical parameters of subsurface media by minimizing the difference between simulated and observed waveforms. Due to its ill-posed nature, FWI is susceptible to getting trapped in local minima. Consequently, various research efforts have attempted to combine neural networks with FWI to stabilize the inversion process. This study presents a simple yet effective training framework that is independent of dataset reliance and requires only moderate pre-training on a simple initial model to stabilize network outputs. During the transfer learning phase, the conventional FWI gradients will simultaneously update both the neural network and the proposed adaptive residual learning module, which learns the residual mapping of large-scale distribution features in the network's output, rather than directly fitting the target mapping. Through this synergistic training paradigm, the proposed algorithm effectively infers the physically-informed prior knowledge into a global representation of stratigraphic distribution, as well as capturing subtle variations in inter-layer velocities within local details, thereby escaping local optima. Evaluating the method on two benchmark models under various conditions, including absent low-frequency data, noise interference, and differing initial models, along with corresponding ablation experiments, consistently demonstrates the superiority of the proposed approach.

cross GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs

Authors: Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han

Abstract: The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework. However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts and images) on such graphs for multimodal comprehension and generation. In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs. We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs. Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs. Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method. Datasets and codes will be open-sourced upon acceptance.

cross Qubit-Based Framework for Quantum Machine Learning: Bridging Classical Data and Quantum Algorithms

Authors: Bhavna Bose, Saurav Verma

Abstract: This paper dives into the exciting and rapidly growing field of quantum computing, explaining its core ideas, current progress, and how it could revolutionize the way we solve complex problems. It starts by breaking down the basics, like qubits, quantum circuits, and how principles like superposition and entanglement make quantum computers fundamentally different-and far more powerful for certain tasks-than the classical computers we use today. We also explore how quantum computing deals with complex problems and why it is uniquely suited for challenges classical systems struggle to handle. A big part of this paper focuses on Quantum Machine Learning (QML), where the strengths of quantum computing meet the world of artificial intelligence. By processing massive datasets and optimizing intricate algorithms, quantum systems offer new possibilities for machine learning. We highlight different approaches to combining quantum and classical computing, showing how they can work together to produce faster and more accurate results. Additionally, we explore the tools and platforms available-like TensorFlow Quantum, Qiskit and PennyLane-that are helping researchers and developers bring these theories to life. Of course, quantum computing has its hurdles. Challenges like scaling up hardware, correcting errors, and keeping qubits stable are significant roadblocks. Yet, with rapid advancements in cloud-based platforms and innovative technologies, the potential of quantum computing feels closer than ever. This paper aims to offer readers a clear and comprehensive introduction to quantum computing, its role in machine learning, and the immense possibilities it holds for the future of technology.

cross Refined PAC-Bayes Bounds for Offline Bandits

Authors: Amaury Gouverneur, Tobias J. Oechtering, Mikael Skoglund

Abstract: In this paper, we present refined probabilistic bounds on empirical reward estimates for off-policy learning in bandit problems. We build on the PAC-Bayesian bounds from Seldin et al. (2010) and improve on their results using a new parameter optimization approach introduced by Rodr\'iguez et al. (2024). This technique is based on a discretization of the space of possible events to optimize the "in probability" parameter. We provide two parameter-free PAC-Bayes bounds, one based on Hoeffding-Azuma's inequality and the other based on Bernstein's inequality. We prove that our bounds are almost optimal as they recover the same rate as would be obtained by setting the "in probability" parameter after the realization of the data.

cross Learning Generalizable Prompt for CLIP with Class Similarity Knowledge

Authors: Sehun Jung, Hyang-won Lee

Abstract: In vision-language models (VLMs), prompt tuning has shown its effectiveness in adapting models to downstream tasks. However, learned prompts struggle to generalize to unseen classes, as they tend to overfit to the classes that are targeted during prompt tuning. Examining failure cases, we observed that learned prompts disrupt the semantics of unseen classes, generating text embeddings with incorrect semantic relationships among classes. To address this, we propose Similarity Alignment Regularization (SAR), which regularizes learnable prompts to preserve the semantic relationships among classes captured by hand-crafted prompts. Specifically, we first obtain novel classes related to base classes using ChatGPT-4o and utilize them as potential unseen classes during prompt tuning. Then, by targeting both base and novel classes, SAR aligns the similarity relationships among text embeddings generated by learnable prompts with the similarity relationships from hand-crafted prompts. Extensive experiments applying SAR to existing prompt tuning methods demonstrate its effectiveness in improving generalization to unseen classes.

cross Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition

Authors: Thibault Rousset, Taisei Kakibuchi, Yusuke Sasaki, Yoshihide Nomura

Abstract: This paper investigates the integration of technical vocabulary in merged language models. We explore the knowledge transfer mechanisms involved when combining a general-purpose language-specific model with a domain-specific model, focusing on the resulting model's comprehension of technical jargon. Our experiments analyze the impact of this merging process on the target model's proficiency in handling specialized terminology. We present a quantitative evaluation of the performance of the merged model, comparing it with that of the individual constituent models. The findings offer insights into the effectiveness of different model merging methods for enhancing domain-specific knowledge and highlight potential challenges and future directions in leveraging these methods for cross-lingual knowledge transfer in Natural Language Processing.

cross Reconfigurable Intelligent Surfaces-Assisted Integrated Access and Backhaul

Authors: Charitha Madapatha, Behrooz Makki, Hao Guo, Tommy Svensson

Abstract: In this paper, we study the impact of reconfigurable intelligent surfaces (RISs) on the coverage extension of integrated access and backhaul (IAB) networks. Particularly, using a finite stochastic geometry model, with random distributions of user equipments (UEs) in a finite region, and planned hierachical architecture for IAB, we study the service coverage probability defined as the probability of the event that the UEs' minimum rate requirements are satisfied. We present comparisons between different cases including IAB-only, IAB assisted with RIS for backhaul as well as IAB assisted by network controlled repeaters (NCRs). Our investigations focus on wide-area IAB assisted with RIS through the lens of different design architectures and deployments, revealing both conflicts and synergies for minimizing the effect of tree foliage over seasonal changes. Our simulation results reveal both opportunities and challenges towards the implementation of RIS in IAB.

cross Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions

Authors: Amine Barrak, Emna Ksontini

Abstract: As data-intensive applications grow, batch processing in limited-resource environments faces scalability and resource management challenges. Serverless computing offers a flexible alternative, enabling dynamic resource allocation and automatic scaling. This paper explores how serverless architectures can make large-scale ML inference tasks faster and cost-effective by decomposing monolithic processes into parallel functions. Through a case study on sentiment analysis using the DistilBERT model and the IMDb dataset, we demonstrate that serverless parallel processing can reduce execution time by over 95% compared to monolithic approaches, at the same cost.

cross Atom of Thoughts for Markov LLM Test-Time Scaling

Authors: Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo

Abstract: Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning progress is often achieved by solving a sequence of independent subquestions, each being self-contained and verifiable. These subquestions are essentially atomic questions, relying primarily on their current state rather than accumulated history, similar to the memoryless transitions in a Markov process. Based on this observation, we propose Atom of Thoughts (AoT), where each state transition in the reasoning process consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a new atomic question state. This iterative decomposition-contraction process continues until reaching directly solvable atomic questions, naturally realizing Markov transitions between question states. Furthermore, these atomic questions can be seamlessly integrated into existing test-time scaling methods, enabling AoT to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of AoT both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and DeepSeek-R1 by 10.6%. The code will be available at https://github.com/qixucen/atom.

URLs: https://github.com/qixucen/atom.

cross A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond

Authors: Shreya Shukla, Jose Torres, Abhijit Mishra, Jacek Gwizdka, Shounak Roychowdhury

Abstract: Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial Intelligence (GenAI) has opened new frontiers in brain signal decoding, enabling assistive communication, neural representation learning, and multimodal integration. BCIs, particularly those leveraging Electroencephalography (EEG), provide a non-invasive means of translating neural activity into meaningful outputs. Recent advances in deep learning, including Generative Adversarial Networks (GANs) and Transformer-based Large Language Models (LLMs), have significantly improved EEG-based generation of images, text, and speech. This paper provides a literature review of the state-of-the-art in EEG-based multimodal generation, focusing on (i) EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based language models and contrastive learning methods. Additionally, we discuss the emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We highlight key datasets, use cases, challenges, and EEG feature encoding methods that underpin generative approaches. By providing a structured overview of EEG-based generative AI, this survey aims to equip researchers and practitioners with insights to advance neural decoding, enhance assistive technologies, and expand the frontiers of brain-computer interaction.

cross How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines

Authors: Ayan Sengupta, Yash Goel, Tanmoy Chakraborty

Abstract: Neural scaling laws have revolutionized the design and optimization of large-scale AI models by revealing predictable relationships between model size, dataset volume, and computational resources. Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies. However, recent studies highlighted their limitations across architectures, modalities, and deployment contexts. Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns. Moreover, scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches. In this survey, we synthesize insights from over 50 studies, examining the theoretical foundations, empirical findings, and practical implications of scaling laws. We also explore key challenges, including data efficiency, inference scaling, and architecture-specific constraints, advocating for adaptive scaling strategies tailored to real-world applications. We suggest that while scaling laws provide a useful guide, they do not always generalize across all architectures and training strategies.

cross Low-Rank Thinning

Authors: Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey

Abstract: The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially reducing the number of summary points. However, existing guarantees cover only a restricted range of distributions and kernel-based quality measures and suffer from pessimistic dimension dependence. To address these deficiencies, we introduce a new low-rank analysis of sub-Gaussian thinning that applies to any distribution and any kernel, guaranteeing high-quality compression whenever the kernel or data matrix is approximately low-rank. To demonstrate the broad applicability of the techniques, we design practical sub-Gaussian thinning approaches that improve upon the best known guarantees for approximating attention in transformers, accelerating stochastic gradient training through reordering, and distinguishing distributions in near-linear time.

cross CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models

Authors: Yifan Zhang, Xue Yang

Abstract: Automating planning with LLMs presents transformative opportunities for traditional industries, yet remains underexplored. In commercial construction, the complexity of automated scheduling often requires manual intervention to ensure precision. We propose CONSTRUCTA, a novel framework leveraging LLMs to optimize construction schedules in complex projects like semiconductor fabrication. CONSTRUCTA addresses key challenges by: (1) integrating construction-specific knowledge through static RAG; (2) employing context-sampling techniques inspired by architectural expertise to provide relevant input; and (3) deploying Construction DPO to align schedules with expert preferences using RLHF. Experiments on proprietary data demonstrate performance improvements of +42.3% in missing value prediction, +79.1% in dependency analysis, and +28.9% in automated planning compared to baseline methods, showcasing its potential to revolutionize construction workflows and inspire domain-specific LLM advancements.

cross AdaSplash: Adaptive Sparse Flash Attention

Authors: Nuno Gon\c{c}alves, Marcos Treviso, Andr\'e F. T. Martins

Abstract: The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches -- and in some cases surpasses -- the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.

cross How compositional generalization and creativity improve as diffusion models are trained

Authors: Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, Matthieu Wyart

Abstract: Natural data is often organized as a hierarchical composition of features. How many samples do generative models need to learn the composition rules, so as to produce a combinatorial number of novel data? What signal in the data is exploited to learn? We investigate these questions both theoretically and empirically. Theoretically, we consider diffusion models trained on simple probabilistic context-free grammars - tree-like graphical models used to represent the structure of data such as language and images. We demonstrate that diffusion models learn compositional rules with the sample complexity required for clustering features with statistically similar context, a process similar to the word2vec algorithm. However, this clustering emerges hierarchically: higher-level, more abstract features associated with longer contexts require more data to be identified. This mechanism leads to a sample complexity that scales polynomially with the said context size. As a result, diffusion models trained on intermediate dataset size generate data coherent up to a certain scale, but that lacks global coherence. We test these predictions in different domains, and find remarkable agreement: both generated texts and images achieve progressively larger coherence lengths as the training time or dataset size grows. We discuss connections between the hierarchical clustering mechanism we introduce here and the renormalization group in physics.

cross On the Query Complexity of Verifier-Assisted Language Generation

Authors: Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski

Abstract: Recently, a plethora of works have proposed inference-time algorithms (e.g. best-of-n), which incorporate verifiers to assist the generation process. Their quality-efficiency trade-offs have been empirically benchmarked on a variety of constrained generation tasks, but the algorithmic design landscape is still largely poorly understood. In this paper, we develop a mathematical framework for reasoning about constrained generation using a pre-trained language model generator oracle and a process verifier--which can decide whether a prefix can be extended to a string which satisfies the constraints of choice. We show that even in very simple settings, access to a verifier can render an intractable problem (information-theoretically or computationally) to a tractable one. In fact, we show even simple algorithms, like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier. Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to "backtrack" (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling)--both in terms of computational efficiency, accuracy and diversity.

cross Hypernym Bias: Unraveling Deep Classifier Training Dynamics through the Lens of Class Hierarchy

Authors: Roman Malashin, Valeria Yachnaya, Alexander Mullin

Abstract: We investigate the training dynamics of deep classifiers by examining how hierarchical relationships between classes evolve during training. Through extensive experiments, we argue that the learning process in classification problems can be understood through the lens of label clustering. Specifically, we observe that networks tend to distinguish higher-level (hypernym) categories in the early stages of training, and learn more specific (hyponym) categories later. We introduce a novel framework to track the evolution of the feature manifold during training, revealing how the hierarchy of class relations emerges and refines across the network layers. Our analysis demonstrates that the learned representations closely align with the semantic structure of the dataset, providing a quantitative description of the clustering process. Notably, we show that in the hypernym label space, certain properties of neural collapse appear earlier than in the hyponym label space, helping to bridge the gap between the initial and terminal phases of learning. We believe our findings offer new insights into the mechanisms driving hierarchical learning in deep networks, paving the way for future advancements in understanding deep learning dynamics.

cross Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction

Authors: Xiang Fu, Brandon M. Wood, Luis Barroso-Luque, Daniel S. Levine, Meng Gao, Misko Dzamba, C. Lawrence Zitnick

Abstract: Machine learning interatomic potentials (MLIPs) have become increasingly effective at approximating quantum mechanical calculations at a fraction of the computational cost. However, lower errors on held out test sets do not always translate to improved results on downstream physical property prediction tasks. In this paper, we propose testing MLIPs on their practical ability to conserve energy during molecular dynamic simulations. If passed, improved correlations are found between test errors and their performance on physical property prediction tasks. We identify choices which may lead to models failing this test, and use these observations to improve upon highly-expressive models. The resulting model, eSEN, provides state-of-the-art results on a range of physical property prediction tasks, including materials stability prediction, thermal conductivity prediction, and phonon calculations.

cross Learning Getting-Up Policies for Real-World Humanoid Robots

Authors: Xialin He, Runpei Dong, Zixuan Chen, Saurabh Gupta

Abstract: Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of humanoid locomotion learning, the getting-up task involves complex contact patterns, which necessitates accurately modeling the collision geometry and sparser rewards. We address these challenges through a two-phase approach that follows a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). To the best of our knowledge, this is the first successful demonstration of learned getting-up policies for human-sized humanoid robots in the real world. Project page: https://humanoid-getup.github.io/

URLs: https://humanoid-getup.github.io/

cross Diffusion Models without Classifier-free Guidance

Authors: Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo

Abstract: This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at https://github.com/tzco/Diffusion-wo-CFG.

URLs: https://github.com/tzco/Diffusion-wo-CFG.

replace Network Level Spatial Temporal Traffic State Forecasting with Hierarchical-Attention-LSTM (HierAttnLSTM)

Authors: Tianya Zhang

Abstract: Traffic state data, such as speed, volume and travel time collected from ubiquitous traffic monitoring sensors require advanced network level analytics for forecasting and identifying significant traffic patterns. This paper leverages diverse traffic state datasets from the Caltrans Performance Measurement System (PeMS) hosted on the open benchmark and achieved promising performance compared to well recognized spatial-temporal models. Drawing inspiration from the success of hierarchical architectures in various Artificial Intelligence (AI) tasks, we integrate cell and hidden states from low-level to high-level Long Short-Term Memory (LSTM) networks with an attention pooling mechanism, similar to human perception systems. The developed hierarchical structure is designed to account for dependencies across different time scales, capturing the spatial-temporal correlations of network-level traffic states, enabling the prediction of traffic states for all corridors rather than a single link or route. The efficiency of designed attention-based LSTM is analyzed by ablation study. Comparative results with baseline LSTM models demonstrate that the Hierarchical Attention LSTM (HierAttnLSTM) model not only provides higher prediction accuracy but also effectively forecasts unusual congestion patterns. Data and code are made publicly available to support reproducible scientific research.

replace Federated Multi-Armed Bandits Under Byzantine Attacks

Authors: Artun Saday, \.Ilker Demirel, Yi\u{g}it Y{\i}ld{\i}r{\i}m, Cem Tekin

Abstract: Multi-armed bandits (MAB) is a sequential decision-making model in which the learner controls the trade-off between exploration and exploitation to maximize its cumulative reward. Federated multi-armed bandits (FMAB) is an emerging framework where a cohort of learners with heterogeneous local models play an MAB game and communicate their aggregated feedback to a server to learn a globally optimal arm. Two key hurdles in FMAB are communication-efficient learning and resilience to adversarial attacks. To address these issues, we study the FMAB problem in the presence of Byzantine clients who can send false model updates threatening the learning process. We analyze the sample complexity and the regret of $\beta$-optimal arm identification. We borrow tools from robust statistics and propose a median-of-means (MoM)-based online algorithm, Fed-MoM-UCB, to cope with Byzantine clients. In particular, we show that if the Byzantine clients constitute less than half of the cohort, the cumulative regret with respect to $\beta$-optimal arms is bounded over time with high probability, showcasing both communication efficiency and Byzantine resilience. We analyze the interplay between the algorithm parameters, a discernibility margin, regret, communication cost, and the arms' suboptimality gaps. We demonstrate Fed-MoM-UCB's effectiveness against the baselines in the presence of Byzantine attacks via experiments.

replace $\Delta$-PINNs: physics-informed neural networks on complex geometries

Authors: Francisco Sahli Costabal, Simone Pezzuto, Paris Perdikaris

Abstract: Physics-informed neural networks (PINNs) have demonstrated promise in solving forward and inverse problems involving partial differential equations. Despite recent progress on expanding the class of problems that can be tackled by PINNs, most of existing use-cases involve simple geometric domains. To date, there is no clear way to inform PINNs about the topology of the domain where the problem is being solved. In this work, we propose a novel positional encoding mechanism for PINNs based on the eigenfunctions of the Laplace-Beltrami operator. This technique allows to create an input space for the neural network that represents the geometry of a given object. We approximate the eigenfunctions as well as the operators involved in the partial differential equations with finite elements. We extensively test and compare the proposed methodology against traditional PINNs in complex shapes, such as a coil, a heat sink and a bunny, with different physics, such as the Eikonal equation and heat transfer. We also study the sensitivity of our method to the number of eigenfunctions used, as well as the discretization used for the eigenfunctions and the underlying operators. Our results show excellent agreement with the ground truth data in cases where traditional PINNs fail to produce a meaningful solution. We envision this new technique will expand the effectiveness of PINNs to more realistic applications.

replace Graph Learning Across Data Silos

Authors: Xiang Zhang, Qiao Wang

Abstract: We consider the problem of inferring graph topology from smooth graph signals in a novel but practical scenario where data are located in distributed clients and prohibited from leaving local clients due to factors such as privacy concerns. The main difficulty in this task is how to exploit the potentially heterogeneous data of all clients under data silos. To this end, we first propose an auto-weighted multiple graph learning model to jointly learn a personalized graph for each local client and a single consensus graph for all clients. The personalized graphs match local data distributions, thereby mitigating data heterogeneity, while the consensus graph captures the global information. Moreover, the model can automatically assign appropriate contribution weights to local graphs based on their similarity to the consensus graph. We next devise a tailored algorithm to solve the induced problem, where all raw data are processed locally without leaving clients. Theoretically, we establish a provable estimation error bound and convergence analysis for the proposed model and algorithm. Finally, extensive experiments on synthetic and real data are carried out, and the results illustrate that our approach can learn graphs effectively in the target scenario.

replace Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes

Authors: Di Wang, Yao Wang, Shao-Bo Lin

Abstract: In recent years, large amounts of electronic health records (EHRs) concerning chronic diseases have been collected to facilitate medical diagnosis. Modeling the dynamic properties of EHRs related to chronic diseases can be efficiently done using dynamic treatment regimes (DTRs). While reinforcement learning (RL) is a widely used method for creating DTRs, there is ongoing research in developing RL algorithms that can effectively handle large amounts of data. In this paper, we present a scalable kernel-based distributed Q-learning algorithm for generating DTRs. We perform both theoretical assessments and numerical analysis for the proposed approach. The results demonstrate that our algorithm significantly reduces the computational complexity associated with the state-of-the-art deep reinforcement learning methods, while maintaining comparable generalization performance in terms of accumulated rewards across stages, such as survival time or cumulative survival probability.

replace Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning

Authors: Zixuan Hu, Li Shen, Zhenyi Wang, Tongliang Liu, Chun Yuan, Dacheng Tao

Abstract: The goal of data-free meta-learning is to learn useful prior knowledge from a collection of pre-trained models without accessing their training data. However, existing works only solve the problem in parameter space, which (i) ignore the fruitful data knowledge contained in the pre-trained models; (ii) can not scale to large-scale pre-trained models; (iii) can only meta-learn pre-trained models with the same network architecture. To address those issues, we propose a unified framework, dubbed PURER, which contains: (1) ePisode cUrriculum inveRsion (ECI) during data-free meta training; and (2) invErsion calibRation following inner loop (ICFIL) during meta testing. During meta training, we propose ECI to perform pseudo episode training for learning to adapt fast to new unseen tasks. Specifically, we progressively synthesize a sequence of pseudo episodes by distilling the training data from each pre-trained model. The ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model. We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner. During meta testing, we further propose a simple plug-and-play supplement-ICFIL-only used during meta testing to narrow the gap between meta training and meta testing task distribution. Extensive experiments in various real-world scenarios show the superior performance of ours.

replace An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

Authors: Yeqi Gao, Zhao Song, Junze Yin

Abstract: Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come. Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression. In this paper, we propose an iterative algorithm to solve a rescaled version of the slightly different formulation of the softmax regression problem that arises in attention mechanisms of large language models. Specifically, we consider minimizing the squared loss between a certain function, which can be either the exponential function, hyperbolic sine function, or hyperbolic cosine function, and its inner product with a target $n$-dimensional vector $b$, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor. The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes under small perturbations for in-context learning. Limitations and societal impact are discussed.

replace Non-Parametric Learning of Stochastic Differential Equations with Non-asymptotic Fast Rates of Convergence

Authors: Riccardo Bonalli, Alessandro Rudi

Abstract: We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of multi-dimensional non-linear stochastic differential equations, which relies upon discrete-time observations of the state. The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations, yielding theoretical estimates of non-asymptotic learning rates which, unlike previous works, become increasingly tighter when the regularity of the unknown drift and diffusion coefficients becomes higher. Our method being kernel-based, offline pre-processing may be profitably leveraged to enable efficient numerical implementation, offering excellent balance between precision and computational complexity.

replace Learning to Learn from APIs: Black-Box Data-Free Meta-Learning

Authors: Zixuan Hu, Li Shen, Zhenyi Wang, Baoyuan Wu, Chun Yuan, Dacheng Tao

Abstract: Data-free meta-learning (DFML) aims to enable efficient learning of new tasks by meta-learning from a collection of pre-trained models without access to the training data. Existing DFML work can only meta-learn from (i) white-box and (ii) small-scale pre-trained models (iii) with the same architecture, neglecting the more practical setting where the users only have inference access to the APIs with arbitrary model architectures and model scale inside. To solve this issue, we propose a Bi-level Data-free Meta Knowledge Distillation (BiDf-MKD) framework to transfer more general meta knowledge from a collection of black-box APIs to one single meta model. Specifically, by just querying APIs, we inverse each API to recover its training data via a zero-order gradient estimator and then perform meta-learning via a novel bi-level meta knowledge distillation structure, in which we design a boundary query set recovery technique to recover a more informative query set near the decision boundary. In addition, to encourage better generalization within the setting of limited API budgets, we propose task memory replay to diversify the underlying task distribution by covering more interpolated tasks. Extensive experiments in various real-world scenarios show the superior performance of our BiDf-MKD framework.

replace Efficient Alternating Minimization with Applications to Weighted Low Rank Approximation

Authors: Zhao Song, Mingquan Ye, Junze Yin, Lichen Zhang

Abstract: Weighted low rank approximation is a fundamental problem in numerical linear algebra, and it has many applications in machine learning. Given a matrix $M \in \mathbb{R}^{n \times n}$, a non-negative weight matrix $W \in \mathbb{R}_{\geq 0}^{n \times n}$, a parameter $k$, the goal is to output two matrices $X,Y\in \mathbb{R}^{n \times k}$ such that $\| W \circ (M - X Y^\top) \|_F$ is minimized, where $\circ$ denotes the Hadamard product. It naturally generalizes the well-studied low rank matrix completion problem. Such a problem is known to be NP-hard and even hard to approximate assuming the Exponential Time Hypothesis [GG11, RSW16]. Meanwhile, alternating minimization is a good heuristic solution for weighted low rank approximation. In particular, [LLR16] shows that, under mild assumptions, alternating minimization does provide provable guarantees. In this work, we develop an efficient and robust framework for alternating minimization that allows the alternating updates to be computed approximately. For weighted low rank approximation, this improves the runtime of [LLR16] from $\|W\|_0k^2$ to $\|W\|_0 k$ where $\|W\|_0$ denotes the number of nonzero entries of the weight matrix. At the heart of our framework is a high-accuracy multiple response regression solver together with a robust analysis of alternating minimization.

replace Probabilistic Learning of Multivariate Time Series with Temporal Irregularity

Authors: Yijun Li, Cheuk Hang Leung, Qi Wu

Abstract: Probabilistic forecasting of multivariate time series is essential for various downstream tasks. Most existing approaches rely on the sequences being uniformly spaced and aligned across all variables. However, real-world multivariate time series often suffer from temporal irregularities, including nonuniform intervals and misaligned variables, which pose significant challenges for accurate forecasting. To address these challenges, we propose an end-to-end framework that models temporal irregularities while capturing the joint distribution of variables at arbitrary continuous-time points. Specifically, we introduce a dynamic conditional continuous normalizing flow to model data distributions in a non-parametric manner, accommodating the complex, non-Gaussian characteristics commonly found in real-world datasets. Then, by leveraging a carefully factorized log-likelihood objective, our approach captures both temporal and cross-sectional dependencies efficiently. Extensive experiments on a range of real-world datasets demonstrate the superiority and adaptability of our method compared to existing approaches.

replace Data-Free Quantization via Mixed-Precision Compensation without Fine-Tuning

Authors: Jun Chen, Shipeng Bai, Tianxin Huang, Mengmeng Wang, Guanzhong Tian, Yong Liu

Abstract: Neural network quantization is a very promising solution in the field of model compression, but its resulting accuracy highly depends on a training/fine-tuning process and requires the original data. This not only brings heavy computation and time costs but also is not conducive to privacy and sensitive information protection. Therefore, a few recent works are starting to focus on data-free quantization. However, data-free quantization does not perform well while dealing with ultra-low precision quantization. Although researchers utilize generative methods of synthetic data to address this problem partially, data synthesis needs to take a lot of computation and time. In this paper, we propose a data-free mixed-precision compensation (DF-MPC) method to recover the performance of an ultra-low precision quantized model without any data and fine-tuning process. By assuming the quantized error caused by a low-precision quantized layer can be restored via the reconstruction of a high-precision quantized layer, we mathematically formulate the reconstruction loss between the pre-trained full-precision model and its layer-wise mixed-precision quantized model. Based on our formulation, we theoretically deduce the closed-form solution by minimizing the reconstruction loss of the feature maps. Since DF-MPC does not require any original/synthetic data, it is a more efficient method to approximate the full-precision model. Experimentally, our DF-MPC is able to achieve higher accuracy for an ultra-low precision quantized model compared to the recent methods without any data and fine-tuning process.

replace Random-Set Neural Networks (RS-NN)

Authors: Shireen Kudukkil Manchingal, Muhammad Mubashar, Kaizheng Wang, Keivan Shariatmadar, Fabio Cuzzolin

Abstract: Machine learning is increasingly deployed in safety-critical domains where erroneous predictions may lead to potentially catastrophic consequences, highlighting the need for learning systems to be aware of how confident they are in their own predictions: in other words, 'to know when they do not know'. In this paper, we propose a novel Random-Set Neural Network (RS-NN) approach to classification which predicts belief functions (rather than classical probability vectors) over the class list using the mathematics of random sets, i.e., distributions over the collection of sets of classes. RS-NN encodes the 'epistemic' uncertainty induced by training sets that are insufficiently representative or limited in size via the size of the convex set of probability vectors associated with a predicted belief function. Our approach outperforms state-of-the-art Bayesian and Ensemble methods in terms of accuracy, uncertainty estimation and out-of-distribution (OoD) detection on multiple benchmarks (CIFAR-10 vs SVHN/Intel-Image, MNIST vs FMNIST/KMNIST, ImageNet vs ImageNet-O). RS-NN also scales up effectively to large-scale architectures (e.g. WideResNet-28-10, VGG16, Inception V3, EfficientNetB2 and ViT-Base-16), exhibits remarkable robustness to adversarial attacks and can provide statistical guarantees in a conformal learning setting.

replace Deep neural networks from the perspective of ergodic theory

Authors: Fan Zhang

Abstract: The design of deep neural networks remains somewhat of an art rather than precise science. By tentatively adopting ergodic theory considerations on top of viewing the network as the time evolution of a dynamical system, with each layer corresponding to a temporal instance, we show that some rules of thumb, which might otherwise appear mysterious, can be attributed heuristics.

replace TabuLa: Harnessing Language Models for Tabular Data Synthesis

Authors: Zilong Zhao, Robert Birke, Lydia Chen

Abstract: Tabular data synthesis is crucial for addressing privacy and security concerns in industries reliant on tabular data. While recent advancements adopt large language models (LLMs) for realistic tabular data generation, their long training times and limited reusability hinder practical applications. In this paper, we propose Tabula, a tabular data synthesizer that leverages the structure of LLM. Unlike state-of-the-art (SOTA) LLM-based tabular data synthesizers that rely on pre-trained LLMs, Tabula discards the pre-trained weights originally designed for natural language tasks, focusing instead on a tailored approach for tabular data. In addition, Tabula introduces a token sequence compression strategy that significantly reduces training time while maintaining data quality, alongside a novel token padding method that improves sequence alignment across training batches. Experiments on six datasets show that Tabula achieves superior synthetic data utility compared to current SOTA methods. Additionally, the results demonstrate that Tabula model trained on tabular datasets serves effectively as a foundational model for synthesizing new tabular datasets. Furthermore, the proposed padding method outperforms the conventional left and right padding strategies. Finally, the results highlight that Tabula averagely reduces training time per epoch by 46.2% compared to state-of-the-art LLM approaches while achieving higher data utility. Our code is available at https://github.com/zhao-zilong/Tabula

URLs: https://github.com/zhao-zilong/Tabula

replace Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups

Authors: Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong

Abstract: Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery. Code is available at https://github.com/BrachioLab/sop.

URLs: https://github.com/BrachioLab/sop.

replace Wasserstein Convergence Guarantees for a General Class of Score-Based Generative Models

Authors: Xuefeng Gao, Hoang M. Nguyen, Lingjiong Zhu

Abstract: Score-based generative models (SGMs) is a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of SGMs in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our result to several concrete SGMs with specific choices of forward processes modelled by stochastic differential equations, and obtain an upper bound on the iteration complexity for each model, which demonstrates the impacts of different choices of the forward processes. We also provide a lower bound when the data distribution is Gaussian. Numerically, we experiment SGMs with different forward processes, some of which are newly proposed in this paper, for unconditional image generation on CIFAR-10. We find that the experimental results are in good agreement with our theoretical predictions on the iteration complexity, and the models with our newly proposed forward processes can outperform existing models.

replace On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates

Authors: Stefano Bruno, Ying Zhang, Dong-Young Lim, \"Omer Deniz Akyildiz, Sotirios Sabanis

Abstract: We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly log-concave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions avoiding any Lipschitzness assumption on the score function. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm. Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.

replace Metalearning Continual Learning Algorithms

Authors: Kazuki Irie, R\'obert Csord\'as, J\"urgen Schmidhuber

Abstract: General-purpose learning systems should improve themselves in open-ended fashion in ever-changing environments. Conventional learning algorithms for neural networks, however, suffer from catastrophic forgetting (CF), i.e., previously acquired skills are forgotten when a new task is learned. Instead of hand-crafting new algorithms for avoiding CF, we propose Automated Continual Learning (ACL) to train self-referential neural networks to metalearn their own in-context continual (meta)learning algorithms. ACL encodes continual learning (CL) desiderata -- good performance on both old and new tasks -- into its metalearning objectives. Our experiments demonstrate that ACL effectively resolves "in-context catastrophic forgetting," a problem that naive in-context learning algorithms suffer from; ACL-learned algorithms outperform both hand-crafted learning algorithms and popular meta-continual learning methods on the Split-MNIST benchmark in the replay-free setting, and enables continual learning of diverse tasks consisting of multiple standard image classification datasets. We also discuss the current limitations of in-context CL by comparing ACL with state-of-the-art CL methods that leverage pre-trained models. Overall, we bring several novel perspectives into the long-standing problem of CL.

replace PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

Authors: Lei Guan, Dongsheng Li, Yongle Chen, Jiye Liang, Wenjian Wang, Xicheng Lu

Abstract: Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim can be accessible at https://github.com/guanleics/PipeOptim.

URLs: https://github.com/guanleics/PipeOptim.

replace On Temperature Scaling and Conformal Prediction of Deep Classifiers

Authors: Lahav Dabah, Tom Tirer

Abstract: In many classification applications, the prediction of a deep neural network (DNN) based classifier needs to be accompanied by some confidence indication. Two popular approaches for that aim are: 1) Calibration: modifies the classifier's softmax values such that the maximal value better estimates the correctness probability; and 2) Conformal Prediction (CP): produces a prediction set of candidate labels that contains the true label with a user-specified probability, guaranteeing marginal coverage but not, e.g., per class coverage. In practice, both types of indications are desirable, yet, so far the interplay between them has not been investigated. Focusing on the ubiquitous Temperature Scaling (TS) calibration, we start this paper with an extensive empirical study of its effect on prominent CP methods. We show that while TS calibration improves the class-conditional coverage of adaptive CP methods, surprisingly, it negatively affects their prediction set sizes. Motivated by this behavior, we explore the effect of TS on CP beyond its calibration application and reveal an intriguing trend under which it allows to trade prediction set size and conditional coverage of adaptive CP methods. Then, we establish a mathematical theory that explains the entire non-monotonic trend. Finally, based on our experiments and theory, we offer simple guidelines for practitioners to effectively combine adaptive CP with calibration.

replace How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

Authors: Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry

Abstract: Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifies the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN'' that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parameterization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.

replace Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

Authors: Taira Tsuchiya, Shinji Ito, Junya Honda

Abstract: Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, \emph{exploration by optimization} (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of $O(\sum_{a \neq a^*} k^2 m^2 \log T / \Delta_a)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds, $a^*$ is an optimal action, and $\Delta_a$ is the suboptimality gap for action $a$. This bound is roughly $\Theta(k^2 \log T)$ times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first $O(\log T)$ stochastic bound.

replace Fast Rates in Stochastic Online Convex Optimization by Exploiting the Curvature of Feasible Sets

Authors: Taira Tsuchiya, Shinji Ito

Abstract: In this work, we explore online convex optimization (OCO) and introduce a new condition and analysis that provides fast rates by exploiting the curvature of feasible sets. In online linear optimization, it is known that if the average gradient of loss functions exceeds a certain threshold, the curvature of feasible sets can be exploited by the follow-the-leader (FTL) algorithm to achieve a logarithmic regret. This study reveals that algorithms adaptive to the curvature of loss functions can also leverage the curvature of feasible sets. In particular, we first prove that if an optimal decision is on the boundary of a feasible set and the gradient of an underlying loss function is non-zero, then the algorithm achieves a regret bound of $O(\rho \log T)$ in stochastic environments. Here, $\rho > 0$ is the radius of the smallest sphere that includes the optimal decision and encloses the feasible set. Our approach, unlike existing ones, can work directly with convex loss functions, exploiting the curvature of loss functions simultaneously, and can achieve the logarithmic regret only with a local property of feasible sets. Additionally, the algorithm achieves an $O(\sqrt{T})$ regret even in adversarial environments, in which FTL suffers an $\Omega(T)$ regret, and achieves an $O(\rho \log T + \sqrt{C \rho \log T})$ regret in corrupted stochastic environments with corruption level $C$. Furthermore, by extending our analysis, we establish a matching regret upper bound of $O\Big(T^{\frac{q-2}{2(q-1)}} (\log T)^{\frac{q}{2(q-1)}}\Big)$ for $q$-uniformly convex feasible sets, where uniformly convex sets include strongly convex sets and $\ell_p$-balls for $p \in [2,\infty)$. This bound bridges the gap between the $O(\log T)$ bound for strongly convex sets~($q=2$) and the $O(\sqrt{T})$ bound for non-curved sets~($q\to\infty$).

replace Cost-Effective Attention Mechanisms for Low Resource Settings: Necessity & Sufficiency of Linear Transformations

Authors: Peyman Hosseini, Mehran Hosseini, Ignacio Castro, Matthew Purver

Abstract: From natural language processing to vision, Scaled Dot Product Attention (SDPA) is the backbone of most modern deep learning applications. Unfortunately, its memory and computational requirements can be prohibitive in low-resource settings. In this paper, we improve its efficiency without sacrificing its versatility. We propose three attention variants where we remove consecutive linear transformations or add a novel one, and evaluate them on a range of standard NLP and vision tasks. Our proposed models are substantially lighter than standard SDPA (and have 25-50% fewer parameters). We show that the performance cost of these changes is negligible relative to size reduction and that in one case (Super Attention) we succeed in outperforming SDPA by up to 10% while improving its speed and reducing its parameters by 25%.

replace On the Asymptotic Mean Square Error Optimality of Diffusion Models

Authors: Benedikt Fesl, Benedikt B\"ock, Florian Strasser, Michael Baur, Michael Joham, Wolfgang Utschick

Abstract: Diffusion models (DMs) as generative priors have recently shown great potential for denoising tasks but lack theoretical understanding with respect to their mean square error (MSE) optimality. This paper proposes a novel denoising strategy inspired by the structure of the MSE-optimal conditional mean estimator (CME). The resulting DM-based denoiser can be conveniently employed using a pre-trained DM, being particularly fast by truncating reverse diffusion steps and not requiring stochastic re-sampling. We present a comprehensive (non-)asymptotic optimality analysis of the proposed diffusion-based denoiser, demonstrating polynomial-time convergence to the CME under mild conditions. Our analysis also derives a novel Lipschitz constant that depends solely on the DM's hyperparameters. Further, we offer a new perspective on DMs, showing that they inherently combine an asymptotically optimal denoiser with a powerful generator, modifiable by switching re-sampling in the reverse process on or off. The theoretical findings are thoroughly validated with experiments based on various benchmark datasets

replace Efficient Off-Policy Learning for High-Dimensional Action Spaces

Authors: Fabian Otto, Philipp Becker, Ngo Anh Vien, Gerhard Neumann

Abstract: Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation, which can be problematic in high-dimensional action spaces due to the curse of dimensionality. This reliance results in data inefficiency as maintaining a state-action-value function in such spaces is challenging. We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning. This approach, which we refer to as Vlearn, effectively circumvents the limitations of existing methods by eliminating the necessity for an explicit state-action-value function. To this end, we leverage a weighted importance sampling loss for learning deep value functions from off-policy data. While this is common for linear methods, it has not been combined with deep value function networks. This transfer to deep methods is not straightforward and requires novel design choices such as robust policy updates, twin value function networks to avoid an optimization bias, and importance weight clipping. We also present a novel analysis of the variance of our estimate compared to commonly used importance sampling estimators such as V-trace. Our approach improves sample complexity as well as final performance and ensures consistent and robust performance across various benchmark tasks. Eliminating the state-action-value function in Vlearn facilitates a streamlined learning process, yielding high-return agents.

replace Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts

Authors: Anh Bui, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung

Abstract: Diffusion models have demonstrated remarkable capability in generating high-quality visual content from textual descriptions. However, since these models are trained on large-scale internet data, they inevitably learn undesirable concepts, such as sensitive content, copyrighted material, and harmful or unethical elements. While previous works focus on permanently removing such concepts, this approach is often impractical, as it can degrade model performance and lead to irreversible loss of information. In this work, we introduce a novel concept-hiding approach that makes unwanted concepts inaccessible to public users while allowing controlled recovery when needed. Instead of erasing knowledge from the model entirely, we incorporate a learnable prompt into the cross-attention module, acting as a secure memory that suppresses the generation of hidden concepts unless a secret key is provided. This enables flexible access control -- ensuring that undesirable content cannot be easily generated while preserving the option to reinstate it under restricted conditions. Our method introduces a new paradigm where concept suppression and controlled recovery coexist, which was not feasible in prior works. We validate its effectiveness on the Stable Diffusion model, demonstrating that hiding concepts mitigate the risks of permanent removal while maintaining the model's overall capability.

replace Physics-Informed Diffusion Models

Authors: Jan-Hendrik Bastek, WaiChing Sun, Dennis M. Kochmann

Abstract: Generative models such as denoising diffusion models are quickly advancing their ability to approximate highly complex data distributions. They are also increasingly leveraged in scientific machine learning, where samples from the implied data distribution are expected to adhere to specific governing equations. We present a framework that unifies generative modeling and partial differential equation fulfillment by introducing a first-principle-based loss term that enforces generated samples to fulfill the underlying physical constraints. Our approach reduces the residual error by up to two orders of magnitude compared to previous work in a fluid flow case study and outperforms task-specific frameworks in relevant metrics for structural topology optimization. We also present numerical evidence that our extended training objective acts as a natural regularization mechanism against overfitting. Our framework is simple to implement and versatile in its applicability for imposing equality and inequality constraints as well as auxiliary optimization objectives.

replace GLAD: Improving Latent Graph Generative Modeling with Simple Quantization

Authors: Van Khoa Nguyen, Yoann Boget, Frantzeska Lavda, Alexandros Kalousis

Abstract: Learning graph generative models over latent spaces has received less attention compared to models that operate on the original data space and has so far demonstrated lacklustre performance. We present GLAD a latent space graph generative model. Unlike most previous latent space graph generative models, GLAD operates on a discrete latent space that preserves to a significant extent the discrete nature of the graph structures making no unnatural assumptions such as latent space continuity. We learn the prior of our discrete latent space by adapting diffusion bridges to its structure. By operating over an appropriately constructed latent space we avoid relying on decompositions that are often used in models that operate in the original data space. We present experiments on a series of graph benchmark datasets that demonstrates GLAD as the first equivariant latent graph generative method achieves competitive performance with the state of the art baselines.

replace Accurate estimation of feature importance faithfulness for tree models

Authors: Mateusz Gajewski, Adam Karczmarz, Mateusz Rapicki, Piotr Sankowski

Abstract: In this paper, we consider a perturbation-based metric of predictive faithfulness of feature rankings (or attributions) that we call PGI squared. When applied to decision tree-based regression models, the metric can be computed accurately and efficiently for arbitrary independent feature perturbation distributions. In particular, the computation does not involve Monte Carlo sampling that has been typically used for computing similar metrics and which is inherently prone to inaccuracies. Moreover, we propose a method of ranking features by their importance for the tree model's predictions based on PGI squared. Our experiments indicate that in some respects, the method may identify the globally important features better than the state-of-the-art SHAP explainer

replace PHLP: Sole Persistent Homology for Link Prediction - Interpretable Feature Extraction

Authors: Junwon You, Eunwoo Heo, Jae-Hun Jung

Abstract: Link prediction (LP), inferring the connectivity between nodes, is a significant research area in graph data, where a link represents essential information on relationships between nodes. Although graph neural network (GNN)-based models have achieved high performance in LP, understanding why they perform well is challenging because most comprise complex neural networks. We employ persistent homology (PH), a topological data analysis method that helps analyze the topological information of graphs, to interpret the features used for prediction. We propose a novel method that employs PH for LP (PHLP) focusing on how the presence or absence of target links influences the overall topology. The PHLP utilizes the angle hop subgraph and new node labeling called degree double radius node labeling (Degree DRNL), distinguishing the information of graphs better than DRNL. Using only a classifier, PHLP performs similarly to state-of-the-art (SOTA) models on most benchmark datasets. Incorporating the outputs calculated using PHLP into the existing GNN-based SOTA models improves performance across all benchmark datasets. To the best of our knowledge, PHLP is the first method of applying PH to LP without GNNs. The proposed approach, employing PH while not relying on neural networks, enables the identification of crucial factors for improving performance.

replace On the Universality of Self-Supervised Representation Learning

Authors: Wenwen Qiang, Jingyao Wang, Lingyu Si, Chuxiong Sun, Fuchun Sun, Hui Xiong

Abstract: In this paper, we investigate the characteristics that define a good representation or model. We propose that such a representation or model should possess universality, characterized by: (i) discriminability: performing well on training samples; (ii) generalization: performing well on unseen datasets; and (iii) transferability: performing well on unseen tasks with distribution shifts. Despite its importance, current self-supervised learning (SSL) methods lack explicit modeling of universality, and theoretical analysis remains underexplored. To address these issues, we aim to explore and incorporate universality into SSL. Specifically, we first revisit SSL from a task perspective and find that each mini-batch can be viewed as a multi-class classification task. We then propose that a universal SSL model should achieve: (i) learning universality by minimizing loss across all training samples, and (ii) evaluation universality by learning causally invariant representations that generalize well to unseen tasks. To quantify this, we introduce a $\sigma$-measurement that assesses the gap between the performance of SSL model and optimal task-specific models. Furthermore, to model universality, we propose the GeSSL framework. It first learns task-specific models by minimizing SSL loss, then incorporates future updates to enhance discriminability, and finally integrates these models to learn from multiple tasks. Theoretical and empirical evidence supports the effectiveness of GeSSL.

replace An Information Theoretic Perspective on Conformal Prediction

Authors: Alvaro H. C. Correia, Fabio Valerio Massoli, Christos Louizos, Arash Behboodi

Abstract: Conformal Prediction (CP) is a distribution-free uncertainty estimation framework that constructs prediction sets guaranteed to contain the true answer with a user-specified probability. Intuitively, the size of the prediction set encodes a general notion of uncertainty, with larger sets associated with higher degrees of uncertainty. In this work, we leverage information theory to connect conformal prediction to other notions of uncertainty. More precisely, we prove three different ways to upper bound the intrinsic uncertainty, as described by the conditional entropy of the target variable given the inputs, by combining CP with information theoretical inequalities. Moreover, we demonstrate two direct and useful applications of such connection between conformal prediction and information theory: (i) more principled and effective conformal training objectives that generalize previous approaches and enable end-to-end training of machine learning models from scratch, and (ii) a natural mechanism to incorporate side information into conformal prediction. We empirically validate both applications in centralized and federated learning settings, showing our theoretical results translate to lower inefficiency (average prediction set size) for popular CP methods.

replace DiskGNN: Bridging I/O Efficiency and Model Accuracy for Out-of-Core GNN Training

Authors: Renjie Liu, Yichuan Wang, Xiao Yan, Haitian Jiang, Zhenkun Cai, Minjie Wang, Bo Tang, Jinyang Li

Abstract: Graph neural networks (GNNs) are machine learning models specialized for graph data and widely used in many applications. To train GNNs on large graphs that exceed CPU memory, several systems store data on disk and conduct out-of-core processing. However, these systems suffer from either read amplification when reading node features that are usually smaller than a disk page or degraded model accuracy by treating the graph as disconnected partitions. To close this gap, we build a system called DiskGNN, which achieves high I/O efficiency and thus fast training without hurting model accuracy. The key technique used by DiskGNN is offline sampling, which helps decouple graph sampling from model computation. In particular, by conducting graph sampling beforehand, DiskGNN acquires the node features that will be accessed by model computation, and such information is utilized to pack the target node features contiguously on disk to avoid read amplification. Besides, \name{} also adopts designs including four-level feature store to fully utilize the memory hierarchy to cache node features and reduce disk access, batched packing to accelerate the feature packing process, and pipelined training to overlap disk access with other operations. We compare DiskGNN with Ginex and MariusGNN, which are state-of-the-art systems for out-of-core GNN training. The results show that DiskGNN can speed up the baselines by over 8x while matching their best model accuracy.

replace DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation

Authors: Jie Xu, Karthikeyan Saravanan, Rogier van Dalen, Haaris Mehmood, David Tuckey, Mete Ozay

Abstract: Federated learning (FL) allows clients to collaboratively train a global model without sharing their local data with a server. However, clients' contributions to the server can still leak sensitive information. Differential privacy (DP) addresses such leakage by providing formal privacy guarantees, with mechanisms that add randomness to the clients' contributions. The randomness makes it infeasible to train large transformer-based models, common in modern federated learning systems. In this work, we empirically evaluate the practicality of fine-tuning large scale on-device transformer-based models with differential privacy in a federated learning system. We conduct comprehensive experiments on various system properties for tasks spanning a multitude of domains: speech recognition, computer vision (CV) and natural language understanding (NLU). Our results show that full fine-tuning under differentially private federated learning (DP-FL) generally leads to huge performance degradation which can be alleviated by reducing the dimensionality of contributions through parameter-efficient fine-tuning (PEFT). Our benchmarks of existing DP-PEFT methods show that DP-Low-Rank Adaptation (DP-LoRA) consistently outperforms other methods. An even more promising approach, DyLoRA, which makes the low rank variable, when naively combined with FL would straightforwardly break differential privacy. We therefore propose an adaptation method that can be combined with differential privacy and call it DP-DyLoRA. Finally, we are able to reduce the accuracy degradation and word error rate (WER) increase due to DP to less than 2% and 7% respectively with 1 million clients and a stringent privacy budget of $\epsilon=2$.

replace CViT: Continuous Vision Transformer for Operator Learning

Authors: Sifan Wang, Jacob H Seidman, Shyam Sankaran, Hanwen Wang, George J. Pappas, Paris Perdikaris

Abstract: Operator learning, which aims to approximate maps between infinite-dimensional function spaces, is an important area in scientific machine learning with applications across various physical domains. Here we introduce the Continuous Vision Transformer (CViT), a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems. CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. This design allows for flexible output representations and consistent evaluation at arbitrary resolutions. We demonstrate CViT's effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes. Our comprehensive experiments show that CViT achieves state-of-the-art performance on multiple benchmarks, often surpassing larger foundation models, even without extensive pretraining and roll-out fine-tuning. Taken together, CViT exhibits robust handling of discontinuous solutions, multi-scale features, and intricate spatio-temporal dynamics. Our contributions can be viewed as a significant step towards adapting advanced computer vision architectures for building more flexible and accurate machine learning models in the physical sciences.

replace Space-aware Socioeconomic Indicator Inference with Heterogeneous Graphs

Authors: Xingchen Zou, Jiani Huang, Xixuan Hao, Yuhao Yang, Haomin Wen, Yibo Yan, Chao Huang, Chao Chen, Yuxuan Liang

Abstract: Regional socioeconomic indicators are critical across various domains, yet their acquisition can be costly. Inferring global socioeconomic indicators from a limited number of regional samples is essential for enhancing management and sustainability in urban areas and human settlements. Current inference methods typically rely on spatial interpolation based on the assumption of spatial continuity, which does not adequately address the complex variations present within regional spaces. In this paper, we present GeoHG, the first space-aware socioeconomic indicator inference method that utilizes a heterogeneous graph-based structure to represent geospace for non-continuous inference. Extensive experiments demonstrate the effectiveness of GeoHG in comparison to existing methods, achieving an $R^2$ score exceeding 0.8 under extreme data scarcity with a masked ratio of 95\%.

replace Learning to Discretize Denoising Diffusion ODEs

Authors: Vinh Tong, Trung-Dung Hoang, Anji Liu, Guy Van den Broeck, Mathias Niepert

Abstract: Diffusion Probabilistic Models (DPMs) are generative models showing competitive performance in various domains, including image synthesis and 3D point cloud generation. Sampling from pre-trained DPMs involves multiple neural function evaluations (NFEs) to transform Gaussian noise samples into images, resulting in higher computational costs compared to single-step generative models such as GANs or VAEs. Therefore, reducing the number of NFEs while preserving generation quality is crucial. To address this, we propose LD3, a lightweight framework designed to learn the optimal time discretization for sampling. LD3 can be combined with various samplers and consistently improves generation quality without having to retrain resource-intensive neural networks. We demonstrate analytically and empirically that LD3 improves sampling efficiency with much less computational overhead. We evaluate our method with extensive experiments on 7 pre-trained models, covering unconditional and conditional sampling in both pixel-space and latent-space DPMs. We achieve FIDs of 2.38 (10 NFE), and 2.27 (10 NFE) on unconditional CIFAR10 and AFHQv2 in 5-10 minutes of training. LD3 offers an efficient approach to sampling from pre-trained diffusion models. Code is available at https://github.com/vinhsuhi/LD3.

URLs: https://github.com/vinhsuhi/LD3.

replace Wasserstein Distances, Neuronal Entanglement, and Sparsity

Authors: Shashata Sawmya, Linghao Kong, Ilia Markov, Dan Alistarh, Nir Shavit

Abstract: Disentangling polysemantic neurons is at the core of many current approaches to interpretability of large language models. Here we attempt to study how disentanglement can be used to understand performance, particularly under weight sparsity, a leading post-training optimization technique. We suggest a novel measure for estimating neuronal entanglement: the Wasserstein distance of a neuron's output distribution to a Gaussian. Moreover, we show the existence of a small number of highly entangled "Wasserstein Neurons" in each linear layer of an LLM, characterized by their highly non-Gaussian output distributions, their role in mapping similar inputs to dissimilar outputs, and their significant impact on model accuracy. To study these phenomena, we propose a new experimental framework for disentangling polysemantic neurons. Our framework separates each layer's inputs to create a mixture of experts where each neuron's output is computed by a mixture of neurons of lower Wasserstein distance, each better at maintaining accuracy when sparsified without retraining. We provide strong evidence that this is because the mixture of sparse experts is effectively disentangling the input-output relationship of individual neurons, in particular the difficult Wasserstein neurons.

replace A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of $\Theta(T^{2/3})$ and its Application to Best-of-Both-Worlds

Authors: Taira Tsuchiya, Shinji Ito

Abstract: Follow-the-Regularized-Leader (FTRL) is a powerful framework for various online learning problems. By designing its regularizer and learning rate to be adaptive to past observations, FTRL is known to work adaptively to various properties of an underlying environment. However, most existing adaptive learning rates are for online learning problems with a minimax regret of $\Theta(\sqrt{T})$ for the number of rounds $T$, and there are only a few studies on adaptive learning rates for problems with a minimax regret of $\Theta(T^{2/3})$, which include several important problems dealing with indirect feedback. To address this limitation, we establish a new adaptive learning rate framework for problems with a minimax regret of $\Theta(T^{2/3})$. Our learning rate is designed by matching the stability, penalty, and bias terms that naturally appear in regret upper bounds for problems with a minimax regret of $\Theta(T^{2/3})$. As applications of this framework, we consider three major problems with a minimax regret of $\Theta(T^{2/3})$: partial monitoring, graph bandits, and multi-armed bandits with paid observations. We show that FTRL with our learning rate and the Tsallis entropy regularizer improves existing Best-of-Both-Worlds (BOBW) regret upper bounds, which achieve simultaneous optimality in the stochastic and adversarial regimes. The resulting learning rate is surprisingly simple compared to the existing learning rates for BOBW algorithms for problems with a minimax regret of $\Theta(T^{2/3})$.

replace Mitigating the Impact of Labeling Errors on Training via Rockafellian Relaxation

Authors: Louis L. Chen, Bobbie Chern, Eric Eckstrand, Amogh Mahapatra, Johannes O. Royset

Abstract: Labeling errors in datasets are common, arising in a variety of contexts, such as human labeling, noisy labeling, and weak labeling (i.e., image classification). Although neural networks (NNs) can tolerate modest amounts of these errors, their performance degrades substantially once error levels exceed a certain threshold. We propose a new loss reweighting, architecture-independent methodology, Rockafellian Relaxation Method (RRM) for neural network training. Experiments indicate RRM can enhance neural network methods to achieve robust performance across classification tasks in computer vision and natural language processing (sentiment analysis). We find that RRM can mitigate the effects of dataset contamination stemming from both (heavy) labeling error and/or adversarial perturbation, demonstrating effectiveness across a variety of data domains and machine learning tasks.

replace Generative AI for Deep Reinforcement Learning: Framework, Analysis, and Use Cases

Authors: Geng Sun, Wenwen Xie, Dusit Niyato, Fang Mei, Jiawen Kang, Hongyang Du, Shiwen Mao

Abstract: As a form of artificial intelligence (AI) technology based on interactive learning, deep reinforcement learning (DRL) has been widely applied across various fields and has achieved remarkable accomplishments. However, DRL faces certain limitations, including low sample efficiency and poor generalization. Therefore, we present how to leverage generative AI (GAI) to address these issues above and enhance the performance of DRL algorithms in this paper. We first introduce several classic GAI and DRL algorithms and demonstrate the applications of GAI-enhanced DRL algorithms. Then, we discuss how to use GAI to improve DRL algorithms from the data and policy perspectives. Subsequently, we introduce a framework that demonstrates an actual and novel integration of GAI with DRL, i.e., GAI-enhanced DRL. Additionally, we provide a case study of the framework on UAV-assisted integrated near-field/far-field communication to validate the performance of the proposed framework. Moreover, we present several future directions. Finally, the related code is available at: https://xiewenwen22.github.io/GAI-enhanced-DRL.

URLs: https://xiewenwen22.github.io/GAI-enhanced-DRL.

replace REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning

Authors: Sungho Jeon, Xinyue Ma, Kwang In Kim, Myeongjae Jeon

Abstract: Recent rehearsal-free methods, guided by prompts, excel in vision-related continual learning (CL) with drifting data but lack resource efficiency, making real-world deployment challenging. In this paper, we introduce Resource-Efficient Prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during new-task learning. Extensive experiments on multiple image classification datasets demonstrates REP's superior resource efficiency over state-of-the-art ViT- and CNN-based methods.

replace Attention as a Hypernetwork

Authors: Simon Schug, Seijin Kobayashi, Yassir Akram, Jo\~ao Sacramento, Razvan Pascanu

Abstract: Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training, but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork-generated linear value network nonlinear strengthens compositionality. We find that this modification improves compositional generalization on abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven's Progressive Matrices human intelligence test, which gives us precise control over the problem compositions encountered during training and evaluation. We demonstrate on this task how scaling model size and data enables compositional generalization in transformers and gives rise to a functionally structured latent space.

replace FairCoT: Enhancing Fairness in Text-to-Image Generation via Chain of Thought Reasoning with Multimodal Large Language Models

Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver

Abstract: In the domain of text-to-image generative models, biases inherent in training datasets often propagate into generated content, posing significant ethical challenges, particularly in socially sensitive contexts. We introduce FairCoT, a novel framework that enhances fairness in text to image models through Chain of Thought (CoT) reasoning within multimodal generative large language models. FairCoT employs iterative CoT refinement to systematically mitigate biases, and dynamically adjusts textual prompts in real time, ensuring diverse and equitable representation in generated images. By integrating iterative reasoning processes, FairCoT addresses the limitations of zero shot CoT in sensitive scenarios, balancing creativity with ethical responsibility. Experimental evaluations across popular text-to-image systems including DALLE and various Stable Diffusion variants, demonstrate that FairCoT significantly enhances fairness and diversity without sacrificing image quality or semantic fidelity. By combining robust reasoning, lightweight deployment, and extensibility to multiple models, FairCoT represents a promising step toward more socially responsible and transparent AI driven content generation.

replace MiNT: Multi-Network Training for Transfer Learning on Temporal Graphs

Authors: Kiarash Shamsi, Tran Gia Bao Ngo, Razieh Shirzadkhani, Shenyang Huang, Farimah Poursafaei, Poupak Azad, Reihaneh Rabbany, Baris Coskunuzer, Guillaume Rabusseau, Cuneyt Gurcan Akcora

Abstract: Temporal Graph Learning (TGL) has become a robust framework for discovering patterns in dynamic networks and predicting future interactions. While existing research has largely concentrated on learning from individual networks, this study explores the potential of learning from multiple temporal networks and its ability to transfer to unobserved networks. To achieve this, we introduce Temporal Multi-network Training MiNT, a novel pre-training approach that learns from multiple temporal networks. With a novel collection of 84 temporal transaction networks, we pre-train TGL models on up to 64 networks and assess their transferability to 20 unseen networks. Remarkably, MiNT achieves state-of-the-art results in zero-shot inference, surpassing models individually trained on each network. Our findings further demonstrate that increasing the number of pre-training networks significantly improves transfer performance. This work lays the groundwork for developing Temporal Graph Foundation Models, highlighting the significant potential of multi-network pre-training in TGL.

replace The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Authors: Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini

Abstract: Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.

replace Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control

Authors: Zifan Liu, Xinran Li, Shibo Chen, Gen Li, Jiashuo Jiang, Jun Zhang

Abstract: Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded due to two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to train the RL policy to convergence. Second, online experience may not reflect the true demand due to the lost sales phenomenon typical in IC, which makes the learning process more challenging. To address the above challenges, we propose a decision framework that combines reinforcement learning with feedback graph (RLFG) and intrinsically motivated exploration (IME) to boost sample efficiency. In particular, we first take advantage of the inherent properties of lost-sales IC problems and design the feedback graph (FG) specially for lost-sales IC problems to generate abundant side experiences aid RL updates. Then we conduct a rigorous theoretical analysis of how the designed FG reduces the sample complexity of RL methods. Based on the theoretical insights, we design an intrinsic reward to direct the RL agent to explore to the state-action space with more side experiences, further exploiting FG's power. Experimental results demonstrate that our method greatly improves the sample efficiency of applying RL in IC. Our code is available at https://anonymous.4open.science/r/RLIMFG4IC-811D/

URLs: https://anonymous.4open.science/r/RLIMFG4IC-811D/

replace Leveraging Contrastive Learning for Enhanced Node Representations in Tokenized Graph Transformers

Authors: Jinsong Chen, Hanpeng Liu, John E. Hopcroft, Kun He

Abstract: While tokenized graph Transformers have demonstrated strong performance in node classification tasks, their reliance on a limited subset of nodes with high similarity scores for constructing token sequences overlooks valuable information from other nodes, hindering their ability to fully harness graph information for learning optimal node representations. To address this limitation, we propose a novel graph Transformer called GCFormer. Unlike previous approaches, GCFormer develops a hybrid token generator to create two types of token sequences, positive and negative, to capture diverse graph information. And a tailored Transformer-based backbone is adopted to learn meaningful node representations from these generated token sequences. Additionally, GCFormer introduces contrastive learning to extract valuable information from both positive and negative token sequences, enhancing the quality of learned node representations. Extensive experimental results across various datasets, including homophily and heterophily graphs, demonstrate the superiority of GCFormer in node classification, when compared to representative graph neural networks (GNNs) and graph Transformers.

replace TEAL: New Selection Strategy for Small Buffers in Experience Replay Class Incremental Learning

Authors: Shahar Shaul-Ariel, Daphna Weinshall

Abstract: Continual Learning is an unresolved challenge, whose relevance increases when considering modern applications. Unlike the human brain, trained deep neural networks suffer from a phenomenon called catastrophic forgetting, wherein they progressively lose previously acquired knowledge upon learning new tasks. To mitigate this problem, numerous methods have been developed, many relying on the replay of past exemplars during new task training. However, as the memory allocated for replay decreases, the effectiveness of these approaches diminishes. On the other hand, maintaining a large memory for the purpose of replay is inefficient and often impractical. Here we introduce TEAL, a novel approach to populate the memory with exemplars, that can be integrated with various experience-replay methods and significantly enhance their performance with small memory buffers. We show that TEAL enhances the average accuracy of existing class-incremental methods and outperforms other selection strategies, achieving state-of-the-art performance even with small memory buffers of 1-3 exemplars per class in the final task. This confirms our initial hypothesis that when memory is scarce, it is best to prioritize the most typical data. Code is available at this https URL: https://github.com/shahariel/TEAL.

URLs: https://github.com/shahariel/TEAL.

replace On the Expressive Power of Sparse Geometric MPNNs

Authors: Yonatan Sverdlov, Nadav Dym

Abstract: Motivated by applications in chemistry and other sciences, we study the expressive power of message-passing neural networks for geometric graphs, whose node features correspond to 3-dimensional positions. Recent work has shown that such models can separate generic pairs of non-isomorphic geometric graphs, though they may fail to separate some rare and complicated instances. However, these results assume a fully connected graph, where each node possesses complete knowledge of all other nodes. In contrast, often, in application, every node only possesses knowledge of a small number of nearest neighbors. This paper shows that generic pairs of non-isomorphic geometric graphs can be separated by message-passing networks with rotation equivariant features as long as the underlying graph is connected. When only invariant intermediate features are allowed, generic separation is guaranteed for generically globally rigid graphs. We introduce a simple architecture, EGENNET, which achieves our theoretical guarantees and compares favorably with alternative architecture on synthetic and chemical benchmarks. Our code is available at https://github.com/yonatansverdlov/E-GenNet.

URLs: https://github.com/yonatansverdlov/E-GenNet.

replace AIGC for Industrial Time Series: From Deep Generative Models to Large Generative Models

Authors: Lei Ren, Haiteng Wang, Jinwang Li, Yang Tang, Chunhua Yang

Abstract: With the remarkable success of generative models like ChatGPT, Artificial Intelligence Generated Content (AIGC) is undergoing explosive development. Not limited to text and images, generative models can generate industrial time series data, addressing challenges such as the difficulty of data collection and data annotation. Due to their outstanding generation ability, they have been widely used in Internet of Things, metaverse, and cyber-physical-social systems to enhance the efficiency of industrial production. In this paper, we present a comprehensive overview of generative models for industrial time series from deep generative models (DGMs) to large generative models (LGMs). First, a DGM-based AIGC framework is proposed for industrial time series generation. Within this framework, we survey advanced industrial DGMs and present a multi-perspective categorization. Furthermore, we systematically analyze the critical technologies required to construct industrial LGMs from four aspects: large-scale industrial dataset, LGMs architecture for complex industrial characteristics, self-supervised training for industrial time series, and fine-tuning of industrial downstream tasks. Finally, we conclude the challenges and future directions to enable the development of generative models in industry.

replace Downlink CCM Estimation via Representation Learning with Graph Regularization

Authors: Melih Can Zerin, Elif Vural, Ali \"Ozg\"ur Y{\i}lmaz

Abstract: In this paper, we propose an algorithm for downlink (DL) channel covariance matrix (CCM) estimation for frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) communication systems with base station (BS) possessing a uniform linear array (ULA) antenna structure. We consider a setting where the UL CCM is mapped to DL CCM by a mapping function. We first present a theoretical error analysis of learning a nonlinear embedding by constructing a mapping function, which points to the importance of the Lipschitz regularity of the mapping function for achieving high estimation performance. Then, based on the theoretical ground, we propose a representation learning algorithm as a solution for the estimation problem, where Gaussian RBF kernel interpolators are chosen to map UL CCMs to their DL counterparts. The proposed algorithm is based on the optimization of an objective function that fits a regression model between the DL CCM and UL CCM samples in the training dataset and preserves the local geometric structure of the data in the UL CCM space, while explicitly regulating the Lipschitz continuity of the mapping function in light of our theoretical findings. The proposed algorithm surpasses benchmark methods in terms of three error metrics as shown by simulations.

replace Performative Prediction on Games and Mechanism Design

Authors: Ant\'onio G\'ois, Mehrnaz Mofakhami, Fernando P. Santos, Gauthier Gidel, Simon Lacoste-Julien

Abstract: Agents often have individual goals which depend on a group's actions. If agents trust a forecast of collective action and adapt strategically, such prediction can influence outcomes non-trivially, resulting in a form of performative prediction. This effect is ubiquitous in scenarios ranging from pandemic predictions to election polls, but existing work has ignored interdependencies among predicted agents. As a first step in this direction, we study a collective risk dilemma where agents dynamically decide whether to trust predictions based on past accuracy. As predictions shape collective outcomes, social welfare arises naturally as a metric of concern. We explore the resulting interplay between accuracy and welfare, and demonstrate that searching for stable accurate predictions can minimize social welfare with high probability in our setting. By assuming knowledge of a Bayesian agent behavior model, we then show how to achieve better trade-offs and use them for mechanism design.

replace Boosting Graph Neural Network Expressivity with Learnable Lanczos Constraints

Authors: Niloofar Azizi, Nils Kriege, Horst Bischof

Abstract: Graph Neural Networks (GNNs) excel in handling graph-structured data but often underperform in link prediction tasks compared to classical methods, mainly due to the limitations of the commonly used message-passing principle. Notably, their ability to distinguish non-isomorphic graphs is limited by the 1-dimensional Weisfeiler-Lehman test. Our study presents a novel method to enhance the expressivity of GNNs by embedding induced subgraphs into the graph Laplacian matrix's eigenbasis. We introduce a Learnable Lanczos algorithm with Linear Constraints (LLwLC), proposing two novel subgraph extraction strategies: encoding vertex-deleted subgraphs and applying Neumann eigenvalue constraints. For the former, we demonstrate the ability to distinguish graphs that are indistinguishable by 2-WL, while maintaining efficient time complexity. The latter focuses on link representations enabling differentiation between $k$-regular graphs and node automorphism, a vital aspect for link prediction tasks. Our approach results in an extremely lightweight architecture, reducing the need for extensive training datasets. Empirically, our method improves performance in challenging link prediction tasks across benchmark datasets, establishing its practical utility and supporting our theoretical findings. Notably, LLwLC achieves 20x and 10x speedup by only requiring 5% and 10% data from the PubMed and OGBL-Vessel datasets while comparing to the state-of-the-art.

replace Exploring the Potential of Large Language Models for Heterophilic Graphs

Authors: Yuxia Wu, Shujie Li, Yuan Fang, Chuan Shi

Abstract: Large language models (LLMs) have presented significant opportunities to enhance various machine learning applications, including graph neural networks (GNNs). By leveraging the vast open-world knowledge within LLMs, we can more effectively interpret and utilize textual data to better characterize heterophilic graphs, where neighboring nodes often have different labels. However, existing approaches for heterophilic graphs overlook the rich textual data associated with nodes, which could unlock deeper insights into their heterophilic contexts. In this work, we explore the potential of LLMs for modeling heterophilic graphs and propose a novel two-stage framework: LLM-enhanced edge discriminator and LLM-guided edge reweighting. In the first stage, we fine-tune the LLM to better identify homophilic and heterophilic edges based on the textual content of their nodes. In the second stage, we adaptively manage message propagation in GNNs for different edge types based on node features, structures, and heterophilic or homophilic characteristics. To cope with the computational demands when deploying LLMs in practical scenarios, we further explore model distillation techniques to fine-tune smaller, more efficient models that maintain competitive performance. Extensive experiments validate the effectiveness of our framework, demonstrating the feasibility of using LLMs to enhance node classification on heterophilic graphs.

replace sEMG-Driven Physics-Informed Gated Recurrent Networks for Modeling Upper Limb Multi-Joint Movement Dynamics

Authors: Rajnish Kumar, Anand Gupta, Suriya Prakash Muthukrishnan, Lalan Kumar, Sitikantha Roy

Abstract: Exoskeletons and rehabilitation systems have the potential to improve human strength and recovery by using adaptive human-machine interfaces. Achieving precise and responsive control in these systems depends on accurately estimating joint movement dynamics, such as joint angle, velocity, acceleration, external mass, and torque. While machine learning (ML) approaches have been employed to predict joint kinematics from surface electromyography (sEMG) data, traditional ML models often struggle to generalize across dynamic movements. In contrast, physics-informed neural networks integrate biomechanical principles, but their effectiveness in predicting full movement dynamics has not been thoroughly explored. To address this, we introduce the Physics-informed Gated Recurrent Network (PiGRN), a novel model designed to predict multi-joint movement dynamics from sEMG data. PiGRN uses a Gated Recurrent Unit (GRU) to process time-series sEMG inputs, estimate multi-joint kinematics and external loads, and predict joint torque while incorporating physics-based constraints during training. Experimental validation, using sEMG data from five participants performing elbow flexion-extension tasks with 0 kg, 2 kg, and 4 kg loads, showed that PiGRN accurately predicted joint torques for 10 novel movements. RMSE values ranged from 4.02\% to 11.40\%, with correlation coefficients between 0.87 and 0.98. These results underscore PiGRN's potential for real-time applications in exoskeletons and rehabilitation. Future work will focus on expanding datasets, improving musculoskeletal models, and investigating unsupervised learning approaches.

replace Lyapunov Neural ODE State-Feedback Control Policies

Authors: Joshua Hang Sai Ip, Georgios Makrygiorgos, Ali Mesbah

Abstract: Deep neural networks are increasingly used as an effective way to represent control policies in various learning-based control paradigms. For continuous-time optimal control problems (OCPs), which are central to many decision-making tasks, control policy learning can be cast as a neural ordinary differential equation (NODE) problem wherein state and control constraints are naturally accommodated. This paper presents a NODE approach to solving continuous-time OCPs for the case of stabilizing a known constrained nonlinear system around an equilibrium state. The approach, termed Lyapunov-NODE control (L-NODEC), uses a novel Lyapunov loss formulation that incorporates an exponentially-stabilizing control Lyapunov function to learn a state-feedback neural control policy. The proposed Lyapunov loss allows L-NODEC to guarantee exponential stability of the controlled system, as well as its adversarial robustness to perturbations to the initial state. The performance of L-NODEC is illustrated in two problems, including a dose delivery problem in plasma medicine, wherein L-NODEC effectively stabilizes the controlled system around the equilibrium state despite perturbations to the initial state and reduces the inference time necessary to reach equilibrium.

replace Online Optimization for Learning to Communicate over Time-Correlated Channels

Authors: Zheshun Wu, Junfan Li, Zenglin Xu, Sumei Sun, Jie Liu

Abstract: Machine learning techniques have garnered great interest in designing communication systems owing to their capacity in tacking with channel uncertainty. To provide theoretical guarantees for learning-based communication systems, some recent works analyze generalization bounds for devised methods based on the assumption of Independently and Identically Distributed (I.I.D.) channels, a condition rarely met in practical scenarios. In this paper, we drop the I.I.D. channel assumption and study an online optimization problem of learning to communicate over time-correlated channels. To address this issue, we further focus on two specific tasks: optimizing channel decoders for time-correlated fading channels and selecting optimal codebooks for time-correlated additive noise channels. For utilizing temporal dependence of considered channels to better learn communication systems, we develop two online optimization algorithms based on the optimistic online mirror descent framework. Furthermore, we provide theoretical guarantees for proposed algorithms via deriving sub-linear regret bound on the expected error probability of learned systems. Extensive simulation experiments have been conducted to validate that our presented approaches can leverage the channel correlation to achieve a lower average symbol error rate compared to baseline methods, consistent with our theoretical findings.

replace Rethinking Meta-Learning from a Learning Lens

Authors: Jingyao Wang, Wenwen Qiang, Chuxiong Sun, Changwen Zheng, Jiangmeng Li

Abstract: Meta-learning has emerged as a powerful approach for leveraging knowledge from previous tasks to solve new tasks. The mainstream methods focus on training a well-generalized model initialization, which is then adapted to different tasks with limited data and updates. However, it pushes the model overfitting on the training tasks. Previous methods mainly attributed this to the lack of data and used augmentations to address this issue, but they were limited by sufficient training and effective augmentation strategies. In this work, we focus on the more fundamental learning to learn strategy of meta-learning to explore what causes errors and how to eliminate these errors without changing the environment. Specifically, we first rethink the algorithmic procedure of meta-learning from a learning lens. Through theoretical and empirical analyses, we find that (i) this paradigm faces the risk of both overfitting and underfitting and (ii) the model adapted to different tasks promote each other where the effect is stronger if the tasks are more similar. Based on this insight, we propose using task relations to calibrate the optimization process of meta-learning and propose a plug-and-play method called Task Relation Learner (TRLearner) to achieve this goal. Specifically, it first obtains task relation matrices from the extracted task-specific meta-data. Then, it uses the obtained matrices with relation-aware consistency regularization to guide optimization. Extensive theoretical and empirical analyses demonstrate the effectiveness of TRLearner.

replace CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting

Authors: Haobo Li, Zhaowei Wang, Jiachen Wang, Yueya Wang, Alexis Kai Hon Lau, Huamin Qu

Abstract: Forecasting weather and climate events is crucial for making appropriate measures to mitigate environmental hazards and minimize losses. However, existing environmental forecasting research focuses narrowly on predicting numerical meteorological variables (e.g., temperature), neglecting the translation of these variables into actionable textual narratives of events and their consequences. To bridge this gap, we proposed Weather and Climate Event Forecasting (WCEF), a new task that leverages numerical meteorological raster data and textual event data to predict weather and climate events. This task is challenging to accomplish due to difficulties in aligning multimodal data and the lack of supervised datasets. To address these challenges, we present CLLMate, the first multimodal dataset for WCEF, using 26,156 environmental news articles aligned with ERA5 reanalysis data. We systematically benchmark 23 existing MLLMs on CLLMate, including closed-source, open-source, and our fine-tuned models. Our experiments reveal the advantages and limitations of existing MLLMs and the value of CLLMate for the training and benchmarking of the WCEF task.

replace MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

Authors: Sheng Wang, Liheng Chen, Pengan Chen, Jingwei Dong, Boyang Xue, Jiyue Jiang, Lingpeng Kong, Chuan Wu

Abstract: The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8x parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods. The code is officially available at https://github.com/Forence1999/MoS.

URLs: https://github.com/Forence1999/MoS.

replace Statistical inference on black-box generative models in the data kernel perspective space

Authors: Hayden Helm, Aranyak Acharyya, Brandon Duderstadt, Youngser Park, Carey E. Priebe

Abstract: Generative models are capable of producing human-expert level content across a variety of topics and domains. As the impact of generative models grows, it is necessary to develop statistical methods to understand collections of available models. These methods are particularly important in settings where the user may not have access to information related to a model's pre-training data, weights, or other relevant model-level covariates. In this paper we extend recent results on representations of black-box generative models to model-level statistical inference tasks. We demonstrate that the model-level representations are effective for multiple inference tasks.

replace Efficient PAC Learning of Halfspaces with Constant Malicious Noise Rate

Authors: Jie Shen

Abstract: Understanding noise tolerance of machine learning algorithms is a central quest in learning theory. In this work, we study the problem of computationally efficient PAC learning of halfspaces in the presence of malicious noise, where an adversary can corrupt both instances and labels of training samples. The best-known noise tolerance either depends on a target error rate under distributional assumptions or on a margin parameter under large-margin conditions. In this work, we show that when both types of conditions are satisfied, it is possible to achieve constant noise tolerance by minimizing a reweighted hinge loss. Our key ingredients include: 1) an efficient algorithm that finds weights to control the gradient deterioration from corrupted samples, and 2) a new analysis on the robustness of the hinge loss equipped with such weights.

replace Rethinking GNN Expressive Power Research in the Machine Learning Community: Limitations, Issues, and Corrections

Authors: Guanyu Cui, Zhewei Wei, Hsin-Hao Su

Abstract: The success of graph neural networks (GNNs) has spurred theoretical explorations into their expressive power. In the graph machine learning community, researchers often equate GNNs with the Weisfeiler-Lehman (WL) tests as a foundation for theoretical analysis. However, we identify two major limitations of this approach: (1) the semantics of WL tests involve verifying purely structural equivalences through a set of logical sentences. As a result, they do not align well with the concept of expressive power, which is typically defined as the class of functions that GNNs can express, and they are not well-suited for handling graphs with features; (2) by leveraging communication complexity, we show that the lower bound on a GNN's capacity (depth multiplied by width) to simulate one iteration of the WL test grows almost linearly with the graph size. This finding indicates that the WL test is not locally computable and is misaligned with the message-passing GNNs. Furthermore, we show that allowing unlimited precomputation or directly integrating features computed by external models, while claiming that these precomputations enhance the expressiveness of GNNs, can sometimes lead to issues. Such problems can even be observed in an influential paper published in a top-tier machine learning conference. We argue that using well-defined computational models, such as the CONGEST model from distributed computing, is a reasonable approach to characterizing and exploring GNNs' expressive power. Following this approach, we present some results on the effects of virtual nodes and edges. Finally, we highlight several open problems regarding GNN expressive power for further exploration.

replace Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

Authors: Philipp Mondorf, Sondre Wold, Barbara Plank

Abstract: A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in mechanistic interpretability have made progress in identifying $\textit{circuits}$, which represent the minimal computational subgraphs responsible for a model's behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits $\textit{relate}$ to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through set operations to represent more complex functional model capabilities.

replace Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering

Authors: Klaus-Rudolf Kladny, Bernhard Sch\"olkopf, Michael Muehlebach

Abstract: Generative models lack rigorous statistical guarantees for their outputs and are therefore unreliable in safety-critical applications. In this work, we propose Sequential Conformal Prediction for Generative Models (SCOPE-Gen), a sequential conformal prediction method producing prediction sets that satisfy a rigorous statistical guarantee called conformal admissibility control. This guarantee states that with high probability, the prediction sets contain at least one admissible (or valid) example. To this end, our method first samples an initial set of i.i.d. examples from a black box generative model. Then, this set is iteratively pruned via so-called greedy filters. As a consequence of the iterative generation procedure, admissibility of the final prediction set factorizes as a Markov chain. This factorization is crucial, because it allows to control each factor separately, using conformal prediction. In comparison to prior work, our method demonstrates a large reduction in the number of admissibility evaluations during calibration. This reduction is important in safety-critical applications, where these evaluations must be conducted manually by domain experts and are therefore costly and time consuming. We highlight the advantages of our method in terms of admissibility evaluations and cardinality of the prediction sets through experiments in natural language generation and molecular graph extension tasks.

replace Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity

Authors: Hyosoon Jang, Yunhui Jang, Jaehyung Kim, Sungsoo Ahn

Abstract: Recent advancements in large language models (LLMs) have demonstrated impressive performance in molecular generation, which offers potential to accelerate drug discovery. However, the current LLMs overlook a critical requirement for drug discovery: proposing a diverse set of molecules. This diversity is essential for improving the chances of finding a viable drug, as it provides alternative molecules that may succeed where others fail in real-world validations. Nevertheless, the LLMs often output structurally similar molecules. While decoding schemes like diverse beam search may enhance textual diversity, this often does not align with molecular structural diversity. In response, we propose a new method for fine-tuning molecular generative LLMs to autoregressively generate a set of structurally diverse molecules, where each molecule is generated by conditioning on the previously generated molecules. Our approach consists of two stages: (1) supervised fine-tuning to adapt LLMs to autoregressively generate molecules in a sequence and (2) reinforcement learning to maximize structural diversity within the generated molecules. Our experiments show that the proposed approach enables LLMs to generate diverse molecules better than existing approaches for diverse sequence generation.

replace A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

Authors: Yan Scholten, Stephan G\"unnemann, Leo Schwinn

Abstract: Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework for LLMs. Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Our experimental analysis reveals that deterministic evaluations falsely indicate successful unlearning and alignment, whereas our probabilistic evaluations better capture model capabilities. We show how to overcome challenges associated with probabilistic outputs in a case study on unlearning by introducing (1) a novel loss based on entropy optimization, and (2) adaptive temperature scaling. We demonstrate that our approach significantly enhances unlearning in probabilistic settings on recent benchmarks. Overall, our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://www.cs.cit.tum.de/daml/probabilistic-unlearning/.

URLs: https://www.cs.cit.tum.de/daml/probabilistic-unlearning/.

replace Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback

Authors: Fatemeh Pesaran Zadeh, Juyeon Kim, Jin-Hwa Kim, Gunhee Kim

Abstract: Large language models (LLMs) have demonstrated strong capabilities across various language tasks, notably through instruction-tuning methods. However, LLMs face challenges in visualizing complex, real-world data through charts and plots. Firstly, existing datasets rarely cover a full range of chart types, such as 3D, volumetric, and gridded charts. Secondly, supervised fine-tuning methods do not fully leverage the intricate relationships within rich datasets, including text, code, and figures. To address these challenges, we propose a hierarchical pipeline and a new dataset for chart generation. Our dataset, Text2Chart31, includes 31 unique plot types referring to the Matplotlib library, with 11.1K tuples of descriptions, code, data tables, and plots. Moreover, we introduce a reinforcement learning-based instruction tuning technique for chart generation tasks without requiring human feedback. Our experiments show that this approach significantly enhances the model performance, enabling smaller models to outperform larger open-source models and be comparable to state-of-the-art proprietary models in data visualization tasks. We make the code and dataset available at https://github.com/fatemehpesaran310/Text2Chart31.

URLs: https://github.com/fatemehpesaran310/Text2Chart31.

replace Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks

Authors: Zi Wang, Divyam Anshumaan, Ashish Hooda, Yudong Chen, Somesh Jha

Abstract: Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emph{functional homotopy} method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20\%-30\%$ improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.

replace Learning Interpretable Hierarchical Dynamical Systems Models from Time Series Data

Authors: Manuel Brenner, Elias Weber, Georgia Koppe, Daniel Durstewitz

Abstract: In science, we are often interested in obtaining a generative model of the underlying system dynamics from observed time series. While powerful methods for dynamical systems reconstruction (DSR) exist when data come from a single domain, how to best integrate data from multiple dynamical regimes and leverage it for generalization is still an open question. This becomes particularly important when individual time series are short, and group-level information may help to fill in for gaps in single-domain data. Here we introduce a hierarchical framework that enables to harvest group-level (multi-domain) information while retaining all single-domain characteristics, and showcase it on popular DSR benchmarks, as well as on neuroscience and medical data. In addition to faithful reconstruction of all individual dynamical regimes, our unsupervised methodology discovers common low-dimensional feature spaces in which datasets with similar dynamics cluster. The features spanning these spaces were further dynamically highly interpretable, surprisingly in often linear relation to control parameters that govern the dynamics of the underlying system. Finally, we illustrate transfer learning and generalization to new parameter regimes, paving the way toward DSR foundation models.

replace Low-Rank Continual Personalization of Diffusion Models

Authors: {\L}ukasz Staniszewski, Katarzyna Zaleska, Kamil Deja

Abstract: Recent personalization methods for diffusion models, such as Dreambooth and LoRA, allow fine-tuning pre-trained models to generate new concepts. However, applying these techniques across consecutive tasks in order to include, e.g., new objects or styles, leads to a forgetting of previous knowledge due to mutual interference between their adapters. In this work, we tackle the problem of continual customization under a rigorous regime with no access to past tasks' adapters. In such a scenario, we investigate how different adapters' initialization and merging methods can improve the quality of the final model. To that end, we evaluate the naive continual fine-tuning of customized models and compare this approach with three methods for consecutive adapters' training: sequentially merging new adapters, merging orthogonally initialized adapters, and updating only relevant task-specific weights. In our experiments, we show that the proposed techniques mitigate forgetting when compared to the naive approach. In our studies, we show different traits of selected techniques and their effect on the plasticity and stability of the continually adapted model. Repository with the code is available at https://github.com/luk-st/continual-lora.

URLs: https://github.com/luk-st/continual-lora.

replace QERA: an Analytical Framework for Quantization Error Reconstruction

Authors: Cheng Zhang, Jeffrey T. H. Wong, Can Xiao, George A. Constantinides, Yiren Zhao

Abstract: The growing number of parameters and computational demands of large language models (LLMs) present significant challenges for their efficient deployment. Recently, there is an increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods such as LoftQ and low-precision inference techniques including ZeroQuant-V2. Usually, the low-rank terms are calculated via the singular value decomposition (SVD) of the weight quantization error, minimizing the Frobenius and spectral norms of the weight approximation error. Recent methods like LQ-LoRA and LQER introduced hand-crafted heuristics to minimize errors in layer outputs (activations) rather than weights, resulting improved quantization results. However, these heuristic methods lack an analytical solution to guide the design of quantization error reconstruction terms. In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods -- QERA achieves a fine-tuned accuracy gain of $\Delta_{\text{acc}}$ = 6.05% of 2-bit RoBERTa-base on GLUE compared to LoftQ; and obtains $\Delta_{\text{acc}}$ = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B on average than ZeroQuant-V2 and $\Delta_{\text{ppl}}$ = - 0.28 lower perplexity on WikiText2 than LQER.

replace Revisiting Multi-Permutation Equivariance through the Lens of Irreducible Representations

Authors: Yonatan Sverdlov, Ido Springer, Nadav Dym

Abstract: This paper explores the characterization of equivariant linear layers for representations of permutations and related groups. Unlike traditional approaches, which address these problems using parameter-sharing, we consider an alternative methodology based on irreducible representations and Schur's lemma. Using this methodology, we obtain an alternative derivation for existing models like DeepSets, 2-IGN graph equivariant networks, and Deep Weight Space (DWS) networks. The derivation for DWS networks is significantly simpler than that of previous results. Next, we extend our approach to unaligned symmetric sets, where equivariance to the wreath product of groups is required. Previous works have addressed this problem in a rather restrictive setting, in which almost all wreath equivariant layers are Siamese. In contrast, we give a full characterization of layers in this case and show that there is a vast number of additional non-Siamese layers in some settings. We also show empirically that these additional non-Siamese layers can improve performance in tasks like graph anomaly detection, weight space alignment, and learning Wasserstein distances. Our code is available at \href{https://github.com/yonatansverdlov/Irreducible-Representations-of-Deep-Weight-Spaces}{GitHub}.

URLs: https://github.com/yonatansverdlov/Irreducible-Representations-of-Deep-Weight-Spaces

replace C-Adapter: Adapting Deep Classifiers for Efficient Conformal Prediction Sets

Authors: Kangdao Liu, Hao Zeng, Jianguo Huang, Huiping Zhuang, Chi-Man Vong, Hongxin Wei

Abstract: Conformal prediction, as an emerging uncertainty quantification technique, typically functions as post-hoc processing for the outputs of trained classifiers. To optimize the classifier for maximum predictive efficiency, Conformal Training rectifies the training objective with a regularization that minimizes the average prediction set size at a specific error rate. However, the regularization term inevitably deteriorates the classification accuracy and leads to suboptimal efficiency of conformal predictors. To address this issue, we introduce \textbf{Conformal Adapter} (C-Adapter), an adapter-based tuning method to enhance the efficiency of conformal predictors without sacrificing accuracy. In particular, we implement the adapter as a class of intra order-preserving functions and tune it with our proposed loss that maximizes the discriminability of non-conformity scores between correctly and randomly matched data-label pairs. Using C-Adapter, the model tends to produce extremely high non-conformity scores for incorrect labels, thereby enhancing the efficiency of prediction sets across different coverage rates. Extensive experiments demonstrate that C-Adapter can effectively adapt various classifiers for efficient prediction sets, as well as enhance the conformal training method.

replace Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models

Authors: Jun Luo, Chen Chen, Shandong Wu

Abstract: Federated prompt learning benefits federated learning with CLIP-like Vision-Language Model's (VLM's) robust representation learning ability through prompt learning. However, current federated prompt learning methods are habitually restricted to the traditional FL paradigm, where the participating clients are generally only allowed to download a single globally aggregated model from the server. While justifiable for training full-sized models under federated settings, in this work, we argue that this paradigm is ill-suited for lightweight prompts. By facilitating the clients to download multiple pre-aggregated prompts as fixed non-local experts, we propose Personalized Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that personalizes the prompt learning process through the lens of Mixture of Experts (MoE). pFedMoAP implements a local attention-based gating network that learns to generate enhanced text features for better alignment with local image data, benefiting from both local and downloaded non-local adaptive prompt experts. Extensive experiments on 9 datasets under various federated settings demonstrate the efficacy of the proposed pFedMoAP algorithm. The code is available at https://github.com/ljaiverson/pFedMoAP.

URLs: https://github.com/ljaiverson/pFedMoAP.

replace Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation

Authors: Thang Do, Sonja Hannibal, Arnulf Jentzen

Abstract: Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the considered optimizer does with high probability not converge to global minimizers of the optimization problem. It turns out that the probability to not converge to a global minimizer converges at least exponentially quickly to one as the width of the first hidden layer of the ANN and the depth of the ANN, respectively, increase. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods such as the momentum SGD, the Nesterov accelerated SGD, the Adagrad, the RMSProp, the Adam, the Adamax, the AMSGrad, and the Nadam optimizers.

replace On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures

Authors: Wei Shen, Ruida Zhou, Jing Yang, Cong Shen

Abstract: While transformers have demonstrated impressive capacities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism enabling transformers to perform ICL is still in its infant stage. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels. Experimental results corroborate the theoretical findings.

replace Towards Neural Scaling Laws for Time Series Foundation Models

Authors: Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, Shirui Pan

Abstract: Scaling laws offer valuable insights into the design of time series foundation models (TSFMs). However, previous research has largely focused on the scaling laws of TSFMs for in-distribution (ID) data, leaving their out-of-distribution (OOD) scaling behavior and the influence of model architectures less explored. In this work, we examine two common TSFM architectures, encoder-only and decoder-only Transformers, and investigate their scaling behavior on both ID and OOD data. These models are trained and evaluated across varying parameter counts, compute budgets, and dataset sizes. Our experiments reveal that the log-likelihood loss of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We further compare the scaling properties across different architectures, incorporating two state-of-the-art TSFMs as case studies, showing that model architecture plays a significant role in scaling. The encoder-only Transformers demonstrate better scalability than the decoder-only Transformers, while the architectural enhancements in the two advanced TSFMs primarily improve ID performance but reduce OOD scalability. While scaling up TSFMs is expected to drive performance breakthroughs, the lack of a comprehensive understanding of TSFM scaling laws has hindered the development of a robust framework to guide model scaling. We fill this gap in this work by synthesizing our findings and providing practical guidelines for designing and scaling larger TSFMs with enhanced model capabilities.

replace MING: A Functional Approach to Learning Molecular Generative Models

Authors: Van Khoa Nguyen, Maciej Falkiewicz, Giangiacomo Mercatali, Alexandros Kalousis

Abstract: Traditional molecule generation methods often rely on sequence- or graph-based representations, which can limit their expressive power or require complex permutation-equivariant architectures. This paper introduces a novel paradigm for learning molecule generative models based on functional representations. Specifically, we propose Molecular Implicit Neural Generation (MING), a diffusion-based model that learns molecular distributions in the function space. Unlike standard diffusion processes in the data space, MING employs a novel functional denoising probabilistic process, which jointly denoises information in both the function's input and output spaces by leveraging an expectation-maximization procedure for latent implicit neural representations of data. This approach enables a simple yet effective model design that accurately captures underlying function distributions. Experimental results on molecule-related datasets demonstrate MING's superior performance and ability to generate plausible molecular samples, surpassing state-of-the-art data-space methods while offering a more streamlined architecture and significantly faster generation times. The code is available at https://github.com/v18nguye/MING.

URLs: https://github.com/v18nguye/MING.

replace Likelihood-Free Inference and Hierarchical Data Assimilation for Geological Carbon Storage

Authors: Wenchao Teng, Louis J. Durlofsky

Abstract: Data assimilation will be essential for the management and expansion of geological carbon storage operations. In traditional data assimilation approaches a fixed set of geological hyperparameters, such as mean and standard deviation of log-permeability, is often assumed. Such hyperparameters, however, may be highly uncertain in practical CO2 storage applications where measurements are scarce. In this study, we develop a hierarchical data assimilation framework for carbon storage that treats hyperparameters as uncertain variables characterized by hyperprior distributions. To deal with the computationally intractable likelihood function in hyperparameter estimation, we apply a likelihood-free (or simulation-based) inference algorithm, specifically sequential Monte Carlo-based approximate Bayesian computation (SMC-ABC), to draw posterior samples of hyperparameters given dynamic monitoring well data. In the second step we use an ensemble smoother with multiple data assimilation (ESMDA) procedure to provide posterior realizations of grid-block permeability. To reduce computational costs, a 3D recurrent R-U-Net deep learning-based surrogate model is applied for forward function evaluations. A rejection sampling (RS) procedure for data assimilation is applied to provide reference posterior results. Detailed posterior results from SMC-ABC-ESMDA are compared to those from the reference RS method. Close agreement is achieved with 'converged' RS results, for two synthetic true models, in all quantities considered. Importantly, the SMC-ABC-ESMDA procedure provides speedup of 1-2 orders of magnitude relative to RS for the two cases. A modified standalone ESMDA procedure is introduced for comparison purposes. For the same number of function evaluations, the hierarchical approach is shown to provide superior results for posterior hyperparameter distributions and monitoring well pressure predictions.

replace EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models

Authors: Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie

Abstract: Large Language Models (LLMs) are critical for a wide range of applications, but serving them efficiently becomes increasingly challenging as inputs become more complex. Context caching improves serving performance by exploiting inter-request dependency and reusing key-value (KV) cache across requests, thus improving time-to-first-token (TTFT). However, existing prefix-based context caching requires exact token prefix matches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmented generation, where prefixes may vary. In this paper, we present EPIC, an LLM serving system that introduces position-independent context caching (PIC), enabling modular KV cache reuse regardless of token chunk position (or prefix). EPIC features two key designs: AttnLink, which leverages static attention sparsity to minimize recomputation for accuracy recovery, and KVSplit, a customizable chunking method that preserves semantic coherence. Our experiments demonstrate that Epic delivers up to 8x improvements in TTFT and 7x throughput over existing systems, with negligible or no accuracy loss. By addressing the limitations of traditional caching approaches, Epic enables more scalable and efficient LLM inference.

replace Revisiting the Equivalence of Bayesian Neural Networks and Gaussian Processes: On the Importance of Learning Activations

Authors: Marcin Sendera, Amin Sorkhei, Tomasz Ku\'smierczyk

Abstract: Gaussian Processes (GPs) provide a convenient framework for specifying function-space priors, making them a natural choice for modeling uncertainty. In contrast, Bayesian Neural Networks (BNNs) offer greater scalability and extendability but lack the advantageous properties of GPs. This motivates the development of BNNs capable of replicating GP-like behavior. However, existing solutions are either limited to specific GP kernels or rely on heuristics. We demonstrate that trainable activations are crucial for effective mapping of GP priors to wide BNNs. Specifically, we leverage the closed-form 2-Wasserstein distance for efficient gradient-based optimization of reparameterized priors and activations. Beyond learned activations, we also introduce trainable periodic activations that ensure global stationarity by design, and functional priors conditioned on GP hyperparameters to allow efficient model selection. Empirically, our method consistently outperforms existing approaches or matches performance of the heuristic methods, while offering stronger theoretical foundations.

replace Machine Learning Approaches for Mental Illness Detection on Social Media: A Systematic Review of Biases and Methodological Challenges

Authors: Yuchen Cao, Jianglai Dai, Zhongyan Wang, Yeyubei Zhang, Xiaorui Shen, Yunchong Liu, Yexin Tian

Abstract: The global increase in mental illness requires innovative detection methods for early intervention. Social media provides a valuable platform to identify mental illness through user-generated content. This systematic review examines machine learning (ML) models for detecting mental illness, with a particular focus on depression, using social media data. It highlights biases and methodological challenges encountered throughout the ML lifecycle. A search of PubMed, IEEE Xplore, and Google Scholar identified 47 relevant studies published after 2010. The Prediction model Risk Of Bias ASsessment Tool (PROBAST) was utilized to assess methodological quality and risk of bias. The review reveals significant biases affecting model reliability and generalizability. A predominant reliance on Twitter (63.8%) and English-language content (over 90%) limits diversity, with most studies focused on users from the United States and Europe. Non-probability sampling (80%) limits representativeness. Only 23% explicitly addressed linguistic nuances like negations, crucial for accurate sentiment analysis. Inconsistent hyperparameter tuning (27.7%) and inadequate data partitioning (17%) risk overfitting. While 74.5% used appropriate evaluation metrics for imbalanced data, others relied on accuracy without addressing class imbalance, potentially skewing results. Reporting transparency varied, often lacking critical methodological details. These findings highlight the need to diversify data sources, standardize preprocessing, ensure consistent model development, address class imbalance, and enhance reporting transparency. By overcoming these challenges, future research can develop more robust and generalizable ML models for depression detection on social media, contributing to improved mental health outcomes globally.

replace POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Authors: Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar

Abstract: Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for attention computation because existing attention kernels specialize execution independently for the prefill and decode phases. In this paper, we present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to $59\%$ (mean $28\%$), enabling higher throughput and lower latency LLM inference compared to the use of independently optimized prefill and decode attention kernels.

replace A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases

Authors: Yunchong Liu, Xiaorui Shen, Yeyubei Zhang, Zhongyan Wang, Yexin Tian, Jianglai Dai, Yuchen Cao

Abstract: Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.

replace TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

Authors: Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, Jure Leskovec

Abstract: Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a mixed-type stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to $22.5\%$ improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at https://github.com/MinkaiXu/TabDiff.

URLs: https://github.com/MinkaiXu/TabDiff.

replace RDSA: A Robust Deep Graph Clustering Framework via Dual Soft Assignment

Authors: Yang Xiang, Li Fan, Tulika Saha, Xiaoying Pang, Yushan Pan, Haiyang Zhang, Chengtao Ji

Abstract: Graph clustering is an essential aspect of network analysis that involves grouping nodes into separate clusters. Recent developments in deep learning have resulted in graph clustering, which has proven effective in many applications. Nonetheless, these methods often encounter difficulties when dealing with real-world graphs, particularly in the presence of noisy edges. Additionally, many denoising graph clustering methods tend to suffer from lower performance, training instability, and challenges in scaling to large datasets compared to non-denoised models. To tackle these issues, we introduce a new framework called the Robust Deep Graph Clustering Framework via Dual Soft Assignment (RDSA). RDSA consists of three key components: (i) a node embedding module that effectively integrates the graph's topological features and node attributes; (ii) a structure-based soft assignment module that improves graph modularity by utilizing an affinity matrix for node assignments; and (iii) a node-based soft assignment module that identifies community landmarks and refines node assignments to enhance the model's robustness. We assess RDSA on various real-world datasets, demonstrating its superior performance relative to existing state-of-the-art methods. Our findings indicate that RDSA provides robust clustering across different graph types, excelling in clustering effectiveness and robustness, including adaptability to noise, stability, and scalability.

replace Retrieval-Augmented Generation with Estimation of Source Reliability

Authors: Jeongyeon Hwang, Junyoung Park, Hyejin Park, Sangdon Park, Jungseul Ok

Abstract: Retrieval-augmented generation (RAG) addresses key limitations of large language models (LLMs), such as hallucinations and outdated knowledge, by incorporating external databases. These databases typically consult multiple sources to encompass up-to-date and various information. However, standard RAG methods often overlook the heterogeneous source reliability in the multi-source database and retrieve documents solely based on relevance, making them prone to propagating misinformation. To address this, we propose Reliability-Aware RAG (RA-RAG) which estimates the reliability of multiple sources and incorporates this information into both retrieval and aggregation processes. Specifically, it iteratively estimates source reliability and true answers for a set of queries with no labelling. Then, it selectively retrieves relevant documents from a few of reliable sources and aggregates them using weighted majority voting, where the selective retrieval ensures scalability while not compromising the performance. We also introduce a benchmark designed to reflect real-world scenarios with heterogeneous source reliability and demonstrate the effectiveness of RA-RAG compared to a set of baselines.

replace Adaptive NAD: Online and Self-adaptive Unsupervised Network Anomaly Detector

Authors: Yachao Yuan, Yu Huang, Yali Yuan, Jin Wang

Abstract: The widespread usage of the Internet of Things (IoT) has raised the risks of cyber threats, thus developing Anomaly Detection Systems (ADSs) that can adapt to evolving or new attacks is critical. Previous studies primarily focused on offline unsupervised learning methods to safeguard ADSs, which is not applicable in practical real-world applications. Besides, most of them strongly rely on assumptions of known legitimates and fail to satisfy the interpretable requirements in security applications, creating barriers to the adoption in practice. In this paper, we design Adaptive NAD, a general framework to improve and interpret online unsupervised anomaly detection in security domains. An interpretable two-layer anomaly detection strategy is proposed to generate reliable high-confidence pseudo-labels. Then, an online learning scheme is introduced to update Adaptive NAD by a novel threshold calculation technique to adapt to new threats. Experimental results demonstrate that Adaptive NAD achieves more than 5.4%, 23.0%, and 3.2% improvements in SPAUC compared with state-of-the-art solutions on the CIC-Darknet2020, CIC-DoHBrw-2020, and Edge-IIoTset datasets, respectively. The code is released at https://github.com/MyLearnCodeSpace/Adaptive-NAD.

URLs: https://github.com/MyLearnCodeSpace/Adaptive-NAD.

replace Fairness with Exponential Weights

Authors: Stephen Pasteris, Chris Hicks, Vasilios Mavroudis

Abstract: Motivated by the need to remove discrimination in certain applications, we develop a meta-algorithm that can convert any efficient implementation of an instance of Hedge (or equivalently, an algorithm for discrete bayesian inference) into an efficient algorithm for the equivalent contextual bandit problem which guarantees exact statistical parity on every trial. Relative to any comparator with statistical parity, the resulting algorithm has the same asymptotic regret bound as running the corresponding instance of Exp4 for each protected characteristic independently. Given that our Hedge instance admits non-stationarity we can handle a varying distribution with which to enforce statistical parity with respect to, which is useful when the true population is unknown and needs to be estimated from the data received so far. Via online-to-batch conversion we can handle the equivalent batch classification problem with exact statistical parity, giving us results that we believe are novel and important in their own right.

replace Analysis, forecasting and system identification of a floating offshore wind turbine using dynamic mode decomposition

Authors: Giorgio Palma, Andrea Bardazzi, Alessia Lucarelli, Chiara Pilloton, Andrea Serani, Claudio Lugni, Matteo Diez

Abstract: This article presents the data-driven equation-free modeling of the dynamics of a hexafloat floating offshore wind turbine based on the application of dynamic mode decomposition (DMD). All the analyses are performed on experimental data collected from an operating prototype. The DMD has here used i) to extract knowledge from the dynamic system through its modal analysis, ii) for short-term forecasting from the knowledge of the immediate past of the system state, and iii) for the system identification and reduced order modeling. The forecasting method for the motions, accelerations, and forces acting on the floating system is developed using Hankel-DMD, a methodological extension that includes time-delayed copies of the states in an augmented state vector. The system identification task is performed by applying Hankel-DMD with control (Hankel-DMDc), which models the system including the effect of forcing terms. The influence of the main hyperparameters of the methods, namely the number of delayed copies in the state and input vector and the length of the observation time, is investigated with a full factorial analysis using three error metrics analyzing complementary aspects of the prediction: the normalized root mean square error, the normalized average minimum-maximum absolute error, and the Jensen-Shannon divergence. A Bayesian extension of the Hankel-DMD and Hankel-DMDc is introduced by considering the hyperparameters as stochastic variables varying in suitable ranges defined after the full factorial analysis, enriching the predictions with uncertainty quantification. Results show the capability of the approaches for short-term forecasting and system identification, suggesting their potential for real-time continuously-learning digital twinning and surrogate data-driven reduced order modeling.

replace Impactful Bit-Flip Search on Full-precision Models

Authors: Nadav Benedek, Matan Levy, Mahmood Sharif

Abstract: Neural networks have shown remarkable performance in various tasks, yet they remain susceptible to subtle changes in their input or model parameters. One particularly impactful vulnerability arises through the Bit-Flip Attack (BFA), where flipping a small number of critical bits in a model's parameters can severely degrade its performance. A common technique for inducing bit flips in DRAM is the Row-Hammer attack, which exploits frequent uncached memory accesses to alter data. Identifying susceptible bits can be achieved through exhaustive search or progressive layer-by-layer analysis, especially in quantized networks. In this work, we introduce Impactful Bit-Flip Search (IBS), a novel method for efficiently pinpointing and flipping critical bits in full-precision networks. Additionally, we propose a Weight-Stealth technique that strategically modifies the model's parameters in a way that maintains the float values within the original distribution, thereby bypassing simple range checks often used in tamper detection.

replace Conformation Generation using Transformer Flows

Authors: Sohil Atul Shah, Vladlen Koltun

Abstract: Estimating three-dimensional conformations of a molecular graph allows insight into the molecule's biological and chemical functions. Fast generation of valid conformations is thus central to molecular modeling. Recent advances in graph-based deep networks have accelerated conformation generation from hours to seconds. However, current network architectures do not scale well to large molecules. Here we present ConfFlow, a flow-based model for conformation generation based on transformer networks. In contrast with existing approaches, ConfFlow directly samples in the coordinate space without enforcing any explicit physical constraints. The generative procedure is highly interpretable and is akin to force field updates in molecular dynamics simulation. When applied to the generation of large molecule conformations, ConfFlow improve accuracy by up to $40\%$ relative to state-of-the-art learning-based methods. The source code is made available at https://github.com/IntelLabs/ConfFlow.

URLs: https://github.com/IntelLabs/ConfFlow.

replace Attention-Based Reconstruction of Full-Field Tsunami Waves from Sparse Tsunameter Networks

Authors: Edward McDugald, Arvind Mohan, Darren Engwirda, Agnese Marcato, Javier Santos

Abstract: We investigate the potential of an attention-based neural network architecture known as the Senseiver to perform sparse sensing tasks in the context of tsunami forecasting. In particular, we focus on the Tsunami Data Assimilation Method, where forecasts are derived from tsunameter networks. We used our model to generate high-resolution tsunami waves from incredibly sparse observations, whose epicenters are not included in the training set. We also show significantly improved accuracy in the generation of dense observation networks by comparison to the Linear Interpolation with Huygens-Fresnel Principle.

replace Understanding LLM Embeddings for Regression

Authors: Eric Tang, Bangding Yang, Xingyou Song

Abstract: With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.

replace On Feasible Rewards in Multi-Agent Inverse Reinforcement Learning

Authors: Till Freihaut, Giorgia Ramponi

Abstract: In multi-agent systems, agent behavior is driven by utility functions that encapsulate their individual goals and interactions. Inverse Reinforcement Learning (IRL) seeks to uncover these utilities by analyzing expert behavior, offering insights into the underlying decision-making processes. However, multi-agent settings pose significant challenges, particularly when rewards are inferred from equilibrium observations. A key obstacle is that single (Nash) equilibrium observations often fail to adequately capture critical game properties, leading to potential misrepresentations. This paper offers a rigorous analysis of the feasible reward set in multi-agent IRL and addresses these limitations by introducing entropy-regularized games, ensuring equilibrium uniqueness and enhancing interpretability. Furthermore, we examine the effects of estimation errors and present the first sample complexity results for multi-agent IRL across diverse scenarios.

replace Categorical Data Clustering via Value Order Estimated Distance Metric Learning

Authors: Yiqun Zhang, Mingjie Zhao, Hong Jia, Yang Lu, Mengke Li, Yiu-ming Cheung

Abstract: Categorical data composed of qualitative valued attributes are ubiquitous in machine learning tasks. Due to the lack of well-defined metric space, categorical data distributions are difficult to be intuitively understood. Clustering is a popular data analysis technique suitable for data distribution understanding. However, the success of clustering often relies on reasonable distance metrics, which happens to be what categorical data naturally lack. This paper therefore introduces a new finding that the order relation among attribute values is the decisive factor in clustering accuracy, and is also the key to understanding categorical data clusters, because the essence of clustering is to order the clusters in terms of their admission to samples. To obtain the orders, we propose a new learning paradigm that allows joint learning of clusters and the orders. It alternatively partitions the data into clusters based on the distance metric built upon the orders and estimates the most likely orders according to the clusters. The algorithm achieves superior clustering accuracy with a convergence guarantee, and the learned orders facilitate the understanding of the non-intuitive cluster distribution of categorical data. Extensive experiments with ablation studies, statistical evidence, and case studies have validated the new insight into the importance of value order and the method proposition. The source code is temporarily opened in https://anonymous.4open.science/r/OCL-demo.

URLs: https://anonymous.4open.science/r/OCL-demo.

replace Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures

Authors: Yicheng Zhang, Zhen Qin, Zhaomin Wu, Jian Hou, Shuiguang Deng

Abstract: Large-scale instruction data is essential for aligning pretrained Large Language Models (LLMs) with human instructions, but may contain sensitive information that hinders its public sharing. Federated Learning (FL) enables collaborative fine-tuning of LLMs without data sharing. However, existing approaches to federated LLM fine-tuning usually adopt a uniform model architecture, making it hard to fit the highly heterogeneous data with varying amounts and formats. To address this, we propose FedAMoLE, a lightweight personalized FL framework that enables data-driven heterogeneous model architectures. This framework features an adaptive mixture of LoRA experts (MoLE) module for aggregating heterogeneous models and a reverse selection-based expert assignment strategy that optimizes model architectures based on data distributions. Experiments across five scenarios show that FedAMoLE improves accuracy by an average of 5.14% compared to existing approaches while obtaining good scalability.

replace Towards a Mechanistic Explanation of Diffusion Model Generalization

Authors: Matthew Niedoba, Berend Zwartsenberg, Kevin Murphy, Frank Wood

Abstract: We propose a simple, training-free mechanism which explains the generalization behaviour of diffusion models. By comparing pre-trained diffusion models to their theoretically optimal empirical counterparts, we identify a shared local inductive bias across a variety of network architectures. From this observation, we hypothesize that network denoisers generalize through localized denoising operations, as these operations approximate the training objective well over much of the training distribution. To validate our hypothesis, we introduce novel denoising algorithms which aggregate local empirical denoisers to replicate network behaviour. Comparing these algorithms to network denoisers across forward and reverse diffusion processes, our approach exhibits consistent visual similarity to neural network outputs, with lower mean squared error than previously proposed methods.

replace Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Authors: Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, Lu Yin

Abstract: Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated, such as DeepSeekMoE and QwenMoE. We demonstrate the effectiveness of our method. Specifically, for the DeepSeekMoE-16B model, our approach maintains 90% of the average accuracy while reducing memory usage by 27.5% and increasing inference speed by 1.26 times. Moreover, we show that by applying lightweight expert fine-tuning -- only to the condensed layers -- and using 5 hours on a single 80G A100 GPU, we can successfully recover 98% of the original performance. Our code is available at: https://github.com/duterscmy/CD-MoE/tree/main.

URLs: https://github.com/duterscmy/CD-MoE/tree/main.

replace Two stages domain invariant representation learners solve the large co-variate shift in unsupervised domain adaptation with two dimensional data domains

Authors: Hisashi Oshima, Tsuyoshi Ishizone, Tomoyuki Higuchi

Abstract: Recent developments in the unsupervised domain adaptation (UDA) enable the unsupervised machine learning (ML) prediction for target data, thus this will accelerate real world applications with ML models such as image recognition tasks in self-driving. Researchers have reported the UDA techniques are not working well under large co-variate shift problems where e.g. supervised source data consists of handwritten digits data in monotone color and unsupervised target data colored digits data from the street view. Thus there is a need for a method to resolve co-variate shift and transfer source labelling rules under this dynamics. We perform two stages domain invariant representation learning to bridge the gap between source and target with semantic intermediate data (unsupervised). The proposed method can learn domain invariant features simultaneously between source and intermediate also intermediate and target. Finally this achieves good domain invariant representation between source and target plus task discriminability owing to source labels. This induction for the gradient descent search greatly eases learning convergence in terms of classification performance for target data even when large co-variate shift. We also derive a theorem for measuring the gap between trained models and unsupervised target labelling rules, which is necessary for the free parameters optimization. Finally we demonstrate that proposing method is superiority to previous UDA methods using 4 representative ML classification datasets including 38 UDA tasks. Our experiment will be a basis for challenging UDA problems with large co-variate shift.

replace APOLLO: SGD-like Memory, AdamW-level Performance

Authors: Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee

Abstract: Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

replace Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces

Authors: Nianze Tao

Abstract: Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for ${de~novo}$ drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network is capable of effortlessly generating high quality out-of-distribution samples that meet several scenarios. We introduce a semi-autoregressive training/sampling method that helps to enhance the model performance and surpass the state-of-the-art models.

replace Splitting criteria for ordinal decision trees: an experimental study

Authors: Rafael Ayll\'on-Gavil\'an, Francisco Jos\'e Mart\'inez-Estudillo, David Guijo-Rubio, C\'esar Herv\'as-Mart\'inez, Pedro Antonio Guti\'errez

Abstract: Ordinal Classification (OC) is a machine learning field that addresses classification tasks where the labels exhibit a natural order. Unlike nominal classification, which treats all classes as equally distinct, OC takes the ordinal relationship into account, producing more accurate and relevant results. This is particularly critical in applications where the magnitude of classification errors has implications. Despite this, OC problems are often tackled using nominal methods, leading to suboptimal solutions. Although decision trees are one of the most popular classification approaches, ordinal tree-based approaches have received less attention when compared to other classifiers. This work conducts an experimental study of tree-based methodologies specifically designed to capture ordinal relationships. A comprehensive survey of ordinal splitting criteria is provided, standardising the notations used in the literature for clarity. Three ordinal splitting criteria, Ordinal Gini (OGini), Weighted Information Gain (WIG), and Ranking Impurity (RI), are compared to the nominal counterparts of the first two (Gini and information gain), by incorporating them into a decision tree classifier. An extensive repository considering 45 publicly available OC datasets is presented, supporting the first experimental comparison of ordinal and nominal splitting criteria using well-known OC evaluation metrics. Statistical analysis of the results highlights OGini as the most effective ordinal splitting criterion to date. Source code, datasets, and results are made available to the research community.

replace SAT-LDM: Provably Generalizable Image Watermarking for Latent Diffusion Models with Self-Augmented Training

Authors: Lu Zhang, Liang Zeng

Abstract: The rapid proliferation of AI-generated images necessitates effective watermarking techniques to protect intellectual property and detect fraudulent content. While existing training-based watermarking methods show promise, they often struggle with generalizing across diverse prompts and tend to introduce visible artifacts. To this end, we propose a novel, provably generalizable image watermarking approach for Latent Diffusion Models, termed Self-Augmented Training (SAT-LDM). Our method aligns the training and testing phases through a free generation distribution, thereby enhancing the watermarking module's generalization capabilities. We theoretically consolidate SAT-LDM by proving that the free generation distribution contributes to its tight generalization bound, without the need for additional data collection. Extensive experiments show that SAT-LDM not only achieves robust watermarking but also significantly improves the quality of watermarked images across a wide range of prompts. Moreover, our experimental analyses confirm the strong generalization abilities of SAT-LDM. We hope that our method provides a practical and efficient solution for securing high-fidelity AI-generated content.

replace Classification of Operational Records in Aviation Using Deep Learning Approaches

Authors: Aziida Nanyonga, Graham Wild

Abstract: Ensuring safety in the aviation industry is critical, even minor anomalies can lead to severe consequences. This study evaluates the performance of four different models for DP (deep learning), including: Bidirectional Long Short-Term Memory (BLSTM), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Simple Recurrent Neural Networks (sRNN), on a multi-class classification task involving Commercial, Military, and Private categories using the Socrata aviation dataset of 4,864 records. The models were assessed using a classification report, confusion matrix analysis, accuracy metrics, validation loss and accuracy curves. Among the models, BLSTM achieved the highest overall accuracy of 72%, demonstrating superior performance in stability and balanced classification, while LSTM followed closely with 71%, excelling in recall for the Commercial class. CNN and sRNN exhibited lower accuracies of 67% and 69%, with significant misclassifications in the Private class. While the results highlight the strengths of BLSTM and LSTM in handling sequential dependencies and complex classification tasks, all models faced challenges with class imbalance, particularly in predicting the Military and Private categories. Addressing these limitations through data augmentation, advanced feature engineering, and ensemble learning techniques could enhance classification accuracy and robustness. This study underscores the importance of selecting appropriate architectures for domain specific tasks

replace Predicting the Performance of Black-box LLMs through Self-Queries

Authors: Dylan Sam, Marc Finzi, J. Zico Kolter

Abstract: As large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a language model's state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini).

replace Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models

Authors: Bahman Torkamandi

Abstract: In the realm of fractal geometry, intricate structures emerge from simple iterative processes that partition parameter spaces into regions of stability and instability. Likewise, training large language models involves iteratively applying update functions, such as Adam, where even slight hyperparameter adjustments can shift the training process from convergence to divergence. Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics. Building on these insights, this study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure and examining the learning rate hyperparameter landscape for attention and fully connected layers. The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales, with statistically consistent and repeating patterns. Within this landscape, a region of stable convergence is surrounded by a complex chaotic border, illustrating the sensitive nature of the underlying training dynamics.

replace Are DeepSeek R1 And Other Reasoning Models More Faithful?

Authors: James Chua, Owain Evans

Abstract: Language models trained to solve reasoning tasks via reinforcement learning have achieved striking results. We refer to these models as reasoning models. A key question emerges: Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? To investigate this, we evaluate three reasoning models (based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base) on an existing test of faithful CoT. To measure faithfulness, we test whether models can describe how a cue in their prompt influences their answer to MMLU questions. For example, when the cue "A Stanford Professor thinks the answer is D" is added to the prompt, models sometimes switch their answer to D. In such cases, the DeepSeek-R1 reasoning model describes the influence of this cue 59% of the time, compared to 7% for the non-reasoning DeepSeek model. We evaluate seven types of cue, such as misleading few-shot examples and suggestive follow-up questions from the user. Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested (including Claude-3.5-Sonnet and GPT-4). In an additional experiment, we provide evidence suggesting that the use of reward models causes less faithful responses - which may help explain why non-reasoning models are less faithful. Our study has two main limitations. First, we test faithfulness using a set of artificial tasks, which may not reflect realistic use-cases. Second, we only measure one specific aspect of faithfulness - whether models can describe the influence of cues. Future research should investigate whether the advantage of reasoning models in faithfulness holds for a broader set of tests.

replace Using Domain Knowledge with Deep Learning to Solve Applied Inverse Problems

Authors: Qinyi Tian, Winston Lindqwister, Manolis Veveakis, Laura E. Dalton

Abstract: Advancements in deep learning have improved the ability to model complex, nonlinear relationships, such as those encountered in complex material inverse problems. However, the effectiveness of these methods often depends on large datasets, which are not always available. In this study, the incorporation of domain-specific knowledge of mechanical behavior is investigated to evaluate the impact on the predictive performance of the models in data-scarce scenarios. To demonstrate this, stress-strain curves were used to predict key microstructural features of porous materials, and the performance of models trained with and without domain knowledge was compared using five deep learning models: Convolutional Neural Networks, Extreme Gradient Boosting, K-Nearest Neighbors, Long Short-Term Memory, and Random Forest. The results of the models with domain-specific characteristics consistently achieved higher $R^2$ values and improved learning efficiency compared to models without prior knowledge. When the models did not include domain knowledge, the model results revealed meaningful patterns were not recognized, while those enhanced with mechanical insights showed superior feature extraction and predictions. These findings underscore the critical role of domain knowledge in guiding deep learning models, highlighting the need to combine domain expertise with data-driven approaches to achieve reliable and accurate outcomes in materials science and related fields.

replace Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay

Authors: Qi Cheems Wang, Zehao Xiao, Yixiu Mao, Yun Qu, Jiayi Shen, Yiqin Lv, Xiangyang Ji

Abstract: The foundation model enables general-purpose problem-solving and enjoys desirable rapid adaptation due to its adopted cross-task generalization paradigms, e.g., pretraining, meta-training, and finetuning. Recent advances in these paradigms show the crucial role of challenging tasks' prioritized sampling in enhancing adaptation robustness. However, ranking task difficulties exhausts massive task queries to evaluate, thus computation and annotation intensive, which is typically unaffordable in practice. This work underscores the criticality of both adaptation robustness and learning efficiency, especially in scenarios where tasks are risky or costly to evaluate, e.g., policy evaluations in Markov decision processes (MDPs) or inference with large models. To this end, we present Model Predictive Task Sampling (MPTS) to establish connections between the task space and adaptation risk landscape to form a theoretical guideline in robust active task sampling. MPTS characterizes the task episodic information with a generative model and directly predicts task-specific adaptation risk values from posterior inference. The developed risk learner can amortize expensive evaluation and provably approximately rank task difficulties in the pursuit of task robust adaptation. MPTS can be seamlessly integrated into zero-shot, few-shot, and many-shot learning paradigms. Extensive experimental results are conducted to exhibit the superiority of the proposed framework, remarkably increasing task adaptation robustness and retaining learning efficiency in contrast to existing state-of-the-art (SOTA) methods. The code is available at the project site https://github.com/thu-rllab/MPTS.

URLs: https://github.com/thu-rllab/MPTS.

replace A Survey on Diffusion Models for Anomaly Detection

Authors: Jing Liu, Zhenchao Ma, Zepu Wang, Chenxuanyin Zou, Jiayang Ren, Zehua Wang, Liang Song, Bo Hu, Yang Liu, Victor C. M. Leung

Abstract: Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we review recent advances in DMAD research. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at https://github.com/fdjingliu/DMAD.

URLs: https://github.com/fdjingliu/DMAD.

replace Group-Agent Reinforcement Learning with Heterogeneous Agents

Authors: Kaiyue Wu, Xiao-Jun Zeng, Tingting Mu

Abstract: Group-agent reinforcement learning (GARL) is a newly arising learning scenario, where multiple reinforcement learning agents study together in a group, sharing knowledge in an asynchronous fashion. The goal is to improve the learning performance of each individual agent. Under a more general heterogeneous setting where different agents learn using different algorithms, we advance GARL by designing novel and effective group-learning mechanisms. They guide the agents on whether and how to learn from action choices from the others, and allow the agents to adopt available policy and value function models sent by another agent if they perform better. We have conducted extensive experiments on a total of 43 different Atari 2600 games to demonstrate the superior performance of the proposed method. After the group learning, among the 129 agents examined, 96% are able to achieve a learning speed-up, and 72% are able to learn over 100 times faster. Also, around 41% of those agents have achieved a higher accumulated reward score by learning in less than 5% of the time steps required by a single agent when learning on its own.

replace Wormhole Memory: A Rubik's Cube for Cross-Dialogue Retrieval

Authors: Libo Wang

Abstract: In view of the gap in the current large language model in sharing memory across dialogues, this research proposes a wormhole memory module (WMM) to realize memory as a Rubik's cube that can be arbitrarily retrieved between different dialogues. Through simulation experiments, the researcher built an experimental framework based on the Python environment and used setting memory barriers to simulate the current situation where memories between LLMs dialogues are difficult to share. The CoQA development data set was imported into the experiment, and the feasibility of its cross-dialogue memory retrieval function was verified for WMM's nonlinear indexing and dynamic retrieval, and a comparative analysis was conducted with the capabilities of Titans and MemGPT memory modules. Experimental results show that WMM demonstrated the ability to retrieve memory across dialogues and the stability of quantitative indicators in eight experiments. It contributes new technical approaches to the optimization of memory management of LLMs and provides experience for the practical application in the future.

replace A Unified Evaluation Framework for Epistemic Predictions

Authors: Shireen Kudukkil Manchingal, Muhammad Mubashar, Kaizheng Wang, Fabio Cuzzolin

Abstract: Predictions of uncertainty-aware models are diverse, ranging from single point estimates (often averaged over prediction samples) to predictive distributions, to set-valued or credal-set representations. We propose a novel unified evaluation framework for uncertainty-aware classifiers, applicable to a wide range of model classes, which allows users to tailor the trade-off between accuracy and precision of predictions via a suitably designed performance metric. This makes possible the selection of the most suitable model for a particular real-world application as a function of the desired trade-off. Our experiments, concerning Bayesian, ensemble, evidential, deterministic, credal and belief function classifiers on the CIFAR-10, MNIST and CIFAR-100 datasets, show that the metric behaves as desired.

replace Path Planning for Masked Diffusion Model Sampling

Authors: Fred Zhangzhi Peng, Zachary Bezemek, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Alexander Tong, Pranam Chatterjee

Abstract: In this paper, we explore how token unmasking order influences generative quality in masked diffusion models (MDMs). We derive an expanded evidence lower bound (ELBO) that introduces a planner to select which tokens to unmask at each step. Our analysis reveals that alternative unmasking strategies can enhance generation performance. Building on this, we propose Path Planning (P2), a sampling framework that uses a pre-trained BERT model or the denoiser itself to guide unmasking decisions. P2 generalizes all known MDM sampling strategies and significantly improves performance across diverse domains, including language generation (in-context learning, code generation, story infilling, mathematical reasoning, reverse curse correction) and biological sequence generation (protein and RNA sequences).

replace Code Simulation as a Proxy for High-order Tasks in Large Language Models

Authors: Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge

Abstract: Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. We collect pairs of naturalistic and synthetic reasoning tasks to assess the capabilities of Large Language Models (LLM). While naturalistic tasks often require careful human handcrafting, we show that synthetic data is, in many cases, a good proxy that is much easier to collect at scale. We leverage common constructs in programming as the counterpart of the building blocks of naturalistic reasoning tasks, such as straight-line programs, code that contains critical paths, and approximate and redundant instructions. We further assess the capabilities of LLMs on sorting problems and repeated operations via sorting algorithms and nested loops. Our synthetic datasets further reveal that while the most powerful LLMs exhibit relatively strong execution capabilities, the process is fragile: it is negatively affected by memorisation and seems to rely heavily on pattern recognition. Our contribution builds upon synthetically testing the reasoning capabilities of LLMs as a scalable complement to handcrafted human-annotated problems.

replace Rethinking Oversmoothing in Graph Neural Networks: A Rank-Based Perspective

Authors: Kaicheng Zhang, Piero Deidda, Desmond Higham, Francesco Tudisco

Abstract: Oversmoothing is a fundamental challenge in graph neural networks (GNNs): as the number of layers increases, node embeddings become increasingly similar, and model performance drops sharply. Traditionally, oversmoothing has been quantified using metrics that measure the similarity of neighbouring node features, such as the Dirichlet energy. While these metrics are related to oversmoothing, we argue they have critical limitations and fail to reliably capture oversmoothing in realistic scenarios. For instance, they provide meaningful insights only for very deep networks and under somewhat strict conditions on the norm of network weights and feature representations. As an alternative, we propose measuring oversmoothing by examining the numerical or effective rank of the feature representations. We provide theoretical support for this approach, demonstrating that the numerical rank of feature representations converges to one for a broad family of nonlinear activation functions under the assumption of nonnegative trained weights. To the best of our knowledge, this is the first result that proves the occurrence of oversmoothing without assumptions on the boundedness of the weight matrices. Along with the theoretical findings, we provide extensive numerical evaluation across diverse graph architectures. Our results show that rank-based metrics consistently capture oversmoothing, whereas energy-based metrics often fail. Notably, we reveal that a significant drop in the rank aligns closely with performance degradation, even in scenarios where energy metrics remain unchanged.

replace Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Authors: Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein

Abstract: We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

replace Debiasing Guidance for Discrete Diffusion with Sequential Monte Carlo

Authors: Cheuk Kit Lee, Paul Jeha, Jes Frellsen, Pietro Lio, Michael Samuel Albergo, Francisco Vargas

Abstract: Discrete diffusion models are a class of generative models that produce samples from an approximated data distribution within a discrete state space. Often, there is a need to target specific regions of the data distribution. Current guidance methods aim to sample from a distribution with mass proportional to $p_0(x_0) p(\zeta|x_0)^\alpha$ but fail to achieve this in practice. We introduce a Sequential Monte Carlo algorithm that generates unbiasedly from this target distribution, utilising the learnt unconditional and guided process. We validate our approach on low-dimensional distributions, controlled images and text generations. For text generation, our method provides strong control while maintaining low perplexity compared to guidance-based approaches.

replace On the Expressiveness of Rational ReLU Neural Networks With Bounded Depth

Authors: Gennadiy Averkov, Christopher Hojny, Maximilian Merkert

Abstract: To confirm that the expressive power of ReLU neural networks grows with their depth, the function $F_n = \max \{0,x_1,\ldots,x_n\}$ has been considered in the literature. A conjecture by Hertrich, Basu, Di Summa, and Skutella [NeurIPS 2021] states that any ReLU network that exactly represents $F_n$ has at least $\lceil\log_2 (n+1)\rceil$ hidden layers. The conjecture has recently been confirmed for networks with integer weights by Haase, Hertrich, and Loho [ICLR 2023]. We follow up on this line of research and show that, within ReLU networks whose weights are decimal fractions, $F_n$ can only be represented by networks with at least $\lceil\log_3 (n+1)\rceil$ hidden layers. Moreover, if all weights are $N$-ary fractions, then $F_n$ can only be represented by networks with at least $\Omega( \frac{\ln n}{\ln \ln N})$ layers. These results are a partial confirmation of the above conjecture for rational ReLU networks, and provide the first non-constant lower bound on the depth of practically relevant ReLU networks.

replace Poincar\'e Inequality for Local Log-Polyak-Lojasiewicz Measures : Non-asymptotic Analysis in Low-temperature Regime

Authors: Yun Gong, Zebang Shen, Niao He

Abstract: Potential functions in highly pertinent applications, such as deep learning in over-parameterized regime, are empirically observed to admit non-isolated minima. To understand the convergence behavior of stochastic dynamics in such landscapes, we propose to study the class of \logPLmeasure\ measures $\mu_\epsilon \propto \exp(-V/\epsilon)$, where the potential $V$ satisfies a local Polyak-{\L}ojasiewicz (P\L) inequality, and its set of local minima is provably \emph{connected}. Notably, potentials in this class can exhibit local maxima and we characterize its optimal set S to be a compact $\mathcal{C}^2$ \emph{embedding submanifold} of $\mathbb{R}^d$ without boundary. The \emph{non-contractibility} of S distinguishes our function class from the classical convex setting topologically. Moreover, the embedding structure induces a naturally defined Laplacian-Beltrami operator on S, and we show that its first non-trivial eigenvalue provides an \emph{$\epsilon$-independent} lower bound for the \Poincare\ constant in the \Poincare\ inequality of $\mu_\epsilon$. As a direct consequence, Langevin dynamics with such non-convex potential $V$ and diffusion coefficient $\epsilon$ converges to its equilibrium $\mu_\epsilon$ at a rate of $\tilde{\mathcal{O}}(1/\epsilon)$, provided $\epsilon$ is sufficiently small. Here $\tilde{\mathcal{O}}$ hides logarithmic terms.

replace Fast Clustering of Categorical Big Data

Authors: Bipana Thapaliya, Yu Zhuang

Abstract: The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the initial cluster centers for the K-Modes. Experimental studies of the BK-Modes were carried out and were compared against the K-Modes with multiple sets of initial cluster centers as well as the best of the existing methods we found so far in our survey. Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.

replace FedAPA: Server-side Gradient-Based Adaptive Personalized Aggregation for Federated Learning on Heterogeneous Data

Authors: Yuxia Sun, Aoxiang Sun, Siyi Pan, Zhixiao Fu, Jingcai Guo

Abstract: Personalized federated learning (PFL) tailors models to clients' unique data distributions while preserving privacy. However, existing aggregation-weight-based PFL methods often struggle with heterogeneous data, facing challenges in accuracy, computational efficiency, and communication overhead. We propose FedAPA, a novel PFL method featuring a server-side, gradient-based adaptive aggregation strategy to generate personalized models, by updating aggregation weights based on gradients of client-parameter changes with respect to the aggregation weights in a centralized manner. FedAPA guarantees theoretical convergence and achieves superior accuracy and computational efficiency compared to 10 PFL competitors across three datasets, with competitive communication overhead.

replace HRP: High-Rank Preheating for Superior LoRA Initialization

Authors: Yuzhu Chen, Yingjie Wang, Shi Fu, Li Shen, Yongcheng Jing, Xinmei Tian, Dacheng Tao

Abstract: This paper studies the crucial impact of initialization on the convergence properties of Low-Rank Adaptation (LoRA). We theoretically demonstrate that random initialization, a widely used schema, will likely lead LoRA to random low-rank results, rather than the best low-rank result. While this issue can be mitigated by adjusting initialization towards a well-informed direction, it relies on prior knowledge of the target, which is typically unknown in real-world scenarios. To approximate this well-informed initial direction, we propose High-Rank Preheating (HRP), which fine-tunes high-rank LoRA for a few steps and uses the singular value decomposition of the preheated result as a superior initialization. HRP initialization is theory-supported to combine the convergence strengths of high-rank LoRA and the generalization strengths of low-rank LoRA. Extensive experiments demonstrate that HRP significantly enhances LoRA's effectiveness across various models and tasks, achieving performance comparable to full-parameter fine-tuning and outperforming other initialization strategies.

replace Bridging Domain Adaptation and Graph Neural Networks: A Tensor-Based Framework for Effective Label Propagation

Authors: Tao Wen, Elynn Chen, Yuzhou Chen, Qi Lei

Abstract: Graph Neural Networks (GNNs) have recently become the predominant tools for studying graph data. Despite state-of-the-art performance on graph classification tasks, GNNs are overwhelmingly trained in a single domain under supervision, thus necessitating a prohibitively high demand for labels and resulting in poorly transferable representations. To address this challenge, we propose the Label-Propagation Tensor Graph Neural Network (LP-TGNN) framework to bridge the gap between graph data and traditional domain adaptation methods. It extracts graph topological information holistically with a tensor architecture and then reduces domain discrepancy through label propagation. It is readily compatible with general GNNs and domain adaptation techniques with minimal adjustment through pseudo-labeling. Experiments on various real-world benchmarks show that our LP-TGNN outperforms baselines by a notable margin. We also validate and analyze each component of the proposed framework in the ablation study.

replace Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics

Authors: Sebastian Sanokowski, Wilhelm Berghammer, Martin Ennemoser, Haoyu Peter Wang, Sepp Hochreiter, Sebastian Lehner

Abstract: Learning to sample from complex unnormalized distributions over discrete domains emerged as a promising research direction with applications in statistical physics, variational inference, and combinatorial optimization. Recent work has demonstrated the potential of diffusion models in this domain. However, existing methods face limitations in memory scaling and thus the number of attainable diffusion steps since they require backpropagation through the entire generative process. To overcome these limitations we introduce two novel training methods for discrete diffusion samplers, one grounded in the policy gradient theorem and the other one leveraging Self-Normalized Neural Importance Sampling (SN-NIS). These methods yield memory-efficient training and achieve state-of-the-art results in unsupervised combinatorial optimization. Numerous scientific applications additionally require the ability of unbiased sampling. We introduce adaptations of SN-NIS and Neural Markov Chain Monte Carlo that enable for the first time the application of discrete diffusion models to this problem. We validate our methods on Ising model benchmarks and find that they outperform popular autoregressive approaches. Our work opens new avenues for applying diffusion models to a wide range of scientific applications in discrete domains that were hitherto restricted to exact likelihood models.

replace A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

Authors: Wangyang Ying, Cong Wei, Nanxu Gong, Xinyuan Wang, Haoyue Bai, Arun Vignesh Malarkkan, Sixun Dong, Dongjie Wang, Denghui Zhang, Yanjie Fu

Abstract: Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.

replace AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization

Authors: Caleb Cranney, Jesse G. Meyer

Abstract: Transformer architectures have transformed AI applications but remain complex to customize for domain experts lacking low-level implementation expertise. We introduce AttentionSmithy, a modular software package that simplifies transformer innovation by breaking down key components into reusable building blocks: attention modules, feed-forward networks, normalization layers, and positional encodings. Users can rapidly prototype and evaluate transformer variants without extensive coding. Our framework supports four positional encoding strategies and integrates with neural architecture search for automated design. We validate AttentionSmithy by replicating the original transformer under resource constraints and optimizing translation performance by combining positional encodings. Additionally, we demonstrate its adaptability in gene-specific modeling, achieving over 95% accuracy in cell type classification. These case studies highlight AttentionSmithy's potential to accelerate research across diverse fields by removing framework implementation barriers.

replace Diffusing DeBias: a Recipe for Turning a Bug into a Feature

Authors: Massimiliano Ciranni, Vito Paolo Pastore, Roberto Di Via, Enzo Tartaglione, Francesca Odone, Vittorio Murino

Abstract: Deep learning model effectiveness in classification tasks is often challenged by the quality and quantity of training data which, whenever containing strong spurious correlations between specific attributes and target labels, can result in unrecoverable biases in model predictions. Tackling these biases is crucial in improving model generalization and trust, especially in real-world scenarios. This paper presents Diffusing DeBias (DDB), a novel approach acting as a plug-in for common methods in model debiasing while exploiting the inherent bias-learning tendency of diffusion models. Our approach leverages conditional diffusion models to generate synthetic bias-aligned images, used to train a bias amplifier model, to be further employed as an auxiliary method in different unsupervised debiasing approaches. Our proposed method, which also tackles the common issue of training set memorization typical of this type of tech- niques, beats current state-of-the-art in multiple benchmark datasets by significant margins, demonstrating its potential as a versatile and effective tool for tackling dataset bias in deep learning applications.

replace Improving Acoustic Side-Channel Attacks on Keyboards Using Transformers and Large Language Models

Authors: Jin Hyun Park, Seyyed Ali Ayati, Yichen Cai

Abstract: The increasing prevalence of microphones in everyday devices and the growing reliance on online services have amplified the risk of acoustic side-channel attacks (ASCAs) targeting keyboards. This study explores deep learning techniques, specifically vision transformers (VTs) and large language models (LLMs), to enhance the effectiveness and applicability of such attacks. We present substantial improvements over prior research, with the CoAtNet model achieving state-of-the-art performance. Our CoAtNet shows a 5.0% improvement for keystrokes recorded via smartphone (Phone) and 5.9% for those recorded via Zoom compared to previous benchmarks. We also evaluate transformer architectures and language models, with the best VT model matching CoAtNet's performance. A key advancement is the introduction of a noise mitigation method for real-world scenarios. By using LLMs for contextual understanding, we detect and correct erroneous keystrokes in noisy environments, enhancing ASCA performance. Additionally, fine-tuned lightweight language models with Low-Rank Adaptation (LoRA) deliver comparable performance to heavyweight models with 67X more parameters. This integration of VTs and LLMs improves the practical applicability of ASCA mitigation, marking the first use of these technologies to address ASCAs and error correction in real-world scenarios.

replace Data Valuation using Neural Networks for Efficient Instruction Fine-Tuning

Authors: Ishika Agarwal, Dilek Hakkani-T\"ur

Abstract: Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.

URLs: https://github.com/agarwalishika/NN-CIFT.

replace AffinityFlow: Guided Flows for Antibody Affinity Maturation

Authors: Can Chen, Karla-Luise Herpoldt, Chenchao Zhao, Zichen Wang, Marcus Collins, Shang Shang, Ron Benson

Abstract: Antibodies are widely used as therapeutics, but their development requires costly affinity maturation, involving iterative mutations to enhance binding affinity.This paper explores a sequence-only scenario for affinity maturation, using solely antibody and antigen sequences. Recently AlphaFlow wraps AlphaFold within flow matching to generate diverse protein structures, enabling a sequence-conditioned generative model of structure. Building on this, we propose an alternating optimization framework that (1) fixes the sequence to guide structure generation toward high binding affinity using a structure-based affinity predictor, then (2) applies inverse folding to create sequence mutations, refined by a sequence-based affinity predictor for post selection. A key challenge is the lack of labeled data for training both predictors. To address this, we develop a co-teaching module that incorporates valuable information from noisy biophysical energies into predictor refinement. The sequence-based predictor selects consensus samples to teach the structure-based predictor, and vice versa. Our method, AffinityFlow, achieves state-of-the-art performance in affinity maturation experiments. We plan to open-source our code after acceptance.

replace-cross Exploiting Defenses against GAN-Based Feature Inference Attacks in Federated Learning

Authors: Xinjian Luo, Xianglong Zhang

Abstract: Federated learning (FL) is a decentralized model training framework that aims to merge isolated data islands while maintaining data privacy. However, recent studies have revealed that Generative Adversarial Network (GAN) based attacks can be employed in FL to learn the distribution of private datasets and reconstruct recognizable images. In this paper, we exploit defenses against GAN-based attacks in FL and propose a framework, Anti-GAN, to prevent attackers from learning the real distribution of the victim's data. The core idea of Anti-GAN is to manipulate the visual features of private training images to make them indistinguishable to human eyes even restored by attackers. Specifically, Anti-GAN projects the private dataset onto a GAN's generator and combines the generated fake images with the actual images to create the training dataset, which is then used for federated model training. The experimental results demonstrate that Anti-GAN is effective in preventing attackers from learning the distribution of private images while causing minimal harm to the accuracy of the federated model.

replace-cross Flow-based sampling for multimodal and extended-mode distributions in lattice field theory

Authors: Daniel C. Hackett, Chung-Chun Hsieh, Sahil Pontula, Michael S. Albergo, Denis Boyda, Jiunn-Wei Chen, Kai-Feng Chen, Kyle Cranmer, Gurtej Kanwar, Phiala E. Shanahan

Abstract: Recent results have demonstrated that samplers constructed with flow-based generative models are a promising new approach for configuration generation in lattice field theory. In this paper, we present a set of training- and architecture-based methods to construct flow models for targets with multiple separated modes (i.e.~vacua) as well as targets with extended/continuous modes. We demonstrate the application of these methods to modeling two-dimensional real and complex scalar field theories in their symmetry-broken phases. In this context we investigate different flow-based sampling algorithms, including a composite sampling algorithm where flow-based proposals are occasionally augmented by applying updates using traditional algorithms like HMC.

replace-cross Human alignment of neural network representations

Authors: Lukas Muttenthaler, Jonas Dippel, Lorenz Linhardt, Robert A. Vandermeulen, Simon Kornblith

Abstract: Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses. We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses, whereas the training dataset and objective function both have a much larger impact. These findings are consistent across three datasets of human similarity judgments collected using two different tasks. Linear transformations of neural network representations learned from behavioral responses from one dataset substantially improve alignment with human similarity judgments on the other two datasets. In addition, we find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.

replace-cross Understanding Sample Generation Strategies for Learning Heuristic Functions in Classical Planning

Authors: R. V. Bettker, P. P. Minini, A. G. Pereira, M. Ritt

Abstract: We study the problem of learning good heuristic functions for classical planning tasks with neural networks based on samples represented by states with their cost-to-goal estimates. The heuristic function is learned for a state space and goal condition with the number of samples limited to a fraction of the size of the state space, and must generalize well for all states of the state space with the same goal condition. Our main goal is to better understand the influence of sample generation strategies on the performance of a greedy best-first heuristic search (GBFS) guided by a learned heuristic function. In a set of controlled experiments, we find that two main factors determine the quality of the learned heuristic: the algorithm used to generate the sample set and how close the sample estimates to the perfect cost-to-goal are. These two factors are dependent: having perfect cost-to-goal estimates is insufficient if the samples are not well distributed across the state space. We also study other effects, such as adding samples with high-value estimates. Based on our findings, we propose practical strategies to improve the quality of learned heuristics: three strategies that aim to generate more representative states and two strategies that improve the cost-to-goal estimates. Our practical strategies result in a learned heuristic that, when guiding a GBFS algorithm, increases by more than 30% the mean coverage compared to a baseline learned heuristic.

replace-cross Exploring Singularities in point clouds with the graph Laplacian: An explicit approach

Authors: Martin Andersson, Benny Avelin

Abstract: We develop theory and methods that use the graph Laplacian to analyze the geometry of the underlying manifold of datasets. Our theory provides theoretical guarantees and explicit bounds on the functional forms of the graph Laplacian when it acts on functions defined close to singularities of the underlying manifold. We use these explicit bounds to develop tests for singularities and propose methods that can be used to estimate geometric properties of singularities in the datasets.

replace-cross Detecting hidden structures from a static loading experiment: topology optimization meets physics-informed neural networks

Authors: Saviz Mowlavi, Ken Kamrin

Abstract: Most noninvasive imaging techniques utilize electromagnetic or acoustic waves originating from multiple locations and directions to identify hidden geometrical structures. Surprisingly, it is also possible to image hidden voids and inclusions buried within an object using a single static thermal or mechanical loading experiment by observing the response of the exposed surface of the body, but this problem is challenging to invert. Although physics-informed neural networks (PINNs) have shown promise as a simple-yet-powerful tool for problem inversion, they have not yet been applied to imaging problems with a priori unknown topology. Here, we introduce a topology optimization framework based on PINNs that identifies concealed geometries using exposed surface data from a single loading experiment, without prior knowledge of the number or types of shapes. We allow for arbitrary solution topology by representing the geometry using a material density field combined with a novel eikonal regularization technique. We validate our framework by detecting the number, locations, and shapes of hidden voids and inclusions in many example cases, in both 2D and 3D, and we demonstrate the method's robustness to noise and sparsity in the data. Our methodology opens a pathway for PINNs to solve geometry optimization problems in engineering.

replace-cross Wearable-based Fair and Accurate Pain Assessment Using Multi-Attribute Fairness Loss in Convolutional Neural Networks

Authors: Yidong Zhu, Shao-Hsien Liu, Mohammad Arif Ul Alam

Abstract: The integration of diverse health data, such as IoT (Internet of Things), EHR (Electronic Health Record), and clinical surveys, with scalable AI(Artificial Intelligence) has enabled the identification of physical, behavioral, and psycho-social indicators of pain. However, the adoption of AI in clinical pain evaluation is hindered by challenges like personalization and fairness. Many AI models, including machine and deep learning, exhibit biases, discriminating against specific groups based on gender or ethnicity, causing skepticism among medical professionals about their reliability. This paper proposes a Multi-attribute Fairness Loss (MAFL) based Convolutional Neural Network (CNN) model designed to account for protected attributes in data, ensuring fair pain status predictions while minimizing disparities between privileged and unprivileged groups. We evaluate whether a balance between accuracy and fairness is achievable by comparing the proposed model with existing mitigation methods. Our findings indicate that the model performs favorably against state-of-the-art techniques. Using the NIH All-Of-US dataset, comprising data from 868 individuals over 1500 days, we demonstrate our model's effectiveness, achieving accuracy rates between 75% and 85%.

replace-cross A Multiobjective Reinforcement Learning Framework for Microgrid Energy Management

Authors: M. Vivienne Liu, Patrick M. Reed, David Gold, Garret Quist, C. Lindsay Anderson

Abstract: The emergence of microgrids (MGs) has provided a promising solution for decarbonizing and decentralizing the power grid, mitigating the challenges posed by climate change. However, MG operations often involve considering multiple objectives that represent the interests of different stakeholders, leading to potentially complex conflicts. To tackle this issue, we propose a novel multi-objective reinforcement learning framework that explores the high-dimensional objective space and uncovers the tradeoffs between conflicting objectives. This framework leverages exogenous information and capitalizes on the data-driven nature of reinforcement learning, enabling the training of a parametric policy without the need for long-term forecasts or knowledge of the underlying uncertainty distribution. The trained policies exhibit diverse, adaptive, and coordinative behaviors with the added benefit of providing interpretable insights on the dynamics of their information use. We employ this framework on the Cornell University MG (CU-MG), which is a combined heat and power MG, to evaluate its effectiveness. The results demonstrate performance improvements in all objectives considered compared to the status quo operations and offer more flexibility in navigating complex operational tradeoffs.

replace-cross Manifold Learning with Sparse Regularised Optimal Transport

Authors: Stephen Zhang, Gilles Mordant, Tetsuya Matsumoto, Geoffrey Schiebinger

Abstract: Manifold learning is a central task in modern statistics and data science. Many datasets (cells, documents, images, molecules) can be represented as point clouds embedded in a high dimensional ambient space, however the degrees of freedom intrinsic to the data are usually far fewer than the number of ambient dimensions. The task of detecting a latent manifold along which the data are embedded is a prerequisite for a wide family of downstream analyses. Real-world datasets are subject to noisy observations and sampling, so that distilling information about the underlying manifold is a major challenge. We propose a method for manifold learning that utilises a symmetric version of optimal transport with a quadratic regularisation that constructs a sparse and adaptive affinity matrix, that can be interpreted as a generalisation of the bistochastic kernel normalisation. We prove that the resulting kernel is consistent with a Laplace-type operator in the continuous limit, establish robustness to heteroskedastic noise and exhibit these results in numerical experiments. We identify a highly efficient computational scheme for computing this optimal transport for discrete data and demonstrate that it outperforms competing methods in a set of examples.

replace-cross Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

Authors: Boshko Koloski, Bla\v{z} \v{S}krlj, Marko Robnik-\v{S}ikonja, Senja Pollak

Abstract: The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual transfer strategies, we compare the intermediate-training (\textit{IT}) that uses each language sequentially and cross-lingual validation (\textit{CLV}) that uses a target language already in the validation phase of fine-tuning. We assess the success of transfer and the extent of catastrophic forgetting in a source language due to cross-lingual transfer, i.e., how much previously acquired knowledge is lost when we learn new information in a different language. The results on two different classification problems, hate speech detection and product reviews, each containing datasets in several languages, show that the \textit{IT} cross-lingual strategy outperforms \textit{CLV} for the target language. Our findings indicate that, in the majority of cases, the \textit{CLV} strategy demonstrates superior retention of knowledge in the base language (English) compared to the \textit{IT} strategy, when evaluating catastrophic forgetting in multiple cross-lingual transfers.

replace-cross Necessary and Sufficient Watermark for Large Language Models

Authors: Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, Makoto Yamada

Abstract: In recent years, large language models (LLMs) have achieved remarkable performances in various NLP tasks. They can generate texts that are indistinguishable from those written by humans. Such remarkable performance of LLMs increases their risk of being used for malicious purposes, such as generating fake news articles. Therefore, it is necessary to develop methods for distinguishing texts written by LLMs from those written by humans. Watermarking is one of the most powerful methods for achieving this. Although existing watermarking methods have successfully detected texts generated by LLMs, they significantly degrade the quality of the generated texts. In this study, we propose the Necessary and Sufficient Watermark (NS-Watermark) for inserting watermarks into generated texts without degrading the text quality. More specifically, we derive minimum constraints required to be imposed on the generated texts to distinguish whether LLMs or humans write the texts. Then, we formulate the NS-Watermark as a constrained optimization problem and propose an efficient algorithm to solve it. Through the experiments, we demonstrate that the NS-Watermark can generate more natural texts than existing watermarking methods and distinguish more accurately between texts written by LLMs and those written by humans. Especially in machine translation tasks, the NS-Watermark can outperform the existing watermarking method by up to 30 BLEU scores.

replace-cross Chain-of-Factors Paper-Reviewer Matching

Authors: Yu Zhang, Yanzhen Shen, SeongKu Kang, Xiusi Chen, Bowen Jin, Jiawei Han

Abstract: With the rapid increase in paper submissions to academic conferences, the need for automated and accurate paper-reviewer matching is more critical than ever. Previous efforts in this area have considered various factors to assess the relevance of a reviewer's expertise to a paper, such as the semantic similarity, shared topics, and citation connections between the paper and the reviewer's previous works. However, most of these studies focus on only one factor, resulting in an incomplete evaluation of the paper-reviewer relevance. To address this issue, we propose a unified model for paper-reviewer matching that jointly considers semantic, topic, and citation factors. To be specific, during training, we instruction-tune a contextualized language model shared across all factors to capture their commonalities and characteristics; during inference, we chain the three factors to enable step-by-step, coarse-to-fine search for qualified reviewers given a submission. Experiments on four datasets (one of which is newly contributed by us) spanning various fields such as machine learning, computer vision, information retrieval, and data mining consistently demonstrate the effectiveness of our proposed Chain-of-Factors model in comparison with state-of-the-art paper-reviewer matching methods and scientific pre-trained language models.

replace-cross Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs

Authors: Huaman Sun, Jiaxin Pei, Minje Choi, David Jurgens

Abstract: Human judgments are inherently subjective and are actively affected by personal traits such as gender and ethnicity. While Large Language Models (LLMs) are widely used to simulate human responses across diverse contexts, their ability to account for demographic differences in subjective tasks remains uncertain. In this study, leveraging the POPQUORN dataset, we evaluate nine popular LLMs on their ability to understand demographic differences in two subjective judgment tasks: politeness and offensiveness. We find that in zero-shot settings, most models' predictions for both tasks align more closely with labels from White participants than those from Asian or Black participants, while only a minor gender bias favoring women appears in the politeness task. Furthermore, sociodemographic prompting does not consistently improve and, in some cases, worsens LLMs' ability to perceive language from specific sub-populations. These findings highlight potential demographic biases in LLMs when performing subjective judgment tasks and underscore the limitations of sociodemographic prompting as a strategy to achieve pluralistic alignment. Code and data are available at: https://github.com/Jiaxin-Pei/LLM-as-Subjective-Judge.

URLs: https://github.com/Jiaxin-Pei/LLM-as-Subjective-Judge.

replace-cross SWEA: Updating Factual Knowledge in Large Language Models via Subject Word Embedding Altering

Authors: Xiaopeng Li, Shasha Li, Shezheng Song, Huijun Liu, Bin Ji, Xi Wang, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Weimin Zhang

Abstract: The general capabilities of large language models (LLMs) make them the infrastructure for various AI applications, but updating their inner knowledge requires significant resources. Recent model editing is a promising technique for efficiently updating a small amount of knowledge of LLMs and has attracted much attention. In particular, local editing methods, which directly update model parameters, are proven suitable for updating small amounts of knowledge. Local editing methods update weights by computing least squares closed-form solutions and identify edited knowledge by vector-level matching in inference, which achieve promising results. However, these methods still require a lot of time and resources to complete the computation. Moreover, vector-level matching lacks reliability, and such updates disrupt the original organization of the model's parameters. To address these issues, we propose a detachable and expandable Subject Word Embedding Altering (SWEA) framework, which finds the editing embeddings through token-level matching and adds them to the subject word embeddings in Transformer input. To get these editing embeddings, we propose optimizing then suppressing fusion method, which first optimizes learnable embedding vectors for the editing target and then suppresses the Knowledge Embedding Dimensions (KEDs) to obtain final editing embeddings. We thus propose SWEA$\oplus$OS method for editing factual knowledge in LLMs. We demonstrate the overall state-of-the-art (SOTA) performance of SWEA$\oplus$OS on the CounterFact and zsRE datasets. To further validate the reasoning ability of SWEA$\oplus$OS in editing knowledge, we evaluate it on the more complex RippleEdits benchmark. The results demonstrate that SWEA$\oplus$OS possesses SOTA reasoning ability.

replace-cross Convergence Analysis for General Probability Flow ODEs of Diffusion Models in Wasserstein Distances

Authors: Xuefeng Gao, Lingjiong Zhu

Abstract: Score-based generative modeling with probability flow ordinary differential equations (ODEs) has achieved remarkable success in a variety of applications. While various fast ODE-based samplers have been proposed in the literature and employed in practice, the theoretical understandings about convergence properties of the probability flow ODE are still quite limited. In this paper, we provide the first non-asymptotic convergence analysis for a general class of probability flow ODE samplers in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distributions. We then consider various examples and establish results on the iteration complexity of the corresponding ODE-based samplers. Our proof technique relies on spelling out explicitly the contraction rate for the continuous-time ODE and analyzing the discretization and score-matching errors using synchronous coupling; the challenge in our analysis mainly arises from the inherent non-autonomy of the probability flow ODE and the specific exponential integrator that we study.

replace-cross From Data to Decisions: The Transformational Power of Machine Learning in Business Recommendations

Authors: Kapilya Gangadharan, K. Malathi, Anoop Purandaran, Barathi Subramanian, Rathinaraja Jeyaraj, Soon Ki Jung

Abstract: This research aims to explore the impact of Machine Learning (ML) on the evolution and efficacy of Recommendation Systems (RS), particularly in the context of their growing significance in commercial business environments. Methodologically, the study delves into the role of ML in crafting and refining these systems, focusing on aspects such as data sourcing, feature engineering, and the importance of evaluation metrics, thereby highlighting the iterative nature of enhancing recommendation algorithms. The deployment of Recommendation Engines (RE), driven by advanced algorithms and data analytics, is explored across various domains, showcasing their significant impact on user experience and decision-making processes. These engines not only streamline information discovery and enhance collaboration but also accelerate knowledge acquisition, proving vital in navigating the digital landscape for businesses. They contribute significantly to sales, revenue, and the competitive edge of enterprises by offering improved recommendations that align with individual customer needs. The research identifies the increasing expectation of users for a seamless, intuitive online experience, where content is personalized and dynamically adapted to changing preferences. Future research directions include exploring advancements in deep learning models, ethical considerations in the deployment of RS, and addressing scalability challenges. This study emphasizes the indispensability of comprehending and leveraging ML in RS for researchers and practitioners, to tap into the full potential of personalized recommendation in commercial business prospects.

replace-cross Combining Evidence Across Filtrations

Authors: Yo Joong Choe, Aaditya Ramdas

Abstract: In sequential anytime-valid inference, any admissible procedure must be based on e-processes: generalizations of test martingales that quantify the accumulated evidence against a composite null hypothesis at any stopping time. This paper proposes a method for combining e-processes constructed in different filtrations but for the same null. Although e-processes in the same filtration can be combined effortlessly (by averaging), e-processes in different filtrations cannot because their validity in a coarser filtration does not translate to a finer filtration. This issue arises in sequential tests of randomness and independence, as well as in the evaluation of sequential forecasters. We establish that a class of functions called adjusters can lift arbitrary e-processes across filtrations. The result yields a generally applicable "adjust-then-combine" procedure, which we demonstrate on the problem of testing randomness in real-world financial data. Furthermore, we prove a characterization theorem for adjusters that formalizes a sense in which using adjusters is necessary. There are two major implications. First, if we have a powerful e-process in a coarsened filtration, then we readily have a powerful e-process in the original filtration. Second, when we coarsen the filtration to construct an e-process, there is a logarithmic cost to recovering validity in the original filtration.

replace-cross From Risk to Uncertainty: Generating Predictive Uncertainty Measures via Bayesian Estimation

Authors: Nikita Kotelevskii, Vladimir Kondratyev, Martin Tak\'a\v{c}, \'Eric Moulines, Maxim Panov

Abstract: There are various measures of predictive uncertainty in the literature, but their relationships to each other remain unclear. This paper uses a decomposition of statistical pointwise risk into components, associated with different sources of predictive uncertainty, namely aleatoric uncertainty (inherent data variability) and epistemic uncertainty (model-related uncertainty). Together with Bayesian methods, applied as an approximation, we build a framework that allows one to generate different predictive uncertainty measures. We validate our method on image datasets by evaluating its performance in detecting out-of-distribution and misclassified instances using the AUROC metric. The experimental results confirm that the measures derived from our framework are useful for the considered downstream tasks.

replace-cross GreedyML: A Parallel Algorithm for Maximizing Constrained Submodular Functions

Authors: Shivaram Gopal, S M Ferdous, Hemanta K. Maji, Alex Pothen

Abstract: We describe a parallel approximation algorithm for maximizing monotone submodular functions subject to hereditary constraints on distributed memory multiprocessors. Our work is motivated by the need to solve submodular optimization problems on massive data sets, for practical contexts such as data summarization, machine learning, and graph sparsification. Our work builds on the randomized distributed RandGreedi algorithm, proposed by Barbosa, Ene, Nguyen, and Ward (2015). This algorithm computes a distributed solution by randomly partitioning the data among all the processors and then employing \emph{a single} accumulation step in which all processors send their partial solutions to one processor. However, for large problems, the accumulation step exceeds the memory available on a processor, and the processor that performs the accumulation becomes a computational bottleneck. Hence we propose a generalization of the RandGreedi algorithm that employs multiple accumulation steps to reduce the memory required. We analyze the approximation ratio and the time complexity of the algorithm (in the BSP model). We evaluate the new GreedyML algorithm on three classes of problems, and report results from large-scale data sets with millions of elements. The results show that the GreedyML algorithm can solve problems where the sequential Greedy and distributed RandGreedi algorithms fail due to memory constraints. For certain computationally intensive problems, the GreedyML algorithm is faster than the RandGreedi algorithm. The observed approximation quality of the solutions computed by the GreedyML algorithm closely matches those obtained by the RandGreedi algorithm on these problems.

replace-cross Dual-Channel Multiplex Graph Neural Networks for Recommendation

Authors: Xiang Li, Chaofan Fu, Zhongying Zhao, Guanjie Zheng, Chao Huang, Yanwei Yu, Junyu Dong

Abstract: Effective recommender systems play a crucial role in accurately capturing user and item attributes that mirror individual preferences. Some existing recommendation techniques have started to shift their focus towards modeling various types of interactive relations between users and items in real-world recommendation scenarios, such as clicks, marking favorites, and purchases on online shopping platforms. Nevertheless, these approaches still grapple with two significant challenges: (1) Insufficient modeling and exploitation of the impact of various behavior patterns formed by multiplex relations between users and items on representation learning, and (2) ignoring the effect of different relations within behavior patterns on the target relation in recommender system scenarios. In this work, we introduce a novel recommendation framework, Dual-Channel Multiplex Graph Neural Network (DCMGNN), which addresses the aforementioned challenges. It incorporates an explicit behavior pattern representation learner to capture the behavior patterns composed of multiplex user-item interactive relations, and includes a relation chain representation learner and a relation chain-aware encoder to discover the impact of various auxiliary relations on the target relation, the dependencies between different relations, and mine the appropriate order of relations in a behavior pattern. Extensive experiments on three real-world datasets demonstrate that our \model surpasses various state-of-the-art recommendation methods. It outperforms the best baselines by 10.06% and 12.15% on average across all datasets in terms of Recall@10 and NDCG@10 respectively.

replace-cross HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling

Authors: Daniel Duenias, Brennan Nichyporuk, Tal Arbel, Tammy Riklin Raviv

Abstract: The integration of diverse clinical modalities such as medical imaging and the tabular data extracted from patients' Electronic Health Records (EHRs) is a crucial aspect of modern healthcare. Integrative analysis of multiple sources can provide a comprehensive understanding of the clinical condition of a patient, improving diagnosis and treatment decision. Deep Neural Networks (DNNs) consistently demonstrate outstanding performance in a wide range of multimodal tasks in the medical domain. However, the complex endeavor of effectively merging medical imaging with clinical, demographic and genetic information represented as numerical tabular data remains a highly active and ongoing research pursuit. We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR's values and measurements. This approach aims to leverage the complementary information present in these modalities to enhance the accuracy of various medical applications. We demonstrate the strength and generality of our method on two different brain Magnetic Resonance Imaging (MRI) analysis tasks, namely, brain age prediction conditioned by subject's sex and multi-class Alzheimer's Disease (AD) classification conditioned by tabular data. We show that our framework outperforms both single-modality models and state-of-the-art MRI tabular data fusion methods. A link to our code can be found at https://github.com/daniel4725/HyperFusion

URLs: https://github.com/daniel4725/HyperFusion

replace-cross Reasoning-Enhanced Object-Centric Learning for Videos

Authors: Jian Li, Pu Ren, Yang Liu, Hao Sun

Abstract: Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experimental results on various datasets indicate that the STATM module can significantly enhance the capabilities of multiple state-of-the-art object-centric learning models for video. Moreover, as a predictive model, the STATM module also performs well in downstream prediction and Visual Question Answering (VQA) tasks. We will release our codes and data at https://github.com/intell-sci-comput/STATM.

URLs: https://github.com/intell-sci-comput/STATM.

replace-cross Joint enhancement of automatic chest X-ray diagnosis and radiological gaze prediction with multi-stage cooperative learning

Authors: Zirui Qiu, Hassan Rivaz, Yiming Xiao

Abstract: Purpose: As visual inspection is an inherent process during radiological screening, the associated eye gaze data can provide valuable insights into relevant clinical decisions. As deep learning has become the state-of-the-art for computer-assisted diagnosis, integrating human behavior, such as eye gaze data, into these systems is instrumental to help align machine predictions with clinical diagnostic criteria, thus enhancing the quality of automatic radiological diagnosis. Methods: We propose a novel deep learning framework for joint disease diagnosis and prediction of corresponding clinical visual attention maps for chest X-ray scans. Specifically, we introduce a new dual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a Residual and Squeeze-and-Excitation block-based encoder to extract diverse features for visual attention map prediction, and a multi-scale feature-fusion classifier to perform disease classification. To tackle the issue of asynchronous training schedules of individual tasks in multi-task learning, we proposed a multi-stage cooperative learning strategy, with contrastive learning for feature encoder pretraining to boost performance. Results: Our proposed method is shown to significantly outperform existing techniques for chest X-ray diagnosis (AUC=0.93) and the quality of visual attention map prediction (Correlation coefficient=0.58). Conclusion: Benefiting from the proposed multi-task multi-stage cooperative learning, our technique demonstrates the benefit of integrating clinicians' eye gaze into clinical AI systems to boost performance and potentially explainability.

replace-cross Is The Watermarking Of LLM-Generated Code Robust?

Authors: Tarun Suresh, Shubham Ugare, Gagandeep Singh, Sasa Misailovic

Abstract: We present the first in depth study on the robustness of existing watermarking techniques applied to code generated by large language models (LLMs). As LLMs increasingly contribute to software development, watermarking has emerged as a potential solution for detecting AI generated code and mitigating misuse, such as plagiarism or the automated generation of malicious programs. While previous research has demonstrated the resilience of watermarking in the text setting, our work reveals that watermarking techniques are significantly more fragile in code-based contexts. Specifically, we show that simple semantic-preserving transformations, such as variable renaming and dead code insertion, can effectively erase watermarks without altering the program's functionality. To systematically evaluate watermark robustness, we develop an algorithm that traverses the Abstract Syntax Tree (AST) of a watermarked program and applies a sequence of randomized, semantics-preserving transformations. Our experimental results, conducted on Python code generated by different LLMs, indicate that even minor modifications can drastically reduce watermark detectability, with true positive rates (TPR) dropping below 50% in many cases. Our code is publicly available at https://github.com/uiuc-arc/llm-code-watermark.

URLs: https://github.com/uiuc-arc/llm-code-watermark.

replace-cross GINopic: Topic Modeling with Graph Isomorphism Network

Authors: Suman Adhya, Debarshi Kumar Sanyal

Abstract: Topic modeling is a widely used approach for analyzing and exploring large document collections. Recent research efforts have incorporated pre-trained contextualized language models, such as BERT embeddings, into topic modeling. However, they often neglect the intrinsic informational value conveyed by mutual dependencies between words. In this study, we introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words. By conducting intrinsic (quantitative as well as qualitative) and extrinsic evaluations on diverse benchmark datasets, we demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling.

replace-cross CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning

Authors: Zukang Yang, Zixuan Zhu, Xuan Zhu

Abstract: Large Language Models (LLMs) have achieved significant success in open-domain question answering. However, they continue to face challenges such as hallucinations and knowledge cutoffs. These issues can be mitigated through in-context learning by providing LLMs with relevant context before generating answers. Recent literature proposes Knowledge Graph Prompting (KGP) which integrates knowledge graphs with an LLM-based traversal agent to substantially enhance document retrieval quality. However, KGP requires costly fine-tuning with large datasets and remains prone to hallucination. In this paper, we propose CuriousLLM, an enhancement that integrates a curiosity-driven reasoning mechanism into an LLM agent. This mechanism enables the agent to generate relevant follow-up questions, thereby guiding the information retrieval process more efficiently. Central to our approach is the development of the new Follow-upQA dataset, which includes questions and supporting evidence as input, with follow-up questions serving as ground truths. These follow-up questions either inquire about what is still missing to fully answer the user's query or use special tokens to signify that the retrieved evidence is sufficient. Our experiments show that CuriousLLM significantly boosts LLM performance in multi-document question answering (MD-QA), circumventing the substantial computational costs and latency from the original KGP framework.

replace-cross Generalization capabilities and robustness of hybrid models grounded in physics compared to purely deep learning models

Authors: Rodrigo Abad\'ia-Heredia, Adri\'an Corrochano, Manuel Lopez-Martin, Soledad Le Clainche

Abstract: This study investigates the generalization capabilities and robustness of purely deep learning (DL) models and hybrid models based on physical principles in fluid dynamics applications, specifically focusing on iteratively forecasting the temporal evolution of flow dynamics. Three autoregressive models were compared: a hybrid model (POD-DL) that combines proper orthogonal decomposition (POD) with a long-short term memory (LSTM) layer, a convolutional autoencoder combined with a convolutional LSTM (ConvLSTM) layer and a variational autoencoder (VAE) combined with a ConvLSTM layer. These models were tested on two high-dimensional, nonlinear datasets representing the velocity field of flow past a circular cylinder in both laminar and turbulent regimes. The study used latent dimension methods, enabling a bijective reduction of high-dimensional dynamics into a lower-order space to facilitate future predictions. While the VAE and ConvLSTM models accurately predicted laminar flow, the hybrid POD-DL model outperformed the others across both laminar and turbulent flow regimes. This success is attributed to the model's ability to incorporate modal decomposition, reducing the dimensionality of the data, by a non-parametric method, and simplifying the forecasting component. By leveraging POD, the model not only gained insight into the underlying physics, improving prediction accuracy with less training data, but also reduce the number of trainable parameters as POD is non-parametric. The findings emphasize the potential of hybrid models, particularly those integrating modal decomposition and deep learning, in predicting complex flow dynamics.

replace-cross Can humans teach machines to code?

Authors: C\'eline Hocquette, Johannes Langer, Andrew Cropper, Ute Schmid

Abstract: The goal of inductive program synthesis is for a machine to automatically generate a program from user-supplied examples. A key underlying assumption is that humans can provide sufficient examples to teach a concept to a machine. To evaluate the validity of this assumption, we conduct a study where human participants provide examples for six programming concepts, such as finding the maximum element of a list. We evaluate the generalisation performance of five program synthesis systems trained on input-output examples (i) from non-expert humans, (ii) from a human expert, and (iii) randomly sampled. Our results suggest that non-experts typically do not provide sufficient examples for a program synthesis system to learn an accurate program.

replace-cross $T^2$ of Thoughts: Temperature Tree Elicits Reasoning in Large Language Models

Authors: Chengkun Cai, Xu Zhao, Yucheng Du, Haoliang Liu, Lei Li

Abstract: Large Language Models (LLMs) have emerged as powerful tools in artificial intelligence, especially in complex decision-making scenarios, but their static problem-solving strategies often limit their adaptability to dynamic environments. We explore the enhancement of reasoning capabilities in LLMs through Temperature Tree ($T^2$) prompting via a heuristic algorithm, termed as $T^2$ of Thoughts ($T^2oT$). The primary focus is on enhancing decision-making processes by dynamically adjusting search parameters, especially temperature, to improve accuracy without increasing computational demands. We empirically validate that our hybrid $T^2oT$ approach yields enhancements in, single-solution accuracy, multi-solution generation and text generation quality. Our findings suggest that while dynamic search depth adjustments based on temperature can yield mixed results, a fixed search depth, when coupled with adaptive capabilities of $T^2oT$, provides a more reliable and versatile problem-solving strategy. This work highlights the potential for future explorations in optimizing algorithmic interactions with foundational language models, particularly illustrated by our development for the Game of 24 and Creative Writing tasks.

replace-cross Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling

Authors: Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Damien Teney, Hamed Damirchi, Edison Marrese-Taylor, Anton van den Hengel

Abstract: Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performance across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. In principle, classification accuracy could be improved by over 40 percentage with an informed selection of the optimal backbone per test example.Using this insight, we develop a straightforward yet powerful approach to adaptively ensemble multiple backbones. The approach uses as few as one labeled example per class to tune the adaptive combination of backbones. On a large collection of datasets, the method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone, well beyond traditional ensembles

replace-cross The Point of View of a Sentiment: Towards Clinician Bias Detection in Psychiatric Notes

Authors: Alissa A. Valentine, Lauren A. Lepow, Lili Chan, Alexander W. Charney, Isotta Landi

Abstract: Negative patient descriptions and stigmatizing language can contribute to generating healthcare disparities in two ways: (1) read by patients, they can harm their trust and engagement with the medical center; (2) read by physicians, they may negatively influence their perspective of a future patient. In psychiatry, the patient-clinician therapeutic alliance is a major determinant of clinical outcomes. Therefore, language usage in psychiatric clinical notes may not only create healthcare disparities, but also perpetuate them. Recent advances in NLP systems have facilitated the efforts to detect discriminatory language in healthcare. However, such attempts have only focused on the perspectives of the medical center and its physicians. Considering both physicians and non-physicians' point of view is a more translatable approach to identifying potentially harmful language in clinical notes. By leveraging pre-trained and large language models (PLMs and LLMs), this work aims to characterize potentially harmful language usage in psychiatric notes by identifying the sentiment expressed in sentences describing patients based on the reader's point of view. Extracting 39 sentences from the Mount Sinai Health System containing psychiatric lexicon, we fine-tuned three PLMs (RoBERTa, GatorTron, and GatorTron + Task Adaptation) and implemented zero-shot and few-shot ICL approaches for three LLMs (GPT-3.5, Llama-3.1, and Mistral) to classify the sentiment of the sentences according to the physician or non-physician point of view. Results showed that GPT-3.5 aligned best to physician point of view and Mistral aligned best to non-physician point of view. These results underline the importance of recognizing the reader's point of view, not only for improving the note writing process, but also for the quantification, identification, and reduction of bias in computational systems for downstream analyses.

replace-cross Speeding up Policy Simulation in Supply Chain RL

Authors: Vivek Farias, Joren Gijsbrechts, Aryan Khojandi, Tianyi Peng, Andrew Zheng

Abstract: Simulating a single trajectory of a dynamical system under some state-dependent policy is a core bottleneck in policy optimization (PO) algorithms. The many inherently serial policy evaluations that must be performed in a single simulation constitute the bulk of this bottleneck. In applying PO to supply chain optimization (SCO) problems, simulating a single sample path corresponding to one month of a supply chain can take several hours. We present an iterative algorithm to accelerate policy simulation, dubbed Picard Iteration. This scheme carefully assigns policy evaluation tasks to independent processes. Within an iteration, any given process evaluates the policy only on its assigned tasks while assuming a certain "cached" evaluation for other tasks; the cache is updated at the end of the iteration. Implemented on GPUs, this scheme admits batched evaluation of the policy across a single trajectory. We prove that the structure afforded by many SCO problems allows convergence in a small number of iterations independent of the horizon. We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments.

replace-cross Demystifying Spectral Bias on Real-World Data

Authors: Itay Lavie, Zohar Ringel

Abstract: Kernel ridge regression (KRR) and Gaussian processes (GPs) are fundamental tools in statistics and machine learning, with recent applications to highly over-parameterized deep neural networks. The ability of these tools to learn a target function is directly related to the eigenvalues of their kernel sampled on the input data distribution. Targets that have support on higher eigenvalues are more learnable. However, solving such eigenvalue problems on real-world data remains a challenge. Here, we consider cross-dataset learnability and show that one may use eigenvalues and eigenfunctions associated with highly idealized data measures to reveal spectral bias on complex datasets and bound learnability on real-world data. This allows us to leverage various symmetries that realistic kernels manifest to unravel their spectral bias.

replace-cross DP-MemArc: Differential Privacy Transfer Learning for Memory Efficient Language Models

Authors: Yanming Liu, Xinyue Peng, Yuwei Zhang, Xiaolan Ke, Songhang Deng, Jiannan Cao, Chen Ma, Mengchen Fu, Sheng Cheng, Xun Wang, Jianwei Yin, Tianyu Du, Xuhong Zhang

Abstract: Large language models have repeatedly shown outstanding performance across diverse applications. However, deploying these models can inadvertently risk user privacy. The significant memory demands during training pose a major challenge in terms of resource consumption. This substantial size places a heavy load on memory resources, raising considerable practical concerns. In this paper, we introduce DP-MemArc, a novel training framework aimed at reducing the memory costs of large language models while emphasizing the protection of user data privacy. DP-MemArc incorporates side network or reversible network designs to support a variety of differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves in memory optimization but also ensures robust privacy protection, keeping user data secure and confidential. Extensive experiments have demonstrated that DP-MemArc effectively provides differential privacy-efficient fine-tuning across different task scenarios.

replace-cross DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors

Authors: Keon Lee, Dong Won Kim, Jaehyeon Kim, Seungjun Chung, Jaewoong Cho

Abstract: Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies show potential in removing these domain-specific factors, performance remains suboptimal. In this work, we introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors. Through rigorous analysis and empirical exploration, we find that (1) DiT with minimal modifications outperforms U-Net, (2) variable-length modeling with a speech length predictor significantly improves results over fixed-length approaches, and (3) conditions like semantic alignment in speech latent representations are key to further enhancement. By scaling our training data to 82K hours and the model size to 790M parameters, we achieve superior or comparable zero-shot performance to state-of-the-art TTS models in naturalness, intelligibility, and speaker similarity, all without relying on domain-specific factors. Speech samples are available at https://ditto-tts.github.io.

URLs: https://ditto-tts.github.io.

replace-cross CELL your Model: Contrastive Explanations for Large Language Models

Authors: Ronny Luss, Erik Miehling, Amit Dhurandhar

Abstract: The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI, such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing a contrastive explanation method requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a scoring function that has meaning to the user and not necessarily a specific real valued quantity (viz. class label). To this end, we offer a novel budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts based on such a scoring function while adhering to a query budget, necessary for longer contexts. We show the efficacy of our method on important natural language tasks such as open-text generation and chatbot conversations.

replace-cross Causal Discovery Inspired Unsupervised Domain Adaptation for Emotion-Cause Pair Extraction

Authors: Yuncheng Hua, Yujin Huang, Shuo Huang, Tao Feng, Lizhen Qu, Chris Bain, Richard Bassed, Gholamreza Haffari

Abstract: This paper tackles the task of emotion-cause pair extraction in the unsupervised domain adaptation setting. The problem is challenging as the distributions of the events causing emotions in target domains are dramatically different than those in source domains, despite the distributions of emotional expressions between domains are overlapped. Inspired by causal discovery, we propose a novel deep latent model in the variational autoencoder (VAE) framework, which not only captures the underlying latent structures of data but also utilizes the easily transferable knowledge of emotions as the bridge to link the distributions of events in different domains. To facilitate knowledge transfer across domains, we also propose a novel variational posterior regularization technique to disentangle the latent representations of emotions from those of events in order to mitigate the damage caused by the spurious correlations related to the events in source domains. Through extensive experiments, we demonstrate that our model outperforms the strongest baseline by approximately 11.05\% on a Chinese benchmark and 2.45\% on a English benchmark in terms of weighted-average F1 score. We have released our source code and the generated dataset publicly at: https://github.com/tk1363704/CAREL-VAE.

URLs: https://github.com/tk1363704/CAREL-VAE.

replace-cross RuleR: Improving LLM Controllability by Rule-based Data Recycling

Authors: Ming Li, Han Chen, Chenguang Wang, Dang Nguyen, Dianqi Li, Tianyi Zhou

Abstract: Large language models (LLMs) still lack delicate controllability over their responses, which is critical to enhancing their performance and the user experience. However, curating supervised fine-tuning (SFT) datasets to improve LLM controllability usually relies on human experts or proprietary LLMs, which requires additional costs. To bridge this gap, we propose Rule-based Data Recycling (RuleR), a data augmentation method incorporating multiple constraints into the original data samples according to predefined rules, which creates new training tasks to consolidate the controllability of LLMs. Instead of creating new data from scratch, RuleR "recycles" existing data by simply applying rule-based edits to their responses and appending the rule-instructions in their original instructions. Experimental results demonstrate RuleR's effectiveness in improving LLM controllability while maintaining general instruction-following capabilities.

replace-cross GraphEval36K: Benchmarking Coding and Reasoning Capabilities of Large Language Models on Graph Datasets

Authors: Qiming Wu, Zichen Chen, Will Corcoran, Misha Sra, Ambuj K. Singh

Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing (NLP), demonstrating significant capabilities in processing and understanding text data. However, recent studies have identified limitations in LLMs' ability to manipulate, program, and reason about structured data, especially graphs. We introduce GraphEval36K, the first comprehensive graph dataset, comprising 40 graph coding problems and 36,900 test cases to evaluate the ability of LLMs on graph problem-solving. Our dataset is categorized into eight primary and four sub-categories to ensure a thorough evaluation across different types of graphs. We benchmark ten LLMs, finding that private models outperform open-source ones, though the gap is narrowing. We also analyze the performance of LLMs across directed vs undirected graphs, different kinds of graph concepts, and network models. Furthermore, to improve the usability of our evaluation framework, we propose Structured Symbolic Decomposition (SSD), an instruction-based method designed to enhance LLM performance on complex graph tasks. Results show that SSD improves the average passing rate of GPT-4, GPT-4o, Gemini-Pro and Claude-3-Sonnet by 8.38%, 6.78%, 29.28% and 25.28%, respectively.

replace-cross Towards Federated Low-Rank Adaptation of Language Models with Rank Heterogeneity

Authors: Yuji Byun, Jaeho Lee

Abstract: Low-rank adaptation (LoRA) offers an efficient alternative to full-weight adaptation in federated fine-tuning of language models, significantly reducing computational costs. By adjusting ranks for each client, federated LoRA enables flexible resource allocation. However, we observe that heterogeneous ranks among clients lead to unstable performance. Our analysis attributes this instability to the conventional zero-padding aggregation strategy, which dilutes information from high-rank clients during model aggregation. To address this issue, we propose a replication-based padding strategy that better retains valuable information from clients with high-quality data. Empirically, this approach accelerates convergence and enhances the global model's predictive performance.

replace-cross Optimal Low-Depth Quantum Signal-Processing Phase Estimation

Authors: Yulong Dong, Jonathan A. Gross, Murphy Yuezhen Niu

Abstract: Quantum effects like entanglement and coherent amplification can be used to drastically enhance the accuracy of quantum parameter estimation beyond classical limits. However, challenges such as decoherence and time-dependent errors hinder Heisenberg-limited amplification. We introduce Quantum Signal-Processing Phase Estimation algorithms that are robust against these challenges and achieve optimal performance as dictated by the Cram\'{e}r-Rao bound. These algorithms use quantum signal transformation to decouple interdependent phase parameters into largely orthogonal ones, ensuring that time-dependent errors in one do not compromise the accuracy of learning the other. Combining provably optimal classical estimation with near-optimal quantum circuit design, our approach achieves a standard deviation accuracy of $10^{-4}$ radians for estimating unwanted swap angles in superconducting two-qubit experiments, using low-depth ($<10$) circuits. This represents up to two orders of magnitude improvement over existing methods. Theoretically and numerically, we demonstrate the optimality of our algorithm against time-dependent phase errors, observing that the variance of the time-sensitive parameter $\varphi$ scales faster than the asymptotic Heisenberg scaling in the small-depth regime. Our results are rigorously validated against the quantum Fisher information, confirming our protocol's ability to achieve unmatched precision for two-qubit gate learning.

replace-cross Robust Q-Learning for finite ambiguity sets

Authors: C\'ecile Decker, Julian Sester

Abstract: In this paper we propose a novel $Q$-learning algorithm allowing to solve distributionally robust Markov decision problems for which the ambiguity set of probability measures can be chosen arbitrarily as long as it comprises only a finite amount of measures. Therefore, our approach goes beyond the well-studied cases involving ambiguity sets of balls around some reference measure with the distance to reference measure being measured with respect to the Wasserstein distance or the Kullback--Leibler divergence. Hence, our approach allows the applicant to create ambiguity sets better tailored to her needs and to solve the associated robust Markov decision problem via a $Q$-learning algorithm whose convergence is guaranteed by our main result. Moreover, we showcase in several numerical experiments the tractability of our approach.

replace-cross FabGPT: An Efficient Large Multimodal Model for Complex Wafer Defect Knowledge Queries

Authors: Yuqi Jiang, Xudong Lu, Qian Jin, Qi Sun, Hanming Wu, Cheng Zhuo

Abstract: Intelligence is key to advancing integrated circuit (IC) fabrication. Recent breakthroughs in Large Multimodal Models (LMMs) have unlocked extraditionary abilities in understanding images and text, fostering intelligent fabrication. Leveraging the power of LMMs, we introduce FabGPT, a customized IC fabrication large multimodal model for wafer defect knowledge query. FabGPT manifests expertise in conducting defect detection in Scanning Electron Microscope (SEM) images, performing root cause analysis, and providing expert Q&A on fabrication processes. FabGPT matches enhanced multimodal features to automatically detect minute defects under complex wafer backgrounds and reduce the subjectivity of manual threshold settings. Besides, the proposed modulation module and interactive corpus training strategy embed wafer defect knowledge into the pre-trained model, effectively balancing Q&A queries related to defect knowledge and original knowledge and mitigating the modality bias issues. Experiments on in-house fab data show that FabGPT achieves significant performance improvement in wafer defect detection and knowledge querying.

replace-cross ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Authors: Peng Xu, Wei Ping, Xianchao Wu, Chejian Xu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro

Abstract: In this work, we introduce ChatQA 2, an Llama 3.0-based model with a 128K context window, designed to bridge the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo-2024-04-09) in long context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are complementary to each other and essential for LLMs to process large volumes of information that cannot fit into a single prompt. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model outperforms most existing state-of-the-art models, including GPT-4-Turbo-2024-04-09, Qwen2-72B-Instruct, and Llama3.1-70B-Instruct, on ultra-long tasks beyond 100K tokens, as well as on the RAG benchmark using only a 4K context window, showing the strong long context capability across varying sequence lengths. We further provide extensive comparisons between direct long-context and RAG solutions using the same state-of-the-art long-context LLMs. Interestingly, we find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks. With a large set of top-k chunks, RAG consistently outperforms direct long-context solution using the same state-of-the-art long-context models (e.g., Llama3-ChatQA-2-70B and Qwen2-72B-Instruct) on both 32K and 128K benchmarks. We open-source the model weights, training data, and the evaluation setup for the for the community: https://chatqa2-project.github.io/

URLs: https://chatqa2-project.github.io/

replace-cross LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models

Authors: Shi Lin, Hongming Yang, Rongchang Li, Xun Wang, Changting Lin, Wenpeng Xing, Meng Han

Abstract: The rapid development of Large Language Models (LLMs) has brought significant advancements across various tasks. However, despite these achievements, LLMs still exhibit inherent safety vulnerabilities, especially when confronted with jailbreak attacks. Existing jailbreak methods suffer from two main limitations: reliance on complicated prompt engineering and iterative optimization, which lead to low attack success rate (ASR) and attack efficiency (AE). In this work, we propose an efficient jailbreak attack method, Analyzing-based Jailbreak (ABJ), which leverages the advanced reasoning capability of LLMs to autonomously generate harmful content, revealing their underlying safety vulnerabilities during complex reasoning process. We conduct comprehensive experiments on ABJ across various open-source and closed-source LLMs. In particular, ABJ achieves high ASR (82.1% on GPT-4o-2024-11-20) with exceptional AE among all target LLMs, showcasing its remarkable attack effectiveness, transferability, and efficiency. Our findings underscore the urgent need to prioritize and improve the safety of LLMs to mitigate the risks of misuse.

replace-cross Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling

Authors: Wei Guo, Molei Tao, Yongxin Chen

Abstract: We consider the outstanding problem of sampling from an unnormalized density that may be non-log-concave and multimodal. To enhance the performance of simple Markov chain Monte Carlo (MCMC) methods, techniques of annealing type have been widely used. However, quantitative theoretical guarantees of these techniques are under-explored. This study takes a first step toward providing a non-asymptotic analysis of annealed MCMC. Specifically, we establish, for the first time, an oracle complexity of $\widetilde{O}\left(\frac{d\beta^2{\cal A}^2}{\varepsilon^6}\right)$ for the simple annealed Langevin Monte Carlo algorithm to achieve $\varepsilon^2$ accuracy in Kullback-Leibler divergence to the target distribution $\pi\propto{\rm e}^{-V}$ on $\mathbb{R}^d$ with $\beta$-smooth potential $V$. Here, ${\cal A}$ represents the action of a curve of probability measures interpolating the target distribution $\pi$ and a readily sampleable distribution.

replace-cross Bridging Compressed Image Latents and Multimodal Large Language Models

Authors: Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng, Yi-Hsin Chen, Alessandro Gnutti, Shao-Yuan Lo, Wen-Hsiao Peng, Riccardo Leonardi

Abstract: This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to modalities (e.g. images) beyond text, but their billion scale hinders deployment on resource-constrained end devices. While cloud-hosted MLLMs could be available, transmitting raw, uncompressed images captured by end devices to the cloud requires an efficient image compression system. To address this, we focus on emerging neural image compression and propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks. Given the huge scale of MLLMs, our framework excludes the entire downstream MLLM except part of its visual encoder from training our system. This stands out from most existing coding for machine approaches that involve downstream networks in training and thus could be impractical when the networks are MLLMs. The proposed framework is general in that it is applicable to various MLLMs, neural image codecs, and multiple application scenarios, where the neural image codec can be (1) pre-trained for human perception without updating, (2) fully updated for joint human and machine perception, or (3) fully updated for only machine perception. Extensive experiments on different neural image codecs and various MLLMs show that our method achieves great rate-accuracy performance with much less complexity.

replace-cross Using Linearized Optimal Transport to Predict the Evolution of Stochastic Particle Systems

Authors: Nicholas Karris, Evangelos A. Nikitopoulos, Ioannis G. Kevrekidis, Seungjoon Lee, Alexander Cloninger

Abstract: We develop an Euler-type method to predict the evolution of a time-dependent probability measure without explicitly learning an operator that governs its evolution. We use linearized optimal transport theory to prove that the measure-valued analog of Euler's method is first-order accurate when the measure evolves ``smoothly.'' In applications of interest, however, the measure is an empirical distribution of a system of stochastic particles whose behavior is only accessible through an agent-based micro-scale simulation. In such cases, this empirical measure does not evolve smoothly because the individual particles move chaotically on short time scales. However, we can still perform our Euler-type method, and when the particles' collective distribution approximates a measure that \emph{does} evolve smoothly, we observe that the algorithm still accurately predicts this collective behavior over relatively large Euler steps. We specifically demonstrate the efficacy of our approach by showing that our algorithm vastly reduces the number of micro-scale steps needed to correctly approximate long-term behavior in two illustrative examples, reflected Brownian motion and a model of bacterial chemotaxis.

replace-cross A Logical Fallacy-Informed Framework for Argument Generation

Authors: Luca Mouchel, Debjit Paul, Shaobo Cui, Robert West, Antoine Bosselut, Boi Faltings

Abstract: Despite the remarkable performance of Large Language Models (LLMs) in natural language processing tasks, they still struggle with generating logically sound arguments, resulting in potential risks such as spreading misinformation. To address this issue, we introduce FIPO, a fallacy-informed framework that leverages preference optimization methods to steer LLMs toward logically sound arguments. FIPO includes a classification loss, to capture the fine-grained information on fallacy types. Our results on argumentation datasets show that our method reduces the fallacy errors by up to 17.5%. Furthermore, our human evaluation results indicate that the quality of the generated arguments by our method significantly outperforms the fine-tuned baselines, as well as other preference optimization methods, such as DPO. These findings highlight the importance of ensuring models are aware of logical fallacies for effective argument generation. Our code is available at github.com/lucamouchel/Logical-Fallacies.

replace-cross A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models

Authors: Geonhee Kim, Marco Valentino, Andr\'e Freitas

Abstract: Recent studies on logical reasoning in Language Models (LMs) have sparked a debate on whether they can learn systematic reasoning principles during pre-training or merely exploit superficial patterns in the training data. This paper presents a mechanistic interpretation of syllogistic reasoning in LMs to advance the understanding of internal dynamics. Specifically, we present a methodology for circuit discovery aimed at interpreting content-independent reasoning mechanisms. Through two distinct intervention methods, we uncover a sufficient and necessary circuit involving middle-term suppression that elucidates how LMs transfer information to derive valid conclusions from premises. Furthermore, we investigate how belief biases manifest in syllogistic reasoning, finding evidence of partial contamination from additional attention heads responsible for encoding commonsense and contextualized knowledge. Finally, we explore the generalization of the discovered mechanisms across various syllogistic schemes, model sizes and architectures, finding that the identified circuit is sufficient and necessary for the schemes on which the models achieve high downstream accuracy (> 60%), and that the activation patterns apply to models of different families. Overall, our findings suggest that LMs indeed learn transferable content-independent reasoning mechanisms, but that, at the same time, such mechanisms do not involve generalizable and abstract logical primitives, being susceptible to contamination by the same world knowledge acquired during pre-training.

replace-cross ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

Authors: Weizhou Wang, Eric Liu, Xiangyu Guo, Xiao Hu, Ilya Grishchenko, David Lie

Abstract: Supervised learning-based software vulnerability detectors often fall short due to the inadequate availability of labelled training data. In contrast, Large Language Models (LLMs) such as GPT-4, are not trained on labelled data, but when prompted to detect vulnerabilities, LLM prediction accuracy is only marginally better than random guessing. In this paper, we explore a different approach by reframing vulnerability detection as one of anomaly detection. Since the vast majority of code does not contain vulnerabilities and LLMs are trained on massive amounts of such code, vulnerable code can be viewed as an anomaly from the LLM's predicted code distribution, freeing the model from the need for labelled data to provide a learnable representation of vulnerable code. Leveraging this perspective, we demonstrate that LLMs trained for code generation exhibit a significant gap in prediction accuracy when prompted to reconstruct vulnerable versus non-vulnerable code. Using this insight, we implement ANVIL, a detector that identifies software vulnerabilities at line-level granularity. Our experiments explore the discriminating power of different anomaly scoring methods, as well as the sensitivity of ANVIL to context size. We also study the effectiveness of ANVIL on various LLM families, and conduct leakage experiments on vulnerabilities that were discovered after the knowledge cutoff of our evaluated LLMs. On a collection of vulnerabilities from the Magma benchmark, ANVIL outperforms state-of-the-art line-level vulnerability detectors, LineVul and LineVD, which have been trained with labelled data, despite ANVIL having never been trained with labelled vulnerabilities. Specifically, our approach achieves $1.62\times$ to $2.18\times$ better Top-5 accuracies and $1.02\times$ to $1.29\times$ times better ROC scores on line-level vulnerability detection tasks.

replace-cross S2Cap: A Benchmark and a Baseline for Singing Style Captioning

Authors: Hyunjong Ok, Jaeho Lee

Abstract: Singing voices contain much richer information than common voices, such as diverse vocal and acoustic characteristics. However, existing open-source audio-text datasets for singing voices capture only a limited set of attributes and lacks acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally consider the task of singing style captioning and introduce S2Cap, a singing voice dataset with comprehensive descriptions of diverse vocal, acoustic and demographic attributes. Based on this dataset, we develop a simple yet effective baseline algorithm for the singing style captioning. The algorithm utilizes two novel technical components: CRESCENDO for mitigating misalignment between pretrained unimodal models, and demixing supervision to regularize the model to focus on the singing voice. Despite its simplicity, the proposed method outperforms state-of-the-art baselines.

replace-cross MusicLIME: Explainable Multimodal Music Understanding

Authors: Theodoros Sotirou, Vassilis Lyberatos, Orfeas Menis Mastromichalakis, Giorgos Stamou

Abstract: Multimodal models are critical for music understanding tasks, as they capture the complex interplay between audio and lyrics. However, as these models become more prevalent, the need for explainability grows-understanding how these systems make decisions is vital for ensuring fairness, reducing bias, and fostering trust. In this paper, we introduce MusicLIME, a model-agnostic feature importance explanation method designed for multimodal music models. Unlike traditional unimodal methods, which analyze each modality separately without considering the interaction between them, often leading to incomplete or misleading explanations, MusicLIME reveals how audio and lyrical features interact and contribute to predictions, providing a holistic view of the model's decision-making. Additionally, we enhance local explanations by aggregating them into global explanations, giving users a broader perspective of model behavior. Through this work, we contribute to improving the interpretability of multimodal music models, empowering users to make informed choices, and fostering more equitable, fair, and transparent music understanding systems.

replace-cross SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation

Authors: Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas

Abstract: Recent advancements in music source separation have significantly progressed, particularly in isolating vocals, drums, and bass elements from mixed tracks. These developments owe much to the creation and use of large-scale, multitrack datasets dedicated to these specific components. However, the challenge of extracting similarly sounding sources from orchestra recordings has not been extensively explored, largely due to a scarcity of comprehensive and clean (i.e bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack dataset called SynthSOD, developed using a set of simulation techniques to create a realistic (i.e. using high-quality soundfonts), musically motivated, and heterogeneous training set comprising different dynamics, natural tempo changes, styles, and conditions. Moreover, we demonstrate the application of a widely used baseline music separation model trained on our synthesized dataset w.r.t to the well-known EnsembleSet, and evaluate its performance under both synthetic and real-world conditions.

replace-cross Numerical determination of the width and shape of the effective string using Stochastic Normalizing Flows

Authors: Michele Caselle, Elia Cellini, Alessandro Nada

Abstract: Flow-based architectures have recently proved to be an efficient tool for numerical simulations of Effective String Theories regularized on the lattice that otherwise cannot be efficiently sampled by standard Monte Carlo methods. In this work we use Stochastic Normalizing Flows, a state-of-the-art deep learning architecture based on non-equilibrium Monte Carlo simulations, to study different effective string models. After testing the reliability of this approach through a comparison with exact results for the Nambu-Got\={o} model, we discuss results on observables that are challenging to study analytically, such as the width of the string and the shape of the flux density. Furthermore, we perform a novel numerical study of Effective String Theories with terms beyond the Nambu-Got\={o} action, including a broader discussion on their significance for lattice gauge theories. The combination of these findings enables a quantitative description of the fine details of the confinement mechanism in different lattice gauge theories. The results presented in this work establish the reliability and feasibility of flow-based samplers for Effective String Theories and pave the way for future applications on more complex models.

replace-cross EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Authors: Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, Nan Du

Abstract: Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

replace-cross Distributed Learning with Discretely Observed Functional Data

Authors: Jiading Liu, Lei Shi

Abstract: By selecting different filter functions, spectral algorithms can generate various regularization methods to solve statistical inverse problems within the learning-from-samples framework. This paper combines distributed spectral algorithms with Sobolev kernels to tackle the functional linear regression problem. The design and mathematical analysis of the algorithms require only that the functional covariates are observed at discrete sample points. Furthermore, the hypothesis function spaces of the algorithms are the Sobolev spaces generated by the Sobolev kernels, optimizing both approximation capability and flexibility. Through the establishment of regularity conditions for the target function and functional covariate, we derive matching upper and lower bounds for the convergence of the distributed spectral algorithms in the Sobolev norm. This demonstrates that the proposed regularity conditions are reasonable and that the convergence analysis under these conditions is tight, capturing the essential characteristics of functional linear regression. The analytical techniques and estimates developed in this paper also enhance existing results in the previous literature.

replace-cross StateAct: State Tracking and Reasoning for Acting and Planning with Large Language Models

Authors: Nikolai Rozanov, Marek Rei

Abstract: Planning and acting to solve `real' tasks using large language models (LLMs) in interactive environments has become a new frontier for AI methods. While recent advances allowed LLMs to interact with online tools, solve robotics tasks and many more, long range reasoning tasks remain a problem for LLMs. Existing methods to address this issue are very resource intensive and require additional data or human crafted rules, instead, we propose a simple method based on few-shot in-context learning alone to enhance `chain-of-thought' with state-tracking for planning and acting with LLMs. We show that our method establishes the new state-of-the-art on Alfworld for in-context learning methods (+14\% over the previous best few-shot in-context learning method) and performs on par with methods that use additional training data and additional tools such as code-execution. We also demonstrate that our enhanced `chain-of-states' allows the agent to both solve longer horizon problems and to be more efficient in number of steps required to solve a task. We show that our method works across a variety of LLMs for both API-based and open source ones. Finally, we also conduct ablation studies and show that `chain-of-thoughts' helps state-tracking accuracy, while a json-structure harms overall performance. We open-source our code and annotations at https://github.com/ai-nikolai/StateAct.

URLs: https://github.com/ai-nikolai/StateAct.

replace-cross The Role of Deductive and Inductive Reasoning in Large Language Models

Authors: Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang, Serge Belongie, Lei Li

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning tasks, yet their reliance on static prompt structures and limited adaptability to complex scenarios remains a significant challenge. In this paper, we propose the Deductive and InDuctive(DID) method, a novel framework that enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning approaches. Drawing from cognitive science principles, DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy to precisely assess task difficulty and guide decomposition strategies. DID enables the model to progressively adapt its reasoning pathways based on problem complexity, mirroring human cognitive processes. We evaluate DID's effectiveness across multiple benchmarks, including the AIW and MR-GSM8K, as well as our custom Holiday Puzzle dataset for temporal reasoning. Our results demonstrate significant improvements in reasoning quality and solution accuracy - achieving 70.3% accuracy on AIW (compared to 62.2% for Tree of Thought) while maintaining lower computational costs. The success of DID in improving LLM performance while preserving computational efficiency suggests promising directions for developing more cognitively aligned and capable language models. Our work contributes a theoretically grounded, input-centric approach to enhancing LLM reasoning capabilities, offering an efficient alternative to traditional output-exploration methods.

replace-cross SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Authors: Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, Sanqiang Zhao

Abstract: To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning beyond the conventional NTP paradigm, without relying on well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles in instruction tuning--confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, interpolates them to bridge the confidence gap, and applies a Mixup-based regularization to support learning on these additional, interpolated examples. By propagating supervision signals across confidence regions and encouraging linear behavior between them, SFTMix mitigates overfitting in confident examples while enhancing generalization in unconfident ones. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

replace-cross Cost-aware simulation-based inference

Authors: Ayush Bharti, Daolang Huang, Samuel Kaski, Fran\c{c}ois-Xavier Briol

Abstract: Simulation-based inference (SBI) is the preferred framework for estimating parameters of intractable models in science and engineering. A significant challenge in this context is the large computational cost of simulating data from complex models, and the fact that this cost often depends on parameter values. We therefore propose \textit{cost-aware SBI methods} which can significantly reduce the cost of existing sampling-based SBI methods, such as neural SBI and approximate Bayesian computation. This is achieved through a combination of rejection and self-normalised importance sampling, which significantly reduces the number of expensive simulations needed. Our approach is studied extensively on models from epidemiology to telecommunications engineering, where we obtain significant reductions in the overall cost of inference.

replace-cross Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery

Authors: Pratinav Seth, Michelle Lin, Brefo Dwamena Yaw, Jade Boutot, Mary Kang, David Rolnick

Abstract: Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the groundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first large-scale benchmark dataset for this problem, leveraging medium-resolution multi-spectral satellite imagery from Planet Labs. Our curated dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.

replace-cross KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

Authors: Yongqin Xu, Huan Li, Ke Chen, Lidan Shou

Abstract: Schema matching (SM) and entity matching (EM) tasks are crucial for data integration. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. This study presents the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a once-and-for-all pseudo-code-based task decomposition strategy to adopt natural language statements that guide LLM reasoning and reduce confusion across various task types. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Moreover, we introduce a result-ensemble strategy to leverage multiple knowledge sources and suppress badly formatted outputs. Extensive evaluations confirm that KcMF clearly enhances five LLM backbones in both SM and EM tasks while outperforming the non-LLM competitors by an average F1-score of 17.93%.

replace-cross Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Authors: Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, Lingpeng Kong

Abstract: Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques. For instance, MDM achieves 91.5\% and 100\% accuracy on Countdown and Sudoku, respectively, compared to 45.8\% and 20.7\% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks.

replace-cross Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

Authors: Mamadou K. Keita, Christopher Homan, Marcos Zampieri, Adwoa Bremang, Habibatou Abdoulaye Alfari, Elysabhete Amadou Ibrahim, Dennis Owusu

Abstract: Grammatical error correction (GEC) aims to improve quality and readability of texts through accurate correction of linguistic mistakes. Previous work has focused on high-resource languages, while low-resource languages lack robust tools. However, low-resource languages often face problems such as: non-standard orthography, limited annotated corpora, and diverse dialects, which slows down the development of GEC tools. We present a study on GEC for Zarma, spoken by over five million in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated them using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95. 82% and a suggestion accuracy of 78. 90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs -- MT5-small -- showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.

replace-cross Long-time Integration of Nonlinear Wave Equations with Neural Operators

Authors: Guanhang Lei, Zhen Lei, Lei Shi

Abstract: Neural operators have shown promise in solving many types of Partial Differential Equations (PDEs). They are significantly faster compared to traditional numerical solvers once they have been trained with a certain amount of observed data. However, their numerical performance in solving time-dependent PDEs, particularly in long-time prediction of dynamic systems, still needs improvement. In this paper, we focus on solving the long-time integration of nonlinear wave equations via neural operators by replacing the initial condition with the prediction in a recurrent manner. Given limited observed temporal trajectory data, we utilize some intrinsic features of these nonlinear wave equations, such as conservation laws and well-posedness, to improve the algorithm design and reduce accumulated error. Our numerical experiments examine these improvements in the Korteweg-de Vries (KdV) equation, the sine-Gordon equation, and the Klein-Gordon wave equation on the irregular domain.

replace-cross How to Backdoor Consistency Models?

Authors: Chengen Wang, Murat Kantarcioglu

Abstract: Consistency models are a new class of models that generate images by directly mapping noise to data, allowing for one-step generation and significantly accelerating the sampling process. However, their robustness against adversarial attacks has not yet been thoroughly investigated. In this work, we conduct the first study on the vulnerability of consistency models to backdoor attacks. While previous research has explored backdoor attacks on diffusion models, those studies have primarily focused on conventional diffusion models, employing a customized backdoor training process and objective, whereas consistency models have distinct training processes and objectives. Our proposed framework demonstrates the vulnerability of consistency models to backdoor attacks. During image generation, poisoned consistency models produce images with a Fr\'echet Inception Distance (FID) comparable to that of a clean model when sampling from Gaussian noise. However, once the trigger is activated, they generate backdoor target images. We explore various trigger and target configurations to evaluate the vulnerability of consistency models, including the use of random noise as a trigger. This novel trigger is visually inconspicuous, more challenging to detect, and aligns well with the sampling process of consistency models. Across all configurations, our framework successfully compromises the consistency models while maintaining high utility and specificity. We also examine the stealthiness of our proposed attack, which is attributed to the unique properties of consistency models and the elusive nature of the Gaussian noise trigger. Our code is available at \href{https://github.com/chengenw/backdoorCM}{https://github.com/chengenw/backdoorCM}.

URLs: https://github.com/chengenw/backdoorCM, https://github.com/chengenw/backdoorCM

replace-cross $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Authors: Jiaqi Han, Mingjian Jiang, Yuxuan Song, Stefano Ermon, Minkai Xu

Abstract: Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that generalizes and extends existing approaches. $f$-PO minimizes $f$-divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of $f$-divergences. We provide theoretical analysis of $f$-PO's properties and conduct extensive experiments on state-of-the-art language models using benchmark datasets. Results demonstrate $f$-PO's effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, MT-Bench, and Open LLM Leaderboard v2. Additionally, we present ablation studies exploring the impact of different $f$-divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of language model alignment. Code is available at https://github.com/MinkaiXu/fPO.

URLs: https://github.com/MinkaiXu/fPO.

replace-cross SciPIP: An LLM-based Scientific Paper Idea Proposer

Authors: Wenxiao Wang, Lihui Gu, Liye Zhang, Yunxiang Luo, Yi Dai, Chen Shen, Liang Xie, Binbin Lin, Xiaofei He, Jieping Ye

Abstract: The rapid advancement of large language models (LLMs) has opened new possibilities for automating the proposal of innovative scientific ideas. This process involves two key phases: literature retrieval and idea generation. However, existing approaches often fall short due to their reliance on keyword-based search tools during the retrieval phase, which neglects crucial semantic information and frequently results in incomplete retrieval outcomes. Similarly, in the idea generation phase, current methodologies tend to depend solely on the internal knowledge of LLMs or metadata from retrieved papers, thereby overlooking significant valuable insights contained within the full texts. To address these limitations, we introduce SciPIP, an innovative framework designed to enhance the LLM-based proposal of scientific ideas through improvements in both literature retrieval and idea generation. Our approach begins with the construction of a comprehensive literature database that supports advanced retrieval based not only on keywords but also on semantics and citation relationships. This is complemented by the introduction of a multi-granularity retrieval algorithm aimed at ensuring more thorough and exhaustive retrieval results. For the idea generation phase, we propose a dual-path framework that effectively integrates both the content of retrieved papers and the extensive internal knowledge of LLMs. This integration significantly boosts the novelty, feasibility, and practical value of proposed ideas. Our experiments, conducted across various domains such as natural language processing and computer vision, demonstrate SciPIP's capability to generate a multitude of innovative and useful ideas. These findings underscore SciPIP's potential as a valuable tool for researchers seeking to advance their fields with groundbreaking concepts.

replace-cross BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments

Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu

Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.

URLs: https://github.com/xinghaow99/BitStack.

replace-cross Convergence proofs and strong error bounds for forward-backward stochastic differential equations using neural network simulations

Authors: Oliver Sheridan-Methven

Abstract: We introduce forward-backward stochastic differential equations, highlighting the connection between solutions of these and solutions of partial differential equations, related by the Feynman-Kac theorem. We review the technique of approximating solutions to high dimensional partial differential equations using neural networks, and similarly approximating solutions of stochastic differential equations using multilevel Monte Carlo. Connecting the multilevel Monte Carlo method with the neural network framework using the setup established by E et al. and Raissi, we provide novel numerical analyses to produce strong error bounds for the specific framework of Raissi. Our results bound the overall strong error in terms of the maximum of the discretisation error and the neural network's approximation error. Our analyses are necessary for applications of multilevel Monte Carlo, for which we propose suitable frameworks to exploit the variance structures of the multilevel estimators we elucidate. Also, focusing on the loss function advocated by Raissi, we expose the limitations of this, highlighting and quantifying its bias and variance. Lastly, we propose various avenues of further research which we anticipate should offer significant insight and speed improvements.

replace-cross Quantum Policy Gradient in Reproducing Kernel Hilbert Space

Authors: David M. Bossens, Kishor Bharti, Jayne Thompson

Abstract: Parametrised quantum circuits offer expressive and data-efficient representations for machine learning. Due to quantum states residing in a high-dimensional Hilbert space, parametrised quantum circuits have a natural interpretation in terms of kernel methods. The representation of quantum circuits in terms of quantum kernels has been studied widely in quantum supervised learning, but has been overlooked in the context of quantum RL. This paper proposes parametric and non-parametric policy gradient and actor-critic algorithms with quantum kernel policies in quantum environments. This approach, implemented with both numerical and analytical quantum policy gradient techniques, allows exploiting the many advantages of kernel methods, including available analytic forms for the gradient of the policy and tunable expressiveness. The proposed approach is suitable for vector-valued action spaces and each of the formulations demonstrates a quadratic reduction in query complexity compared to their classical counterparts. Two actor-critic algorithms, one based on stochastic policy gradient and one based on deterministic policy gradient (comparable to the popular DDPG algorithm), demonstrate additional query complexity reductions compared to quantum policy gradient algorithms under favourable conditions.

replace-cross Towards Scalable Insect Monitoring: Ultra-Lightweight CNNs as On-Device Triggers for Insect Camera Traps

Authors: Ross Gardiner, Sareh Rowands, Benno I. Simmons

Abstract: Camera traps, combined with AI, have emerged as a way to achieve automated, scalable biodiversity monitoring. However, the passive infrared (PIR) sensors that trigger camera traps are poorly suited for detecting small, fast-moving ectotherms such as insects. Insects comprise over half of all animal species and are key components of ecosystems and agriculture. The need for an appropriate and scalable insect camera trap is critical in the wake of concerning reports of declines in insect populations. This study proposes an alternative to the PIR trigger: ultra-lightweight convolutional neural networks running on low-powered hardware to detect insects in a continuous stream of captured images. We train a suite of models to distinguish insect images from backgrounds. Our design achieves zero latency between trigger and image capture. Our models are rigorously tested and achieve high accuracy ranging from 91.8% to 96.4% AUC on validation data and >87% AUC on data from distributions unseen during training. The high specificity of our models ensures minimal saving of false positive images, maximising deployment storage efficiency. High recall scores indicate a minimal false negative rate, maximising insect detection. Further analysis with saliency maps shows the learned representation of our models to be robust, with low reliance on spurious background features. Our system is also shown to operate deployed on off-the-shelf, low-powered microcontroller units, consuming a maximum power draw of less than 300mW. This enables longer deployment times using cheap and readily available battery components. Overall we offer a step change in the cost, efficiency and scope of insect monitoring. Solving the challenging trigger problem, we demonstrate a system which can be deployed for far longer than existing designs and budgets power and bandwidth effectively, moving towards a generic insect camera trap.

replace-cross Quantum Hamiltonian Descent for Graph Partition

Authors: Jinglei Cheng, Ruilin Zhou, Yuhang Gan, Chen Qian, Junyu Liu

Abstract: We introduce Quantum Hamiltonian Descent as a novel approach to solve the graph partition problem. By reformulating graph partition as a Quadratic Unconstrained Binary Optimization (QUBO) problem, we leverage QHD's quantum-inspired dynamics to identify optimal community structures. Our method implements a multi-level refinement strategy that alternates between QUBO formulation and QHD optimization to iteratively improve partition quality. Experimental results demonstrate that our QHD-based approach achieves superior modularity scores (up to 5.49\%) improvement with reduced computational overhead compared to traditional optimization methods. This work establishes QHD as an effective quantum-inspired framework for tackling graph partition challenges in large-scale networks.

replace-cross TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models

Authors: Jiahao Wang, Mingyue Cheng, Qingyang Mao, Yitong Zhou, Feiyang Xu, Xin Li

Abstract: Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.

replace-cross Libra: Leveraging Temporal Images for Biomedical Radiology Analysis

Authors: Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

Abstract: Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance visual-language understanding, most existing methods rely on single-image analysis or rule-based heuristics to process multiple images, failing to fully leverage temporal information in multi-modal medical datasets. In this paper, we introduce Libra, a temporal-aware MLLM tailored for chest X-ray report generation. Libra combines a radiology-specific image encoder with a novel Temporal Alignment Connector (TAC), designed to accurately capture and integrate temporal differences between paired current and prior images. Extensive experiments on the MIMIC-CXR dataset demonstrate that Libra establishes a new state-of-the-art benchmark among similarly scaled MLLMs, setting new standards in both clinical relevance and lexical accuracy.

replace-cross Spatial Conformal Inference through Localized Quantile Regression

Authors: Hanyang Jiang, Yao Xie

Abstract: Reliable uncertainty quantification at unobserved spatial locations, especially in the presence of complex and heterogeneous datasets, remains a core challenge in spatial statistics. Traditional approaches like Kriging rely heavily on assumptions such as normality, which often break down in large-scale, diverse datasets, leading to unreliable prediction intervals. While machine learning methods have emerged as powerful alternatives, they primarily focus on point predictions and provide limited mechanisms for uncertainty quantification. Conformal prediction, a distribution-free framework, offers valid prediction intervals without relying on parametric assumptions. However, existing conformal prediction methods are either not tailored for spatial settings, or existing ones for spatial data have relied on rather restrictive i.i.d. assumptions. In this paper, we propose Localized Spatial Conformal Prediction (LSCP), a conformal prediction method designed specifically for spatial data. LSCP leverages localized quantile regression to construct prediction intervals. Instead of i.i.d. assumptions, our theoretical analysis builds on weaker conditions of stationarity and spatial mixing, which is natural for spatial data, providing finite-sample bounds on the conditional coverage gap and establishing asymptotic guarantees for conditional coverage. We present experiments on both synthetic and real-world datasets to demonstrate that LSCP achieves accurate coverage with significantly tighter and more consistent prediction intervals across the spatial domain compared to existing methods.

replace-cross Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation

Authors: Xuanlin Li, Tong Zhao, Xinghao Zhu, Jiuguang Wang, Tao Pang, Kuan Fang

Abstract: Contact-rich bimanual manipulation involves precise coordination of two arms to change object states through strategically selected contacts and motions. Due to the inherent complexity of these tasks, acquiring sufficient demonstration data and training policies that generalize to unseen scenarios remain a largely unresolved challenge. Building on recent advances in planning through contacts, we introduce Generalizable Planning-Guided Diffusion Policy Learning (GLIDE), an approach that effectively learns to solve contact-rich bimanual manipulation tasks by leveraging model-based motion planners to generate demonstration data in high-fidelity physics simulation. Through efficient planning in randomized environments, our approach generates large-scale and high-quality synthetic motion trajectories for tasks involving diverse objects and transformations. We then train a task-conditioned diffusion policy via behavior cloning using these demonstrations. To tackle the sim-to-real gap, we propose a set of essential design options in feature extraction, task representation, action prediction, and data augmentation that enable learning robust prediction of smooth action sequences and generalization to unseen scenarios. Through experiments in both simulation and the real world, we demonstrate that our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties. Website: https://glide-manip.github.io/

URLs: https://glide-manip.github.io/

replace-cross Corrupted Learning Dynamics in Games

Authors: Taira Tsuchiya, Shinji Ito, Haipeng Luo

Abstract: Learning in games refers to scenarios where multiple players interact in a shared environment, each aiming to minimize their regret. An equilibrium can be computed at a fast rate of $O(1/T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL). However, this acceleration is limited to the honest regime, in which all players adhere to a prescribed algorithm -- a situation that may not be realistic in practice. To address this issue, we present corrupted learning dynamics that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm. First, in two-player zero-sum corrupted games, we provide learning dynamics for which the external regret of $x$-player (and similarly for $y$-player) is roughly bounded by $O(\log (m_x m_y) + \sqrt{\hat{C}_y} + \hat{C}_x)$, where $m_x$ and $m_y$ denote the number of actions of $x$- and $y$-players, respectively, and $\hat{C}_x$ and $\hat{C}_y$ represent their cumulative deviations. We then extend our approach to multi-player general-sum corrupted games, providing learning dynamics for which the swap regret of player $i$ is bounded by $O(\log T + \sqrt{\sum_{k} \hat{C}_k \log T} + \hat{C}_i)$ ignoring dependence on the number of players and actions, where $\hat{C}_i$ is the cumulative deviation of player $i$ from the prescribed algorithm. Our learning dynamics are agnostic to the levels of corruption. A key technical contribution is a new analysis that ensures the stability of a Markov chain under a new adaptive learning rate, thereby allowing us to achieve the desired bound in the corrupted regime while matching the best existing bound in the honest regime. Notably, our framework can be extended to address not only corruption in strategies but also corruption in the observed expected utilities, and we provide several matching lower bounds.

replace-cross CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding

Authors: Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Haiteng Jiang, Shijian Li, Tao Li, Gang Pan

Abstract: Electroencephalography (EEG) is a non-invasive technique to measure and record brain electrical activity, widely used in various BCI and healthcare applications. Early EEG decoding methods rely on supervised learning, limited by specific tasks and datasets, hindering model performance and generalizability. With the success of large language models, there is a growing body of studies focusing on EEG foundation models. However, these studies still leave challenges: Firstly, most of existing EEG foundation models employ full EEG modeling strategy. It models the spatial and temporal dependencies between all EEG patches together, but ignores that the spatial and temporal dependencies are heterogeneous due to the unique structural characteristics of EEG signals. Secondly, existing EEG foundation models have limited generalizability on a wide range of downstream BCI tasks due to varying formats of EEG data, making it challenging to adapt to. To address these challenges, we propose a novel foundation model called CBraMod. Specifically, we devise a criss-cross transformer as the backbone to thoroughly leverage the structural characteristics of EEG signals, which can model spatial and temporal dependencies separately through two parallel attention mechanisms. And we utilize an asymmetric conditional positional encoding scheme which can encode positional information of EEG patches and be easily adapted to the EEG with diverse formats. CBraMod is pre-trained on a very large corpus of EEG through patch-based masked EEG reconstruction. We evaluate CBraMod on up to 10 downstream BCI tasks (12 public datasets). CBraMod achieves the state-of-the-art performance across the wide range of tasks, proving its strong capability and generalizability. The source code is publicly available at https://github.com/wjq-learning/CBraMod.

URLs: https://github.com/wjq-learning/CBraMod.

replace-cross MemHunter: Automated and Verifiable Memorization Detection at Dataset-scale in LLMs

Authors: Zhenpeng Wu, Jian Lou, Zibin Zheng, Chuan Chen

Abstract: Large language models (LLMs) have been shown to memorize and reproduce content from their training data, raising significant privacy concerns, especially with web-scale datasets. Existing methods for detecting memorization are primarily sample-specific, relying on manually crafted or discretely optimized memory-inducing prompts generated on a per-sample basis, which become impractical for dataset-level detection due to the prohibitive computational cost of iterating through all samples. In real-world scenarios, data owners may need to verify whether a susceptible LLM has memorized their dataset, particularly if the LLM may have collected the data from the web without authorization. To address this, we introduce MemHunter, which trains a memory-inducing LLM and employs hypothesis testing to efficiently detect memorization at the dataset level, without requiring sample-specific memory inducing. Experiments on models like Pythia and Llama demonstrate that MemHunter can extract up to 40% more training data than existing methods under constrained time resources and reduce search time by up to 80% when integrated as a plug-in. Crucially, MemHunter is the first method capable of dataset-level memorization detection, providing a critical tool for assessing privacy risks in LLMs powered by large-scale datasets.

replace-cross FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model

Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, Lin Shao

Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.Video demos are on our website: https://nus-lins-lab.github.io/flipweb/.

URLs: https://nus-lins-lab.github.io/flipweb/.

replace-cross Precise Asymptotics and Refined Regret of Variance-Aware UCB

Authors: Yingying Fan, Yuxuan Han, Jinchi Lv, Xiaocong Xu, Zhengyuan Zhou

Abstract: In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for the Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates for UCB-V, extending recent results for the canonical UCB in Kalvit and Zeevi (2021) and Khamaru and Zhang (2024). In an interesting contrast to the canonical UCB, our analysis reveals that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for the arm-pulling rates in the high probability regime, offering insights into the regret analysis. As an application of this high probability result, we establish that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.

replace-cross Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

Authors: Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B. Tenenbaum, James J. DiCarlo

Abstract: Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.

replace-cross From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities

Authors: Shixin Jiang, Jiafeng Liang, Jiyuan Wang, Xuan Dong, Heng Chang, Weijiang Yu, Jinhua Du, Ming Liu, Bing Qin

Abstract: To tackle complex tasks in real-world scenarios, more researchers are focusing on Omni-MLLMs, which aim to achieve omni-modal understanding and generation. Beyond the constraints of any specific non-linguistic modality, Omni-MLLMs map various non-linguistic modalities into the embedding space of LLMs and enable the interaction and understanding of arbitrary combinations of modalities within a single model. In this paper, we systematically investigate relevant research and provide a comprehensive survey of Omni-MLLMs. Specifically, we first explain the four core components of Omni-MLLMs for unified multi-modal modeling with a meticulous taxonomy that offers novel perspectives. Then, we introduce the effective integration achieved through two-stage training and discuss the corresponding datasets as well as evaluation. Furthermore, we summarize the main challenges of current Omni-MLLMs and outline future directions. We hope this paper serves as an introduction for beginners and promotes the advancement of related research. Resources will be made public.

replace-cross The Eclipsing Binaries via Artificial Intelligence. II. Need for Speed in PHOEBE Forward Models

Authors: Marcin Wrona, Andrej Pr\v{s}a

Abstract: In modern astronomy, the quantity of data collected has vastly exceeded the capacity for manual analysis, necessitating the use of advanced artificial intelligence (AI) techniques to assist scientists with the most labor-intensive tasks. AI can optimize simulation codes where computational bottlenecks arise from the time required to generate forward models. One such example is PHOEBE, a modeling code for eclipsing binaries (EBs), where simulating individual systems is feasible, but analyzing observables for extensive parameter combinations is highly time-consuming. To address this, we present a fully connected feedforward artificial neural network (ANN) trained on a dataset of over one million synthetic light curves generated with PHOEBE. Optimization of the ANN architecture yielded a model with six hidden layers, each containing 512 nodes, provides an optimized balance between accuracy and computational complexity. Extensive testing enabled us to establish ANN's applicability limits and to quantify the systematic and statistical errors associated with using such networks for EB analysis. Our findings demonstrate the critical role of dilution effects in parameter estimation for EBs, and we outline methods to incorporate these effects in AI-based models. This proposed ANN framework enables a speedup of over four orders of magnitude compared to traditional methods, with systematic errors not exceeding 1\%, and often as low as 0.01\%, across the entire parameter space.

replace-cross Do Language Models Understand Time?

Authors: Xi Ding, Lei Wang

Abstract: Large language models (LLMs) have revolutionized video-based computer vision applications, including action recognition, anomaly detection, and video summarization. Videos inherently pose unique challenges, combining spatial complexity with temporal dynamics that are absent in static images or textual data. Current approaches to video understanding with LLMs often rely on pretrained video encoders to extract spatiotemporal features and text encoders to capture semantic meaning. These representations are integrated within LLM frameworks, enabling multimodal reasoning across diverse video tasks. However, the critical question persists: Can LLMs truly understand the concept of time, and how effectively can they reason about temporal relationships in videos? This work critically examines the role of LLMs in video processing, with a specific focus on their temporal reasoning capabilities. We identify key limitations in the interaction between LLMs and pretrained encoders, revealing gaps in their ability to model long-term dependencies and abstract temporal concepts such as causality and event progression. Furthermore, we analyze challenges posed by existing video datasets, including biases, lack of temporal annotations, and domain-specific limitations that constrain the temporal understanding of LLMs. To address these gaps, we explore promising future directions, including the co-evolution of LLMs and encoders, the development of enriched datasets with explicit temporal labels, and innovative architectures for integrating spatial, temporal, and semantic reasoning. By addressing these challenges, we aim to advance the temporal comprehension of LLMs, unlocking their full potential in video analysis and beyond. Our paper's GitHub repository can be found at https://github.com/Darcyddx/Video-LLM.

URLs: https://github.com/Darcyddx/Video-LLM.

replace-cross OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

Authors: Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, Zhengzhong Tu

Abstract: Since the advent of Multimodal Large Language Models (MLLMs), they have made a significant impact across a wide range of real-world applications, particularly in Autonomous Driving (AD). Their ability to process complex visual data and reason about intricate driving scenarios has paved the way for a new paradigm in end-to-end AD systems. However, the progress of developing end-to-end models for AD has been slow, as existing fine-tuning methods demand substantial resources, including extensive computational power, large-scale datasets, and significant funding. Drawing inspiration from recent advancements in inference computing, we propose OpenEMMA, an open-source end-to-end framework based on MLLMs. By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline when leveraging a diverse range of MLLMs. Furthermore, OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios, offering a more efficient and effective approach to autonomous driving. We release all the codes in https://github.com/taco-group/OpenEMMA.

URLs: https://github.com/taco-group/OpenEMMA.

replace-cross Investigating the importance of social vulnerability in opioid-related mortality across the United States

Authors: Andrew Deas, Adam Spannaus, Dakotah D. Maguire, Jodie Trafton, Anuj J. Kapadia, Vasileios Maroulas

Abstract: The opioid crisis remains a critical public health challenge in the United States. Despite national efforts to reduce opioid prescribing rates by nearly 45\% between 2011 and 2021, opioid overdose deaths more than tripled during this same period. This alarming trend reflects a major shift in the crisis, with illegal opioids now driving the majority of overdose deaths instead of prescription opioids. Although much attention has been given to supply-side factors fueling this transition, the underlying socioeconomic conditions that perpetuate and exacerbate opioid misuse remain less understood. Moreover, the COVID-19 pandemic intensified the opioid crisis through widespread social isolation and record-high unemployment; consequently, understanding the socioeconomic drivers of this epidemic has become even more crucial in recent years. To address this need, our study examines the correlation between opioid-related mortality and thirteen components of the Social Vulnerability Index (SVI). Leveraging a nationwide county-level dataset spanning consecutive years from 2010 to 2022, this study integrates empirical insights from exploratory data analysis with feature importance metrics derived from machine learning models. Our findings highlight critical social factors strongly correlated with opioid-related mortality, emphasizing their potential roles in worsening the epidemic when their levels are high and mitigating it when their levels are low.

replace-cross Robustness-aware Automatic Prompt Optimization

Authors: Zeru Shi, Zhenting Wang, Yongye Su, Weidi Luo, Hang Gao, Fan Yang, Ruixiang Tang, Yongfeng Zhang

Abstract: The performance of Large Language Models (LLMs) depends on the quality of prompts and the semantic and structural integrity of the input data. However, existing prompt generation methods primarily focus on well-structured input data, often neglecting the impact of perturbed inputs on prompt effectiveness. To address this limitation, we propose BATprompt (By Adversarial Training prompt), a novel method for prompt generation designed to withstand input perturbations (such as typos in the input). Inspired by adversarial training techniques, BATprompt demonstrates strong performance on a variety of perturbed tasks through a two-step process: adversarial perturbation and iterative optimization on unperturbed input via LLM. Unlike conventional adversarial attack methods, BATprompt does not need access to model parameters and gradients. Instead, BATprompt leverages the advanced reasoning, language understanding and self reflection capabilities of LLMs to simulate gradients, guiding the generation of adversarial perturbations and optimizing prompt performance. We evaluate BATprompt on multiple datasets across both language understanding and generation tasks. The results indicate that BATprompt outperforms existing prompt generation methods, delivering superior robustness and performance under diverse perturbation scenarios.

replace-cross Token-Budget-Aware LLM Reasoning

Authors: Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen

Abstract: Reasoning is critical for large language models (LLMs) to excel in a wide range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM performance by decomposing problems into intermediate steps, they also incur significant overhead in token usage, leading to increased costs. We find that the reasoning process of current LLMs is unnecessarily lengthy and it can be compressed by including a reasonable token budget in the prompt, but the choice of token budget plays a crucial role in the actual compression effectiveness. We then propose a token-budget-aware LLM reasoning framework, which dynamically estimates token budgets for different problems based on reasoning complexity and uses the estimated token budgets to guide the reasoning process. Experiments show that our method effectively reduces token costs in CoT reasoning with only a slight performance reduction, offering a practical solution to balance efficiency and accuracy in LLM reasoning. Code: https://github.com/GeniusHTX/TALE.

URLs: https://github.com/GeniusHTX/TALE.

replace-cross Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning

Authors: Dongwei Sun, Jing Yao, Changsheng Zhou, Xiangyong Cao, Pedram Ghamisi

Abstract: Remote sensing image change description represents an innovative multimodal task within the realm of remote sensing processing. This task not only facilitates the detection of alterations in surface conditions, but also provides comprehensive descriptions of these changes, thereby improving human interpretability and interactivity.Generally, existing deep-learning-based methods predominantly utilized a three-stage framework that successively perform feature extraction, feature fusion, and localization from bitemporal images before text generation. However, this reliance often leads to an excessive focus on the design of specific network architectures and restricts the feature distributions to the dataset at hand, which in turn results in limited generalizability and robustness during application.To address these limitations, this paper proposes a novel approach for remote sensing image change detection and description that incorporates diffusion models, aiming to transition the emphasis of modeling paradigms from conventional feature learning to data distribution learning. The proposed method primarily includes a simple multi-scale change detection module, whose output features are subsequently refined by an well-designed diffusion model. Furthermore, we introduce a frequency-guided complex filter module to boost the model performance by managing high-frequency noise throughout the diffusion process. We validate the effectiveness of our proposed method across several datasets for remote sensing change detection and description, showcasing its superior performance compared to existing techniques. The code will be available at \href{https://github.com/sundongwei}{MaskApproxNet} after a possible publication.

URLs: https://github.com/sundongwei

replace-cross Probing Visual Language Priors in VLMs

Authors: Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee

Abstract: Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q\&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17\% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form ``good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the actual visual inputs, and we demonstrate their effectiveness in boosting the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.

replace-cross Transfer Learning Analysis of Variational Quantum Circuits

Authors: Huan-Hsin Tseng, Hsin-Yi Lin, Samuel Yen-Chi Chen, Shinjae Yoo

Abstract: This work analyzes transfer learning of the Variational Quantum Circuit (VQC). Our framework begins with a pretrained VQC configured in one domain and calculates the transition of 1-parameter unitary subgroups required for a new domain. A formalism is established to investigate the adaptability and capability of a VQC under the analysis of loss bounds. Our theory observes knowledge transfer in VQCs and provides a heuristic interpretation for the mechanism. An analytical fine-tuning method is derived to attain the optimal transition for adaptations of similar domains.

replace-cross Evolving Skeletons: Motion Dynamics in Action Recognition

Authors: Jushang Qiu, Lei Wang

Abstract: Skeleton-based action recognition has gained significant attention for its ability to efficiently represent spatiotemporal information in a lightweight format. Most existing approaches use graph-based models to process skeleton sequences, where each pose is represented as a skeletal graph structured around human physical connectivity. Among these, the Spatiotemporal Graph Convolutional Network (ST-GCN) has become a widely used framework. Alternatively, hypergraph-based models, such as the Hyperformer, capture higher-order correlations, offering a more expressive representation of complex joint interactions. A recent advancement, termed Taylor Videos, introduces motion-enhanced skeleton sequences by embedding motion concepts, providing a fresh perspective on interpreting human actions in skeleton-based action recognition. In this paper, we conduct a comprehensive evaluation of both traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal graph and hypergraph representations, analyzing static poses against motion-injected poses. Our findings highlight the strengths and limitations of Taylor-transformed skeletons, demonstrating their potential to enhance motion dynamics while exposing current challenges in fully using their benefits. This study underscores the need for innovative skeletal modelling techniques to effectively handle motion-rich data and advance the field of action recognition.

replace-cross BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Authors: Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang

Abstract: Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.

replace-cross Multimodal semantic retrieval for product search

Authors: Dong Liu, Esther Lopez Ramos

Abstract: Semantic retrieval (also known as dense retrieval) based on textual data has been extensively studied for both web search and product search application fields, where the relevance of a query and a potential target document is computed by their dense vector representation comparison. Product image is crucial for e-commerce search interactions and is a key factor for customers at product explorations. However, its impact on semantic retrieval has not been well studied yet. In this research, we build a multimodal representation for product items in e-commerce search in contrast to pure-text representation of products, and investigate the impact of such representations. The models are developed and evaluated on e-commerce datasets. We demonstrate that a multimodal representation scheme for a product can show improvement either on purchase recall or relevance accuracy in semantic retrieval. Additionally, we provide numerical analysis for exclusive matches retrieved by a multimodal semantic retrieval model versus a text-only semantic retrieval model, to demonstrate the validation of multimodal solutions.

replace-cross Real-time Verification and Refinement of Language Model Text Generation

Authors: Joonho Ko, Jinheon Baek, Sung Ju Hwang

Abstract: Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, a critical challenge remains in that they sometimes generate factually incorrect answers. To address this, while many previous work has focused on identifying errors in their generation and further refining them, they are slow in deployment since they are designed to verify the response from LLMs only after their entire generation (from the first to last tokens) is done. Further, we observe that once LLMs generate incorrect tokens early on, there is a higher likelihood that subsequent tokens will also be factually incorrect. To this end, in this work, we propose Streaming-VR (Streaming Verification and Refinement), a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs. Specifically, the proposed Streaming-VR enables on-the-fly verification and correction of tokens as they are being generated, similar to a streaming process, ensuring that each subset of tokens is checked and refined in real-time by another LLM as the LLM constructs its response. Through comprehensive evaluations on multiple datasets, we demonstrate that our approach not only enhances the factual accuracy of LLMs, but also offers a more efficient solution compared to prior refinement methods.

replace-cross AI Guide Dog: Egocentric Path Prediction on Smartphone

Authors: Aishwarya Jadhav, Jeffery Cao, Abhishree Shetty, Urvashi Priyam Kumar, Aditi Sharma, Ben Sukboontip, Jayant Sravan Tamarapalli, Jingyi Zhang, Anirudh Koul

Abstract: This paper presents AI Guide Dog (AIGD), a lightweight egocentric (first-person) navigation system for visually impaired users, designed for real-time deployment on smartphones. AIGD employs a vision-only multi-label classification approach to predict directional commands, ensuring safe navigation across diverse environments. We introduce a novel technique for goal-based outdoor navigation by integrating GPS signals and high-level directions, while also handling uncertain multi-path predictions for destination-free indoor navigation. As the first navigation assistance system to handle both goal-oriented and exploratory navigation across indoor and outdoor settings, AIGD establishes a new benchmark in blind navigation. We present methods, datasets, evaluations, and deployment insights to encourage further innovations in assistive navigation systems.

replace-cross Estimating shared subspace with AJIVE: the power and limitation of multiple data matrices

Authors: Yuepeng Yang, Cong Ma

Abstract: Integrative data analysis often requires disentangling joint and individual variations across multiple datasets, a challenge commonly addressed by the Joint and Individual Variation Explained (JIVE) model. While numerous methods have been developed to estimate the shared subspace under JIVE, the theoretical understanding of their performance remains limited, particularly in the context of multiple matrices and varying degrees of subspace misalignment. This paper bridges this gap by providing a systematic analysis of shared subspace estimation in multi-matrix settings. We focus on the Angle-based Joint and Individual Variation Explained (AJIVE) method, a two-stage spectral approach, and establish new performance guarantees that uncover its strengths and limitations. Specifically, we show that in high signal-to-noise ratio (SNR) regimes, AJIVE's estimation error decreases with the number of matrices, demonstrating the power of multi-matrix integration. Conversely, in low-SNR settings, AJIVE exhibits a non-diminishing error, highlighting fundamental limitations. To complement these results, we derive minimax lower bounds, showing that AJIVE achieves optimal rates in high-SNR regimes. Furthermore, we analyze an oracle-aided spectral estimator to demonstrate that the non-diminishing error in low-SNR scenarios is a fundamental barrier. Extensive numerical experiments corroborate our theoretical findings, providing insights into the interplay between SNR, the number of matrices, and subspace misalignment.

replace-cross iTool: Boosting Tool Use of Large Language Models via Iterative Reinforced Fine-Tuning

Authors: Yirong Zeng, Xiao Ding, Yuxian Wang, Weiwen Liu, Wu Ning, Yutai Hou, Xu Huang, Bing Qin, Ting Liu

Abstract: Augmenting large language models (LLMs) with external tools is known as a promising approach to enhancing their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve it. Nevertheless, our investigation reveals that (1) training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data due to potential data diversity issues, resulting in poor performance in complex scenarios. Moreover, we find that (2) this challenge primarily manifests as minor discrepancies between the model's output and the ground truth response (termed as deficiency), such as errors in parameter values that require complex reasoning from the context to resolve. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate these challenges. This strategy involves: (1) enhancing the diversity of synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively identifying deficiency-related data, constructing fine-grained preference pairs to pinpoint deficiencies, and then applying preference optimization to optimize these deficiencies. Our experiments show that models trained using our method achieve about 3\% better performance than same-size models, outperforming larger open-source and closed-source models.

replace-cross The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

Authors: Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, Stephanie Wang

Abstract: While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of the two models that enables efficient and fault-tolerant heterogeneous execution. The key idea is to execute one partition at a time to allow lineage-based recovery with dynamic resource allocation. This enables memory-efficient pipelining across heterogeneous resources, similar to stream processing, but also offers the elasticity and fault tolerance properties of batch processing. We present Ray Data, an implementation of the streaming batch model that improves throughput on heterogeneous batch inference pipelines by 3--8$\times$ compared to traditional batch and stream processing systems. When training Stable Diffusion, Ray Data matches the throughput of single-node ML data loaders while additionally leveraging distributed heterogeneous clusters to further improve training throughput by 31%.

replace-cross LITE: Efficiently Estimating Gaussian Probability of Maximality

Authors: Nicolas Menet (ETH Z\"urich), Jonas H\"ubotter (ETH Z\"urich), Parnian Kassraie (ETH Z\"urich), Andreas Krause (ETH Z\"urich)

Abstract: We consider the problem of computing the probability of maximality (PoM) of a Gaussian random vector, i.e., the probability for each dimension to be maximal. This is a key challenge in applications ranging from Bayesian optimization to reinforcement learning, where the PoM not only helps with finding an optimal action, but yields a fine-grained analysis of the action domain, crucial in tasks such as drug discovery. Existing techniques are costly, scaling polynomially in computation and memory with the vector size. We introduce LITE, the first approach for estimating Gaussian PoM with almost-linear time and memory complexity. LITE achieves SOTA accuracy on a number of tasks, while being in practice several orders of magnitude faster than the baselines. This also translates to a better performance on downstream tasks such as entropy estimation and optimal control of bandits. Theoretically, we cast LITE as entropy-regularized UCB and connect it to prior PoM estimators.

replace-cross Option-ID Based Elimination For Multiple Choice Questions

Authors: Zhenhao Zhu, Bulou Liu, Qingyao Ai, Yiqun Liu

Abstract: Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs). Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method. Existing methods to the PoE generally fall into two categories: one involves having the LLM directly select the incorrect options, while the other involves scoring the options. However, both methods incur high computational costs and often perform worse than methods that directly answer the MCQs with the option IDs. To address this issue, this paper proposes a PoE based on option ID. Specifically, our method eliminates option by selecting the option ID with the lowest probability. We conduct experiments with 10 different LLMs in zero-shot settings on 7 publicly available datasets. The experimental results demonstrate that our method significantly improves the LLM's performance. Further analysis reveals that the sequential elimination strategy can effectively enhance the LLM's reasoning ability. Additionally, we find that sequential elimination is also applicable to few-shot settings and can be combined with debias methods to further improve LLM's performance.

replace-cross iFormer: Integrating ConvNet and Transformer for Mobile Application

Authors: Chuanyang Zheng

Abstract: We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.

replace-cross Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?

Authors: Yutong Yin, Zhaoran Wang

Abstract: Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.

replace-cross Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models

Authors: Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, Olga Fink

Abstract: In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at https://github.com/donghao51/Awesome-Multimodal-Adaptation.

URLs: https://github.com/donghao51/Awesome-Multimodal-Adaptation.

replace-cross Learning Autonomous Code Integration for Math Language Models

Authors: Haozhe Wang, Long Li, Chao Qu, Fengming Zhu, Weidi Xu, Wei Chu, Fangzhen Lin

Abstract: Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness -- the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.

replace-cross Language Models Struggle to Achieve a Consistent Temporal Representation of Facts

Authors: Hichem Ammar Khodja, Fr\'ed\'eric B\'echet, Quentin Brabant, Alexis Nasr, Gw\'enol\'e Lecorv\'e

Abstract: Language Models (LMs) have shown substantial improvements in handling factual knowledge, yet their capability to consistently represent temporal facts, which are valid only within specific timeframes, remains underexplored. To investigate this, we introduce TimeStress, a novel dataset comprising 521K statements on 2003 of the most popular temporal facts in Wikidata. Each statement contextualizes a fact with correct and incorrect dates across three precisions (Day, Month, Year). This setup allows us to evaluate LMs' ability to discern between correct and incorrect temporal statements based on their probability of being generated. We assess 18 LMs across various architectures using two metrics: the win rate, indicating how often correct dates outperform incorrect ones, and robustness, reflecting consistent performance across all dates. Our findings reveal that while some LMs achieve a win rate exceeding 80\%, robustness remains low, with the best model achieving only 6\%. Furthermore, robust knowledge at one date precision does not reliably transfer to others, highlighting a significant generalization gap. These results underscore the struggle of LMs to maintain a consistent temporal representation, supporting their limitations as reliable sources of temporal knowledge. We provide all data and code for further research.

replace-cross Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb

Authors: Fotis I. Giasemis, Vladimir Lon\v{c}ar, Bertrand Granado, Vladimir Vava Gligorov

Abstract: In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing charged particle tracks, due to its potentially linear computational scaling with detector hits. The recent implementation of a graph neural network-based track reconstruction pipeline in the first level trigger of the LHCb experiment on GPUs serves as a platform for comparative studies between computational architectures in the context of high-energy physics. This paper presents a novel comparison of the throughput of ML model inference between FPGAs and GPUs, focusing on the first step of the track reconstruction pipeline$\unicode{x2013}$an implementation of a multilayer perceptron. Using HLS4ML for FPGA deployment, we benchmark its performance against the GPU implementation and demonstrate the potential of FPGAs for high-throughput, low-latency inference without the need for an expertise in FPGA development and while consuming significantly less power.

replace-cross Rethinking Latent Representations in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation

Authors: Shuanghao Bai, Wanqi Zhou, Pengxiang Ding, Wei Zhao, Donglin Wang, Badong Chen

Abstract: Behavior Cloning (BC) is a widely adopted visual imitation learning method in robot manipulation. Current BC approaches often enhance generalization by leveraging large datasets and incorporating additional visual and textual modalities to capture more diverse information. However, these methods overlook whether the learned representations contain redundant information and lack a solid theoretical foundation to guide the learning process. To address these limitations, we adopt an information-theoretic perspective and introduce mutual information to quantify and mitigate redundancy in latent representations. Building on this, we incorporate the Information Bottleneck (IB) principle into BC, which extends the idea of reducing redundancy by providing a structured framework for compressing irrelevant information while preserving task-relevant features. This work presents the first comprehensive study on redundancy in latent representations across various methods, backbones, and experimental settings, while extending the generalizability of the IB to BC. Extensive experiments and analyses on the CortexBench and LIBERO benchmarks demonstrate significant performance improvements with IB, underscoring the importance of reducing input data redundancy and highlighting its practical value for more practical applications. Project Page: https://baishuanghao.github.io/BC-IB.github.io.

URLs: https://baishuanghao.github.io/BC-IB.github.io.

replace-cross IRIS: An Immersive Robot Interaction System

Authors: Xinkai Jiang, Qihao Yuan, Enes Ulas Dincer, Hongyi Zhou, Ge Li, Xueyin Li, Julius Haag, Nicolas Schreiber, Kailai Li, Gerhard Neumann, Rudolf Lioutikov

Abstract: This paper introduces IRIS, an immersive Robot Interaction System leveraging Extended Reality (XR), designed for robot data collection and interaction across multiple simulators, benchmarks, and real-world scenarios. While existing XR-based data collection systems provide efficient and intuitive solutions for large-scale data collection, they are often challenging to reproduce and reuse. This limitation arises because current systems are highly tailored to simulator-specific use cases and environments. IRIS is a novel, easily extendable framework that already supports multiple simulators, benchmarks, and even headsets. Furthermore, IRIS is able to include additional information from real-world sensors, such as point clouds captured through depth cameras. A unified scene specification is generated directly from simulators or real-world sensors and transmitted to XR headsets, creating identical scenes in XR. This specification allows IRIS to support any of the objects, assets, and robots provided by the simulators. In addition, IRIS introduces shared spatial anchors and a robust communication protocol that links simulations between multiple XR headsets. This feature enables multiple XR headsets to share a synchronized scene, facilitating collaborative and multi-user data collection. IRIS can be deployed on any device that supports the Unity Framework, encompassing the vast majority of commercially available headsets. In this work, IRIS was deployed and tested on the Meta Quest 3 and the HoloLens 2. IRIS showcased its versatility across a wide range of real-world and simulated scenarios, using current popular robot simulators such as MuJoCo, IsaacSim, CoppeliaSim, and Genesis. In addition, a user study evaluates IRIS on a data collection task for the LIBERO benchmark. The study shows that IRIS significantly outperforms the baseline in both objective and subjective metrics.

replace-cross End-to-End Learning Framework for Solving Non-Markovian Optimal Control

Authors: Xiaole Zhang, Peiyu Zhang, Xiongye Xiao, Shixuan Li, Vasileios Tzoumas, Vijay Gupta, Paul Bogdan

Abstract: Integer-order calculus often falls short in capturing the long-range dependencies and memory effects found in many real-world processes. Fractional calculus addresses these gaps via fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control due to the lack of standard control methodologies. In this paper, we theoretically derive the optimal control via linear quadratic regulator (LQR) for fractional-order linear time-invariant (FOLTI) systems and develop an end-to-end deep learning framework based on this theoretical foundation. Our approach establishes a rigorous mathematical model, derives analytical solutions, and incorporates deep learning to achieve data-driven optimal control of FOLTI systems. Our key contributions include: (i) proposing an innovative system identification method control strategy for FOLTI systems, (ii) developing the first end-to-end data-driven learning framework, Fractional-Order Learning for Optimal Control (FOLOC), that learns control policies from observed trajectories, and (iii) deriving a theoretical analysis of sample complexity to quantify the number of samples required for accurate optimal control in complex real-world problems. Experimental results indicate that our method accurately approximates fractional-order system behaviors without relying on Gaussian noise assumptions, pointing to promising avenues for advanced optimal control.

replace-cross ProofWala: Multilingual Proof Data Synthesis and Theorem-Proving

Authors: Amitayush Thakur, George Tsoukalas, Greg Durrett, Swarat Chaudhuri

Abstract: Neural networks have shown substantial promise at automatic theorem-proving in interactive proof assistants (ITPs) like Lean and Coq. However, most neural theorem-proving models are restricted to specific ITPs, leaving out opportunities for cross-lingual $\textit{transfer}$ between ITPs. We address this weakness with a multilingual proof framework, ${\rm P{\small ROOF}W{\small ALA}}$, that allows a standardized form of interaction between neural theorem-provers and two established ITPs (Coq and Lean). It enables the collection of multilingual proof step data -- data recording the result of proof actions on ITP states -- for training neural provers. ${\rm P{\small ROOF}W{\small ALA}}$ allows the systematic evaluation of a model's performance across different ITPs and problem domains via efficient parallel proof search algorithms. We show that multilingual training enabled by ${\rm P{\small ROOF}W{\small ALA}}$ can lead to successful transfer across ITPs. Specifically, a model trained on a mix of ${\rm P{\small ROOF}W{\small ALA}}$-generated Coq and Lean data outperforms Lean-only and Coq-only models on the standard prove-at-$k$ metric. We open source all code including code for the ${\rm P{\small ROOF}W{\small ALA}}$ Framework (https://github.com/trishullab/proof-wala), and the Multilingual ITP interaction framework (https://github.com/trishullab/itp-interface).

URLs: https://github.com/trishullab/proof-wala),, https://github.com/trishullab/itp-interface).

replace-cross Self-Supervised Prompt Optimization

Authors: Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Fengwei Teng, Jinhao Tu, Xinbing Liang, Sirui Hong, Chenglin Wu, Yuyu Luo

Abstract: Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/geekan/MetaGPT/blob/main/examples/spo

URLs: https://github.com/geekan/MetaGPT/blob/main/examples/spo

replace-cross SnipGen: A Mining Repository Framework for Evaluating LLMs for Code

Authors: Daniel Rodriguez-Cardenas, Alejandro Velasco, Denys Poshyvanyk

Abstract: Language Models (LLMs), such as transformer-based neural networks trained on billions of parameters, have become increasingly prevalent in software engineering (SE). These models, trained on extensive datasets that include code repositories, exhibit remarkable capabilities for SE tasks. However, evaluating their effectiveness poses significant challenges, primarily due to the potential overlap between the datasets used for training and those employed for evaluation. To address this issue, we introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation. SnipGen aims to mitigate data contamination by generating robust testbeds and crafting tailored data points to assist researchers and practitioners in evaluating LLMs for code-related tasks. In our exploratory study, SnipGen mined approximately 227K data points from 338K recent code changes in GitHub commits, focusing on method-level granularity. SnipGen features a collection of prompt templates that can be combined to create a Chain-of-Thought-like sequence of prompts, enabling a nuanced assessment of LLMs' code generation quality. By providing the mining tool, the methodology, and the dataset, SnipGen empowers researchers and practitioners to rigorously evaluate and interpret LLMs' performance in software engineering contexts.

replace-cross MEMIT-Merge: Addressing MEMIT's Key-Value Conflicts in Same-Subject Batch Editing for LLMs

Authors: Zilu Dong, Xiangqing Shen, Rui Xia

Abstract: As large language models continue to scale up, knowledge editing techniques that modify models' internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncover a critical limitation that MEMIT's editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals that the root cause lies in MEMIT's key value modeling framework: When multiple facts with the same subject in a batch are modeled through MEMIT's key value mechanism, identical keys (derived from the shared subject) are forced to represent different values (corresponding to different knowledge), resulting in updates conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in same-subject batch editing scenarios. Experimental results demonstrate that when MEMIT's edit success rate drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate exceeding 90%, showcasing remarkable robustness to subject entity collisions.

replace-cross Novel computational workflows for natural and biomedical image processing based on hypercomplex algebras

Authors: Nektarios A. Valous, Eckhard Hitzer, Drago\c{s} Du\c{s}e, Rodrigo Rojas Moraleda, Ferdinand Popp, Meggy Suarez-Carmona, Anna Berthel, Ismini Papageorgiou, Carlo Fremd, Alexander R\"olle, Christina C. Westhoff, B\'en\'edicte Lenoir, Niels Halama, Inka Z\"ornig, Dirk J\"ager

Abstract: Hypercomplex image processing extends conventional techniques in a unified paradigm encompassing algebraic and geometric principles. This work leverages quaternions and the two-dimensional orthogonal planes split framework (splitting of a quaternion - representing a pixel - into pairs of orthogonal 2D planes) for natural/biomedical image analysis through the following computational workflows and outcomes: natural/biomedical image re-colorization, natural image de-colorization, natural/biomedical image contrast enhancement, computational re-staining and stain separation in histological images, and performance gains in machine/deep learning pipelines for histological images. The workflows are analyzed separately for natural and biomedical images to showcase the effectiveness of the proposed approaches. The proposed workflows can regulate color appearance (e.g. with alternative renditions and grayscale conversion) and image contrast, be part of automated image processing pipelines (e.g. isolating stain components, boosting learning models), and assist in digital pathology applications (e.g. enhancing biomarker visibility, enabling colorblind-friendly renditions). Employing only basic arithmetic and matrix operations, this work offers a computationally accessible methodology - in the hypercomplex domain - that showcases versatility and consistency across image processing tasks and a range of computer vision and biomedical applications. The proposed non-data-driven methods achieve comparable or better results (particularly in cases involving well-known methods) to those reported in the literature, showcasing the potential of robust theoretical frameworks with practical effectiveness. Results, methods, and limitations are detailed alongside discussion of promising extensions, emphasizing the potential of feature-rich mathematical/computational frameworks for natural and biomedical images.

replace-cross Graph Foundation Models for Recommendation: A Comprehensive Survey

Authors: Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi

Abstract: Recommender systems (RS) serve as a fundamental tool for navigating the vast expanse of online information, with deep learning advancements playing an increasingly important role in improving ranking accuracy. Among these, graph neural networks (GNNs) excel at extracting higher-order structural information, while large language models (LLMs) are designed to process and comprehend natural language, making both approaches highly effective and widely adopted. Recent research has focused on graph foundation models (GFMs), which integrate the strengths of GNNs and LLMs to model complex RS problems more efficiently by leveraging the graph-based structure of user-item relationships alongside textual understanding. In this survey, we provide a comprehensive overview of GFM-based RS technologies by introducing a clear taxonomy of current approaches, diving into methodological details, and highlighting key challenges and future directions. By synthesizing recent advancements, we aim to offer valuable insights into the evolving landscape of GFM-based recommender systems.

replace-cross QA-Expand: Multi-Question Answer Generation for Enhanced Query Expansion in Information Retrieval

Authors: Wonduk Seo, Seunghyun Lee

Abstract: Query expansion is widely used in Information Retrieval (IR) to improve search outcomes by enriching queries with additional contextual information. Although recent Large Language Model (LLM) based methods generate pseudo-relevant content and expanded terms via multiple prompts, they often yield repetitive, narrow expansions that lack the diverse context needed to retrieve all relevant information. In this paper, we introduce QA-Expand, a novel and effective framework for query expansion. It first generates multiple relevant questions from the initial query and subsequently produces corresponding pseudo-answers as surrogate documents. A feedback model further rewrites and filters these answers to ensure only the most informative augmentations are incorporated. Extensive experiments on benchmarks such as BEIR and TREC demonstrate that QA-Expand enhances retrieval performance by up to 13% over state-of-the-art methods, offering a robust solution for modern retrieval challenges.

replace-cross Zero-shot generation of synthetic neurosurgical data with large language models

Authors: Austin A. Barr, Eddie Guo, Emre Sezgin

Abstract: Clinical data is fundamental to advance neurosurgical research, but access is often constrained by data availability, small sample sizes, privacy regulations, and resource-intensive preprocessing and de-identification procedures. Synthetic data offers a potential solution to challenges associated with accessing and using real-world data (RWD). This study aims to evaluate the capability of zero-shot generation of synthetic neurosurgical data with a large language model (LLM), GPT-4o, by benchmarking with the conditional tabular generative adversarial network (CTGAN). Synthetic datasets were compared to real-world neurosurgical data to assess fidelity (means, proportions, distributions, and bivariate correlations), utility (ML classifier performance on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated datasets matched or exceeded CTGAN performance, despite no fine-tuning or access to RWD for pre-training. Datasets demonstrated high univariate and bivariate fidelity to RWD without directly exposing any real patient records, even at amplified sample size. Training an ML classifier on GPT-4o-generated data and testing on RWD for a binary prediction task showed an F1 score (0.706) with comparable performance to training on the CTGAN data (0.705) for predicting postoperative functional status deterioration. GPT-4o demonstrated a promising ability to generate high-fidelity synthetic neurosurgical data. These findings also indicate that data synthesized with GPT-4o can effectively augment clinical data with small sample sizes, and train ML models for prediction of neurosurgical outcomes. Further investigation is necessary to improve the preservation of distributional characteristics and boost classifier performance.

replace-cross Human-LLM Coevolution: Evidence from Academic Writing

Authors: Mingmeng Geng, Roberto Trotta

Abstract: With a statistical analysis of arXiv paper abstracts, we report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as "delve", starting soon after they were pointed out in early 2024. The frequency of certain other words favored by ChatGPT, such as "significant", has instead kept increasing. These phenomena suggest that some authors of academic papers have adapted their use of large language models (LLMs), for example, by selecting outputs or applying modifications to the LLM-generated content. Such coevolution and cooperation of humans and LLMs thus introduce additional challenges to the detection of machine-generated text in real-world scenarios. Estimating the impact of LLMs on academic writing by examining word frequency remains feasible, and more attention should be paid to words that were already frequently employed, including those that have decreased in frequency due to LLMs' disfavor.

replace-cross Improved Online Confidence Bounds for Multinomial Logistic Bandits

Authors: Joongkyu Lee, Min-hwan Oh

Abstract: In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variance-dependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly minimax-optimal regret in MNL bandits. However, their results still depend on the norm-boundedness of the unknown parameter $B$ and the maximum size of possible outcomes $K$. To address this, we first derive an online confidence bound of $O\left(\sqrt{d \log t} + B \right)$, which is a significant improvement over the previous bound of $O (B \sqrt{d} \log t \log K )$ (Lee & Oh, 2024). This is mainly achieved by establishing tighter self-concordant properties of the MNL loss and introducing a novel intermediary term to bound the estimation error. Using this new online confidence bound, we propose a constant-time algorithm, OFU-MNL++, which achieves a variance-dependent regret bound of $O \Big( d \log T \sqrt{ \smash[b]{\sum_{t=1}^T} \sigma_t^2 } \Big) $ for sufficiently large $T$, where $\sigma_t^2$ denotes the variance of the rewards at round $t$, $d$ is the dimension of the contexts, and $T$ is the total number of rounds. Furthermore, we introduce a Maximum Likelihood Estimation (MLE)-based algorithm, OFU-MN$^2$L, which achieves an anytime poly(B)-free regret of $O \Big( d \log (BT) \sqrt{ \smash[b]{\sum_{t=1}^T} \sigma_t^2 } \Big) $.