new A New Perspective to Boost Performance Fairness for Medical Federated Learning

Authors: Yunlu Yan, Lei Zhu, Yuexiang Li, Xinxing Xu, Rick Siow Mong Goh, Yong Liu, Salman Khan, Chun-Mei Feng

Abstract: Improving the fairness of federated learning (FL) benefits healthy and sustainable collaboration, especially for medical applications. However, existing fair FL methods ignore the specific characteristics of medical FL applications, i.e., domain shift among the datasets from different hospitals. In this work, we propose Fed-LWR to improve performance fairness from the perspective of feature shift, a key issue influencing the performance of medical FL systems caused by domain shift. Specifically, we dynamically perceive the bias of the global model across all hospitals by estimating the layer-wise difference in feature representations between local and global models. To minimize global divergence, we assign higher weights to hospitals with larger differences. The estimated client weights help us to re-aggregate the local models per layer to obtain a fairer global model. We evaluate our method on two widely used federated medical image segmentation benchmarks. The results demonstrate that our method achieves better and fairer performance compared with several state-of-the-art fair FL methods.

new Deep Learning-driven Mobile Traffic Measurement Collection and Analysis

Authors: Yini Fang

Abstract: Modelling dynamic traffic patterns and especially the continuously changing dependencies between different base stations, which previous studies overlook, is challenging. Traditional algorithms struggle to process large volumes of data and to extract deep insights that help elucidate mobile traffic demands with fine granularity, as well as how these demands will evolve in the future. Therefore, in this thesis we harness the powerful hierarchical feature learning abilities of Deep Learning (DL) techniques in both spatial and temporal domains and develop solutions for precise city-scale mobile traffic analysis and forecasting. Firstly, we design Spider, a mobile traffic measurement collection and reconstruction framework with a view to reducing the cost of measurement collection and inferring traffic consumption with high accuracy, despite working with sparse information. In particular, we train a reinforcement learning agent to selectively sample subsets of target mobile coverage areas and tackle the large action space problem specific to this setting. We then introduce a lightweight neural network model to reconstruct the traffic consumption based on historical sparse measurements. Our proposed framework outperforms existing solutions on a real-world mobile traffic dataset. Secondly, we design SDGNet, a handover-aware graph neural network model for long-term mobile traffic forecasting. We model the cellular network as a graph, and leverage handover frequency to capture the dependencies between base stations across time. Handover information reflects user mobility such as daily commute, which helps in increasing the accuracy of the forecasts made. We proposed dynamic graph convolution to extract features from both traffic consumption and handover data, showing that our model outperforms other benchmark graph models on a mobile traffic dataset collected by a major network operator.

new Novel Development of LLM Driven mCODE Data Model for Improved Clinical Trial Matching to Enable Standardization and Interoperability in Oncology Research

Authors: Aarsh Shekhar, Mincheol Kim

Abstract: Each year, the lack of efficient data standardization and interoperability in cancer care contributes to the severe lack of timely and effective diagnosis, while constantly adding to the burden of cost, with cancer costs nationally reaching over $208 billion in 2023 alone. Traditional methods regarding clinical trial enrollment and clinical care in oncology are often manual, time-consuming, and lack a data-driven approach. This paper presents a novel framework to streamline standardization, interoperability, and exchange of cancer domains and enhance the integration of oncology-based EHRs across disparate healthcare systems. This paper utilizes advanced LLMs and Computer Engineering to streamline cancer clinical trials and discovery. By utilizing FHIR's resource-based approach and LLM-generated mCODE profiles, we ensure timely, accurate, and efficient sharing of patient information across disparate healthcare systems. Our methodology involves transforming unstructured patient treatment data, PDFs, free-text information, and progress notes into enriched mCODE profiles, facilitating seamless integration with our novel AI and ML-based clinical trial matching engine. The results of this study show a significant improvement in data standardization, with accuracy rates of our trained LLM peaking at over 92% with datasets consisting of thousands of patient data. Additionally, our LLM demonstrated an accuracy rate of 87% for SNOMED-CT, 90% for LOINC, and 84% for RxNorm codes. This trumps the current status quo, with LLMs such as GPT-4 and Claude's 3.5 peaking at an average of 77%. This paper successfully underscores the potential of our standardization and interoperability framework, paving the way for more efficient and personalized cancer treatment.

new GNNRL-Smoothing: A Prior-Free Reinforcement Learning Model for Mesh Smoothing

Authors: Zhichao Wang, Xinhai Chen, Chunye Gong, Bo Yang, Liang Deng, Yufei Sun, Yufei Pang, Jie Liu

Abstract: Mesh smoothing methods can enhance mesh quality by eliminating distorted elements, leading to improved convergence in simulations. To balance the efficiency and robustness of traditional mesh smoothing process, previous approaches have employed supervised learning and reinforcement learning to train intelligent smoothing models. However, these methods heavily rely on labeled dataset or prior knowledge to guide the models' learning. Furthermore, their limited capacity to enhance mesh connectivity often restricts the effectiveness of smoothing. In this paper, we first systematically analyze the learning mechanisms of recent intelligent smoothing methods and propose a prior-free reinforcement learning model for intelligent mesh smoothing. Our proposed model integrates graph neural networks with reinforcement learning to implement an intelligent node smoothing agent and introduces, for the first time, a mesh connectivity improvement agent. We formalize mesh optimization as a Markov Decision Process and successfully train both agents using Twin Delayed Deep Deterministic Policy Gradient and Double Dueling Deep Q-Network in the absence of any prior data or knowledge. We verified the proposed model on both 2D and 3D meshes. Experimental results demonstrate that our model achieves feature-preserving smoothing on complex 3D surface meshes. It also achieves state-of-the-art results among intelligent smoothing methods on 2D meshes and is 7.16 times faster than traditional optimization-based smoothing methods. Moreover, the connectivity improvement agent can effectively enhance the quality distribution of the mesh.

new Multidimensional Knowledge Graph Embeddings for International Trade Flow Analysis

Authors: Durgesh Nandini, Simon Bloethner, Mirco Schoenfeld, Mario Larch

Abstract: Understanding the complex dynamics of high-dimensional, contingent, and strongly nonlinear economic data, often shaped by multiplicative processes, poses significant challenges for traditional regression methods as such methods offer limited capacity to capture the structural changes they feature. To address this, we propose leveraging the potential of knowledge graph embeddings for economic trade data, in particular, to predict international trade relationships. We implement KonecoKG, a knowledge graph representation of economic trade data with multidimensional relationships using SDM-RDFizer, and transform the relationships into a knowledge graph embedding using AmpliGraph.

new Deep Learning and Machine Learning -- Python Data Structures and Mathematics Fundamental: From Theory to Practice

Authors: Silin Chen, Ziqian Bi, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Jintao Ren, Qian Niu, Ming Liu

Abstract: This book provides a comprehensive introduction to the foundational concepts of machine learning (ML) and deep learning (DL). It bridges the gap between theoretical mathematics and practical application, focusing on Python as the primary programming language for implementing key algorithms and data structures. The book covers a wide range of topics, including basic and advanced Python programming, fundamental mathematical operations, matrix operations, linear algebra, and optimization techniques crucial for training ML and DL models. Advanced subjects like neural networks, optimization algorithms, and frequency domain methods are also explored, along with real-world applications of large language models (LLMs) and artificial intelligence (AI) in big data management. Designed for both beginners and advanced learners, the book emphasizes the critical role of mathematical principles in developing scalable AI solutions. Practical examples and Python code are provided throughout, ensuring readers gain hands-on experience in applying theoretical knowledge to solve complex problems in ML, DL, and big data analytics.

new Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts

Authors: Sheryl Paul, Jyotirmoy V. Deshmukh

Abstract: Reinforcement learning (RL) has been successfully applied to solve the problem of finding obstacle-free paths for autonomous agents operating in stochastic and uncertain environments. However, when the underlying stochastic dynamics of the environment experiences drastic distribution shifts, the optimal policy obtained in the trained environment may be sub-optimal or may entirely fail in helping find goal-reaching paths for the agent. Approaches like domain randomization and robust RL can provide robust policies, but typically assume minor (bounded) distribution shifts. For substantial distribution shifts, retraining (either with a warm-start policy or from scratch) is an alternative approach. In this paper, we develop a novel approach called {\em Evolutionary Robust Policy Optimization} (ERPO), an adaptive re-training algorithm inspired by evolutionary game theory (EGT). ERPO learns an optimal policy for the shifted environment iteratively using a temperature parameter that controls the trade off between exploration and adherence to the old optimal policy. The policy update itself is an instantiation of the replicator dynamics used in EGT. We show that under fairly common sparsity assumptions on rewards in such environments, ERPO converges to the optimal policy in the shifted environment. We empirically demonstrate that for path finding tasks in a number of environments, ERPO outperforms several popular RL and deep RL algorithms (PPO, A3C, DQN) in many scenarios and popular environments. This includes scenarios where the RL algorithms are allowed to train from scratch in the new environment, when they are retrained on the new environment, or when they are used in conjunction with domain randomization. ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.

new Prototype-Based Methods in Explainable AI and Emerging Opportunities in the Geosciences

Authors: Anushka Narayanan, Karianne J. Bergen

Abstract: Prototype-based methods are intrinsically interpretable XAI methods that produce predictions and explanations by comparing input data with a set of learned prototypical examples that are representative of the training data. In this work, we discuss a series of developments in the field of prototype-based XAI that show potential for scientific learning tasks, with a focus on the geosciences. We organize the prototype-based XAI literature into three themes: the development and visualization of prototypes, types of prototypes, and the use of prototypes in various learning tasks. We discuss how the authors use prototype-based methods, their novel contributions, and any limitations or challenges that may arise when adapting these methods for geoscientific learning tasks. We highlight differences between geoscientific data sets and the standard benchmarks used to develop XAI methods, and discuss how specific geoscientific applications may benefit from using or modifying existing prototype-based XAI techniques.

new Enhancing Deep Learning based RMT Data Inversion using Gaussian Random Field

Authors: Koustav Ghosal, Arun Singh, Samir Malakar, Shalivahan Srivastava, Deepak Gupta

Abstract: Deep learning (DL) methods have emerged as a powerful tool for the inversion of geophysical data. When applied to field data, these models often struggle without additional fine-tuning of the network. This is because they are built on the assumption that the statistical patterns in the training and test datasets are the same. To address this, we propose a DL-based inversion scheme for Radio Magnetotelluric data where the subsurface resistivity models are generated using Gaussian Random Fields (GRF). The network's generalization ability was tested with an out-of-distribution (OOD) dataset comprising a homogeneous background and various rectangular-shaped anomalous bodies. After end-to-end training with the GRF dataset, the pre-trained network successfully identified anomalies in the OOD dataset. Synthetic experiments confirmed that the GRF dataset enhances generalization compared to a homogeneous background OOD dataset. The network accurately recovered structures in a checkerboard resistivity model, and demonstrated robustness to noise, outperforming traditional gradient-based methods. Finally, the developed scheme is tested using exemplary field data from a waste site near Roorkee, India. The proposed scheme enhances generalization in a data-driven supervised learning framework, suggesting a promising direction for OOD generalization in DL methods.

new Evaluating Deep Learning Approaches for Predictions in Unmonitored Basins with Continental-scale Stream Temperature Models

Authors: Jared D. Willard, Fabio Ciulla, Helen Weierbach, Vipin Kumar, Charuleka Varadharajan

Abstract: The prediction of streamflows and other environmental variables in unmonitored basins is a grand challenge in hydrology. Recent machine learning (ML) models can harness vast datasets for accurate predictions at large spatial scales. However, there are open questions regarding model design and data needed for inputs and training to improve performance. This study explores these questions while demonstrating the ability of deep learning models to make accurate stream temperature predictions in unmonitored basins across the conterminous United States. First, we compare top-down models that utilize data from a large number of basins with bottom-up methods that transfer ML models built on local sites, reflecting traditional regionalization techniques. We also evaluate an intermediary grouped modeling approach that categorizes sites based on regional co-location or similarity of catchment characteristics. Second, we evaluate trade-offs between model complexity, prediction accuracy, and applicability for more target locations by systematically removing inputs. We then examine model performance when additional training data becomes available due to reductions in input requirements. Our results suggest that top-down models significantly outperform bottom-up and grouped models. Moreover, it is possible to get acceptable accuracy by reducing both dynamic and static inputs enabling predictions for more sites with lower model complexity and computational needs. From detailed error analysis, we determined that the models are more accurate for sites primarily controlled by air temperatures compared to locations impacted by groundwater and dams. By addressing these questions, this research offers a comprehensive perspective on optimizing ML model design for accurate predictions in unmonitored regions.

new Simultaneous Dimensionality Reduction for Extracting Useful Representations of Large Empirical Multimodal Datasets

Authors: Eslam Abdelaleem

Abstract: The quest for simplification in physics drives the exploration of concise mathematical representations for complex systems. This Dissertation focuses on the concept of dimensionality reduction as a means to obtain low-dimensional descriptions from high-dimensional data, facilitating comprehension and analysis. We address the challenges posed by real-world data that defy conventional assumptions, such as complex interactions within neural systems or high-dimensional dynamical systems. Leveraging insights from both theoretical physics and machine learning, this work unifies diverse reduction methods under a comprehensive framework, the Deep Variational Multivariate Information Bottleneck. This framework enables the design of tailored reduction algorithms based on specific research questions. We explore and assert the efficacy of simultaneous reduction approaches over their independent reduction counterparts, demonstrating their superiority in capturing covariation between multiple modalities, while requiring less data. We also introduced novel techniques, such as the Deep Variational Symmetric Information Bottleneck, for general nonlinear simultaneous reduction. We show that the same principle of simultaneous reduction is the key to efficient estimation of mutual information. We show that our new method is able to discover the coordinates of high-dimensional observations of dynamical systems. Through analytical investigations and empirical validations, we shed light on the intricacies of dimensionality reduction methods, paving the way for enhanced data analysis across various domains. We underscore the potential of these methodologies to extract meaningful insights from complex datasets, driving advancements in fundamental research and applied sciences. As these methods evolve, they promise to deepen our understanding of complex systems and inform more effective data analysis strategies.

new Hypergraph Neural Networks Reveal Spatial Domains from Single-cell Transcriptomics Data

Authors: Mehrad Soltani, Luis Rueda

Abstract: The task of spatial clustering of transcriptomics data is of paramount importance. It enables the classification of tissue samples into diverse subpopulations of cells, which, in turn, facilitates the analysis of the biological functions of clusters, tissue reconstruction, and cell-cell interactions. Many approaches leverage gene expressions, spatial locations, and histological images to detect spatial domains; however, Graph Neural Networks (GNNs) as state of the art models suffer from a limitation in the assumption of pairwise connections between nodes. In the case of domain detection in spatial transcriptomics, some cells are found to be not directly related. Still, they are grouped as the same domain, which shows the incapability of GNNs for capturing implicit connections among the cells. While graph edges connect only two nodes, hyperedges connect an arbitrary number of nodes along their edges, which lets Hypergraph Neural Networks (HGNNs) capture and utilize richer and more complex structural information than traditional GNNs. We use autoencoders to address the limitation of not having the actual labels, which are well-suited for unsupervised learning. Our model has demonstrated exceptional performance, achieving the highest iLISI score of 1.843 compared to other methods. This score indicates the greatest diversity of cell types identified by our method. Furthermore, our model outperforms other methods in downstream clustering, achieving the highest ARI values of 0.51 and Leiden score of 0.60.

new Causal Order Discovery based on Monotonic SCMs

Authors: Ali Izadi, Martin Ester

Abstract: In this paper, we consider the problem of causal order discovery within the framework of monotonic Structural Causal Models (SCMs), which have gained attention for their potential to enable causal inference and causal discovery from observational data. While existing approaches either assume prior knowledge about the causal order or use complex optimization techniques to impose sparsity in the Jacobian of Triangular Monotonic Increasing maps, our work introduces a novel sequential procedure that directly identifies the causal order by iteratively detecting the root variable. This method eliminates the need for sparsity assumptions and the associated optimization challenges, enabling the identification of a unique SCM without the need for multiple independence tests to break the Markov equivalence class. We demonstrate the effectiveness of our approach in sequentially finding the root variable, comparing it to methods that maximize Jacobian sparsity.

new Recommendations for Comprehensive and Independent Evaluation of Machine Learning-Based Earth System Models

Authors: Paul A. Ullrich, Elizabeth A. Barnes, William D. Collins, Katherine Dagon, Shiheng Duan, Joshua Elms, Jiwoo Lee, L. Ruby Leung, Dan Lu, Maria J. Molina, Travis A. O'Brien

Abstract: Machine learning (ML) is a revolutionary technology with demonstrable applications across multiple disciplines. Within the Earth science community, ML has been most visible for weather forecasting, producing forecasts that rival modern physics-based models. Given the importance of deepening our understanding and improving predictions of the Earth system on all time scales, efforts are now underway to develop forecasting models into Earth-system models (ESMs), capable of representing all components of the coupled Earth system (or their aggregated behavior) and their response to external changes. Modeling the Earth system is a much more difficult problem than weather forecasting, not least because the model must represent the alternate (e.g., future) coupled states of the system for which there are no historical observations. Given that the physical principles that enable predictions about the response of the Earth system are often not explicitly coded in these ML-based models, demonstrating the credibility of ML-based ESMs thus requires us to build evidence of their consistency with the physical system. To this end, this paper puts forward five recommendations to enhance comprehensive, standardized, and independent evaluation of ML-based ESMs to strengthen their credibility and promote their wider use.

new TBBC: Predict True Bacteraemia in Blood Cultures via Deep Learning

Authors: Kira Sam

Abstract: Bacteraemia, a bloodstream infection with high morbidity and mortality rates, poses significant diagnostic challenges. Accurate diagnosis through blood cultures is resource-intensive. Developing a machine learning model to predict blood culture outcomes in emergency departments offers potential for improved diagnosis, reduced healthcare costs, and mitigated antibiotic use.This thesis aims to identify optimal machine learning techniques for predicting bacteraemia and develop a predictive model using data from St. Antonius Hospital's emergency department. Based on current literature, CatBoost and Random Forest were selected as the most promising techniques. Model optimization using Optuna prioritized sensitivity.The final Random Forest model achieved an ROC AUC of 0.78 and demonstrated 0.92 sensitivity on the test set. Notably, it accurately identified 36.02% of patients at low risk of bacteraemia, with only 0.85% false negatives.Implementation of this model in St. Antonius Hospital's emergency department could reduce blood cultures, healthcare costs, and antibiotic treatments. Future studies should focus on external validation, exploring advanced techniques, and addressing potential confounders to ensure model generalizability.

new EnergyPlus Room Simulator

Authors: Manuel Weber, Philipp Bogdain, Sophia Viktoria Wei{\ss}enberger, Diana Marjanovic, Katharina Sammet, Jan Vellmer, Farzan Banihashemi, Peter Mandl

Abstract: Research towards energy optimization in buildings heavily relies on building-related data such as measured indoor climate factors. While data collection is a labor- and cost-intensive task, simulations are a cheap alternative to generate datasets of arbitrary sizes, particularly useful for data-intensive deep learning methods. In this paper, we present the tool EnergyPlus Room Simulator, which enables the simulation of indoor climate in a specific room of a building using the simulation software EnergyPlus. It allows to alter room models and simulate various factors such as temperature, humidity, and CO2 concentration. In contrast to manually working with EnergyPlus, this tool enhances the simulation process by offering a convenient interface, including a user-friendly graphical user interface (GUI) as well as a REST API. The tool is intended to support scientific, building-related tasks such as occupancy detection on a room level by facilitating fast access to simulation data that may, for instance, be used for pre-training machine learning models.

new Air Quality Prediction with Physics-Informed Dual Neural ODEs in Open Systems

Authors: Jindong Tian, Yuxuan Liang, Ronghui Xu, Peng Chen, Chenjuan Guo, Aoying Zhou, Lujia Pan, Zhongwen Rao, Bin Yang

Abstract: Air pollution significantly threatens human health and ecosystems, necessitating effective air quality prediction to inform public policy. Traditional approaches are generally categorized into physics-based and data-driven models. Physics-based models usually struggle with high computational demands and closed-system assumptions, while data-driven models may overlook essential physical dynamics, confusing the capturing of spatiotemporal correlations. Although some physics-informed approaches combine the strengths of both models, they often face a mismatch between explicit physical equations and implicit learned representations. To address these challenges, we propose Air-DualODE, a novel physics-informed approach that integrates dual branches of Neural ODEs for air quality prediction. The first branch applies open-system physical equations to capture spatiotemporal dependencies for learning physics dynamics, while the second branch identifies the dependencies not addressed by the first in a fully data-driven way. These dual representations are temporally aligned and fused to enhance prediction accuracy. Our experimental results demonstrate that Air-DualODE achieves state-of-the-art performance in predicting pollutant concentrations across various spatial scales, thereby offering a promising solution for real-world air quality challenges.

new A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection

Authors: Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor

Abstract: This review paper explores recent advances in deep learning approaches for non-invasive cognitive impairment detection. We examine various non-invasive indicators of cognitive decline, including speech and language, facial, and motoric mobility. The paper provides an overview of relevant datasets, feature-extracting techniques, and deep-learning architectures applied to this domain. We have analyzed the performance of different methods across modalities and observed that speech and language-based methods generally achieved the highest detection performance. Studies combining acoustic and linguistic features tended to outperform those using a single modality. Facial analysis methods showed promise for visual modalities but were less extensively studied. Most papers focused on binary classification (impaired vs. non-impaired), with fewer addressing multi-class or regression tasks. Transfer learning and pre-trained language models emerged as popular and effective techniques, especially for linguistic analysis. Despite significant progress, several challenges remain, including data standardization and accessibility, model explainability, longitudinal analysis limitations, and clinical adaptation. Lastly, we propose future research directions, such as investigating language-agnostic speech analysis methods, developing multi-modal diagnostic systems, and addressing ethical considerations in AI-assisted healthcare. By synthesizing current trends and identifying key obstacles, this review aims to guide further development of deep learning-based cognitive impairment detection systems to improve early diagnosis and ultimately patient outcomes.

new Simmering: Sufficient is better than optimal for training neural networks

Authors: Irina Babayan, Hazhir Aliahmadi, Greg van Anders

Abstract: The broad range of neural network training techniques that invoke optimization but rely on ad hoc modification for validity suggests that optimization-based training is misguided. Shortcomings of optimization-based training are brought to particularly strong relief by the problem of overfitting, where naive optimization produces spurious outcomes. The broad success of neural networks for modelling physical processes has prompted advances that are based on inverting the direction of investigation and treating neural networks as if they were physical systems in their own right These successes raise the question of whether broader, physical perspectives could motivate the construction of improved training algorithms. Here, we introduce simmering, a physics-based method that trains neural networks to generate weights and biases that are merely ``good enough'', but which, paradoxically, outperforms leading optimization-based approaches. Using classification and regression examples we show that simmering corrects neural networks that are overfit by Adam, and show that simmering avoids overfitting if deployed from the outset. Our results question optimization as a paradigm for neural network training, and leverage information-geometric arguments to point to the existence of classes of sufficient training algorithms that do not take optimization as their starting point.

new Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces

Authors: Avik Kar, Rahul Singh

Abstract: We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs and develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space. We show that its regret can be bounded as $\mathcal{\tilde{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}} = 2d_\mathcal{S} + d_z + 3$, $d_\mathcal{S}$ is the dimension of the state space, and $d_z$ is the zooming dimension. $d_z$ is a problem-dependent quantity, which allows us to conclude that if MDP is benign, then its regret will be small. We note that the existing notion of zooming dimension for average reward RL is defined in terms of policy coverings, and hence it can be huge when the policy class is rich even though the underlying MDP is simple, so that the regret upper bound is nearly $O(T)$. The zooming dimension proposed in the current work is bounded above by $d$, the dimension of the state-action space, and hence is truly adaptive, i.e., shows how to capture adaptivity gains for infinite-horizon average-reward RL. ZoRL outperforms other state-of-the-art algorithms in experiments; thereby demonstrating the gains arising due to adaptivity.

new Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Authors: Mohamed Salim Aissi, Clement Romac, Thomas Carta, Sylvain Lamprier, Pierre-Yves Oudeyer, Olivier Sigaud, Laure Soulier, Nicolas Thome

Abstract: Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model's internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.

new Disentangling Genotype and Environment Specific Latent Features for Improved Trait Prediction using a Compositional Autoencoder

Authors: Anirudha Powadi, Talukder Zaki Jubery, Michael C. Tross, James C. Schnable, Baskar Ganapathysubramanian

Abstract: This study introduces a compositional autoencoder (CAE) framework designed to disentangle the complex interplay between genotypic and environmental factors in high-dimensional phenotype data to improve trait prediction in plant breeding and genetics programs. Traditional predictive methods, which use compact representations of high-dimensional data through handcrafted features or latent features like PCA or more recently autoencoders, do not separate genotype-specific and environment-specific factors. We hypothesize that disentangling these features into genotype-specific and environment-specific components can enhance predictive models. To test this, we developed a compositional autoencoder (CAE) that decomposes high-dimensional data into distinct genotype-specific and environment-specific latent features. Our CAE framework employs a hierarchical architecture within an autoencoder to effectively separate these entangled latent features. Applied to a maize diversity panel dataset, the CAE demonstrates superior modeling of environmental influences and 5-10 times improved predictive performance for key traits like Days to Pollen and Yield, compared to the traditional methods, including standard autoencoders, PCA with regression, and Partial Least Squares Regression (PLSR). By disentangling latent features, the CAE provides powerful tool for precision breeding and genetic research. This work significantly enhances trait prediction models, advancing agricultural and biological sciences.

new Prediction of Final Phosphorus Content of Steel in a Scrap-Based Electric Arc Furnace Using Artificial Neural Networks

Authors: Riadh Azzaz, Valentin Hurel, Patrice Menard, Mohammad Jahazi, Samira Ebrahimi Kahou, Elmira Moosavi-Khoonsari

Abstract: The scrap-based electric arc furnace process is expected to capture a significant share of the steel market in the future due to its potential for reducing environmental impacts through steel recycling. However, managing impurities, particularly phosphorus, remains a challenge. This study aims to develop a machine learning model to estimate the steel phosphorus content at the end of the process based on input parameters. Data were collected over two years from a steel plant, focusing on the chemical composition and weight of the scrap, the volume of oxygen injected, and process duration. After preprocessing the data, several machine learning models were evaluated, with the artificial neural network (ANN) emerging as the most effective. The best ANN model included four hidden layers. The model was trained for 500 epochs with a batch size of 50. The best model achieves a mean square error (MSE) of 0.000016, a root-mean-square error (RMSE) of 0.0049998, a coefficient of determination (R2) of 99.96%, and a correlation coefficient (r) of 99.98%. Notably, the model achieved a 100% hit rate for predicting phosphorus content within +-0.001 wt% (+-10 ppm). These results demonstrate that the optimized ANN model offers accurate predictions for the steel final phosphorus content.

new Provable optimal transport with transformers: The essence of depth and prompt engineering

Authors: Hadi Daneshmand

Abstract: Can we establish provable performance guarantees for transformers? Establishing such theoretical guarantees is a milestone in developing trustworthy generative AI. In this paper, we take a step toward addressing this question by focusing on optimal transport, a fundamental problem at the intersection of combinatorial and continuous optimization. Leveraging the computational power of attention layers, we prove that a transformer with fixed parameters can effectively solve the optimal transport problem in Wasserstein-2 with entropic regularization for an arbitrary number of points. Consequently, the transformer can sort lists of arbitrary sizes up to an approximation factor. Our results rely on an engineered prompt that enables the transformer to implement gradient descent with adaptive stepsizes on the dual optimal transport. Combining the convergence analysis of gradient descent with Sinkhorn dynamics, we establish an explicit approximation bound for optimal transport with transformers, which improves as depth increases. Our findings provide novel insights into the essence of prompt engineering and depth for solving optimal transport. In particular, prompt engineering boosts the algorithmic expressivity of transformers, allowing them implement an optimization method. With increasing depth, transformers can simulate several iterations of gradient descent.

new Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Authors: Xiyue Peng, Hengquan Guo, Jiawei Zhang, Dongqing Zou, Ziyu Shao, Honghao Wei, Xin Liu

Abstract: Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. However, these methods can lead to ``safety interference'', where average-based safety constraints compromise the safety of some prompts in favor of others. To address this issue, we propose \textbf{Rectified Policy Optimization (RePO)}, which replaces the average safety constraint with stricter (per prompt) safety constraints. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments on Alpaca-7B demonstrate that RePO improves the safety alignment and reduces the safety interference compared to baseline methods. Code is available at https://github.com/pxyWaterMoon/RePO.

URLs: https://github.com/pxyWaterMoon/RePO.

new Privacy without Noisy Gradients: Slicing Mechanism for Generative Model Training

Authors: Kristjan Greenewald, Yuancheng Yu, Hao Wang, Kai Xu

Abstract: Training generative models with differential privacy (DP) typically involves injecting noise into gradient updates or adapting the discriminator's training procedure. As a result, such approaches often struggle with hyper-parameter tuning and convergence. We consider the slicing privacy mechanism that injects noise into random low-dimensional projections of the private data, and provide strong privacy guarantees for it. These noisy projections are used for training generative models. To enable optimizing generative models using this DP approach, we introduce the smoothed-sliced $f$-divergence and show it enjoys statistical consistency. Moreover, we present a kernel-based estimator for this divergence, circumventing the need for adversarial training. Extensive numerical experiments demonstrate that our approach can generate synthetic data of higher quality compared with baselines. Beyond performance improvement, our method, by sidestepping the need for noisy gradients, offers data scientists the flexibility to adjust generator architecture and hyper-parameters, run the optimization over any number of epochs, and even restart the optimization process -- all without incurring additional privacy costs.

new DualMAR: Medical-Augmented Representation from Dual-Expertise Perspectives

Authors: Pengfei Hu, Chang Lu, Fei Wang, Yue Ning

Abstract: Electronic Health Records (EHR) has revolutionized healthcare data management and prediction in the field of AI and machine learning. Accurate predictions of diagnosis and medications significantly mitigate health risks and provide guidance for preventive care. However, EHR driven models often have limited scope on understanding medical-domain knowledge and mostly rely on simple-and-sole ontologies. In addition, due to the missing features and incomplete disease coverage of EHR, most studies only focus on basic analysis on conditions and medication. We propose DualMAR, a framework that enhances EHR prediction tasks through both individual observation data and public knowledge bases. First, we construct a bi-hierarchical Diagnosis Knowledge Graph (KG) using verified public clinical ontologies and augment this KG via Large Language Models (LLMs); Second, we design a new proxy-task learning on lab results in EHR for pretraining, which further enhance KG representation and patient embeddings. By retrieving radial and angular coordinates upon polar space, DualMAR enables accurate predictions based on rich hierarchical and semantic embeddings from KG. Experiments also demonstrate that DualMAR outperforms state-of-the-art models, validating its effectiveness in EHR prediction and KG integration in medical domains.

new Understanding Adam Requires Better Rotation Dependent Assumptions

Authors: Lucas Maes, Tianyue H. Zhang, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret

Abstract: Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.

new Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

Authors: Silvia Terragni, Hoang Cuong, Joachim Daiber, Pallavi Gudipati, Pablo N. Mendes

Abstract: Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.

new Global Graph Counterfactual Explanation: A Subgraph Mapping Approach

Authors: Yinhan He, Wendy Zheng, Yaochen Zhu, Jing Ma, Saumitra Mishra, Natraj Raman, Ninghao Liu, Jundong Li

Abstract: Graph Neural Networks (GNNs) have been widely deployed in various real-world applications. However, most GNNs are black-box models that lack explanations. One strategy to explain GNNs is through counterfactual explanation, which aims to find minimum perturbations on input graphs that change the GNN predictions. Existing works on GNN counterfactual explanations primarily concentrate on the local-level perspective (i.e., generating counterfactuals for each individual graph), which suffers from information overload and lacks insights into the broader cross-graph relationships. To address such issues, we propose GlobalGCE, a novel global-level graph counterfactual explanation method. GlobalGCE aims to identify a collection of subgraph mapping rules as counterfactual explanations for the target GNN. According to these rules, substituting certain significant subgraphs with their counterfactual subgraphs will change the GNN prediction to the desired class for most graphs (i.e., maximum coverage). Methodologically, we design a significant subgraph generator and a counterfactual subgraph autoencoder in our GlobalGCE, where the subgraphs and the rules can be effectively generated. Extensive experiments demonstrate the superiority of our GlobalGCE compared to existing baselines. Our code can be found at https://anonymous.4open.science/r/GlobalGCE-92E8.

URLs: https://anonymous.4open.science/r/GlobalGCE-92E8.

new SAD: State-Action Distillation for In-Context Reinforcement Learning under Random Policies

Authors: Weiqin Chen, Santiago Paternain

Abstract: Pretrained foundation models have exhibited extraordinary in-context learning performance, allowing zero-shot generalization to new tasks not encountered during the pretraining. In the case of RL, in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, current state-of-the-art ICRL algorithms, such as AD, DPT and DIT, impose stringent requirements on generating the pretraining dataset concerning the behavior (source) policies, context information, and action labels, etc. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all environments during the generation of the pretraining dataset. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be both prohibitively intractable and expensive. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate a remarkable pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling the outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during the pretraining. To the best of our knowledge, this is the first work that enables promising ICRL under (e.g., uniform) random policies and random contexts. We establish theoretical analyses regarding the performance guarantees of SAD. Moreover, our empirical results across multiple ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 180.86% in the offline evaluation and by 172.8% in the online evaluation.

new Resolving Domain Shift For Representations Of Speech In Non-Invasive Brain Recordings

Authors: Jeremiah Ridge, Oiwi Parker Jones

Abstract: Machine learning techniques have enabled researchers to leverage neuroimaging data to decode speech from brain activity, with some amazing recent successes achieved by applications built using invasive devices. However, research requiring surgical implants has a number of practical limitations. Non-invasive neuroimaging techniques provide an alternative but come with their own set of challenges, the limited scale of individual studies being among them. Without the ability to pool the recordings from different non-invasive studies, data on the order of magnitude needed to leverage deep learning techniques to their full potential remains out of reach. In this work, we focus on non-invasive data collected using magnetoencephalography (MEG). We leverage two different, leading speech decoding models to investigate how an adversarial domain adaptation framework augments their ability to generalize across datasets. We successfully improve the performance of both models when training across multiple datasets. To the best of our knowledge, this study is the first ever application of feature-level, deep learning based harmonization for MEG neuroimaging data. Our analysis additionally offers further evidence of the impact of demographic features on neuroimaging data, demonstrating that participant age strongly affects how machine learning models solve speech decoding tasks using MEG data. Lastly, in the course of this study we produce a new open-source implementation of one of these models to the benefit of the broader scientific community.

new Residual Random Neural Networks

Authors: M. Andrecut

Abstract: The single-layer feedforward neural network with random weights is a recurring motif in the neural networks literature. The advantage of these networks is their simplified training, which reduces to solving a ridge-regression problem. However, a general assumption is that these networks require a large number of hidden neurons relative to the dimensionality of the data samples, in order to achieve good classification accuracy. Contrary to this assumption, here we show that one can obtain good classification results even if the number of hidden neurons has the same order of magnitude as the dimensionality of the data samples, if this dimensionality is reasonably high. We also develop an efficient iterative residual training method for such random neural networks, which significantly improves their classification accuracy. Moreover, we also describe an encryption (obfuscation) method which can be used to protect both the data and the neural network model.

new Enhancing Battery Storage Energy Arbitrage with Deep Reinforcement Learning and Time-Series Forecasting

Authors: Manuel Sage, Joshua Campbell, Yaoyao Fiona Zhao

Abstract: Energy arbitrage is one of the most profitable sources of income for battery operators, generating revenues by buying and selling electricity at different prices. Forecasting these revenues is challenging due to the inherent uncertainty of electricity prices. Deep reinforcement learning (DRL) emerged in recent years as a promising tool, able to cope with uncertainty by training on large quantities of historical data. However, without access to future electricity prices, DRL agents can only react to the currently observed price and not learn to plan battery dispatch. Therefore, in this study, we combine DRL with time-series forecasting methods from deep learning to enhance the performance on energy arbitrage. We conduct a case study using price data from Alberta, Canada that is characterized by irregular price spikes and highly non-stationary. This data is challenging to forecast even when state-of-the-art deep learning models consisting of convolutional layers, recurrent layers, and attention modules are deployed. Our results show that energy arbitrage with DRL-enabled battery control still significantly benefits from these imperfect predictions, but only if predictors for several horizons are combined. Grouping multiple predictions for the next 24-hour window, accumulated rewards increased by 60% for deep Q-networks (DQN) compared to the experiments without forecasts. We hypothesize that multiple predictors, despite their imperfections, convey useful information regarding the future development of electricity prices through a "majority vote" principle, enabling the DRL agent to learn more profitable control policies.

new Off-Policy Selection for Initiating Human-Centric Experimental Design

Authors: Ge Gao, Xi Yang, Qitong Gao, Song Ju, Miroslav Pajic, Min Chi

Abstract: In human-centric tasks such as healthcare and education, the heterogeneity among patients and students necessitates personalized treatments and instructional interventions. While reinforcement learning (RL) has been utilized in those tasks, off-policy selection (OPS) is pivotal to close the loop by offline evaluating and selecting policies without online interactions, yet current OPS methods often overlook the heterogeneity among participants. Our work is centered on resolving a pivotal challenge in human-centric systems (HCSs): how to select a policy to deploy when a new participant joining the cohort, without having access to any prior offline data collected over the participant? We introduce First-Glance Off-Policy Selection (FPS), a novel approach that systematically addresses participant heterogeneity through sub-group segmentation and tailored OPS criteria to each sub-group. By grouping individuals with similar traits, FPS facilitates personalized policy selection aligned with unique characteristics of each participant or group of participants. FPS is evaluated via two important but challenging applications, intelligent tutoring systems and a healthcare application for sepsis treatment and intervention. FPS presents significant advancement in enhancing learning outcomes of students and in-hospital care outcomes.

new Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors

Authors: Wenqiang Chen, Jiaxuan Cheng, Leyao Wang, Wei Zhao, Wojciech Matusik

Abstract: Visual Question-Answering, a technology that generates textual responses from an image and natural language question, has progressed significantly. Notably, it can aid in tracking and inquiring about daily activities, crucial in healthcare monitoring, especially for elderly patients or those with memory disabilities. However, video poses privacy concerns and has a limited field of view. This paper presents Sensor2Text, a model proficient in tracking daily activities and engaging in conversations using wearable sensors. The approach outlined here tackles several challenges, including low information density in wearable sensor data, insufficiency of single wearable sensors in human activities recognition, and model's limited capacity for Question-Answering and interactive conversations. To resolve these obstacles, transfer learning and student-teacher networks are utilized to leverage knowledge from visual-language models. Additionally, an encoder-decoder neural network model is devised to jointly process language and sensor data for conversational purposes. Furthermore, Large Language Models are also utilized to enable interactive capabilities. The model showcases the ability to identify human activities and engage in Q\&A dialogues using various wearable sensor modalities. It performs comparably to or better than existing visual-language models in both captioning and conversational tasks. To our knowledge, this represents the first model capable of conversing about wearable sensor data, offering an innovative approach to daily activity tracking that addresses privacy and field-of-view limitations associated with current vision-based solutions.

new Training the Untrainable: Introducing Inductive Bias via Representational Alignment

Authors: Vighnesh Subramaniam, David Mayo, Colin Conwell, Tomaso Poggio, Boris Katz, Brian Cheung, Andrei Barbu

Abstract: We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. Networks are considered untrainable when they overfit, underfit, or converge to poor results even when tuning their hyperparameters. For example, plain fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although what that bias is remains unknown. We introduce guidance, where a guide network guides a target network using a neural distance function. The target is optimized to perform well and to match its internal representations, layer-by-layer, to those of the guide; the guide is unchanged. If the guide is trained, this transfers over part of the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. In this manner, we can investigate what kinds of priors different architectures place on untrainable networks such as fully connected networks. We demonstrate that this method overcomes the immediate overfitting of fully connected networks on vision tasks, makes plain CNNs competitive to ResNets, closes much of the gap between plain vanilla RNNs and Transformers, and can even help Transformers learn tasks which RNNs can perform more easily. We also discover evidence that better initializations of fully connected networks likely exist to avoid overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, may demystify the dark art of architecture creation, even perhaps turning architectures into a continuous optimizable parameter of the network.

new Revisiting PlayeRank

Authors: Louise Schmidt, Cristian Lillo, Javier Bustos

Abstract: In this article we revise the football's performance score called PlayeRank, designed and evaluated by Pappalardo et al.\ in 2019. First, we analyze the weights extracted from the Linear Support Vector Machine (SVM) that solves the classification problem of "which set of events has a higher impact on the chances of winning a match". Here, we notice that the previously published results include the Goal-Scored event during the training phase, which produces inconsistencies. We fix these inconsistencies, and show new weights capable of solving the same problem. Following the intuition that the best team should always win a match, we define the team's quality as the average number of players involved in the game. We show that, using the original PlayeRank, in 94.13\% of the matches either the superior team beats the inferior team or the teams end tied if the scores are similar. Finally, we present a way to use PlayeRank in an online fashion using modified free analysis tools. Calculating this modified version of PlayeRank, we performed an online analysis of a real football match every five minutes of game. Here, we evaluate the usefulness of that information with experts and managers, and conclude that the obtained data indeed provides useful information that was not previously available to the manager during the match.

new Annotation Efficiency: Identifying Hard Samples via Blocked Sparse Linear Bandits

Authors: Adit Jain, Soumyabrata Pal, Sunav Choudhary, Ramasuri Narayanam, Vikram Krishnamurthy

Abstract: This paper considers the problem of annotating datapoints using an expert with only a few annotation rounds in a label-scarce setting. We propose soliciting reliable feedback on difficulty in annotating a datapoint from the expert in addition to ground truth label. Existing literature in active learning or coreset selection turns out to be less relevant to our setting since they presume the existence of a reliable trained model, which is absent in the label-scarce regime. However, the literature on coreset selection emphasizes the presence of difficult data points in the training set to perform supervised learning in downstream tasks (Mindermann et al., 2022). Therefore, for a given fixed annotation budget of $\mathsf{T}$ rounds, we model the sequential decision-making problem of which (difficult) datapoints to choose for annotation in a sparse linear bandits framework with the constraint that no arm can be pulled more than once (blocking constraint). With mild assumptions on the datapoints, our (computationally efficient) Explore-Then-Commit algorithm BSLB achieves a regret guarantee of $\widetilde{\mathsf{O}}(k^{\frac{1}{3}} \mathsf{T}^{\frac{2}{3}} +k^{-\frac{1}{2}} \beta_k + k^{-\frac{1}{12}} \beta_k^{\frac{1}{2}}\mathsf{T}^{\frac{5}{6}})$ where the unknown parameter vector has tail magnitude $\beta_k$ at sparsity level $k$. To this end, we show offline statistical guarantees of Lasso estimator with mild Restricted Eigenvalue (RE) condition that is also robust to sparsity. Finally, we propose a meta-algorithm C-BSLB that does not need knowledge of the optimal sparsity parameters at a no-regret cost. We demonstrate the efficacy of our BSLB algorithm for annotation in the label-scarce setting for an image classification task on the PASCAL-VOC dataset, where we use real-world annotation difficulty scores.

new Evaluating Neural Networks for Early Maritime Threat Detection

Authors: Dhanush Tella, Chandra Teja Tiriveedhi, Naphtali Rishe, Dan E. Tamir, Jonathan I. Tamir

Abstract: We consider the task of classifying trajectories of boat activities as a proxy for assessing maritime threats. Previous approaches have considered entropy-based metrics for clustering boat activity into three broad categories: random walk, following, and chasing. Here, we comprehensively assess the accuracy of neural network-based approaches as alternatives to entropy-based clustering. We train four neural network models and compare them to shallow learning using synthetic data. We also investigate the accuracy of models as time steps increase and with and without rotated data. To improve test-time robustness, we normalize trajectories and perform rotation-based data augmentation. Our results show that deep networks can achieve a test-set accuracy of up to 100% on a full trajectory, with graceful degradation as the number of time steps decreases, outperforming entropy-based clustering.

new Mechanism learning: Reverse causal inference in the presence of multiple unknown confounding through front-door causal bootstrapping

Authors: Jianqiao Mao, Max A. Little

Abstract: A major limitation of machine learning (ML) prediction models is that they recover associational, rather than causal, predictive relationships between variables. In high-stakes automation applications of ML this is problematic, as the model often learns spurious, non-causal associations. This paper proposes mechanism learning, a simple method which uses front-door causal bootstrapping to deconfound observational data such that any appropriate ML model is forced to learn predictive relationships between effects and their causes (reverse causal inference), despite the potential presence of multiple unknown and unmeasured confounding. Effect variables can be very high dimensional, and the predictive relationship nonlinear, as is common in ML applications. This novel method is widely applicable, the only requirement is the existence of a mechanism variable mediating the cause (prediction target) and effect (feature data), which is independent of the (unmeasured) confounding variables. We test our method on fully synthetic, semi-synthetic and real-world datasets, demonstrating that it can discover reliable, unbiased, causal ML predictors where by contrast, the same ML predictor trained naively using classical supervised learning on the original observational data, is heavily biased by spurious associations. We provide code to implement the results in the paper, online.

new Deep Concept Identification for Generative Design

Authors: Ryo Tsumoto, Kentaro Yaji, Yutaka Nomaguchi, Kikuo Fujita

Abstract: A generative design based on topology optimization provides diverse alternatives as entities in a computational model with a high design degree. However, as the diversity of the generated alternatives increases, the cognitive burden on designers to select the most appropriate alternatives also increases. Whereas the concept identification approach, which finds various categories of entities, is an effective means to structure alternatives, evaluation of their similarities is challenging due to shape diversity. To address this challenge, this study proposes a concept identification framework for generative design using deep learning (DL) techniques. One of the key abilities of DL is the automatic learning of different representations of a specific task. Deep concept identification finds various categories that provide insights into the mapping relationships between geometric properties and structural performance through representation learning using DL. The proposed framework generates diverse alternatives using a generative design technique, clusters the alternatives into several categories using a DL technique, and arranges these categories for design practice using a classification model. This study demonstrates its fundamental capabilities by implementing variational deep embedding, a generative and clustering model based on the DL paradigm, and logistic regression as a classification model. A simplified design problem of a two-dimensional bridge structure is applied as a case study to validate the proposed framework. Although designers are required to determine the viewing aspect level by setting the number of concepts, this implementation presents the identified concepts and their relationships in the form of a decision tree based on a specified level.

new Understanding the Effect of GCN Convolutions in Regression Tasks

Authors: Juntong Chen, Johannes Schmidt-Hieber, Claire Donnat, Olga Klopp

Abstract: Graph Convolutional Networks (GCNs) have become a pivotal method in machine learning for modeling functions over graphs. Despite their widespread success across various applications, their statistical properties (e.g. consistency, convergence rates) remain ill-characterized. To begin addressing this knowledge gap, in this paper, we provide a formal analysis of the impact of convolution operators on regression tasks over homophilic networks. Focusing on estimators based solely on neighborhood aggregation, we examine how two common convolutions - the original GCN and GraphSage convolutions - affect the learning error as a function of the neighborhood topology and the number of convolutional layers. We explicitly characterize the bias-variance trade-off incurred by GCNs as a function of the neighborhood size and identify specific graph topologies where convolution operators are less effective. Our theoretical findings are corroborated by synthetic experiments, and provide a start to a deeper quantitative understanding of convolutional effects in GCNs for offering rigorous guidelines for practitioners.

new CGKN: A Deep Learning Framework for Modeling Complex Dynamical Systems and Efficient Data Assimilation

Authors: Chuanqi Chen, Nan Chen, Yinling Zhang, Jin-Long Wu

Abstract: Deep learning is widely used to predict complex dynamical systems in many scientific and engineering areas. However, the black-box nature of these deep learning models presents significant challenges for carrying out simultaneous data assimilation (DA), which is a crucial technique for state estimation, model identification, and reconstructing missing data. Integrating ensemble-based DA methods with nonlinear deep learning models is computationally expensive and may suffer from large sampling errors. To address these challenges, we introduce a deep learning framework designed to simultaneously provide accurate forecasts and efficient DA. It is named Conditional Gaussian Koopman Network (CGKN), which transforms general nonlinear systems into nonlinear neural differential equations with conditional Gaussian structures. CGKN aims to retain essential nonlinear components while applying systematic and minimal simplifications to facilitate the development of analytic formulae for nonlinear DA. This allows for seamless integration of DA performance into the deep learning training process, eliminating the need for empirical tuning as required in ensemble methods. CGKN compensates for structural simplifications by lifting the dimension of the system, which is motivated by Koopman theory. Nevertheless, CGKN exploits special nonlinear dynamics within the lifted space. This enables the model to capture extreme events and strong non-Gaussian features in joint and marginal distributions with appropriate uncertainty quantification. We demonstrate the effectiveness of CGKN for both prediction and DA on three strongly nonlinear and non-Gaussian turbulent systems: the projected stochastic Burgers--Sivashinsky equation, the Lorenz 96 system, and the El Ni\~no-Southern Oscillation. The results justify the robustness and computational efficiency of CGKN.

new emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Authors: Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, Michael I Mandel

Abstract: Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities. Dataset and code can be accessed at https://github.com/facebookresearch/emg2qwerty.

URLs: https://github.com/facebookresearch/emg2qwerty.

new Sample Efficient Bayesian Learning of Causal Graphs from Interventions

Authors: Zihan Zhou, Muhammad Qasim Elahi, Murat Kocaoglu

Abstract: Causal discovery is a fundamental problem with applications spanning various areas in science and engineering. It is well understood that solely using observational data, one can only orient the causal graph up to its Markov equivalence class, necessitating interventional data to learn the complete causal graph. Most works in the literature design causal discovery policies with perfect interventions, i.e., they have access to infinite interventional samples. This study considers a Bayesian approach for learning causal graphs with limited interventional samples, mirroring real-world scenarios where such samples are usually costly to obtain. By leveraging the recent result of Wien\"obst et al. (2023) on uniform DAG sampling in polynomial time, we can efficiently enumerate all the cut configurations and their corresponding interventional distributions of a target set, and further track their posteriors. Given any number of interventional samples, our proposed algorithm randomly intervenes on a set of target vertices that cut all the edges in the graph and returns a causal graph according to the posterior of each target set. When the number of interventional samples is large enough, we show theoretically that our proposed algorithm will return the true causal graph with high probability. We compare our algorithm against various baseline methods on simulated datasets, demonstrating its superior accuracy measured by the structural Hamming distance between the learned DAG and the ground truth. Additionally, we present a case study showing how this algorithm could be modified to answer more general causal questions without learning the whole graph. As an example, we illustrate that our method can be used to estimate the causal effect of a variable that cannot be intervened.

new OGBench: Benchmarking Offline Goal-Conditioned RL

Authors: Seohong Park, Kevin Frans, Benjamin Eysenbach, Sergey Levine

Abstract: Offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning (RL) because it provides a simple, unsupervised, and domain-agnostic way to acquire diverse behaviors and representations from unlabeled data without rewards. Despite the importance of this setting, we lack a standard benchmark that can systematically evaluate the capabilities of offline GCRL algorithms. In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and stochasticity. While representative algorithms may rank similarly on prior benchmarks, our experiments reveal stark strengths and weaknesses in these different capabilities, providing a strong foundation for building new algorithms. Project page: https://seohong.me/projects/ogbench

URLs: https://seohong.me/projects/ogbench

new Self-Normalized Resets for Plasticity in Continual Learning

Authors: Vivek F. Farias, Adam D. Jozefiak

Abstract: Plasticity Loss is an increasingly important phenomenon that refers to the empirical observation that as a neural network is continually trained on a sequence of changing tasks, its ability to adapt to a new task diminishes over time. We introduce Self-Normalized Resets (SNR), a simple adaptive algorithm that mitigates plasticity loss by resetting a neuron's weights when evidence suggests its firing rate has effectively dropped to zero. Across a battery of continual learning problems and network architectures, we demonstrate that SNR consistently attains superior performance compared to its competitor algorithms. We also demonstrate that SNR is robust to its sole hyperparameter, its rejection percentile threshold, while competitor algorithms show significant sensitivity. SNR's threshold-based reset mechanism is motivated by a simple hypothesis test that we derive. Seen through the lens of this hypothesis test, competing reset proposals yield suboptimal error rates in correctly detecting inactive neurons, potentially explaining our experimental observations. We also conduct a theoretical investigation of the optimization landscape for the problem of learning a single ReLU. We show that even when initialized adversarially, an idealized version of SNR learns the target ReLU, while regularization-based approaches can fail to learn.

new Latent Neural Operator Pretraining for Solving Time-Dependent PDEs

Authors: Tian Wang, Chuang Wang

Abstract: Pretraining methods gain increasing attraction recently for solving PDEs with neural operators. It alleviates the data scarcity problem encountered by neural operator learning when solving single PDE via training on large-scale datasets consisting of various PDEs and utilizing shared patterns among different PDEs to improve the solution precision. In this work, we propose the Latent Neural Operator Pretraining (LNOP) framework based on the Latent Neural Operator (LNO) backbone. We achieve universal transformation through pretraining on hybrid time-dependent PDE dataset to extract representations of different physical systems and solve various time-dependent PDEs in the latent space through finetuning on single PDE dataset. Our proposed LNOP framework reduces the solution error by 31.7% on four problems and can be further improved to 57.1% after finetuning. On out-of-distribution dataset, our LNOP model achieves roughly 50% lower error and 3$\times$ data efficiency on average across different dataset sizes. These results show that our method is more competitive in terms of solution precision, transfer capability and data efficiency compared to non-pretrained neural operators.

new FedSSP: Federated Graph Learning with Spectral Knowledge and Personalized Preference

Authors: Zihan Tan, Guancheng Wan, Wenke Huang, Mang Ye

Abstract: Personalized Federated Graph Learning (pFGL) facilitates the decentralized training of Graph Neural Networks (GNNs) without compromising privacy while accommodating personalized requirements for non-IID participants. In cross-domain scenarios, structural heterogeneity poses significant challenges for pFGL. Nevertheless, previous pFGL methods incorrectly share non-generic knowledge globally and fail to tailor personalized solutions locally under domain structural shift. We innovatively reveal that the spectral nature of graphs can well reflect inherent domain structural shifts. Correspondingly, our method overcomes it by sharing generic spectral knowledge. Moreover, we indicate the biased message-passing schemes for graph structures and propose the personalized preference module. Combining both strategies, we propose our pFGL framework FedSSP which Shares generic Spectral knowledge while satisfying graph Preferences. Furthermore, We perform extensive experiments on cross-dataset and cross-domain settings to demonstrate the superiority of our framework. The code is available at https://github.com/OakleyTan/FedSSP.

URLs: https://github.com/OakleyTan/FedSSP.

new Emergence of Globally Attracting Fixed Points in Deep Neural Networks With Nonlinear Activations

Authors: Amir Joudaki, Thomas Hofmann

Abstract: Understanding how neural networks transform input data across layers is fundamental to unraveling their learning and generalization capabilities. Although prior work has used insights from kernel methods to study neural networks, a global analysis of how the similarity between hidden representations evolves across layers remains underexplored. In this paper, we introduce a theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs. Operating under the mean-field regime, we show that the kernel sequence evolves deterministically via a kernel map, which only depends on the activation function. By expanding activation using Hermite polynomials and using their algebraic properties, we derive an explicit form for kernel map and fully characterize its fixed points. Our analysis reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture. We further extend our results to networks with residual connections and normalization layers, demonstrating similar convergence behaviors. This work provides new insights into the implicit biases of deep neural networks and how architectural choices influence the evolution of representations across layers.

new GeoFUSE: A High-Efficiency Surrogate Model for Seawater Intrusion Prediction and Uncertainty Reduction

Authors: Su Jiang, Chuyang Liu, Dipankar Dwivedi

Abstract: Seawater intrusion into coastal aquifers poses a significant threat to groundwater resources, especially with rising sea levels due to climate change. Accurate modeling and uncertainty quantification of this process are crucial but are often hindered by the high computational costs of traditional numerical simulations. In this work, we develop GeoFUSE, a novel deep-learning-based surrogate framework that integrates the U-Net Fourier Neural Operator (U-FNO) with Principal Component Analysis (PCA) and Ensemble Smoother with Multiple Data Assimilation (ESMDA). GeoFUSE enables fast and efficient simulation of seawater intrusion while significantly reducing uncertainty in model predictions. We apply GeoFUSE to a 2D cross-section of the Beaver Creek tidal stream-floodplain system in Washington State. Using 1,500 geological realizations, we train the U-FNO surrogate model to approximate salinity distribution and accumulation. The U-FNO model successfully reduces the computational time from hours (using PFLOTRAN simulations) to seconds, achieving a speedup of approximately 360,000 times while maintaining high accuracy. By integrating measurement data from monitoring wells, the framework significantly reduces geological uncertainty and improves the predictive accuracy of the salinity distribution over a 20-year period. Our results demonstrate that GeoFUSE improves computational efficiency and provides a robust tool for real-time uncertainty quantification and decision making in groundwater management. Future work will extend GeoFUSE to 3D models and incorporate additional factors such as sea-level rise and extreme weather events, making it applicable to a broader range of coastal and subsurface flow systems.

new Analyzing Multi-Stage Loss Curve: Plateau and Descent Mechanisms in Neural Networks

Authors: Zheng-An Chen, Tao Luo, GuiHong Wang

Abstract: The multi-stage phenomenon in the training loss curves of neural networks has been widely observed, reflecting the non-linearity and complexity inherent in the training process. In this work, we investigate the training dynamics of neural networks (NNs), with particular emphasis on the small initialization regime and identify three distinct stages observed in the loss curve during training: initial plateau stage, initial descent stage, and secondary plateau stage. Through rigorous analysis, we reveal the underlying challenges causing slow training during the plateau stages. Building on existing work, we provide a more detailed proof for the initial plateau. This is followed by a comprehensive analysis of the dynamics in the descent stage. Furthermore, we explore the mechanisms that enable the network to overcome the prolonged secondary plateau stage, supported by both experimental evidence and heuristic reasoning. Finally, to better understand the relationship between global training trends and local parameter adjustments, we employ the Wasserstein distance to capture the microscopic evolution of weight amplitude distribution.

new FedMABA: Towards Fair Federated Learning through Multi-Armed Bandits Allocation

Authors: Zhichao Wang, Lin Wang, Yongxin Guo, Ying-Jun Angela Zhang, Xiaoying Tang

Abstract: The increasing concern for data privacy has driven the rapid development of federated learning (FL), a privacy-preserving collaborative paradigm. However, the statistical heterogeneity among clients in FL results in inconsistent performance of the server model across various clients. Server model may show favoritism towards certain clients while performing poorly for others, heightening the challenge of fairness. In this paper, we reconsider the inconsistency in client performance distribution and introduce the concept of adversarial multi-armed bandit to optimize the proposed objective with explicit constraints on performance disparities. Practically, we propose a novel multi-armed bandit-based allocation FL algorithm (FedMABA) to mitigate performance unfairness among diverse clients with different data distributions. Extensive experiments, in different Non-I.I.D. scenarios, demonstrate the exceptional performance of FedMABA in enhancing fairness.

new GFlowNet Fine-tuning for Diverse Correct Solutions in Mathematical Reasoning Tasks

Authors: Ryoichi Takase, Masaya Tsunokake, Yuta Tsuchiya, Shota Inuzuka

Abstract: Mathematical reasoning problems are among the most challenging, as they typically require an understanding of fundamental laws to solve. The laws are universal, but the derivation of the final answer changes depending on how a problem is approached. When training large language models (LLMs), learning the capability of generating such multiple solutions is essential to accelerate their use in mathematical education. To this end, we train LLMs using generative flow network (GFlowNet). Different from reward-maximizing reinforcement learning (RL), GFlowNet fine-tuning seeks to find diverse solutions by training the LLM whose distribution is proportional to a reward function. In numerical experiments, we evaluate GFlowNet fine-tuning and reward-maximizing RL in terms of accuracy and diversity. The results show that GFlowNet fine-tuning derives correct final answers from diverse intermediate reasoning steps, indicating the improvement of the capability of alternative solution generation.

new Causal Abstraction in Model Interpretability: A Compact Survey

Authors: Yihao Zhang

Abstract: The pursuit of interpretable artificial intelligence has led to significant advancements in the development of methods that aim to explain the decision-making processes of complex models, such as deep learning systems. Among these methods, causal abstraction stands out as a theoretical framework that provides a principled approach to understanding and explaining the causal mechanisms underlying model behavior. This survey paper delves into the realm of causal abstraction, examining its theoretical foundations, practical applications, and implications for the field of model interpretability.

new Prompt Diffusion Robustifies Any-Modality Prompt Learning

Authors: Yingjun Du, Gaowen Liu, Yuzhang Shang, Yuguang Yao, Ramana Kompella, Cees G. M. Snoek

Abstract: Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine the prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.

new DeepMIDE: A Multivariate Spatio-Temporal Method for Ultra-Scale Offshore Wind Energy Forecasting

Authors: Feng Ye, Xinxi Zhang, Michael Stein, Ahmed Aziz Ezzat

Abstract: To unlock access to stronger winds, the offshore wind industry is advancing with significantly larger and taller wind turbines. This massive upscaling motivates a departure from univariate wind forecasting methods that traditionally focused on a single representative height. To fill this gap, we propose DeepMIDE--a statistical deep learning method which jointly models the offshore wind speeds across space, time, and height. DeepMIDE is formulated as a multi-output integro-difference equation model with a multivariate, nonstationary, and state-dependent kernel characterized by a set of advection vectors that encode the physics of wind field formation and propagation. Embedded within DeepMIDE, an advanced deep learning architecture learns these advection vectors from high dimensional streams of exogenous weather information, which, along with other parameters, are plugged back into the statistical model for probabilistic multi-height space-time forecasting. Tested on real-world data from future offshore wind energy sites in the Northeastern United States, the wind speed and power forecasts from DeepMIDE are shown to outperform those from prevalent time series, spatio-temporal, and deep learning methods.

new Infectious Disease Forecasting in India using LLM's and Deep Learning

Authors: Chaitya Shah, Kashish Gandhi, Javal Shah, Kreena Shah, Nilesh Patil, Kiran Bhowmick

Abstract: Many uncontrollable disease outbreaks of the past exposed several vulnerabilities in the healthcare systems worldwide. While advancements in technology assisted in the rapid creation of the vaccinations, there needs to be a pressing focus on the prevention and prediction of such massive outbreaks. Early detection and intervention of an outbreak can drastically reduce its impact on public health while also making the healthcare system more resilient. The complexity of disease transmission dynamics, influence of various directly and indirectly related factors and limitations of traditional approaches are the main bottlenecks in taking preventive actions. Specifically, this paper implements deep learning algorithms and LLM's to predict the severity of infectious disease outbreaks. Utilizing the historic data of several diseases that have spread in India and the climatic data spanning the past decade, the insights from our research aim to assist in creating a robust predictive system for any outbreaks in the future.

new Alternatives of Unsupervised Representations of Variables on the Latent Space

Authors: Alex Glushkovsky

Abstract: The article addresses the application of unsupervised machine learning to represent variables on the 2D latent space by applying a variational autoencoder (beta-VAE). Representation of variables on low dimensional spaces allows for data visualization, disentanglement of variables based on underlying characteristics, finding of meaningful patterns and outliers, and supports interpretability. Five distinct methods have been introduced to represent variables on the latent space: (1) straightforward transposed, (2) univariate metadata of variables, such as variable statistics, empirical probability density and cumulative distribution functions, (3) adjacency matrices of different metrics, such as correlations, R2 values, Jaccard index, cosine similarity, and mutual information, (4) gradient mappings followed by spot cross product calculation, and (5) combined. Twenty-eight approaches of variable representations by beta-VAE have been considered. The pairwise spot cross product addresses relationships of gradients of two variables along latent space axes, such as orthogonal, confounded positive, confounded negative, and everything in between. The article addresses generalized representations of variables that cover both features and labels. Dealing with categorical variables, reinforced entanglement has been introduced to represent one-hot encoded categories. The article includes three examples: (1) synthetic data with known dependencies, (2) famous MNIST example of handwritten numbers, and (3) real-world multivariate time series of Canadian financial market interest rates. As a result, unsupervised representations of interest rates on the latent space correctly disentangled rates based on their type, such as bonds, T-bills, GICs, or conventional mortgages, positioned bonds and T-bills along a single curve, and ordered rates by their terms along that curve.

new Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

Authors: Yuting Tang, Xin-Qiang Cai, Jing-Cheng Pang, Qiyu Wu, Yao-Xiang Ding, Masashi Sugiyama

Abstract: Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent's performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex structure reflecting the overall contribution of the sequence. To address this problem, we present a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps. Building on this framework, we propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism to effectively model these contributions. We conduct experiments on challenging locomotion tasks where the agent receives delayed rewards computed from composite functions of observable step rewards. The experimental results indicate that CoDeTr consistently outperforms baseline methods across evaluated metrics. Additionally, we demonstrate that it effectively identifies the most significant time steps within the sequence and accurately predicts rewards that closely reflect the environment feedback.

new Copyright-Aware Incentive Scheme for Generative Art Models Using Hierarchical Reinforcement Learning

Authors: Zhuan Shi, Yifei Song, Xiaoli Tang, Lingjuan Lyu, Boi Faltings

Abstract: Generative art using Diffusion models has achieved remarkable performance in image generation and text-to-image tasks. However, the increasing demand for training data in generative art raises significant concerns about copyright infringement, as models can produce images highly similar to copyrighted works. Existing solutions attempt to mitigate this by perturbing Diffusion models to reduce the likelihood of generating such images, but this often compromises model performance. Another approach focuses on economically compensating data holders for their contributions, yet it fails to address copyright loss adequately. Our approach begin with the introduction of a novel copyright metric grounded in copyright law and court precedents on infringement. We then employ the TRAK method to estimate the contribution of data holders. To accommodate the continuous data collection process, we divide the training into multiple rounds. Finally, We designed a hierarchical budget allocation method based on reinforcement learning to determine the budget for each round and the remuneration of the data holder based on the data holder's contribution and copyright loss in each round. Extensive experiments across three datasets show that our method outperforms all eight benchmarks, demonstrating its effectiveness in optimizing budget distribution in a copyright-aware manner. To the best of our knowledge, this is the first technical work that introduces to incentive contributors and protect their copyrights by compensating them.

new Chemical Language Model Linker: blending text and molecules with modular adapters

Authors: Yifan Deng, Spencer S. Ericksen, Anthony Gitter

Abstract: The development of large language models and multi-modal models has enabled the appealing idea of generating novel molecules from text descriptions. Generative modeling would shift the paradigm from relying on large-scale chemical screening to find molecules with desired properties to directly generating those molecules. However, multi-modal models combining text and molecules are often trained from scratch, without leveraging existing high-quality pretrained models. That approach consumes more computational resources and prohibits model scaling. In contrast, we propose a lightweight adapter-based strategy named Chemical Language Model Linker (ChemLML). ChemLML blends the two single domain models and obtains conditional molecular generation from text descriptions while still operating in the specialized embedding spaces of the molecular domain. ChemLML can tailor diverse pretrained text models for molecule generation by training relatively few adapter parameters. We find that the choice of molecular representation used within ChemLML, SMILES versus SELFIES, has a strong influence on conditional molecular generation performance. SMILES is often preferable despite not guaranteeing valid molecules. We raise issues in using the large PubChem dataset of molecules and their associated descriptions for evaluating molecule generation and provide a filtered version of the dataset as a generation test set. To demonstrate how ChemLML could be used in practice, we generate candidate protein inhibitors and use docking to assess their quality.

new Uncertainty-Penalized Direct Preference Optimization

Authors: Sam Houliston, Aliz\'ee Pace, Alexander Immer, Gunnar R\"atsch

Abstract: Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

new Transferable Adversarial Attacks on SAM and Its Downstream Models

Authors: Song Xia, Wenhan Yang, Yi Yu, Xun Lin, Henghui Ding, Lingyu Duan, Xudong Jiang

Abstract: The utilization of large foundational models has a dilemma: while fine-tuning downstream tasks from them holds promise for making use of the well-generalized knowledge in practical applications, their open accessibility also poses threats of adverse usage. This paper, for the first time, explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM), by solely utilizing the information from the open-sourced SAM. In contrast to prevailing transfer-based adversarial attacks, we demonstrate the existence of adversarial dangers even without accessing the downstream task and dataset to train a similar surrogate model. To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm to extract the intrinsic vulnerability inherent in the foundation model, which is then utilized as the prior knowledge to guide the generation of adversarial perturbations. Moreover, by formulating the gradient difference in the attacking process between the open-sourced SAM and its fine-tuned downstream models, we theoretically demonstrate that a deviation occurs in the adversarial update direction by directly maximizing the distance of encoded feature embeddings in the open-sourced SAM. Consequently, we propose a gradient robust loss that simulates the associated uncertainty with gradient-based noise augmentation to enhance the robustness of generated adversarial examples (AEs) towards this deviation, thus improving the transferability. Extensive experiments demonstrate the effectiveness of the proposed universal meta-initialized and gradient robust adversarial attack (UMI-GRAT) toward SAMs and their downstream models. Code is available at https://github.com/xiasong0501/GRAT.

URLs: https://github.com/xiasong0501/GRAT.

new Generative AI in Health Economics and Outcomes Research: A Taxonomy of Key Definitions and Emerging Applications, an ISPOR Working Group Report

Authors: Rachael Fleurence, Xiaoyan Wang, Jiang Bian, Mitchell K. Higashi, Turgay Ayer, Hua Xu, Dalia Dawoud, Jagpreet Chhatwal

Abstract: Objective: This article offers a taxonomy of generative artificial intelligence (AI) for health economics and outcomes research (HEOR), explores its emerging applications, and outlines methods to enhance the accuracy and reliability of AI-generated outputs. Methods: The review defines foundational generative AI concepts and highlights current HEOR applications, including systematic literature reviews, health economic modeling, real-world evidence generation, and dossier development. Approaches such as prompt engineering (zero-shot, few-shot, chain-of-thought, persona pattern prompting), retrieval-augmented generation, model fine-tuning, and the use of domain-specific models are introduced to improve AI accuracy and reliability. Results: Generative AI shows significant potential in HEOR, enhancing efficiency, productivity, and offering novel solutions to complex challenges. Foundation models are promising in automating complex tasks, though challenges remain in scientific reliability, bias, interpretability, and workflow integration. The article discusses strategies to improve the accuracy of these AI tools. Conclusion: Generative AI could transform HEOR by increasing efficiency and accuracy across various applications. However, its full potential can only be realized by building HEOR expertise and addressing the limitations of current AI technologies. As AI evolves, ongoing research and innovation will shape its future role in the field.

new Revisiting Differential Verification: Equivalence Verification with Confidence

Authors: Samuel Teuber, Philipp Kern, Marvin Janzen, Bernhard Beckert

Abstract: When validated neural networks (NNs) are pruned (and retrained) before deployment, it is desirable to prove that the new NN behaves equivalently to the (original) reference NN. To this end, our paper revisits the idea of differential verification which performs reasoning on differences between NNs: On the one hand, our paper proposes a novel abstract domain for differential verification admitting more efficient reasoning about equivalence. On the other hand, we investigate empirically and theoretically which equivalence properties are (not) efficiently solved using differential reasoning. Based on the gained insights, and following a recent line of work on confidence-based verification, we propose a novel equivalence property that is amenable to Differential Verification while providing guarantees for large parts of the input space instead of small-scale guarantees constructed w.r.t. predetermined input points. We implement our approach in a new tool called VeryDiff and perform an extensive evaluation on numerous old and new benchmark families, including new pruned NNs for particle jet classification in the context of CERN's LHC where we observe median speedups >300x over the State-of-the-Art verifier alpha,beta-CROWN.

new SAFE setup for generative molecular design

Authors: Yassir El Mesbahi, Emmanuel Noutahi

Abstract: SMILES-based molecular generative models have been pivotal in drug design but face challenges in fragment-constrained tasks. To address this, the Sequential Attachment-based Fragment Embedding (SAFE) representation was recently introduced as an alternative that streamlines those tasks. In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms. We found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust. SAFE-based models also consistently outperform SMILES-based approaches in scaffold decoration and linker design, particularly with BRICS decomposition yielding the best results. These insights highlight key factors that significantly impact the efficacy of SAFE-based generative models.

new Hoeffding adaptive trees for multi-label classification on data streams

Authors: Aurora Esteban, Alberto Cano, Amelia Zafra, Sebasti\'an Ventura

Abstract: Data stream learning is a very relevant paradigm because of the increasing real-world scenarios generating data at high velocities and in unbounded sequences. Stream learning aims at developing models that can process instances as they arrive, so models constantly adapt to new concepts and the temporal evolution in the stream. In multi-label data stream environments where instances have the peculiarity of belonging simultaneously to more than one class, the problem becomes even more complex and poses unique challenges such as different concept drifts impacting different labels at simultaneous or distinct times, higher class imbalance, or new labels emerging in the stream. This paper proposes a novel approach to multi-label data stream classification called Multi-Label Hoeffding Adaptive Tree (MLHAT). MLHAT leverages the Hoeffding adaptive tree to address these challenges by considering possible relations and label co-occurrences in the partitioning process of the decision tree, dynamically adapting the learner in each leaf node of the tree, and implementing a concept drift detector that can quickly detect and replace tree branches that are no longer performing well. The proposed approach is compared with other 18 online multi-label classifiers on 41 datasets. The results, validated with statistical analysis, show that MLHAT outperforms other state-of-the-art approaches in 12 well-known multi-label metrics.

new Model Equality Testing: Which Model Is This API Serving?

Authors: Irena Gao, Percy Liang, Carlos Guestrin

Abstract: Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- often without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

new Convergence Guarantees for the DeepWalk Embedding on Block Models

Authors: Christopher Harker, Aditya Bhaskara

Abstract: Graph embeddings have emerged as a powerful tool for understanding the structure of graphs. Unlike classical spectral methods, recent methods such as DeepWalk, Node2Vec, etc. are based on solving nonlinear optimization problems on the graph, using local information obtained by performing random walks. These techniques have empirically been shown to produce ''better'' embeddings than their classical counterparts. However, due to their reliance on solving a nonconvex optimization problem, obtaining theoretical guarantees on the properties of the solution has remained a challenge, even for simple classes of graphs. In this work, we show convergence properties for the DeepWalk algorithm on graphs obtained from the Stochastic Block Model (SBM). Despite being simplistic, the SBM has proved to be a classic model for analyzing the behavior of algorithms on large graphs. Our results mirror the existing ones for spectral embeddings on SBMs, showing that even in the case of one-dimensional embeddings, the output of the DeepWalk algorithm provably recovers the cluster structure with high probability.

new Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Authors: Andrew Wagenmaker, Kevin Huang, Liyiming Ke, Byron Boots, Kevin Jamieson, Abhishek Gupta

Abstract: In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emph{exploratory} policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches -- least-squares regression oracles and naive randomized exploration -- yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails. We validate our theoretical results on several realistic robotic simulators and a real-world robotic sim2real task, demonstrating that transferring exploratory policies can yield substantial gains in practice as well.

new Equivariant Blurring Diffusion for Hierarchical Molecular Conformer Generation

Authors: Jiwoong Park, Yang Shen

Abstract: How can diffusion models process 3D geometries in a coarse-to-fine manner, akin to our multiscale view of the world? In this paper, we address the question by focusing on a fundamental biochemical problem of generating 3D molecular conformers conditioned on molecular graphs in a multiscale manner. Our approach consists of two hierarchical stages: i) generation of coarse-grained fragment-level 3D structure from the molecular graph, and ii) generation of fine atomic details from the coarse-grained approximated structure while allowing the latter to be adjusted simultaneously. For the challenging second stage, which demands preserving coarse-grained information while ensuring SE(3) equivariance, we introduce a novel generative model termed Equivariant Blurring Diffusion (EBD), which defines a forward process that moves towards the fragment-level coarse-grained structure by blurring the fine atomic details of conformers, and a reverse process that performs the opposite operation using equivariant networks. We demonstrate the effectiveness of EBD by geometric and chemical comparison to state-of-the-art denoising diffusion models on a benchmark of drug-like molecules. Ablation studies draw insights on the design of EBD by thoroughly analyzing its architecture, which includes the design of the loss function and the data corruption process. Codes are released at https://github.com/Shen-Lab/EBD .

URLs: https://github.com/Shen-Lab/EBD

new Centaur: a foundation model of human cognition

Authors: Marcel Binz, Elif Akata, Matthias Bethge, Franziska Br\"andle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K. Eckstein, No\'emi \'Eltet\H{o}, Thomas L. Griffiths, Susanne Haridi, Akshay K. Jagadish, Li Ji-An, Alexander Kipnis, Sreejan Kumar, Tobias Ludwig, Marvin Mathony, Marcelo Mattar, Alireza Modirshanechi, Surabhi S. Nath, Joshua C. Peterson, Milena Rmus, Evan M. Russek, Tankred Saanum, Natalia Scharfenberg, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Xin Sui, Mirko Thalmann, Fabian Theis, Vuong Truong, Vishaal Udandarao, Konstantinos Voudouris, Robert Wilson, Kristin Witte, Shuchen Wu, Dirk Wulff, Huadong Xiong, Eric Schulz

Abstract: Establishing a unified theory of cognition has been a major goal of psychology. While there have been previous attempts to instantiate such theories by building computational models, we currently do not have one model that captures the human mind in its entirety. Here we introduce Centaur, a computational model that can predict and simulate human behavior in any experiment expressible in natural language. We derived Centaur by finetuning a state-of-the-art language model on a novel, large-scale data set called Psych-101. Psych-101 reaches an unprecedented scale, covering trial-by-trial data from over 60,000 participants performing over 10,000,000 choices in 160 experiments. Centaur not only captures the behavior of held-out participants better than existing cognitive models, but also generalizes to new cover stories, structural task modifications, and entirely new domains. Furthermore, we find that the model's internal representations become more aligned with human neural activity after finetuning. Taken together, Centaur is the first real candidate for a unified model of human cognition. We anticipate that it will have a disruptive impact on the cognitive sciences, challenging the existing paradigm for developing computational models.

new Library Learning Doesn't: The Curious Case of the Single-Use "Library"

Authors: Ian Berlot-Attwell, Frank Rudzicz, Xujie Si

Abstract: Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim to learn a reusable library of tools, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains. Our code and data are available at https://github.com/ikb-a/curious-case

URLs: https://github.com/ikb-a/curious-case

new Proactive Fraud Defense: Machine Learning's Evolving Role in Protecting Against Online Fraud

Authors: Md Kamrul Hasan Chy

Abstract: As online fraud becomes more sophisticated and pervasive, traditional fraud detection methods are struggling to keep pace with the evolving tactics employed by fraudsters. This paper explores the transformative role of machine learning in addressing these challenges by offering more advanced, scalable, and adaptable solutions for fraud detection and prevention. By analyzing key models such as Random Forest, Neural Networks, and Gradient Boosting, this paper highlights the strengths of machine learning in processing vast datasets, identifying intricate fraud patterns, and providing real-time predictions that enable a proactive approach to fraud prevention. Unlike rule-based systems that react after fraud has occurred, machine learning models continuously learn from new data, adapting to emerging fraud schemes and reducing false positives, which ultimately minimizes financial losses. This research emphasizes the potential of machine learning to revolutionize fraud detection frameworks by making them more dynamic, efficient, and capable of handling the growing complexity of fraud across various industries. Future developments in machine learning, including deep learning and hybrid models, are expected to further enhance the predictive accuracy and applicability of these systems, ensuring that organizations remain resilient in the face of new and emerging fraud tactics.

new Classification under strategic adversary manipulation using pessimistic bilevel optimisation

Authors: David Benfield, Stefano Coniglio, Martin Kunc, Phan Tu Vuong, Alain Zemkoho

Abstract: Adversarial machine learning concerns situations in which learners face attacks from active adversaries. Such scenarios arise in applications such as spam email filtering, malware detection and fake-image generation, where security methods must be actively updated to keep up with the ever improving generation of malicious data.We model these interactions between the learner and the adversary as a game and formulate the problem as a pessimistic bilevel optimisation problem with the learner taking the role of the leader. The adversary, modelled as a stochastic data generator, takes the role of the follower, generating data in response to the classifier. While existing models rely on the assumption that the adversary will choose the least costly solution leading to a convex lower-level problem with a unique solution, we present a novel model and solution method which do not make such assumptions. We compare these to the existing approach and see significant improvements in performance suggesting that relaxing these assumptions leads to a more realistic model.

new A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases

Authors: Yunchong Liu, Xiaorui Shen, Yeyubei Zhang, Zhongyan Wang, Yexin Tian, Jianglai Dai, Yuchen Cao

Abstract: Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.

new DeCaf: A Causal Decoupling Framework for OOD Generalization on Node Classification

Authors: Xiaoxue Han, Huzefa Rangwala, Yue Ning

Abstract: Graph Neural Networks (GNNs) are susceptible to distribution shifts, creating vulnerability and security issues in critical domains. There is a pressing need to enhance the generalizability of GNNs on out-of-distribution (OOD) test data. Existing methods that target learning an invariant (feature, structure)-label mapping often depend on oversimplified assumptions about the data generation process, which do not adequately reflect the actual dynamics of distribution shifts in graphs. In this paper, we introduce a more realistic graph data generation model using Structural Causal Models (SCMs), allowing us to redefine distribution shifts by pinpointing their origins within the generation process. Building on this, we propose a casual decoupling framework, DeCaf, that independently learns unbiased feature-label and structure-label mappings. We provide a detailed theoretical framework that shows how our approach can effectively mitigate the impact of various distribution shifts. We evaluate DeCaf across both real-world and synthetic datasets that demonstrate different patterns of shifts, confirming its efficacy in enhancing the generalizability of GNNs.

new Predicting Mortality and Functional Status Scores of Traumatic Brain Injury Patients using Supervised Machine Learning

Authors: Lucas Steinmetz, Shivam Maheshwari, Garik Kazanjian, Abigail Loyson, Tyler Alexander, Venkat Margapuri, C. Nataraj

Abstract: Traumatic brain injury (TBI) presents a significant public health challenge, often resulting in mortality or lasting disability. Predicting outcomes such as mortality and Functional Status Scale (FSS) scores can enhance treatment strategies and inform clinical decision-making. This study applies supervised machine learning (ML) methods to predict mortality and FSS scores using a real-world dataset of 300 pediatric TBI patients from the University of Colorado School of Medicine. The dataset captures clinical features, including demographics, injury mechanisms, and hospitalization outcomes. Eighteen ML models were evaluated for mortality prediction, and thirteen models were assessed for FSS score prediction. Performance was measured using accuracy, ROC AUC, F1-score, and mean squared error. Logistic regression and Extra Trees models achieved high precision in mortality prediction, while linear regression demonstrated the best FSS score prediction. Feature selection reduced 103 clinical variables to the most relevant, enhancing model efficiency and interpretability. This research highlights the role of ML models in identifying high-risk patients and supporting personalized interventions, demonstrating the potential of data-driven analytics to improve TBI care and integrate into clinical workflows.

new Sequential Large Language Model-Based Hyper-Parameter Optimization

Authors: Kanan Mahammadli

Abstract: This study introduces SLLMBO, an innovative framework that leverages Large Language Models (LLMs) for hyperparameter optimization (HPO), incorporating dynamic search space adaptability, enhanced parameter landscape exploitation, and a hybrid, novel LLM-Tree-structured Parzen Estimator (LLM-TPE) sampler. By addressing limitations in recent fully LLM-based methods and traditional Bayesian Optimization (BO), SLLMBO achieves more robust optimization. This comprehensive benchmarking evaluates multiple LLMs, including GPT-3.5-turbo, GPT-4o, Claude-Sonnet-3.5, and Gemini-1.5-flash, extending prior work beyond GPT-3.5 and GPT-4 and establishing SLLMBO as the first framework to benchmark a diverse set of LLMs for HPO. By integrating LLMs' established strengths in parameter initialization with the exploitation abilities demonstrated in this study, alongside TPE's exploration capabilities, the LLM-TPE sampler achieves a balanced exploration-exploitation trade-off, reduces API costs, and mitigates premature early stoppings for more effective parameter searches. Across 14 tabular tasks in classification and regression, the LLM-TPE sampler outperformed fully LLM-based methods and achieved superior results over BO methods in 9 tasks. Testing early stopping in budget-constrained scenarios further demonstrated competitive performance, indicating that LLM-based methods generally benefit from extended iterations for optimal results. This work lays the foundation for future research exploring open-source LLMs, reproducibility of LLM results in HPO, and benchmarking SLLMBO on complex datasets, such as image classification, segmentation, and machine translation.

new Accelerating Direct Preference Optimization with Prefix Sharing

Authors: Franklin Wang, Sumanth Hegde

Abstract: Offline paired preference optimization algorithms have become a popular approach for fine-tuning on preference data, outperforming traditional supervised fine-tuning in various tasks. However, traditional implementations often involve redundant computations, especially for tasks with long shared prompts. We introduce prefix sharing for preference tuning, a novel technique that processes chosen and rejected responses as one sequence with a shared prefix. To prevent cross-response contamination, we use a custom block-sparse attention mask. Our method achieves $1.1$-$1.5\times$ improvement in training throughput on popular DPO datasets, without any effect on convergence. When combined with sequence packing, we observe consistent $1.3$-$1.6\times$ speedups, benefiting even datasets with smaller sequence lengths. While we focus on Direct Preference Optimization (DPO), our approach is applicable to other paired preference tuning methods. By enhancing computational efficiency, our work contributes to making preference-based fine-tuning more accessible for a wider range of applications and model sizes. We open-source our code at https://github.com/frankxwang/dpo-prefix-sharing.

URLs: https://github.com/frankxwang/dpo-prefix-sharing.

new ANOMIX: A Simple yet Effective Hard Negative Generation via Mixing for Graph Anomaly Detection

Authors: Hwan Kim, Junghoon Kim, Sungsu Lim

Abstract: Graph contrastive learning (GCL) generally requires a large number of samples. The one of the effective ways to reduce the number of samples is using hard negatives (e.g., Mixup). Designing mixing-based approach for GAD can be difficult due to imbalanced data or limited number of anomalies. We propose ANOMIX, a framework that consists of a novel graph mixing approach, ANOMIX-M, and multi-level contrasts for GAD. ANOMIX-M can effectively mix abnormality and normality from input graph to generate hard negatives, which are important for efficient GCL. ANOMIX is (a) A first mixing approach: firstly attempting graph mixing to generate hard negatives for GAD task and node- and subgraph-level contrasts to distinguish underlying anomalies. (b) Accurate: winning the highest AUC, up to 5.49% higher and 1.76% faster. (c) Effective: reducing the number of samples nearly 80% in GCL. Code is available at https://github.com/missinghwan/ANOMIX.

URLs: https://github.com/missinghwan/ANOMIX.

new Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model

Authors: Jing Zhang, Linjiajie Fang, Kexin Shi, Wenjia Wang, Bing-Yi Jing

Abstract: ``Distribution shift'' is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative estimates, we introduce an uncertainty-aware optimization objective for updating the Q-value function. The proposed QDQ demonstrates solid theoretical guarantees for the accuracy of Q-value distribution learning and uncertainty measurement, as well as the performance of the learning policy. QDQ consistently shows strong performance on the D4RL benchmark and achieves significant improvements across many tasks.

new ProtSCAPE: Mapping the landscape of protein conformations in molecular dynamics

Authors: Siddharth Viswanath, Dhananjay Bhaskar, David R. Johnson, Joao Felipe Rocha, Egbert Castro, Jackson D. Grady, Alex T. Grigas, Michael A. Perlmutter, Corey S. O'Hern, Smita Krishnaswamy

Abstract: Understanding the dynamic nature of protein structures is essential for comprehending their biological functions. While significant progress has been made in predicting static folded structures, modeling protein motions on microsecond to millisecond scales remains challenging. To address these challenges, we introduce a novel deep learning architecture, Protein Transformer with Scattering, Attention, and Positional Embedding (ProtSCAPE), which leverages the geometric scattering transform alongside transformer-based attention mechanisms to capture protein dynamics from molecular dynamics (MD) simulations. ProtSCAPE utilizes the multi-scale nature of the geometric scattering transform to extract features from protein structures conceptualized as graphs and integrates these features with dual attention structures that focus on residues and amino acid signals, generating latent representations of protein trajectories. Furthermore, ProtSCAPE incorporates a regression head to enforce temporally coherent latent representations.

new Domain Specific Data Distillation and Multi-modal Embedding Generation

Authors: Sharadind Peddiraju, Srini Rajagopal

Abstract: The challenge of creating domain-centric embeddings arises from the abundance of unstructured data and the scarcity of domain-specific structured data. Conventional embedding techniques often rely on either modality, limiting their applicability and efficacy. This paper introduces a novel modeling approach that leverages structured data to filter noise from unstructured data, resulting in embeddings with high precision and recall for domain-specific attribute prediction. The proposed model operates within a Hybrid Collaborative Filtering (HCF) framework, where generic entity representations are fine-tuned through relevant item prediction tasks. Our experiments, focusing on the cloud computing domain, demonstrate that HCF-based embeddings outperform AutoEncoder-based embeddings (using purely unstructured data), achieving a 28% lift in precision and an 11% lift in recall for domain-specific attribute prediction.

new Embedded Nonlocal Operator Regression (ENOR): Quantifying model error in learning nonlocal operators

Authors: Yiming Fan, Habib Najm, Yue Yu, Stewart Silling, Marta D'Elia

Abstract: Nonlocal, integral operators have become an efficient surrogate for bottom-up homogenization, due to their ability to represent long-range dependence and multiscale effects. However, the nonlocal homogenized model has unavoidable discrepancy from the microscale model. Such errors accumulate and propagate in long-term simulations, making the resultant prediction unreliable. To develop a robust and reliable bottom-up homogenization framework, we propose a new framework, which we coin Embedded Nonlocal Operator Regression (ENOR), to learn a nonlocal homogenized surrogate model and its structural model error. This framework provides discrepancy-adaptive uncertainty quantification for homogenized material response predictions in long-term simulations. The method is built on Nonlocal Operator Regression (NOR), an optimization-based nonlocal kernel learning approach, together with an embedded model error term in the trainable kernel. Then, Bayesian inference is employed to infer the model error term parameters together with the kernel parameters. To make the problem computationally feasible, we use a multilevel delayed acceptance Markov chain Monte Carlo (MLDA-MCMC) method, enabling efficient Bayesian model calibration and model error estimation. We apply this technique to predict long-term wave propagation in a heterogeneous one-dimensional bar, and compare its performance with additive noise models. Owing to its ability to capture model error, the learned ENOR achieves improved estimation of posterior predictive uncertainty.

new Intuitionistic Fuzzy Universum Twin Support Vector Machine for Imbalanced Data

Authors: A. Quadir, M. Tanveer

Abstract: One of the major difficulties in machine learning methods is categorizing datasets that are imbalanced. This problem may lead to biased models, where the training process is dominated by the majority class, resulting in inadequate representation of the minority class. Universum twin support vector machine (UTSVM) produces a biased model towards the majority class, as a result, its performance on the minority class is often poor as it might be mistakenly classified as noise. Moreover, UTSVM is not proficient in handling datasets that contain outliers and noises. Inspired by the concept of incorporating prior information about the data and employing an intuitionistic fuzzy membership scheme, we propose intuitionistic fuzzy universum twin support vector machines for imbalanced data (IFUTSVM-ID). We use an intuitionistic fuzzy membership scheme to mitigate the impact of noise and outliers. Moreover, to tackle the problem of imbalanced class distribution, data oversampling and undersampling methods are utilized. Prior knowledge about the data is provided by universum data. This leads to better generalization performance. UTSVM is susceptible to overfitting risks due to the omission of the structural risk minimization (SRM) principle in their primal formulations. However, the proposed IFUTSVM-ID model incorporates the SRM principle through the incorporation of regularization terms, effectively addressing the issue of overfitting. We conduct a comprehensive evaluation of the proposed IFUTSVM-ID model on benchmark datasets from KEEL and compare it with existing baseline models. Furthermore, to assess the effectiveness of the proposed IFUTSVM-ID model in diagnosing Alzheimer's disease (AD), we applied them to the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Experimental results showcase the superiority of the proposed IFUTSVM-ID models compared to the baseline models.

new Leveraging Auxiliary Task Relevance for Enhanced Industrial Fault Diagnosis through Curriculum Meta-learning

Authors: Jinze Wang, Tiehua Zhang, Boon Xian Chai, Adriano Di Pietro, Dimitrios Georgakopoulos, Jiong Jin

Abstract: The accurate diagnosis of machine breakdowns is crucial for maintaining operational safety in smart manufacturing. Despite the promise shown by deep learning in automating fault identification, the scarcity of labeled training data, particularly for equipment failure instances, poses a significant challenge. This limitation hampers the development of robust classification models. Existing methods like model-agnostic meta-learning (MAML) do not adequately address variable working conditions, affecting knowledge transfer. To address these challenges, a Related Task Aware Curriculum Meta-learning (RT-ACM) enhanced fault diagnosis framework is proposed in this paper, inspired by human cognitive learning processes. RT-ACM improves training by considering the relevance of auxiliary working conditions, adhering to the principle of ``paying more attention to more relevant knowledge", and focusing on ``easier first, harder later" curriculum sampling. This approach aids the meta-learner in achieving a superior convergence state. Extensive experiments on two real-world datasets demonstrate the superiority of RT-ACM framework.

new Uncovering Capabilities of Model Pruning in Graph Contrastive Learning

Authors: Wu Junran, Chen Xueyuan, Li Shangzhe

Abstract: Graph contrastive learning has achieved great success in pre-training graph neural networks without ground-truth labels. Leading graph contrastive learning follows the classical scheme of contrastive learning, forcing model to identify the essential information from augmented views. However, general augmented views are produced via random corruption or learning, which inevitably leads to semantics alteration. Although domain knowledge guided augmentations alleviate this issue, the generated views are domain specific and undermine the generalization. In this work, motivated by the firm representation ability of sparse model from pruning, we reformulate the problem of graph contrastive learning via contrasting different model versions rather than augmented views. We first theoretically reveal the superiority of model pruning in contrast to data augmentations. In practice, we take original graph as input and dynamically generate a perturbed graph encoder to contrast with the original encoder by pruning its transformation weights. Furthermore, considering the integrity of node embedding in our method, we are capable of developing a local contrastive loss to tackle the hard negative samples that disturb the model training. We extensively validate our method on various benchmarks regarding graph classification via unsupervised and transfer learning. Compared to the state-of-the-art (SOTA) works, better performance can always be obtained by the proposed method.

new Rethinking Reconstruction-based Graph-Level Anomaly Detection: Limitations and a Simple Remedy

Authors: Sunwoo Kim, Soo Yong Lee, Fanchen Bu, Shinhwan Kang, Kyungho Kim, Jaemin Yoo, Kijung Shin

Abstract: Graph autoencoders (Graph-AEs) learn representations of given graphs by aiming to accurately reconstruct them. A notable application of Graph-AEs is graph-level anomaly detection (GLAD), whose objective is to identify graphs with anomalous topological structures and/or node features compared to the majority of the graph population. Graph-AEs for GLAD regard a graph with a high mean reconstruction error (i.e. mean of errors from all node pairs and/or nodes) as anomalies. Namely, the methods rest on the assumption that they would better reconstruct graphs with similar characteristics to the majority. We, however, report non-trivial counter-examples, a phenomenon we call reconstruction flip, and highlight the limitations of the existing Graph-AE-based GLAD methods. Specifically, we empirically and theoretically investigate when this assumption holds and when it fails. Through our analyses, we further argue that, while the reconstruction errors for a given graph are effective features for GLAD, leveraging the multifaceted summaries of the reconstruction errors, beyond just mean, can further strengthen the features. Thus, we propose a novel and simple GLAD method, named MUSE. The key innovation of MUSE involves taking multifaceted summaries of reconstruction errors as graph features for GLAD. This surprisingly simple method obtains SOTA performance in GLAD, performing best overall among 14 methods across 10 datasets.

new FuseFL: One-Shot Federated Learning through the Lens of Causality with Progressive Model Fusion

Authors: Zhenheng Tang, Yonggang Zhang, Peijie Dong, Yiu-ming Cheung, Amelie Chi Zhou, Bo Han, Xiaowen Chu

Abstract: One-shot Federated Learning (OFL) significantly reduces communication costs in FL by aggregating trained models only once. However, the performance of advanced OFL methods is far behind the normal FL. In this work, we provide a causal view to find that this performance drop of OFL methods comes from the isolation problem, which means that local isolatedly trained models in OFL may easily fit to spurious correlations due to the data heterogeneity. From the causal perspective, we observe that the spurious fitting can be alleviated by augmenting intermediate features from other clients. Built upon our observation, we propose a novel learning approach to endow OFL with superb performance and low communication and storage costs, termed as FuseFL. Specifically, FuseFL decomposes neural networks into several blocks, and progressively trains and fuses each block following a bottom-up manner for feature augmentation, introducing no additional communication costs. Comprehensive experiments demonstrate that FuseFL outperforms existing OFL and ensemble FL by a significant margin. We conduct comprehensive experiments to show that FuseFL supports high scalability of clients, heterogeneous model training, and low memory costs. Our work is the first attempt using causality to analyze and alleviate data heterogeneity of OFL.

new Multiple kernel concept factorization algorithm based on global fusion

Authors: Fei Li, Liang Du, Chaohong Ren

Abstract: Non-negative Matrix Factorization(NMF) algorithm can only be used to find low rank approximation of original non-negative data while Concept Factorization(CF) algorithm extends matrix factorization to single non-linear kernel space, improving learning ability and adaptability of matrix factorization. In unsupervised environment, to design or select proper kernel function for specific dataset, a new algorithm called Globalized Multiple Kernel CF(GMKCF)was proposed. Multiple candidate kernel functions were input in the same time and learned in the CF framework based on global linear fusion, obtaining a clustering result with high quality and stability and solving the problem of kernel function selection that the CF faced. The convergence of the proposed algorithm was verified by solving the model with alternate iteration. The experimental results on several real databases show that the proposed algorithm outperforms comparison algorithms in data clustering, such as Kernel K-Means(KKM), Spectral Clustering(SC), Kernel CF(KCF), Co-regularized multi-view spectral clustering(Coreg), and Robust Multiple KKM(RMKKM).

new Unsupervised Feature Selection Algorithm Based on Dual Manifold Re-ranking

Authors: Yunhui Liang, Jianwen Gan, Yan Chen, Peng Zhou, Liang Du

Abstract: High-dimensional data is commonly encountered in numerous data analysis tasks. Feature selection techniques aim to identify the most representative features from the original high-dimensional data. Due to the absence of class label information, it is significantly more challenging to select appropriate features in unsupervised learning scenarios compared to supervised ones. Traditional unsupervised feature selection methods typically score the features of samples based on certain criteria, treating samples indiscriminately. However, these approaches fail to fully capture the internal structure of the data. The importance of different samples should vary, and there is a dual relationship between the weight of samples and features that will influence each other. Therefore, an unsupervised feature selection algorithm based on dual manifold re-ranking (DMRR) is proposed in this paper. Different similarity matrices are constructed to depict the manifold structures among samples, between samples and features, and among features themselves. Then, manifold re-ranking is performed by combining the initial scores of samples and features. By comparing DMRR with three original unsupervised feature selection algorithms and two unsupervised feature selection post-processing algorithms, experimental results confirm that the importance information of different samples and the dual relationship between sample and feature are beneficial for achieving better feature selection.

new Hierarchical Multiple Kernel K-Means Algorithm Based on Sparse Connectivity

Authors: Lei Wang, Liang Du, Peng Zhou

Abstract: Multiple kernel learning (MKL) aims to find an optimal, consistent kernel function. In the hierarchical multiple kernel clustering (HMKC) algorithm, sample features are extracted layer by layer from a high-dimensional space to maximize the retention of effective information. However, information interaction between layers is often ignored. In this model, only corresponding nodes in adjacent layers exchange information; other nodes remain isolated, and if full connectivity is adopted, the diversity of the final consistency matrix is reduced. Therefore, this paper proposes a hierarchical multiple kernel K-Means (SCHMKKM) algorithm based on sparse connectivity, which controls the assignment matrix to achieve sparse connections through a sparsity rate, thereby locally fusing the features obtained by distilling information between layers. Finally, we conduct cluster analysis on multiple datasets and compare it with the fully connected hierarchical multiple kernel K-Means (FCHMKKM) algorithm in experiments. It is shown that more discriminative information fusion is beneficial for learning a better consistent partition matrix, and the fusion strategy based on sparse connection outperforms the full connection strategy.

new Evaluation of uncertainty estimations for Gaussian process regression based machine learning interatomic potentials

Authors: Matthias Holzenkamp, Dongyu Lyu, Ulrich Kleinekath\"ofer, Peter Zaspel

Abstract: Machine learning interatomic potentials (MLIPs) have seen significant advances as efficient replacement of expensive quantum chemical calculations. Uncertainty estimations for MLIPs are crucial to quantify the additional model error they introduce and to leverage this information in active learning strategies. MLIPs that are based on Gaussian process regression provide a standard deviation as a possible uncertainty measure. An alternative approach are ensemble-based uncertainties. Although these uncertainty measures have been applied to active learning, it has rarely been studied how they correlate with the error, and it is not always clear whether active learning actually outperforms random sampling strategies. We consider GPR models with Coulomb and SOAP representations as inputs to predict potential energy surfaces and excitation energies of molecules. We evaluate, how the GPR variance and ensemble-based uncertainties relate to the error and whether model performance improves by selecting the most uncertain samples from a fixed configuration space. For the ensemble based uncertainty estimations, we find that they often do not provide any information about the error. For the GPR standard deviation, we find that often predictions with an increasing standard deviation also have an increasing systematical bias, which is not captured by the uncertainty. In these cases, selecting training samples with the highest uncertainty leads to a model with a worse test error compared to random sampling. We conclude that confidence intervals, which are derived from the predictive standard deviation, can be highly overconfident. Selecting samples with high GPR standard deviation leads to a model that overemphasizes the borders of the configuration space represented in the fixed dataset. This may result in worse performance in more densely sampled areas but better generalization for extrapolation tasks.

new ThunderKittens: Simple, Fast, and Adorable AI Kernels

Authors: Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher R\'e

Abstract: The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse hardware capabilities of GPUs might suggest that we need a wide variety of techniques to achieve high performance. However, our work explores whether a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like parallel compute operations over tiles, (2) at the thread-block level, we provide a template for overlapping asynchronous operations across parallel warps, and (3) at the grid-level, we provide support to help hide the block launch and tear-down, and memory costs. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by $10-40\%$ on attention backwards, $8\times$ on state space models, and $14\times$ on linear attention.

new Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss

Authors: Kunal Dahiya, Diego Ortego, David Jim\'enez

Abstract: Extreme Multi-label Classification (XMC) methods predict relevant labels for a given query in an extremely large label space. Recent works in XMC address this problem using deep encoders that project text descriptions to an embedding space suitable for recovering the closest labels. However, learning deep models can be computationally expensive in large output spaces, resulting in a trade-off between high performing brute-force approaches and efficient solutions. In this paper, we propose PRIME, a XMC method that employs a novel prototypical contrastive learning technique to reconcile efficiency and performance surpassing brute-force approaches. We frame XMC as a data-to-prototype prediction task where label prototypes aggregate information from related queries. More precisely, we use a shallow transformer encoder that we coin as Label Prototype Network, which enriches label representations by aggregating text-based embeddings, label centroids and learnable free vectors. We jointly train a deep encoder and the Label Prototype Network using an adaptive triplet loss objective that better adapts to the high granularity and ambiguity of extreme label spaces. PRIME achieves state-of-the-art results in several public benchmarks of different sizes and domains, while keeping the model efficient.

new Deep Learning-Driven Microstructure Characterization and Vickers Hardness Prediction of Mg-Gd Alloys

Authors: Lu Wang, Hongchan Chen, Bing Wang, Qian Li, Qun Luo, Yuexing Han

Abstract: In the field of materials science, exploring the relationship between composition, microstructure, and properties has long been a critical research focus. The mechanical performance of solid-solution Mg-Gd alloys is significantly influenced by Gd content, dendritic structures, and the presence of secondary phases. To better analyze and predict the impact of these factors, this study proposes a multimodal fusion learning framework based on image processing and deep learning techniques. This framework integrates both elemental composition and microstructural features to accurately predict the Vickers hardness of solid-solution Mg-Gd alloys. Initially, deep learning methods were employed to extract microstructural information from a variety of solid-solution Mg-Gd alloy images obtained from literature and experiments. This provided precise grain size and secondary phase microstructural features for performance prediction tasks. Subsequently, these quantitative analysis results were combined with Gd content information to construct a performance prediction dataset. Finally, a regression model based on the Transformer architecture was used to predict the Vickers hardness of Mg-Gd alloys. The experimental results indicate that the Transformer model performs best in terms of prediction accuracy, achieving an R^2 value of 0.9. Additionally, SHAP analysis identified critical values for four key features affecting the Vickers hardness of Mg-Gd alloys, providing valuable guidance for alloy design. These findings not only enhance the understanding of alloy performance but also offer theoretical support for future material design and optimization.

new Causal Modeling in Multi-Context Systems: Distinguishing Multiple Context-Specific Causal Graphs which Account for Observational Support

Authors: Martin Rabel, Wiebke G\"unther, Jakob Runge, Andreas Gerhardus

Abstract: Causal structure learning with data from multiple contexts carries both opportunities and challenges. Opportunities arise from considering shared and context-specific causal graphs enabling to generalize and transfer causal knowledge across contexts. However, a challenge that is currently understudied in the literature is the impact of differing observational support between contexts on the identifiability of causal graphs. Here we study in detail recently introduced [6] causal graph objects that capture both causal mechanisms and data support, allowing for the analysis of a larger class of context-specific changes, characterizing distribution shifts more precisely. We thereby extend results on the identifiability of context-specific causal structures and propose a framework to model context-specific independence (CSI) within structural causal models (SCMs) in a refined way that allows to explore scenarios where these graph objects differ. We demonstrate how this framework can help explaining phenomena like anomalies or extreme events, where causal mechanisms change or appear to change under different conditions. Our results contribute to the theoretical foundations for understanding causal relations in multi-context systems, with implications for generalization, transfer learning, and anomaly detection. Future work may extend this approach to more complex data types, such as time-series.

new Integrating uncertainty quantification into randomized smoothing based robustness guarantees

Authors: Sina D\"aubener, Kira Maag, David Krueger, Asja Fischer

Abstract: Deep neural networks have proven to be extremely powerful, however, they are also vulnerable to adversarial attacks which can cause hazardous incorrect predictions in safety-critical applications. Certified robustness via randomized smoothing gives a probabilistic guarantee that the smoothed classifier's predictions will not change within an $\ell_2$-ball around a given input. On the other hand (uncertainty) score-based rejection is a technique often applied in practice to defend models against adversarial attacks. In this work, we fuse these two approaches by integrating a classifier that abstains from predicting when uncertainty is high into the certified robustness framework. This allows us to derive two novel robustness guarantees for uncertainty aware classifiers, namely (i) the radius of an $\ell_2$-ball around the input in which the same label is predicted and uncertainty remains low and (ii) the $\ell_2$-radius of a ball in which the predictions will either not change or be uncertain. While the former provides robustness guarantees with respect to attacks aiming at increased uncertainty, the latter informs about the amount of input perturbation necessary to lead the uncertainty aware model into a wrong prediction. Notably, this is on CIFAR10 up to 20.93% larger than for models not allowing for uncertainty based rejection. We demonstrate, that the novel framework allows for a systematic robustness evaluation of different network architectures and uncertainty measures and to identify desired properties of uncertainty quantification techniques. Moreover, we show that leveraging uncertainty in a smoothed classifier helps out-of-distribution detection.

new TEAFormers: TEnsor-Augmented Transformers for Multi-Dimensional Time Series Forecasting

Authors: Linghang Kong, Elynn Chen, Yuzhou Chen, Yuefeng Han

Abstract: Multi-dimensional time series data, such as matrix and tensor-variate time series, are increasingly prevalent in fields such as economics, finance, and climate science. Traditional Transformer models, though adept with sequential data, do not effectively preserve these multi-dimensional structures, as their internal operations in effect flatten multi-dimensional observations into vectors, thereby losing critical multi-dimensional relationships and patterns. To address this, we introduce the Tensor-Augmented Transformer (TEAFormer), a novel method that incorporates tensor expansion and compression within the Transformer framework to maintain and leverage the inherent multi-dimensional structures, thus reducing computational costs and improving prediction accuracy. The core feature of the TEAFormer, the Tensor-Augmentation (TEA) module, utilizes tensor expansion to enhance multi-view feature learning and tensor compression for efficient information aggregation and reduced computational load. The TEA module is not just a specific model architecture but a versatile component that is highly compatible with the attention mechanism and the encoder-decoder structure of Transformers, making it adaptable to existing Transformer architectures. Our comprehensive experiments, which integrate the TEA module into three popular time series Transformer models across three real-world benchmarks, show significant performance enhancements, highlighting the potential of TEAFormers for cutting-edge time series forecasting.

new Vector Quantization Prompting for Continual Learning

Authors: Li Jiao, Qiuxia Lai, Yu Li, Qiang Xu

Abstract: Continual learning requires to overcome catastrophic forgetting when training a single model on a sequence of tasks. Recent top-performing approaches are prompt-based methods that utilize a set of learnable parameters (i.e., prompts) to encode task knowledge, from which appropriate ones are selected to guide the fixed pre-trained model in generating features tailored to a certain task. However, existing methods rely on predicting prompt identities for prompt selection, where the identity prediction process cannot be optimized with task loss. This limitation leads to sub-optimal prompt selection and inadequate adaptation of pre-trained features for a specific task. Previous efforts have tried to address this by directly generating prompts from input queries instead of selecting from a set of candidates. However, these prompts are continuous, which lack sufficient abstraction for task knowledge representation, making them less effective for continual learning. To address these challenges, we propose VQ-Prompt, a prompt-based continual learning method that incorporates Vector Quantization (VQ) into end-to-end training of a set of discrete prompts. In this way, VQ-Prompt can optimize the prompt selection process with task loss and meanwhile achieve effective abstraction of task knowledge for continual learning. Extensive experiments show that VQ-Prompt outperforms state-of-the-art continual learning methods across a variety of benchmarks under the challenging class-incremental setting. The code is available at \href{https://github.com/jiaolifengmi/VQ-Prompt}{this https URL}.

URLs: https://github.com/jiaolifengmi/VQ-Prompt

new Graph Neural Networks on Discriminative Graphs of Words

Authors: Yassine Abbahaddou, Johannes F. Lutzeyer, Michalis Vazirgiannis

Abstract: In light of the recent success of Graph Neural Networks (GNNs) and their ability to perform inference on complex data structures, many studies apply GNNs to the task of text classification. In most previous methods, a heterogeneous graph, containing both word and document nodes, is constructed using the entire corpus and a GNN is used to classify document nodes. In this work, we explore a new Discriminative Graph of Words Graph Neural Network (DGoW-GNN) approach encapsulating both a novel discriminative graph construction and model to classify text. In our graph construction, containing only word nodes and no document nodes, we split the training corpus into disconnected subgraphs according to their labels and weight edges by the pointwise mutual information of the represented words. Our graph construction, for which we provide theoretical motivation, allows us to reformulate the task of text classification as the task of walk classification. We also propose a new model for the graph-based classification of text, which combines a GNN and a sequence model. We evaluate our approach on seven benchmark datasets and find that it is outperformed by several state-of-the-art baseline models. We analyse reasons for this performance difference and hypothesise under which conditions it is likely to change.

new Hamiltonian Score Matching and Generative Flows

Authors: Peter Holderrieth, Yilun Xu, Tommi Jaakkola

Abstract: Classical Hamiltonian mechanics has been widely used in machine learning in the form of Hamiltonian Monte Carlo for applications with predetermined force fields. In this work, we explore the potential of deliberately designing force fields for Hamiltonian ODEs, introducing Hamiltonian velocity predictors (HVPs) as a tool for score matching and generative models. We present two innovations constructed with HVPs: Hamiltonian Score Matching (HSM), which estimates score functions by augmenting data via Hamiltonian trajectories, and Hamiltonian Generative Flows (HGFs), a novel generative model that encompasses diffusion models and flow matching as HGFs with zero force fields. We showcase the extended design space of force fields by introducing Oscillation HGFs, a generative model inspired by harmonic oscillators. Our experiments validate our theoretical insights about HSM as a novel score matching metric and demonstrate that HGFs rival leading generative modeling techniques.

new Improving Decision Sparsity

Authors: Yiyang Sun, Tong Wang, Cynthia Rudin

Abstract: Sparsity is a central aspect of interpretability in machine learning. Typically, sparsity is measured in terms of the size of a model globally, such as the number of variables it uses. However, this notion of sparsity is not particularly relevant for decision-making; someone subjected to a decision does not care about variables that do not contribute to the decision. In this work, we dramatically expand a notion of decision sparsity called the Sparse Explanation Value(SEV) so that its explanations are more meaningful. SEV considers movement along a hypercube towards a reference point. By allowing flexibility in that reference and by considering how distances along the hypercube translate to distances in feature space, we can derive sparser and more meaningful explanations for various types of function classes. We present cluster-based SEV and its variant tree-based SEV, introduce a method that improves credibility of explanations, and propose algorithms that optimize decision sparsity in machine learning models.

new Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

Authors: Kaiyan Zhao, Yiming Wang, Yuyang Chen, Xiaoguang Niu, Yan Li, Leong Hou U

Abstract: Deep Reinforcement Learning (DRL) has achieved remarkable success in solving complex decision-making problems by combining the representation capabilities of deep learning with the decision-making power of reinforcement learning. However, learning in sparse reward environments remains challenging due to insufficient feedback to guide the optimization of agents, especially in real-life environments with high-dimensional states. To tackle this issue, experience replay is commonly introduced to enhance learning efficiency through past experiences. Nonetheless, current methods of experience replay, whether based on uniform or prioritized sampling, frequently struggle with suboptimal learning efficiency and insufficient utilization of samples. This paper proposes a novel approach, diversity-based experience replay (DBER), which leverages the deterministic point process to prioritize diverse samples in state realizations. We conducted extensive experiments on Robotic Manipulation tasks in MuJoCo, Atari games, and realistic in-door environments in Habitat. The results show that our method not only significantly improves learning efficiency but also demonstrates superior performance in sparse reward environments with high-dimensional states, providing a simple yet effective solution for this field.

new A Cosmic-Scale Benchmark for Symmetry-Preserving Data Processing

Authors: Julia Balla, Siddharth Mishra-Sharma, Carolina Cuesta-Lazaro, Tommi Jaakkola, Tess Smidt

Abstract: Efficiently processing structured point cloud data while preserving multiscale information is a key challenge across domains, from graphics to atomistic modeling. Using a curated dataset of simulated galaxy positions and properties, represented as point clouds, we benchmark the ability of graph neural networks to simultaneously capture local clustering environments and long-range correlations. Given the homogeneous and isotropic nature of the Universe, the data exhibits a high degree of symmetry. We therefore focus on evaluating the performance of Euclidean symmetry-preserving ($E(3)$-equivariant) graph neural networks, showing that they can outperform non-equivariant counterparts and domain-specific information extraction techniques in downstream performance as well as simulation-efficiency. However, we find that current architectures fail to capture information from long-range correlations as effectively as domain-specific baselines, motivating future work on architectures better suited for extracting long-range information.

new Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Authors: Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu

Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

URLs: https://huggingface.co/fnlp/Llama-Scope, https://github.com/OpenMOSS/Language-Model-SAEs

new Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?

Authors: Xuan He (Violet), Da Yin (Violet), Nanyun (Violet), Peng

Abstract: How can "weak teacher models" such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and requires expertise or daily practice from the teacher models? In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging. Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90\%), training on such data can outperform perfectly correct supervision on easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions. Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30\% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation. Data and code are released at \url{https://github.com/hexuan21/Weak-to-Strong}.

URLs: https://github.com/hexuan21/Weak-to-Strong

new Info-CELS: Informative Saliency Map Guided Counterfactual Explanation

Authors: Peiyu Li, Omar Bahri, Pouya Hosseinzadeh, Souka\"ina Filali Boubrahimi, Shah Muhammad Hamdi

Abstract: As the demand for interpretable machine learning approaches continues to grow, there is an increasing necessity for human involvement in providing informative explanations for model decisions. This is necessary for building trust and transparency in AI-based systems, leading to the emergence of the Explainable Artificial Intelligence (XAI) field. Recently, a novel counterfactual explanation model, CELS, has been introduced. CELS learns a saliency map for the interest of an instance and generates a counterfactual explanation guided by the learned saliency map. While CELS represents the first attempt to exploit learned saliency maps not only to provide intuitive explanations for the reason behind the decision made by the time series classifier but also to explore post hoc counterfactual explanations, it exhibits limitations in terms of high validity for the sake of ensuring high proximity and sparsity. In this paper, we present an enhanced approach that builds upon CELS. While the original model achieved promising results in terms of sparsity and proximity, it faced limitations in validity. Our proposed method addresses this limitation by removing mask normalization to provide more informative and valid counterfactual explanations. Through extensive experimentation on datasets from various domains, we demonstrate that our approach outperforms the CELS model, achieving higher validity and producing more informative explanations.

new PaPaGei: Open Foundation Models for Optical Physiological Signals

Authors: Arvind Pillai, Dimitris Spathis, Fahim Kawsar, Mohammad Malekzadeh

Abstract: Photoplethysmography (PPG) is the most widely used non-invasive technique for monitoring biosignals and cardiovascular health, with applications in both clinical settings and consumer health through wearable devices. Current machine learning models trained on PPG signals are mostly task-specific and lack generalizability. Previous works often used single-device datasets, did not explore out-of-domain generalization, or did not release their models, hindering reproducibility and further research. We introduce PaPaGei, the first open foundation model for PPG signals. PaPaGei is pre-trained on more than 57,000 hours of 20 million unlabeled segments of PPG signals using publicly available datasets exclusively. We evaluate against popular time-series foundation models and other benchmarks on 20 tasks of 10 diverse datasets spanning cardiovascular health, sleep disorders, pregnancy monitoring, and wellbeing assessment. Our architecture incorporates novel representation learning approaches that leverage differences in PPG signal morphology across individuals, enabling it to capture richer representations than traditional contrastive learning methods. Across 20 tasks, PaPaGei improves classification and regression performance by an average of 6.3% and 2.9%, respectively, compared to other competitive time-series foundation models in at least 14 tasks. PaPaGei is more data- and parameter-efficient than other foundation models or methods, as it outperforms 70x larger models. Beyond accuracy, we also investigate robustness against different skin tones, establishing a benchmark for bias evaluations of future models. Notably, PaPaGei can be used out of the box as both a feature extractor and an encoder for other multimodal models, opening up new opportunities for multimodal health monitoring

new Deep Reinforcement Learning Agents for Strategic Production Policies in Microeconomic Market Simulations

Authors: Eduardo C. Garrido-Merch\'an, Maria Coronado-Vaca, \'Alvaro L\'opez-L\'opez, Carlos Martinez de Ibarreta

Abstract: Traditional economic models often rely on fixed assumptions about market dynamics, limiting their ability to capture the complexities and stochastic nature of real-world scenarios. However, reality is more complex and includes noise, making traditional models assumptions not met in the market. In this paper, we explore the application of deep reinforcement learning (DRL) to obtain optimal production strategies in microeconomic market environments to overcome the limitations of traditional models. Concretely, we propose a DRL-based approach to obtain an effective policy in competitive markets with multiple producers, each optimizing their production decisions in response to fluctuating demand, supply, prices, subsidies, fixed costs, total production curve, elasticities and other effects contaminated by noise. Our framework enables agents to learn adaptive production policies to several simulations that consistently outperform static and random strategies. As the deep neural networks used by the agents are universal approximators of functions, DRL algorithms can represent in the network complex patterns of data learnt by trial and error that explain the market. Through extensive simulations, we demonstrate how DRL can capture the intricate interplay between production costs, market prices, and competitor behavior, providing insights into optimal decision-making in dynamic economic settings. The results show that agents trained with DRL can strategically adjust production levels to maximize long-term profitability, even in the face of volatile market conditions. We believe that the study bridges the gap between theoretical economic modeling and practical market simulation, illustrating the potential of DRL to revolutionize decision-making in market strategies.

new Toward Conditional Distribution Calibration in Survival Prediction

Authors: Shi-ang Qi, Yakun Yu, Russell Greiner

Abstract: Survival prediction often involves estimating the time-to-event distribution from censored datasets. Previous approaches have focused on enhancing discrimination and marginal calibration. In this paper, we highlight the significance of conditional calibration for real-world applications -- especially its role in individual decision-making. We propose a method based on conformal prediction that uses the model's predicted individual survival probability at that instance's observed time. This method effectively improves the model's marginal and conditional calibration, without compromising discrimination. We provide asymptotic theoretical guarantees for both marginal and conditional calibration and test it extensively across 15 diverse real-world datasets, demonstrating the method's practical effectiveness and versatility in various settings.

new Generator Matching: Generative modeling with arbitrary Markov processes

Authors: Peter Holderrieth, Marton Havasi, Jason Yim, Neta Shaul, Itai Gat, Tommi Jaakkola, Brian Karrer, Ricky T. Q. Chen, Yaron Lipman

Abstract: We introduce generator matching, a modality-agnostic framework for generative modeling using arbitrary Markov processes. Generators characterize the infinitesimal evolution of a Markov process, which we leverage for generative modeling in a similar vein to flow matching: we construct conditional generators which generate single data points, then learn to approximate the marginal generator which generates the full data distribution. We show that generator matching unifies various generative modeling methods, including diffusion models, flow matching and discrete diffusion models. Furthermore, it provides the foundation to expand the design space to new and unexplored Markov processes such as jump processes. Finally, generator matching enables the construction of superpositions of Markov generative processes and enables the construction of multimodal models in a rigorous manner. We empirically validate our method on protein and image structure generation, showing that superposition with a jump process improves image generation.

new Practical Bayesian Algorithm Execution via Posterior Sampling

Authors: Chu Xin Cheng, Raul Astudillo, Thomas Desautels, Yisong Yue

Abstract: We consider Bayesian algorithm execution (BAX), a framework for efficiently selecting evaluation points of an expensive function to infer a property of interest encoded as the output of a base algorithm. Since the base algorithm typically requires more evaluations than are feasible, it cannot be directly applied. Instead, BAX methods sequentially select evaluation points using a probabilistic numerical approach. Current BAX methods use expected information gain to guide this selection. However, this approach is computationally intensive. Observing that, in many tasks, the property of interest corresponds to a target set of points defined by the function, we introduce PS-BAX, a simple, effective, and scalable BAX method based on posterior sampling. PS-BAX is applicable to a wide range of problems, including many optimization variants and level set estimation. Experiments across diverse tasks demonstrate that PS-BAX performs competitively with existing baselines while being significantly faster, simpler to implement, and easily parallelizable, setting a strong baseline for future research. Additionally, we establish conditions under which PS-BAX is asymptotically convergent, offering new insights into posterior sampling as an algorithm design paradigm.

new LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization

Authors: Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S. Dhillon, Cho-Jui Hsieh, Sanjiv Kumar

Abstract: Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. However, current LoRA optimizers lack transformation invariance, meaning the actual updates to the weights depends on how the two LoRA factors are scaled or rotated. This deficiency leads to inefficient learning and sub-optimal solutions in practice. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization, which can achieve transformation invariance and remain computationally efficient. We provide theoretical analysis to demonstrate the benefit of our method and conduct experiments on various LLM tasks with different models including Gemma 2B, 7B, and mT5-XXL. The results demonstrate consistent improvements against existing optimizers. For example, replacing Adam with LoRA-RITE during LoRA fine-tuning of Gemma-2B yielded 4.6\% accuracy gain on Super-Natural Instructions and 3.5\% accuracy gain across other four LLM benchmarks (HellaSwag, ArcChallenge, GSM8K, OpenBookQA).

new TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation

Authors: Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, Jure Leskovec

Abstract: Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a multi-modal stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to $22.5\%$ improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at https://github.com/MinkaiXu/TabDiff.

URLs: https://github.com/MinkaiXu/TabDiff.

new Plastic Learning with Deep Fourier Features

Authors: Alex Lewandowski, Dale Schuurmans, Marlos C. Machado

Abstract: Deep neural networks can struggle to learn continually in the face of non-stationarity. This phenomenon is known as loss of plasticity. In this paper, we identify underlying principles that lead to plastic algorithms. In particular, we provide theoretical results showing that linear function approximation, as well as a special case of deep linear networks, do not suffer from loss of plasticity. We then propose deep Fourier features, which are the concatenation of a sine and cosine in every layer, and we show that this combination provides a dynamic balance between the trainability obtained through linearity and the effectiveness obtained through the nonlinearity of neural networks. Deep networks composed entirely of deep Fourier features are highly trainable and sustain their trainability over the course of learning. Our empirical results show that continual learning performance can be drastically improved by replacing ReLU activations with deep Fourier features. These results hold for different continual learning scenarios (e.g., label noise, class incremental learning, pixel permutations) on all major supervised learning datasets used for continual learning research, such as CIFAR10, CIFAR100, and tiny-ImageNet.

new General Causal Imputation via Synthetic Interventions

Authors: Marco Jiralerspong, Thomas Jiralerspong, Vedant Shah, Dhanya Sridhar, Gauthier Gidel

Abstract: Given two sets of elements (such as cell types and drug compounds), researchers typically only have access to a limited subset of their interactions. The task of causal imputation involves using this subset to predict unobserved interactions. Squires et al. (2022) have proposed two estimators for this task based on the synthetic interventions (SI) estimator: SI-A (for actions) and SI-C (for contexts). We extend their work and introduce a novel causal imputation estimator, generalized synthetic interventions (GSI). We prove the identifiability of this estimator for data generated from a more complex latent factor model. On synthetic and real data we show empirically that it recovers or outperforms their estimators.

new Learning Variational Inequalities from Data: Fast Generalization Rates under Strong Monotonicity

Authors: Eric Zhao, Tatjana Chavdarova, Michael Jordan

Abstract: Variational inequalities (VIs) are a broad class of optimization problems encompassing machine learning problems ranging from standard convex minimization to more complex scenarios like min-max optimization and computing the equilibria of multi-player games. In convex optimization, strong convexity allows for fast statistical learning rates requiring only $\Theta(1/\epsilon)$ stochastic first-order oracle calls to find an $\epsilon$-optimal solution, rather than the standard $\Theta(1/\epsilon^2)$ calls. In this paper, we explain how one can similarly obtain fast $\Theta(1/\epsilon)$ rates for learning VIs that satisfy strong monotonicity, a generalization of strong convexity. Specifically, we demonstrate that standard stability-based generalization arguments for convex minimization extend directly to VIs when the domain admits a small covering, or when the operator is integrable and suboptimality is measured by potential functions; such as when finding equilibria in multi-player games.

new NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Authors: Yongchang Hao, Yanshuai Cao, Lili Mou

Abstract: The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.

new Video to Video Generative Adversarial Network for Few-shot Learning Based on Policy Gradient

Authors: Yintai Ma, Diego Klabjan, Jean Utke

Abstract: The development of sophisticated models for video-to-video synthesis has been facilitated by recent advances in deep reinforcement learning and generative adversarial networks (GANs). In this paper, we propose RL-V2V-GAN, a new deep neural network approach based on reinforcement learning for unsupervised conditional video-to-video synthesis. While preserving the unique style of the source video domain, our approach aims to learn a mapping from a source video domain to a target video domain. We train the model using policy gradient and employ ConvLSTM layers to capture the spatial and temporal information by designing a fine-grained GAN architecture and incorporating spatio-temporal adversarial goals. The adversarial losses aid in content translation while preserving style. Unlike traditional video-to-video synthesis methods requiring paired inputs, our proposed approach is more general because it does not require paired inputs. Thus, when dealing with limited videos in the target domain, i.e., few-shot learning, it is particularly effective. Our experiments show that RL-V2V-GAN can produce temporally coherent video results. These results highlight the potential of our approach for further advances in video-to-video synthesis.

new TurboHopp: Accelerated Molecule Scaffold Hopping with Consistency Models

Authors: Kiwoong Yoo, Owen Oertell, Junhyun Lee, Sanghoon Lee, Jaewoo Kang

Abstract: Navigating the vast chemical space of druggable compounds is a formidable challenge in drug discovery, where generative models are increasingly employed to identify viable candidates. Conditional 3D structure-based drug design (3D-SBDD) models, which take into account complex three-dimensional interactions and molecular geometries, are particularly promising. Scaffold hopping is an efficient strategy that facilitates the identification of similar active compounds by strategically modifying the core structure of molecules, effectively narrowing the wide chemical space and enhancing the discovery of drug-like products. However, the practical application of 3D-SBDD generative models is hampered by their slow processing speeds. To address this bottleneck, we introduce TurboHopp, an accelerated pocket-conditioned 3D scaffold hopping model that merges the strategic effectiveness of traditional scaffold hopping with rapid generation capabilities of consistency models. This synergy not only enhances efficiency but also significantly boosts generation speeds, achieving up to 30 times faster inference speed as well as superior generation quality compared to existing diffusion-based models, establishing TurboHopp as a powerful tool in drug discovery. Supported by faster inference speed, we further optimize our model, using Reinforcement Learning for Consistency Models (RLCM), to output desirable molecules. We demonstrate the broad applicability of TurboHopp across multiple drug discovery scenarios, underscoring its potential in diverse molecular settings.

new Segmenting Watermarked Texts From Language Models

Authors: Xingchi Li, Guanxun Li, Xianyang Zhang

Abstract: Watermarking is a technique that involves embedding nearly unnoticeable statistical signals within generated content to help trace its source. This work focuses on a scenario where an untrusted third-party user sends prompts to a trusted language model (LLM) provider, who then generates a text from their LLM with a watermark. This setup makes it possible for a detector to later identify the source of the text if the user publishes it. The user can modify the generated text by substitutions, insertions, or deletions. Our objective is to develop a statistical method to detect if a published text is LLM-generated from the perspective of a detector. We further propose a methodology to segment the published text into watermarked and non-watermarked sub-strings. The proposed approach is built upon randomization tests and change point detection techniques. We demonstrate that our method ensures Type I and Type II error control and can accurately identify watermarked sub-strings by finding the corresponding change point locations. To validate our technique, we apply it to texts generated by several language models with prompts extracted from Google's C4 dataset and obtain encouraging numerical results. We release all code publicly at https://github.com/doccstat/llm-watermark-cpd.

URLs: https://github.com/doccstat/llm-watermark-cpd.

new Reprogramming Pretrained Target-Specific Diffusion Models for Dual-Target Drug Design

Authors: Xiangxin Zhou, Jiaqi Guan, Yijia Zhang, Xingang Peng, Liang Wang, Jianzhu Ma

Abstract: Dual-target therapeutic strategies have become a compelling approach and attracted significant attention due to various benefits, such as their potential in overcoming drug resistance in cancer therapy. Considering the tremendous success that deep generative models have achieved in structure-based drug design in recent years, we formulate dual-target drug design as a generative task and curate a novel dataset of potential target pairs based on synergistic drug combinations. We propose to design dual-target drugs with diffusion models that are trained on single-target protein-ligand complex pairs. Specifically, we align two pockets in 3D space with protein-ligand binding priors and build two complex graphs with shared ligand nodes for SE(3)-equivariant composed message passing, based on which we derive a composed drift in both 3D and categorical probability space in the generative process. Our algorithm can well transfer the knowledge gained in single-target pretraining to dual-target scenarios in a zero-shot manner. We also repurpose linker design methods as strong baselines for this task. Extensive experiments demonstrate the effectiveness of our method compared with various baselines.

new Contextual Representation Anchor Network to Alleviate Selection Bias in Few-Shot Drug Discovery

Authors: Ruifeng Li, Wei Liu, Xiangxin Zhou, Mingqian Li, Yuhua Zhou, Yuan Yao, Qiang Zhang, Hongyang Chen

Abstract: In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness leads to suboptimal performance. To overcome this challenge, we present a novel method named contextual representation anchor Network (CRA), where an anchor refers to a cluster center of the representations of molecules and serves as a bridge to transfer enriched contextual knowledge into molecular representations and enhance their expressiveness. CRA introduces a dual-augmentation mechanism that includes context augmentation, which dynamically retrieves analogous unlabeled molecules and captures their task-specific contextual knowledge to enhance the anchors, and anchor augmentation, which leverages the anchors to augment the molecular representations. We evaluate our approach on the MoleculeNet and FS-Mol benchmarks, as well as in domain transfer experiments. The results demonstrate that CRA outperforms the state-of-the-art by 2.60% and 3.28% in AUC and $\Delta$AUC-PR metrics, respectively, and exhibits superior generalization capabilities.

new Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment

Authors: Tong Yang, Jincheng Mei, Hanjun Dai, Zixin Wen, Shicong Cen, Dale Schuurmans, Yuejie Chi, Bo Dai

Abstract: Recent advances in aligning large language models with human preferences have corroborated the growing importance of best-of-N distillation (BOND). However, the iterative BOND algorithm is prohibitively expensive in practice due to the sample and computation inefficiency. This paper addresses the problem by revealing a unified game-theoretic connection between iterative BOND and self-play alignment, which unifies seemingly disparate algorithmic paradigms. Based on the connection, we establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization that approximates iterative BOND in the parameter space. We provides provable sample efficiency guarantee for one of the WIND variant with the square loss objective. The experimental results confirm that our algorithm not only accelerates the computation, but also achieves superior sample efficiency compared to existing methods.

new Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Authors: Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, Bing Yin

Abstract: Online shopping is a complex multi-task, few-shot learning problem with a wide and evolving range of entities, relations, and tasks. However, existing models and benchmarks are commonly tailored to specific tasks, falling short of capturing the full complexity of online shopping. Large Language Models (LLMs), with their multi-task and few-shot learning abilities, have the potential to profoundly transform online shopping by alleviating task-specific engineering efforts and by providing users with interactive conversations. Despite the potential, LLMs face unique challenges in online shopping, such as domain-specific concepts, implicit knowledge, and heterogeneous user behaviors. Motivated by the potential and challenges, we propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data. Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality, and can thus comprehensively evaluate the abilities of LLMs as general shop assistants. With Shopping MMLU, we benchmark over 20 existing LLMs and uncover valuable insights about practices and prospects of building versatile LLM-based shop assistants. Shopping MMLU can be publicly accessed at https://github.com/KL4805/ShoppingMMLU. In addition, with Shopping MMLU, we host a competition in KDD Cup 2024 with over 500 participating teams. The winning solutions and the associated workshop can be accessed at our website https://amazon-kddcup24.github.io/.

URLs: https://github.com/KL4805/ShoppingMMLU., https://amazon-kddcup24.github.io/.

new Matryoshka: Learning to Drive Black-Box LLMs with LLMs

Authors: Changhao Li, Yuchen Zhuang, Rushi Qiang, Haotian Sun, Hanjun Dai, Chao Zhang, Bo Dai

Abstract: Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation or in-context learning, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshika, a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with Matryoshika serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. Matryoshika is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on three diverse tasks demonstrate that Matryoshika effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks, including reasoning, planning, and personalization. By leveraging this pioneering controller-generator framework to mitigate dependence on model parameters, Matryoshika provides a transparent and practical solution for improving black-box LLMs through controllable multi-turn generation using white-box LLMs.

new ODRL: A Benchmark for Off-Dynamics Reinforcement Learning

Authors: Jiafei Lyu, Kang Xu, Jiacheng Xu, Mengbei Yan, Jingwen Yang, Zongzhang Zhang, Chenjia Bai, Zongqing Lu, Xiu Li

Abstract: We consider off-dynamics reinforcement learning (RL) where one needs to transfer policies across different domains with dynamics mismatch. Despite the focus on developing dynamics-aware algorithms, this field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods. ODRL contains four experimental settings where the source and target domains can be either online or offline, and provides diverse tasks and a broad spectrum of dynamics shifts, making it a reliable platform to comprehensively evaluate the agent's adaptation ability to the target domain. Furthermore, ODRL includes recent off-dynamics RL algorithms in a unified framework and introduces some extra baselines for different settings, all implemented in a single-file manner. To unpack the true adaptation capability of existing methods, we conduct extensive benchmarking experiments, which show that no method has universal advantages across varied dynamics shifts. We hope this benchmark can serve as a cornerstone for future research endeavors. Our code is publicly available at https://github.com/OffDynamicsRL/off-dynamics-rl.

URLs: https://github.com/OffDynamicsRL/off-dynamics-rl.

new Task Confusion and Catastrophic Forgetting in Class-Incremental Learning: A Mathematical Framework for Discriminative and Generative Modelings

Authors: Milad Khademi Nori, Il-Min Kim

Abstract: In class-incremental learning (class-IL), models must classify all previously seen classes at test time without task-IDs, leading to task confusion. Despite being a key challenge, task confusion lacks a theoretical understanding. We present a novel mathematical framework for class-IL and prove the Infeasibility Theorem, showing optimal class-IL is impossible with discriminative modeling due to task confusion. However, we establish the Feasibility Theorem, demonstrating that generative modeling can achieve optimal class-IL by overcoming task confusion. We then assess popular class-IL strategies, including regularization, bias-correction, replay, and generative classifier, using our framework. Our analysis suggests that adopting generative modeling, either for generative replay or direct classification (generative classifier), is essential for optimal class-IL.

new Introducing Spectral Attention for Long-Range Dependency in Time Series Forecasting

Authors: Bong Gyun Kang, Dongjun Lee, HyunGi Kim, DoHyun Chung

Abstract: Sequence modeling faces challenges in capturing long-range dependencies across diverse tasks. Recent linear and transformer-based forecasters have shown superior performance in time series forecasting. However, they are constrained by their inherent inability to effectively address long-range dependencies in time series data, primarily due to using fixed-size inputs for prediction. Furthermore, they typically sacrifice essential temporal correlation among consecutive training samples by shuffling them into mini-batches. To overcome these limitations, we introduce a fast and effective Spectral Attention mechanism, which preserves temporal correlations among samples and facilitates the handling of long-range information while maintaining the base model structure. Spectral Attention preserves long-period trends through a low-pass filter and facilitates gradient to flow between samples. Spectral Attention can be seamlessly integrated into most sequence models, allowing models with fixed-sized look-back windows to capture long-range dependencies over thousands of steps. Through extensive experiments on 11 real-world time series datasets using 7 recent forecasting models, we consistently demonstrate the efficacy of our Spectral Attention mechanism, achieving state-of-the-art results.

new Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

Authors: Jianmina Ma, Jingtian Ji, Yue Gao

Abstract: Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered. However, constrained reinforcement learning methods face challenges in striking the right balance between task performance and constraint satisfaction and it is prone for them to get stuck in over-conservative or constraint violating local minima. In this paper, we propose Adversarial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training. Our approach divides original constrained problem into two adversarial stages that are solved alternately, and the policy update performance of our algorithm can be theoretically guaranteed. We validate our method through experiments conducted on Safety Gymnasium and quadruped locomotion tasks. Results demonstrate that our algorithm achieves better performances compared to commonly used baselines.

new Reduction-based Pseudo-label Generation for Instance-dependent Partial Label Learning

Authors: Congyu Qiao, Ning Xu, Yihao Hu, Xin Geng

Abstract: Instance-dependent Partial Label Learning (ID-PLL) aims to learn a multi-class predictive model given training instances annotated with candidate labels related to features, among which correct labels are hidden fixed but unknown. The previous works involve leveraging the identification capability of the training model itself to iteratively refine supervision information. However, these methods overlook a critical aspect of ID-PLL: the training model is prone to overfitting on incorrect candidate labels, thereby providing poor supervision information and creating a bottleneck in training. In this paper, we propose to leverage reduction-based pseudo-labels to alleviate the influence of incorrect candidate labels and train our predictive model to overcome this bottleneck. Specifically, reduction-based pseudo-labels are generated by performing weighted aggregation on the outputs of a multi-branch auxiliary model, with each branch trained in a label subspace that excludes certain labels. This approach ensures that each branch explicitly avoids the disturbance of the excluded labels, allowing the pseudo-labels provided for instances troubled by these excluded labels to benefit from the unaffected branches. Theoretically, we demonstrate that reduction-based pseudo-labels exhibit greater consistency with the Bayes optimal classifier compared to pseudo-labels directly generated from the predictive model.

new zGAN: An Outlier-focused Generative Adversarial Network For Realistic Synthetic Data Generation

Authors: Azizjon Azimi, Bonu Boboeva, Ilyas Varshavskiy, Shuhrat Khalilbekov, Akhlitdin Nizamitdinov, Najima Noyoftova, Sergey Shulgin

Abstract: The phenomenon of "black swans" has posed a fundamental challenge to performance of classical machine learning models. Perceived rise in frequency of outlier conditions, especially in post-pandemic environment, has necessitated exploration of synthetic data as a complement real data in model training. This article provides a general overview and experimental investigation of the zGAN model architecture developed for the purpose of generating synthetic tabular data with outlier characteristics. The model is put to test in binary classification environments and shows promising results on not only synthetic data generation, but also on uplift capabilities vis-\`a-vis model performance. A distinctive feature of zGAN is its enhanced correlation capability between features in the generated data, replicating correlations of features in real training data. Furthermore, crucial is the ability of zGAN to generate outliers based on covariance of real data or synthetically generated covariances. This approach to outlier generation enables modeling of complex economic events and augmentation of outliers for tasks such as training predictive models and detecting, processing or removing outliers. Experiments and comparative analyses as part of this study were conducted on both private (credit risk in financial services) and public datasets.

new Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation

Authors: Jaechang Kim, Jinmin Goh, Inseok Hwang, Jaewoong Cho, Jungseul Ok

Abstract: Deep learning-based expert models have reached superhuman performance in decision-making domains such as chess and Go. However, it is under-explored to explain or comment on given decisions although it is important for human education and model explainability. The outputs of expert models are accurate, but yet difficult to interpret for humans. On the other hand, large language models (LLMs) produce fluent commentary but are prone to hallucinations due to their limited decision-making capabilities. To bridge this gap between expert models and LLMs, we focus on chess commentary as a representative case of explaining complex decision-making processes through language and address both the generation and evaluation of commentary. We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality. Experimental results, validated by both human judges and GCC-Eval, demonstrate that CCC generates commentary that is accurate, informative, and fluent.

new Temporal Streaming Batch Principal Component Analysis for Time Series Classification

Authors: Enshuo Yan, Huachuan Wang, Weihao Xia

Abstract: In multivariate time series classification, although current sequence analysis models have excellent classification capabilities, they show significant shortcomings when dealing with long sequence multivariate data, such as prolonged training times and decreased accuracy. This paper focuses on optimizing model performance for long-sequence multivariate data by mitigating the impact of extended time series and multiple variables on the model. We propose a principal component analysis (PCA)-based temporal streaming compression and dimensionality reduction algorithm for time series data (temporal streaming batch PCA, TSBPCA), which continuously updates the compact representation of the entire sequence through streaming PCA time estimation with time block updates, enhancing the data representation capability of a range of sequence analysis models. We evaluated this method using various models on five real datasets, and the experimental results show that our method performs well in terms of classification accuracy and time efficiency. Notably, our method demonstrates a trend of increasing effectiveness as sequence length grows; on the two longest sequence datasets, accuracy improved by about 7.2%, and execution time decreased by 49.5%.

new On Probabilistic Pullback Metrics on Latent Hyperbolic Manifolds

Authors: Luis Augenstein, No\'emie Jaquier, Tamim Asfour, Leonel Rozo

Abstract: Gaussian Process Latent Variable Models (GPLVMs) have proven effective in capturing complex, high-dimensional data through lower-dimensional representations. Recent advances show that using Riemannian manifolds as latent spaces provides more flexibility to learn higher quality embeddings. This paper focuses on the hyperbolic manifold, a particularly suitable choice for modeling hierarchical relationships. While previous approaches relied on hyperbolic geodesics for interpolating the latent space, this often results in paths crossing low-data regions, leading to highly uncertain predictions. Instead, we propose augmenting the hyperbolic metric with a pullback metric to account for distortions introduced by the GPLVM's nonlinear mapping. Through various experiments, we demonstrate that geodesics on the pullback metric not only respect the geometry of the hyperbolic latent space but also align with the underlying data distribution, significantly reducing uncertainty in predictions.

new Strada-LLM: Graph LLM for traffic prediction

Authors: Seyed Mohamad Moghadas, Yangxintong Lyu, Bruno Cornelis, Alexandre Alahi, Adrian Munteanu

Abstract: Traffic prediction is a vital component of intelligent transportation systems. By reasoning about traffic patterns in both the spatial and temporal dimensions, accurate and interpretable predictions can be provided. A considerable challenge in traffic prediction lies in handling the diverse data distributions caused by vastly different traffic conditions occurring at different locations. LLMs have been a dominant solution due to their remarkable capacity to adapt to new datasets with very few labeled data samples, i.e., few-shot adaptability. However, existing forecasting techniques mainly focus on extracting local graph information and forming a text-like prompt, leaving LLM- based traffic prediction an open problem. This work presents a probabilistic LLM for traffic forecasting with three highlights. We propose a graph-aware LLM for traffic prediction that considers proximal traffic information. Specifically, by considering the traffic of neighboring nodes as covariates, our model outperforms the corresponding time-series LLM. Furthermore, we adopt a lightweight approach for efficient domain adaptation when facing new data distributions in few-shot fashion. The comparative experiment demonstrates the proposed method outperforms the state-of-the-art LLM-based methods and the traditional GNN- based supervised approaches. Furthermore, Strada-LLM can be easily adapted to different LLM backbones without a noticeable performance drop.

new CODES: Benchmarking Coupled ODE Surrogates

Authors: Robin Janssen, Immanuel Sulzer, Tobias Buck

Abstract: We introduce CODES, a benchmark for comprehensive evaluation of surrogate architectures for coupled ODE systems. Besides standard metrics like mean squared error (MSE) and inference time, CODES provides insights into surrogate behaviour across multiple dimensions like interpolation, extrapolation, sparse data, uncertainty quantification and gradient correlation. The benchmark emphasizes usability through features such as integrated parallel training, a web-based configuration generator, and pre-implemented baseline models and datasets. Extensive documentation ensures sustainability and provides the foundation for collaborative improvement. By offering a fair and multi-faceted comparison, CODES helps researchers select the most suitable surrogate for their specific dataset and application while deepening our understanding of surrogate learning behaviour.

new Generative Example-Based Explanations: Bridging the Gap between Generative Modeling and Explainability

Authors: Philipp Vaeth, Alexander M. Fruehwald, Benjamin Paassen, Magda Gregorova

Abstract: Recently, several methods have leveraged deep generative modeling to produce example-based explanations of decision algorithms for high-dimensional input data. Despite promising results, a disconnect exists between these methods and the classical explainability literature, which focuses on lower-dimensional data with semantically meaningful features. This conceptual and communication gap leads to misunderstandings and misalignments in goals and expectations. In this paper, we bridge this gap by proposing a novel probabilistic framework for local example-based explanations. Our framework integrates the critical characteristics of classical local explanation desiderata while being amenable to high-dimensional data and their modeling through deep generative models. Our aim is to facilitate communication, foster rigor and transparency, and improve the quality of peer discussion and research progress.

new Deep Recurrent Stochastic Configuration Networks for Modelling Nonlinear Dynamic Systems

Authors: Gang Dang, Dianhui Wang

Abstract: Deep learning techniques have shown promise in many domain applications. This paper proposes a novel deep reservoir computing framework, termed deep recurrent stochastic configuration network (DeepRSCN) for modelling nonlinear dynamic systems. DeepRSCNs are incrementally constructed, with all reservoir nodes directly linked to the final output. The random parameters are assigned in the light of a supervisory mechanism, ensuring the universal approximation property of the built model. The output weights are updated online using the projection algorithm to handle the unknown dynamics. Given a set of training samples, DeepRSCNs can quickly generate learning representations, which consist of random basis functions with cascaded input and readout weights. Experimental results over a time series prediction, a nonlinear system identification problem, and two industrial data predictive analyses demonstrate that the proposed DeepRSCN outperforms the single-layer network in terms of modelling efficiency, learning capability, and generalization performance.

new Constrained Optimal Fuel Consumption of HEV:Considering the Observational Perturbation

Authors: Shuchang Yan, Haoran Sun

Abstract: We assume accurate observation of battery state of charge (SOC) and precise speed curves when addressing the constrained optimal fuel consumption (COFC) problem via constrained reinforcement learning (CRL). However, in practice, SOC measurements are often distorted by noise or confidentiality protocols, and actual reference speeds may deviate from expectations. We aim to minimize fuel consumption while maintaining SOC balance under observational perturbations in SOC and speed. This work first worldwide uses seven training approaches to solve the COFC problem under five types of perturbations, including one based on a uniform distribution, one designed to maximize rewards, one aimed at maximizing costs, and one along with its improved version that seeks to decrease reward on Toyota Hybrid Systems (THS) under New European Driving Cycle (NEDC) condition. The result verifies that the six can successfully solve the COFC problem under observational perturbations, and we further compare the robustness and safety of these training approaches and analyze their impact on optimal fuel consumption.

new Neural Hamilton: Can A.I. Understand Hamiltonian Mechanics?

Authors: Tae-Geun Kim, Seong Chan Park

Abstract: We propose a novel framework based on neural network that reformulates classical mechanics as an operator learning problem. A machine directly maps a potential function to its corresponding trajectory in phase space without solving the Hamilton equations. Most notably, while conventional methods tend to accumulate errors over time through iterative time integration, our approach prevents error propagation. Two newly developed neural network architectures, namely VaRONet and MambONet, are introduced to adapt the Variational LSTM sequence-to-sequence model and leverage the Mamba model for efficient temporal dynamics processing. We tested our approach with various 1D physics problems: harmonic oscillation, double-well potentials, Morse potential, and other potential models outside the training data. Compared to traditional numerical methods based on the fourth-order Runge-Kutta (RK4) algorithm, our model demonstrates improved computational efficiency and accuracy. Code is available at: https://github.com/Axect/Neural_Hamilton

URLs: https://github.com/Axect/Neural_Hamilton

new Simultaneous Unlearning of Multiple Protected User Attributes From Variational Autoencoder Recommenders Using Adversarial Training

Authors: Gustavo Escobedo, Christian Ganh\"or, Stefan Brandl, Mirjam Augstein, Markus Schedl

Abstract: In widely used neural network-based collaborative filtering models, users' history logs are encoded into latent embeddings that represent the users' preferences. In this setting, the models are capable of mapping users' protected attributes (e.g., gender or ethnicity) from these user embeddings even without explicit access to them, resulting in models that may treat specific demographic user groups unfairly and raise privacy issues. While prior work has approached the removal of a single protected attribute of a user at a time, multiple attributes might come into play in real-world scenarios. In the work at hand, we present AdvXMultVAE which aims to unlearn multiple protected attributes (exemplified by gender and age) simultaneously to improve fairness across demographic user groups. For this purpose, we couple a variational autoencoder (VAE) architecture with adversarial training (AdvMultVAE) to support simultaneous removal of the users' protected attributes with continuous and/or categorical values. Our experiments on two datasets, LFM-2b-100k and Ml-1m, from the music and movie domains, respectively, show that our approach can yield better results than its singular removal counterparts (based on AdvMultVAE) in effectively mitigating demographic biases whilst improving the anonymity of latent embeddings.

new Refining CART Models for Covariate Shift with Importance Weight

Authors: Mingyang Cai, Thomas Klausch, Mark A. van de Wiel

Abstract: Machine learning models often face challenges in medical applications due to covariate shifts, where discrepancies between training and target data distributions can decrease predictive accuracy. This paper introduces an adaptation of Classification and Regression Trees (CART) that incorporates importance weighting to address these distributional differences effectively. By assigning greater weight to training samples that closely represent the target distribution, our approach modifies the CART model to improve performance in the presence of covariate shift. We evaluate the effectiveness of this method through simulation studies and apply it to real-world medical data, showing significant improvements in predictive accuracy. The results indicate that this weighted CART approach can be valuable in medical and other fields where covariate shift poses challenges, enabling more reliable predictions across diverse data distributions.

new A Review of Graph-Powered Data Quality Applications for IoT Monitoring Sensor Networks

Authors: Pau Ferrer-Cid, Jose M. Barcelo-Ordinas, Jorge Garcia-Vidal

Abstract: The development of Internet of Things (IoT) technologies has led to the widespread adoption of monitoring networks for a wide variety of applications, such as smart cities, environmental monitoring, and precision agriculture. A major research focus in recent years has been the development of graph-based techniques to improve the quality of data from sensor networks, a key aspect for the use of sensed data in decision-making processes, digital twins, and other applications. Emphasis has been placed on the development of machine learning and signal processing techniques over graphs, taking advantage of the benefits provided by the use of structured data through a graph topology. Many technologies such as the graph signal processing (GSP) or the successful graph neural networks (GNNs) have been used for data quality enhancement tasks. In this survey, we focus on graph-based models for data quality control in monitoring sensor networks. Furthermore, we delve into the technical details that are commonly leveraged for providing powerful graph-based solutions for data quality tasks in sensor networks, including missing value imputation, outlier detection, or virtual sensing. To conclude, we have identified future trends and challenges such as graph-based models for digital twins or model transferability and generalization.

new Physics-informed Partitioned Coupled Neural Operator for Complex Networks

Authors: Weidong Wu, Yong Zhang, Lili Hao, Yang Chen, Xiaoyan Sun, Dunwei Gong

Abstract: Physics-Informed Neural Operators provide efficient, high-fidelity simulations for systems governed by partial differential equations (PDEs). However, most existing studies focus only on multi-scale, multi-physics systems within a single spatial region, neglecting the case with multiple interconnected sub-regions, such as gas and thermal systems. To address this, this paper proposes a Physics-Informed Partitioned Coupled Neural Operator (PCNO) to enhance the simulation performance of such networks. Compared to the existing Fourier Neural Operator (FNO), this method designs a joint convolution operator within the Fourier layer, enabling global integration capturing all sub-regions. Additionally, grid alignment layers are introduced outside the Fourier layer to help the joint convolution operator accurately learn the coupling relationship between sub-regions in the frequency domain. Experiments on gas networks demonstrate that the proposed operator not only accurately simulates complex systems but also shows good generalization and low model complexity.

new Transferable Post-training via Inverse Value Learning

Authors: Xinyu Lu, Xueru Wen, Yaojie Lu, Bowen Yu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, Yongbin Li

Abstract: As post-training processes utilize increasingly large datasets and base models continue to grow in size, the computational demands and implementation challenges of existing algorithms are escalating significantly. In this paper, we propose modeling the changes at the logits level during post-training using a separate neural network (i.e., the value network). After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference, enables them to achieve similar capability enhancements. We systematically investigate the best practices for this paradigm in terms of pre-training weights and connection schemes. We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes within the same family, models undergoing continuous pre-training within the same family, and models with different vocabularies across families. In certain cases, it can achieve performance comparable to full-parameter fine-tuning. Furthermore, we explore methods to enhance the transferability of the value model and prevent overfitting to the base model used during training.

new Graph Based Traffic Analysis and Delay Prediction

Authors: Gabriele Borg, Charlie Abela

Abstract: This research is focused on traffic congestion in the small island of Malta which is the most densely populated country in the EU with about 1,672 inhabitants per square kilometre (4,331 inhabitants/sq mi). Furthermore, Malta has a rapid vehicle growth. Based on our research, the number of vehicles increased by around 11,000 in a little more than 6 months, which shows how important it is to have an accurate and comprehensive means of collecting data to tackle the issue of fluctuating traffic in Malta. In this paper, we first present the newly built comprehensive traffic dataset, called MalTra. This dataset includes realistic trips made by members of the public across the island over a period of 200 days. We then describe the methodology we adopted to generate syntactic data to complete our data set as much as possible. In our research, we consider both MalTra and the Q-Traffic dataset, which has been used in several other research studies. The statistical ARIMA model and two graph neural networks, the spatial temporal graph convolutional network (STGCN) and the diffusion convolutional recurrent network (DCRNN) were used to analyse and compare the results with existing research. From the evaluation, we found that the DCRNN model outperforms the STGCN with the former resulting in MAE of 3.98 (6.65 in the case of the latter) and a RMSE of 7.78 (against 12.73 of the latter).

new Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Authors: Justin Deschenaux, Caglar Gulcehre

Abstract: Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advances have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, our models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.

new Disentangled and Self-Explainable Node Representation Learning

Authors: Simone Piaggesi, Andr\'e Panisson, Megha Khosla

Abstract: Node representations, or embeddings, are low-dimensional vectors that capture node properties, typically learned through unsupervised structural similarity objectives or supervised tasks. While recent efforts have focused on explaining graph model decisions, the interpretability of unsupervised node embeddings remains underexplored. To bridge this gap, we introduce DiSeNE (Disentangled and Self-Explainable Node Embedding), a framework that generates self-explainable embeddings in an unsupervised manner. Our method employs disentangled representation learning to produce dimension-wise interpretable embeddings, where each dimension is aligned with distinct topological structure of the graph. We formalize novel desiderata for disentangled and interpretable embeddings, which drive our new objective functions, optimizing simultaneously for both interpretability and disentanglement. Additionally, we propose several new metrics to evaluate representation quality and human interpretability. Extensive experiments across multiple benchmark datasets demonstrate the effectiveness of our approach.

new Getting By Goal Misgeneralization With a Little Help From a Mentor

Authors: Tu Trinh, Mohamad H. Danesh, Nguyen X. Khanh, Benjamin Plaut

Abstract: While reinforcement learning (RL) agents often perform well during training, they can struggle with distribution shift in real-world deployments. One particularly severe risk of distribution shift is goal misgeneralization, where the agent learns a proxy goal that coincides with the true goal during training but not during deployment. In this paper, we explore whether allowing an agent to ask for help from a supervisor in unfamiliar situations can mitigate this issue. We focus on agents trained with PPO in the CoinRun environment, a setting known to exhibit goal misgeneralization. We evaluate multiple methods for determining when the agent should request help and find that asking for help consistently improves performance. However, we also find that methods based on the agent's internal state fail to proactively request help, instead waiting until mistakes have already occurred. Further investigation suggests that the agent's internal state does not represent the coin at all, highlighting the importance of learning nuanced representations, the risks of ignoring everything not immediately relevant to reward, and the necessity of developing ask-for-help strategies tailored to the agent's training algorithm.

new Computable Lipschitz Bounds for Deep Neural Networks

Authors: Moreno Pintore, Bruno Despr\'es

Abstract: Deriving sharp and computable upper bounds of the Lipschitz constant of deep neural networks is crucial to formally guarantee the robustness of neural-network based models. We analyse three existing upper bounds written for the $l^2$ norm. We highlight the importance of working with the $l^1$ and $l^\infty$ norms and we propose two novel bounds for both feed-forward fully-connected neural networks and convolutional neural networks. We treat the technical difficulties related to convolutional neural networks with two different methods, called explicit and implicit. Several numerical tests empirically confirm the theoretical results, help to quantify the relationship between the presented bounds and establish the better accuracy of the new bounds. Four numerical tests are studied: two where the output is derived from an analytical closed form are proposed; another one with random matrices; and the last one for convolutional neural networks trained on the MNIST dataset. We observe that one of our bound is optimal in the sense that it is exact for the first test with the simplest analytical form and it is better than other bounds for the other tests.

new EMOCPD: Efficient Attention-based Models for Computational Protein Design Using Amino Acid Microenvironment

Authors: Xiaoqi Ling, Cheng Cai, Zhaohong Deng, Lei Wang, Zhisheng Wei, Jing Wu

Abstract: Computational protein design (CPD) refers to the use of computational methods to design proteins. Traditional methods relying on energy functions and heuristic algorithms for sequence design are inefficient and do not meet the demands of the big data era in biomolecules, with their accuracy limited by the energy functions and search algorithms. Existing deep learning methods are constrained by the learning capabilities of the networks, failing to extract effective information from sparse protein structures, which limits the accuracy of protein design. To address these shortcomings, we developed an Efficient attention-based Models for Computational Protein Design using amino acid microenvironment (EMOCPD). It aims to predict the category of each amino acid in a protein by analyzing the three-dimensional atomic environment surrounding the amino acids, and optimize the protein based on the predicted high-probability potential amino acid categories. EMOCPD employs a multi-head attention mechanism to focus on important features in the sparse protein microenvironment and utilizes an inverse residual structure to optimize the network architecture. The proposed EMOCPD achieves over 80% accuracy on the training set and 68.33% and 62.32% accuracy on two independent test sets, respectively, surpassing the best comparative methods by over 10%. In protein design, the thermal stability and protein expression of the predicted mutants from EMOCPD show significant improvements compared to the wild type, effectively validating EMOCPD's potential in designing superior proteins. Furthermore, the predictions of EMOCPD are influenced positively, negatively, or have minimal impact based on the content of the 20 amino acids, categorizing amino acids as positive, negative, or neutral. Research findings indicate that EMOCPD is more suitable for designing proteins with lower contents of negative amino acids.

new Federated Time Series Generation on Feature and Temporally Misaligned Data

Authors: Chenrui Fan, Zhi Wen Soi, Aditya Shankar, Abele M\u{a}lan, Lydia Y. Chen

Abstract: Distributed time series data presents a challenge for federated learning, as clients often possess different feature sets and have misaligned time steps. Existing federated time series models are limited by the assumption of perfect temporal or feature alignment across clients. In this paper, we propose FedTDD, a novel federated time series diffusion model that jointly learns a synthesizer across clients. At the core of FedTDD is a novel data distillation and aggregation framework that reconciles the differences between clients by imputing the misaligned timesteps and features. In contrast to traditional federated learning, FedTDD learns the correlation across clients' time series through the exchange of local synthetic outputs instead of model parameters. A coordinator iteratively improves a global distiller network by leveraging shared knowledge from clients through the exchange of synthetic data. As the distiller becomes more refined over time, it subsequently enhances the quality of the clients' local feature estimates, allowing each client to then improve its local imputations for missing data using the latest, more accurate distiller. Experimental results on five datasets demonstrate FedTDD's effectiveness compared to centralized training, and the effectiveness of sharing synthetic outputs to transfer knowledge of local time series. Notably, FedTDD achieves 79.4% and 62.8% improvement over local training in Context-FID and Correlational scores.

new Skip2-LoRA: A Lightweight On-device DNN Fine-tuning Method for Low-cost Edge Devices

Authors: Hiroki Matsutani, Masaaki Kondo, Kazuki Sunaga, Radu Marculescu

Abstract: This paper proposes Skip2-LoRA as a lightweight fine-tuning method for deep neural networks to address the gap between pre-trained and deployed models. In our approach, trainable LoRA (low-rank adaptation) adapters are inserted between the last layer and every other layer to enhance the network expressive power while keeping the backward computation cost low. This architecture is well-suited to cache intermediate computation results of the forward pass and then can skip the forward computation of seen samples as training epochs progress. We implemented the combination of the proposed architecture and cache, denoted as Skip2-LoRA, and tested it on a $15 single board computer. Our results show that Skip2-LoRA reduces the fine-tuning time by 90.0% on average compared to the counterpart that has the same number of trainable parameters while preserving the accuracy, while taking only a few seconds on the microcontroller board.

new Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models

Authors: Wenda Li, Huijie Zhang, Qing Qu

Abstract: The widespread use of AI-generated content from diffusion models has raised significant concerns regarding misinformation and copyright infringement. Watermarking is a crucial technique for identifying these AI-generated images and preventing their misuse. In this paper, we introduce Shallow Diffuse, a new watermarking technique that embeds robust and invisible watermarks into diffusion model outputs. Unlike existing approaches that integrate watermarking throughout the entire diffusion sampling process, Shallow Diffuse decouples these steps by leveraging the presence of a low-dimensional subspace in the image generation process. This method ensures that a substantial portion of the watermark lies in the null space of this subspace, effectively separating it from the image generation process. Our theoretical and empirical analyses show that this decoupling strategy greatly enhances the consistency of data generation and the detectability of the watermark. Extensive experiments further validate that our Shallow Diffuse outperforms existing watermarking methods in terms of robustness and consistency. The codes will be released at https://github.com/liwd190019/Shallow-Diffuse.

URLs: https://github.com/liwd190019/Shallow-Diffuse.

new Tree-Wasserstein Distance for High Dimensional Data with a Latent Feature Hierarchy

Authors: Ya-Wei Eileen Lin, Ronald R. Coifman, Gal Mishne, Ronen Talmon

Abstract: Finding meaningful distances between high-dimensional data samples is an important scientific task. To this end, we propose a new tree-Wasserstein distance (TWD) for high-dimensional data with two key aspects. First, our TWD is specifically designed for data with a latent feature hierarchy, i.e., the features lie in a hierarchical space, in contrast to the usual focus on embedding samples in hyperbolic space. Second, while the conventional use of TWD is to speed up the computation of the Wasserstein distance, we use its inherent tree as a means to learn the latent feature hierarchy. The key idea of our method is to embed the features into a multi-scale hyperbolic space using diffusion geometry and then present a new tree decoding method by establishing analogies between the hyperbolic embedding and trees. We show that our TWD computed based on data observations provably recovers the TWD defined with the latent feature hierarchy and that its computation is efficient and scalable. We showcase the usefulness of the proposed TWD in applications to word-document and single-cell RNA-sequencing datasets, demonstrating its advantages over existing TWDs and methods based on pre-trained models.

new Dual-Agent Deep Reinforcement Learning for Dynamic Pricing and Replenishment

Authors: Yi Zheng, Zehao Li, Peng Jiang, Yijie Peng

Abstract: We study the dynamic pricing and replenishment problems under inconsistent decision frequencies. Different from the traditional demand assumption, the discreteness of demand and the parameter within the Poisson distribution as a function of price introduce complexity into analyzing the problem property. We demonstrate the concavity of the single-period profit function with respect to product price and inventory within their respective domains. The demand model is enhanced by integrating a decision tree-based machine learning approach, trained on comprehensive market data. Employing a two-timescale stochastic approximation scheme, we address the discrepancies in decision frequencies between pricing and replenishment, ensuring convergence to local optimum. We further refine our methodology by incorporating deep reinforcement learning (DRL) techniques and propose a fast-slow dual-agent DRL algorithm. In this approach, two agents handle pricing and inventory and are updated on different scales. Numerical results from both single and multiple products scenarios validate the effectiveness of our methods.

new FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge

Authors: Sifat Ut Taki, Arthi Padmanabhan, Spyridon Mastorakis

Abstract: Edge AI computing boxes are a new class of computing devices that are aimed to revolutionize the AI industry. These compact and robust hardware units bring the power of AI processing directly to the source of data--on the edge of the network. On the other hand, on-demand serverless inference services are becoming more and more popular as they minimize the infrastructural cost associated with hosting and running DNN models for small to medium-sized businesses. However, these computing devices are still constrained in terms of resource availability. As such, the service providers need to load and unload models efficiently in order to meet the growing demand. In this paper, we introduce FusedInf to efficiently swap DNN models for on-demand serverless inference services on the edge. FusedInf combines multiple models into a single Direct Acyclic Graph (DAG) to efficiently load the models into the GPU memory and make execution faster. Our evaluation of popular DNN models showed that creating a single DAG can make the execution of the models up to 14\% faster while reducing the memory requirement by up to 17\%. The prototype implementation is available at https://github.com/SifatTaj/FusedInf.

URLs: https://github.com/SifatTaj/FusedInf.

new Fast Calibrated Explanations: Efficient and Uncertainty-Aware Explanations for Machine Learning Models

Authors: Tuwe L\"ofstr\"om, Fatima Rabia Yapicioglu, Alessandra Stramiglio, Helena L\"ofstr\"om, Fabio Vitali

Abstract: This paper introduces Fast Calibrated Explanations, a method designed for generating rapid, uncertainty-aware explanations for machine learning models. By incorporating perturbation techniques from ConformaSight - a global explanation framework - into the core elements of Calibrated Explanations (CE), we achieve significant speedups. These core elements include local feature importance with calibrated predictions, both of which retain uncertainty quantification. While the new method sacrifices a small degree of detail, it excels in computational efficiency, making it ideal for high-stakes, real-time applications. Fast Calibrated Explanations are applicable to probabilistic explanations in classification and thresholded regression tasks, where they provide the likelihood of a target being above or below a user-defined threshold. This approach maintains the versatility of CE for both classification and probabilistic regression, making it suitable for a range of predictive tasks where uncertainty quantification is crucial.

new LLM-initialized Differentiable Causal Discovery

Authors: Shiv Kampani, David Hidary, Constantijn van der Poel, Martin Ganahl, Brenda Miao

Abstract: The discovery of causal relationships between random variables is an important yet challenging problem that has applications across many scientific domains. Differentiable causal discovery (DCD) methods are effective in uncovering causal relationships from observational data; however, these approaches often suffer from limited interpretability and face challenges in incorporating domain-specific prior knowledge. In contrast, Large Language Models (LLMs)-based causal discovery approaches have recently been shown capable of providing useful priors for causal discovery but struggle with formal causal reasoning. In this paper, we propose LLM-DCD, which uses an LLM to initialize the optimization of the maximum likelihood objective function of DCD approaches, thereby incorporating strong priors into the discovery method. To achieve this initialization, we design our objective function to depend on an explicitly defined adjacency matrix of the causal graph as its only variational parameter. Directly optimizing the explicitly defined adjacency matrix provides a more interpretable approach to causal discovery. Additionally, we demonstrate higher accuracy on key benchmarking datasets of our approach compared to state-of-the-art alternatives, and provide empirical evidence that the quality of the initialization directly impacts the quality of the final output of our DCD approach. LLM-DCD opens up new opportunities for traditional causal discovery methods like DCD to benefit from future improvements in the causal reasoning capabilities of LLMs.

new Offline Reinforcement Learning With Combinatorial Action Spaces

Authors: Matthew Landers, Taylor W. Killian, Hugo Barnes, Thomas Hartvigsen, Afsaneh Doryab

Abstract: Reinforcement learning problems often involve large action spaces arising from the simultaneous execution of multiple sub-actions, resulting in combinatorial action spaces. Learning in combinatorial action spaces is difficult due to the exponential growth in action space size with the number of sub-actions and the dependencies among these sub-actions. In offline settings, this challenge is compounded by limited and suboptimal data. Current methods for offline learning in combinatorial spaces simplify the problem by assuming sub-action independence. We propose Branch Value Estimation (BVE), which effectively captures sub-action dependencies and scales to large combinatorial spaces by learning to evaluate only a small subset of actions at each timestep. Our experiments show that BVE outperforms state-of-the-art methods across a range of action space sizes.

new Trajectory Flow Matching with Applications to Clinical Time Series Modeling

Authors: Xi Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis L. Shung, Alexander Tong

Abstract: Modeling stochastic and irregularly sampled time series is a challenging problem found in a wide range of applications, especially in medicine. Neural stochastic differential equations (Neural SDEs) are an attractive modeling technique for this problem, which parameterize the drift and diffusion terms of an SDE with neural networks. However, current algorithms for training Neural SDEs require backpropagation through the SDE dynamics, greatly limiting their scalability and stability. To address this, we propose Trajectory Flow Matching (TFM), which trains a Neural SDE in a simulation-free manner, bypassing backpropagation through the dynamics. TFM leverages the flow matching technique from generative modeling to model time series. In this work we first establish necessary conditions for TFM to learn time series data. Next, we present a reparameterization trick which improves training stability. Finally, we adapt TFM to the clinical time series setting, demonstrating improved performance on three clinical time series datasets both in terms of absolute performance and uncertainty prediction.

new Resilience in Knowledge Graph Embeddings

Authors: Arnab Sharma, N'Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo

Abstract: In recent years, knowledge graphs have gained interest and witnessed widespread applications in various domains, such as information retrieval, question-answering, recommendation systems, amongst others. Large-scale knowledge graphs to this end have demonstrated their utility in effectively representing structured knowledge. To further facilitate the application of machine learning techniques, knowledge graph embedding (KGE) models have been developed. Such models can transform entities and relationships within knowledge graphs into vectors. However, these embedding models often face challenges related to noise, missing information, distribution shift, adversarial attacks, etc. This can lead to sub-optimal embeddings and incorrect inferences, thereby negatively impacting downstream applications. While the existing literature has focused so far on adversarial attacks on KGE models, the challenges related to the other critical aspects remain unexplored. In this paper, we, first of all, give a unified definition of resilience, encompassing several factors such as generalisation, performance consistency, distribution adaption, and robustness. After formalizing these concepts for machine learning in general, we define them in the context of knowledge graphs. To find the gap in the existing works on resilience in the context of knowledge graphs, we perform a systematic survey, taking into account all these aspects mentioned previously. Our survey results show that most of the existing works focus on a specific aspect of resilience, namely robustness. After categorizing such works based on their respective aspects of resilience, we discuss the challenges and future research directions.

new SeriesGAN: Time Series Generation via Adversarial and Autoregressive Learning

Authors: MohammadReza EskandariNasab, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

Abstract: Current Generative Adversarial Network (GAN)-based approaches for time series generation face challenges such as suboptimal convergence, information loss in embedding spaces, and instability. To overcome these challenges, we introduce an advanced framework that integrates the advantages of an autoencoder-generated embedding space with the adversarial training dynamics of GANs. This method employs two discriminators: one to specifically guide the generator and another to refine both the autoencoder's and generator's output. Additionally, our framework incorporates a novel autoencoder-based loss function and supervision from a teacher-forcing supervisor network, which captures the stepwise conditional distributions of the data. The generator operates within the latent space, while the two discriminators work on latent and feature spaces separately, providing crucial feedback to both the generator and the autoencoder. By leveraging this dual-discriminator approach, we minimize information loss in the embedding space. Through joint training, our framework excels at generating high-fidelity time series data, consistently outperforming existing state-of-the-art benchmarks both qualitatively and quantitatively across a range of real and synthetic multivariate time series datasets.

new Reconstructing dynamics from sparse observations with no training on target system

Authors: Zheng-Meng Zhai, Jun-Yin Huang, Benjamin D. Stern, Ying-Cheng Lai

Abstract: In applications, an anticipated situation is where the system of interest has never been encountered before and sparse observations can be made only once. Can the dynamics be faithfully reconstructed from the limited observations without any training data? This problem defies any known traditional methods of nonlinear time-series analysis as well as existing machine-learning methods that typically require extensive data from the target system for training. We address this challenge by developing a hybrid transformer and reservoir-computing machine-learning scheme. The key idea is that, for a complex and nonlinear target system, the training of the transformer can be conducted not using any data from the target system, but with essentially unlimited synthetic data from known chaotic systems. The trained transformer is then tested with the sparse data from the target system. The output of the transformer is further fed into a reservoir computer for predicting the long-term dynamics or the attractor of the target system. The power of the proposed hybrid machine-learning framework is demonstrated using a large number of prototypical nonlinear dynamical systems, with high reconstruction accuracy even when the available data is only 20% of that required to faithfully represent the dynamical behavior of the underlying system. The framework provides a paradigm of reconstructing complex and nonlinear dynamics in the extreme situation where training data does not exist and the observations are random and sparse.

new LoRA vs Full Fine-tuning: An Illusion of Equivalence

Authors: Reece Shuttleworth, Jacob Andreas, Antonio Torralba, Pratyusha Sharma

Abstract: Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to match the performance of fully fine-tuned models on various tasks with an extreme reduction in the number of trainable parameters. Even in settings where both methods learn similarly accurate models, \emph{are their learned solutions really equivalent?} We study how different fine-tuning methods change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure; moreover, the fine-tuned models themselves show distinct generalization behaviors when tested outside the adaptation task's distribution. More specifically, we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}. Intruder dimensions do not appear during full fine-tuning. Second, we show that LoRA models with intruder dimensions, despite achieving similar performance to full fine-tuning on the target task, become worse models of the pre-training distribution and adapt less robustly to multiple tasks sequentially. Higher-rank, rank-stabilized LoRA models closely mirror full fine-tuning, even when performing on par with lower-rank LoRA models on the same tasks. These results suggest that models updated with LoRA and full fine-tuning access different parts of parameter space, even when they perform equally on the fine-tuned distribution. We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.

new $\texttt{skwdro}$: a library for Wasserstein distributionally robust machine learning

Authors: Florian Vincent, Wa\"iss Azizian, Franck Iutzeler, J\'er\^ome Malick

Abstract: We present skwdro, a Python library for training robust machine learning models. The library is based on distributionally robust optimization using optimal transport distances. For ease of use, it features both scikit-learn compatible estimators for popular objectives, as well as a wrapper for PyTorch modules, enabling researchers and practitioners to use it in a wide range of models with minimal code changes. Its implementation relies on an entropic smoothing of the original robust objective in order to ensure maximal model flexibility. The library is available at https://github.com/iutzeler/skwdro

URLs: https://github.com/iutzeler/skwdro

new Flaming-hot Initiation with Regular Execution Sampling for Large Language Models

Authors: Weizhe Chen, Zhicheng Zhang, Guanlin Liu, Renjie Zheng, Wenlei Shi, Chen Dun, Zheng Wu, Xing Jin, Lin Yan

Abstract: Since the release of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities across various domains. A key challenge in developing these general capabilities is efficiently sourcing diverse, high-quality data. This becomes especially critical in reasoning-related tasks with sandbox checkers, such as math or code, where the goal is to generate correct solutions to specific problems with higher probability. In this work, we introduce Flaming-hot Initiation with Regular Execution (FIRE) sampling, a simple yet highly effective method to efficiently find good responses. Our empirical findings show that FIRE sampling enhances inference-time generation quality and also benefits training in the alignment stage. Furthermore, we explore how FIRE sampling improves performance by promoting diversity and analyze the impact of employing FIRE at different positions within a response.

new Capacity-Aware Planning and Scheduling in Budget-Constrained Monotonic MDPs: A Meta-RL Approach

Authors: Manav Vora, Ilan Shomorony, Melkior Ornik

Abstract: Many real-world sequential repair problems can be effectively modeled using monotonic Markov Decision Processes (MDPs), where the system state stochastically decreases and can only be increased by performing a restorative action. This work addresses the problem of solving multi-component monotonic MDPs with both budget and capacity constraints. The budget constraint limits the total number of restorative actions and the capacity constraint limits the number of restorative actions that can be performed simultaneously. While prior methods dealt with budget constraints, including capacity constraints in prior methods leads to an exponential increase in computational complexity as the number of components in the MDP grows. We propose a two-step planning approach to address this challenge. First, we partition the components of the multi-component MDP into groups, where the number of groups is determined by the capacity constraint. We achieve this partitioning by solving a Linear Sum Assignment Problem (LSAP). Each group is then allocated a fraction of the total budget proportional to its size. This partitioning effectively decouples the large multi-component MDP into smaller subproblems, which are computationally feasible because the capacity constraint is simplified and the budget constraint can be addressed using existing methods. Subsequently, we use a meta-trained PPO agent to obtain an approximately optimal policy for each group. To validate our approach, we apply it to the problem of scheduling repairs for a large group of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot swarm, particularly for large swarm sizes.

new BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference

Authors: Changwoo Lee, Soo Min Kwon, Qing Qu, Hun-Seok Kim

Abstract: Large-scale foundation models have demonstrated exceptional performance in language and vision tasks. However, the numerous dense matrix-vector operations involved in these large networks pose significant computational challenges during inference. To address these challenges, we introduce the Block-Level Adaptive STructured (BLAST) matrix, designed to learn and leverage efficient structures prevalent in the weight matrices of linear layers within deep learning models. Compared to existing structured matrices, the BLAST matrix offers substantial flexibility, as it can represent various types of structures that are either learned from data or computed from pre-existing weight matrices. We demonstrate the efficiency of using the BLAST matrix for compressing both language and vision tasks, showing that (i) for medium-sized models such as ViT and GPT-2, training with BLAST weights boosts performance while reducing complexity by 70\% and 40\%, respectively; and (ii) for large foundation models such as Llama-7B and DiT-XL, the BLAST matrix achieves a 2x compression while exhibiting the lowest performance degradation among all tested structured matrices. Our code is available at \url{https://github.com/changwoolee/BLAST}.

URLs: https://github.com/changwoolee/BLAST

new Modular Duality in Deep Learning

Authors: Jeremy Bernstein, Laker Newhouse

Abstract: An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We conclude by deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers -- the latter two methods are based on a new rectangular Newton-Schulz iteration that we propose. Our iteration was recently used to set new speed records for training NanoGPT. Overall, we hope that our theory of modular duality will yield a next generation of fast and scalable optimizers for general neural architectures.

new Online Weighted Paging with Unknown Weights

Authors: Orin Levy, Noam Touitou, Aviv Rosenberg

Abstract: Online paging is a fundamental problem in the field of online algorithms, in which one maintains a cache of $k$ slots as requests for fetching pages arrive online. In the weighted variant of this problem, each page has its own fetching cost; a substantial line of work on this problem culminated in an (optimal) $O(\log k)$-competitive randomized algorithm, due to Bansal, Buchbinder and Naor (FOCS'07). Existing work for weighted paging assumes that page weights are known in advance, which is not always the case in practice. For example, in multi-level caching architectures, the expected cost of fetching a memory block is a function of its probability of being in a mid-level cache rather than the main memory. This complex property cannot be predicted in advance; over time, however, one may glean information about page weights through sampling their fetching cost multiple times. We present the first algorithm for online weighted paging that does not know page weights in advance, but rather learns from weight samples. In terms of techniques, this requires providing (integral) samples to a fractional solver, requiring a delicate interface between this solver and the randomized rounding scheme; we believe that our work can inspire online algorithms to other problems that involve cost sampling.

cross Scaling up Masked Diffusion Models on Text

Authors: Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, Chongxuan Li

Abstract: Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective \emph{unsupervised classifier-free guidance} that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, a 1.1B MDM shows competitive results, outperforming the larger 1.5B GPT-2 model on four out of eight zero-shot benchmarks. In text generation, MDMs provide a flexible trade-off compared to ARMs utilizing KV-cache: MDMs match the performance of ARMs while being 1.4 times faster, or achieve higher quality than ARMs at a higher computational cost. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the \emph{reverse curse} encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B). Our code is available at \url{https://github.com/ML-GSAI/SMDM}.

URLs: https://github.com/ML-GSAI/SMDM

cross Combining LLM Code Generation with Formal Specifications and Reactive Program Synthesis

Authors: William Murphy, Nikolaus Holzer, Feitong Qiao, Leyi Cui, Raven Rothkopf, Nathan Koenig, Mark Santolucito

Abstract: In the past few years, Large Language Models (LLMs) have exploded in usefulness and popularity for code generation tasks. However, LLMs still struggle with accuracy and are unsuitable for high-risk applications without additional oversight and verification. In particular, they perform poorly at generating code for highly complex systems, especially with unusual or out-of-sample logic. For such systems, verifying the code generated by the LLM may take longer than writing it by hand. We introduce a solution that divides the code generation into two parts; one to be handled by an LLM and one to be handled by formal methods-based program synthesis. We develop a benchmark to test our solution and show that our method allows the pipeline to solve problems previously intractable for LLM code generation.

cross The Effect of Acute Stress on the Interpretability and Generalization of Schizophrenia Predictive Machine Learning Models

Authors: Gideon Vos, Maryam Ebrahimpour, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

Abstract: Introduction Schizophrenia is a severe mental disorder, and early diagnosis is key to improving outcomes. Its complexity makes predicting onset and progression challenging. EEG has emerged as a valuable tool for studying schizophrenia, with machine learning increasingly applied for diagnosis. This paper assesses the accuracy of ML models for predicting schizophrenia and examines the impact of stress during EEG recording on model performance. We integrate acute stress prediction into the analysis, showing that overlapping conditions like stress during recording can negatively affect model accuracy. Methods Four XGBoost models were built: one for stress prediction, two to classify schizophrenia (at rest and task), and a model to predict schizophrenia for both conditions. XAI techniques were applied to analyze results. Experiments tested the generalization of schizophrenia models using their datasets' healthy controls and independent health-screened controls. The stress model identified high-stress subjects, who were excluded from further analysis. A novel method was used to adjust EEG frequency band power to remove stress artifacts, improving predictive model performance. Results Our results show that acute stress vary across EEG sessions, affecting model performance and accuracy. Generalization improved once these varying stress levels were considered and compensated for during model training. Our findings highlight the importance of thorough health screening and management of the patient's condition during the process. Stress induced during or by the EEG recording can adversely affect model generalization. This may require further preprocessing of data by treating stress as an additional physiological artifact. Our proposed approach to compensate for stress artifacts in EEG data used for training models showed a significant improvement in predictive performance.

cross Tourism destination events classifier based on artificial intelligence techniques

Authors: Miguel Camacho-Ruiz, Ram\'on Alberto Carrasco, Gema Fern\'andez-Avil\'es, Antonio LaTorre

Abstract: Identifying client needs to provide optimal services is crucial in tourist destination management. The events held in tourist destinations may help to meet those needs and thus contribute to tourist satisfaction. As with product management, the creation of hierarchical catalogs to classify those events can aid event management. The events that can be found on the internet are listed in dispersed, heterogeneous sources, which makes direct classification a difficult, time-consuming task. The main aim of this work is to create a novel process for automatically classifying an eclectic variety of tourist events using a hierarchical taxonomy, which can be applied to support tourist destination management. Leveraging data science methods such as CRISP-DM, supervised machine learning, and natural language processing techniques, the automatic classification process proposed here allows the creation of a normalized catalog across very different geographical regions. Therefore, we can build catalogs with consistent filters, allowing users to find events regardless of the event categories assigned at source, if any. This is very valuable for companies that offer this kind of information across multiple regions, such as airlines, travel agencies or hotel chains. Ultimately, this tool has the potential to revolutionize the way companies and end users interact with tourist events information.

cross Adaptive Real-Time Multi-Loss Function Optimization Using Dynamic Memory Fusion Framework: A Case Study on Breast Cancer Segmentation

Authors: Amin Golnari, Mostafa Diba

Abstract: Deep learning has proven to be a highly effective tool for a wide range of applications, significantly when leveraging the power of multi-loss functions to optimize performance on multiple criteria simultaneously. However, optimal selection and weighting loss functions in deep learning tasks can significantly influence model performance, yet manual tuning of these functions is often inefficient and inflexible. We propose a novel framework called dynamic memory fusion for adaptive multi-loss function penalizing in real-time to address this. This framework leverages historical loss values data to dynamically adjust the weighting of multiple loss functions throughout the training process. Additionally, this framework integrates an auxiliary loss function to enhance model performance in the early stages. To further research horizons, we introduce the class-balanced dice loss function, designed to address class imbalance by prioritizing underrepresented classes. Experiments on breast ultrasound datasets demonstrate that the framework improves segmentation performance across various metrics. These results demonstrate the effectiveness of our proposed framework in ensuring that the model dynamically adjusts its focus to prioritize the most relevant criteria, leading to improved performance in evolving environments. The source code for our proposed methodology is publicly available on GitHub.

cross The Geometry of Concepts: Sparse Autoencoder Feature Structure

Authors: Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, Max Tegmark

Abstract: Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.

cross Establishing Nationwide Power System Vulnerability Index across US Counties Using Interpretable Machine Learning

Authors: Junwei Ma, Bo Li, Olufemi A. Omitaomu, Ali Mostafavi

Abstract: Power outages have become increasingly frequent, intense, and prolonged in the US due to climate change, aging electrical grids, and rising energy demand. However, largely due to the absence of granular spatiotemporal outage data, we lack data-driven evidence and analytics-based metrics to quantify power system vulnerability. This limitation has hindered the ability to effectively evaluate and address vulnerability to power outages in US communities. Here, we collected ~179 million power outage records at 15-minute intervals across 3022 US contiguous counties (96.15% of the area) from 2014 to 2023. We developed a power system vulnerability assessment framework based on three dimensions (intensity, frequency, and duration) and applied interpretable machine learning models (XGBoost and SHAP) to compute Power System Vulnerability Index (PSVI) at the county level. Our analysis reveals a consistent increase in power system vulnerability over the past decade. We identified 318 counties across 45 states as hotspots for high power system vulnerability, particularly in the West Coast (California and Washington), the East Coast (Florida and the Northeast area), the Great Lakes megalopolis (Chicago-Detroit metropolitan areas), and the Gulf of Mexico (Texas). Heterogeneity analysis indicates that urban counties, counties with interconnected grids, and states with high solar generation exhibit significantly higher vulnerability. Our results highlight the significance of the proposed PSVI for evaluating the vulnerability of communities to power outages. The findings underscore the widespread and pervasive impact of power outages across the country and offer crucial insights to support infrastructure operators, policymakers, and emergency managers in formulating policies and programs aimed at enhancing the resilience of the US power infrastructure.

cross Physical Simulation for Multi-agent Multi-machine Tending

Authors: Abdalwhab Abdalwhab, Giovanni Beltrame, David St-Onge

Abstract: The manufacturing sector was recently affected by workforce shortages, a problem that automation and robotics can heavily minimize. Simultaneously, reinforcement learning (RL) offers a promising solution where robots can learn through interaction with the environment. In this work, we leveraged a simplistic robotic system to work with RL with "real" data without having to deploy large expensive robots in a manufacturing setting. A real-world tabletop arena was designed with robots that mimic the agents' behavior in the simulation. Despite the difference in dynamics and machine size, the robots were able to depict the same behavior as in the simulation. In addition, those experiments provided an initial understanding of the real deployment challenges.

cross Large Model for Small Data: Foundation Model for Cross-Modal RF Human Activity Recognition

Authors: Yuxuan Weng, Guoquan Wu, Tianyue Zheng, Yanbing Yang, Jun Luo

Abstract: Radio-Frequency (RF)-based Human Activity Recognition (HAR) rises as a promising solution for applications unamenable to techniques requiring computer visions. However, the scarcity of labeled RF data due to their non-interpretable nature poses a significant obstacle. Thanks to the recent breakthrough of foundation models (FMs), extracting deep semantic insights from unlabeled visual data become viable, yet these vision-based FMs fall short when applied to small RF datasets. To bridge this gap, we introduce FM-Fi, an innovative cross-modal framework engineered to translate the knowledge of vision-based FMs for enhancing RF-based HAR systems. FM-Fi involves a novel cross-modal contrastive knowledge distillation mechanism, enabling an RF encoder to inherit the interpretative power of FMs for achieving zero-shot learning. It also employs the intrinsic capabilities of FM and RF to remove extraneous features for better alignment between the two modalities. The framework is further refined through metric-based few-shot learning techniques, aiming to boost the performance for predefined HAR tasks. Comprehensive evaluations evidently indicate that FM-Fi rivals the effectiveness of vision-based methodologies, and the evaluation results provide empirical validation of FM-Fi's generalizability across various environments.

cross Learning Robust Representations for Communications over Interference-limited Channels

Authors: Shubham Paul, Sudharsan Senthil, Preethi Seshadri, Nambi Seshadri, R David Koilpillai

Abstract: In the context of cellular networks, users located at the periphery of cells are particularly vulnerable to substantial interference from neighbouring cells, which can be represented as a two-user interference channel. This study introduces two highly effective methodologies, namely TwinNet and SiameseNet, using autoencoders, tailored for the design of encoders and decoders for block transmission and detection in interference-limited environments. The findings unambiguously illustrate that the developed models are capable of leveraging the interference structure to outperform traditional methods reliant on complete orthogonality. While it is recognized that systems employing coordinated transmissions and independent detection can offer greater capacity, the specific gains of data-driven models have not been thoroughly quantified or elucidated. This paper conducts an analysis to demonstrate the quantifiable advantages of such models in particular scenarios. Additionally, a comprehensive examination of the characteristics of codewords generated by these models is provided to offer a more intuitive comprehension of how these models achieve superior performance.

cross Statistical Test for Auto Feature Engineering by Selective Inference

Authors: Tatsuya Matsukawa, Tomohiro Shiraishi, Shuichi Nishino, Teruyuki Katsuoka, Ichiro Takeuchi

Abstract: Auto Feature Engineering (AFE) plays a crucial role in developing practical machine learning pipelines by automating the transformation of raw data into meaningful features that enhance model performance. By generating features in a data-driven manner, AFE enables the discovery of important features that may not be apparent through human experience or intuition. On the other hand, since AFE generates features based on data, there is a risk that these features may be overly adapted to the data, making it essential to assess their reliability appropriately. Unfortunately, because most AFE problems are formulated as combinatorial search problems and solved by heuristic algorithms, it has been challenging to theoretically quantify the reliability of generated features. To address this issue, we propose a new statistical test for generated features by AFE algorithms based on a framework called selective inference. As a proof of concept, we consider a simple class of tree search-based heuristic AFE algorithms, and consider the problem of testing the generated features when they are used in a linear model. The proposed test can quantify the statistical significance of the generated features in the form of $p$-values, enabling theoretically guaranteed control of the risk of false findings.

cross Real-time Monitoring of Lower Limb Movement Resistance Based on Deep Learning

Authors: Buren Batu, Yuanmeng Liu, Tianyi Lyu

Abstract: Real-time lower limb movement resistance monitoring is critical for various applications in clinical and sports settings, such as rehabilitation and athletic training. Current methods often face limitations in accuracy, computational efficiency, and generalizability, which hinder their practical implementation. To address these challenges, we propose a novel Mobile Multi-Task Learning Network (MMTL-Net) that integrates MobileNetV3 for efficient feature extraction and employs multi-task learning to simultaneously predict resistance levels and recognize activities. The advantages of MMTL-Net include enhanced accuracy, reduced latency, and improved computational efficiency, making it highly suitable for real-time applications. Experimental results demonstrate that MMTL-Net significantly outperforms existing models on the UCI Human Activity Recognition and Wireless Sensor Data Mining Activity Prediction datasets, achieving a lower Force Error Rate (FER) of 6.8% and a higher Resistance Prediction Accuracy (RPA) of 91.2%. Additionally, the model shows a Real-time Responsiveness (RTR) of 12 milliseconds and a Throughput (TP) of 33 frames per second. These findings underscore the model's robustness and effectiveness in diverse real-world scenarios. The proposed framework not only advances the state-of-the-art in resistance monitoring but also paves the way for more efficient and accurate systems in clinical and sports applications. In real-world settings, the practical implications of MMTL-Net include its potential to enhance patient outcomes in rehabilitation and improve athletic performance through precise, real-time monitoring and feedback.

cross Copula-Linked Parallel ICA: A Method for Coupling Structural and Functional MRI brain Networks

Authors: Oktay Agcaoglu (for the Alzheimers Disease Neuroimaging Initiative), Rogers F. Silva (for the Alzheimers Disease Neuroimaging Initiative), Deniz Alacam (for the Alzheimers Disease Neuroimaging Initiative), Sergey Plis (for the Alzheimers Disease Neuroimaging Initiative), Tulay Adali (for the Alzheimers Disease Neuroimaging Initiative), Vince Calhoun (for the Alzheimers Disease Neuroimaging Initiative)

Abstract: Different brain imaging modalities offer unique insights into brain function and structure. Combining them enhances our understanding of neural mechanisms. Prior multimodal studies fusing functional MRI (fMRI) and structural MRI (sMRI) have shown the benefits of this approach. Since sMRI lacks temporal data, existing fusion methods often compress fMRI temporal information into summary measures, sacrificing rich temporal dynamics. Motivated by the observation that covarying networks are identified in both sMRI and resting-state fMRI, we developed a novel fusion method, by combining deep learning frameworks, copulas and independent component analysis (ICA), named copula linked parallel ICA (CLiP-ICA). This method estimates independent sources for each modality and links the spatial sources of fMRI and sMRI using a copula-based model for more flexible integration of temporal and spatial data. We tested CLiP-ICA using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our results showed that CLiP-ICA effectively captures both strongly and weakly linked sMRI and fMRI networks, including the cerebellum, sensorimotor, visual, cognitive control, and default mode networks. It revealed more meaningful components and fewer artifacts, addressing the long-standing issue of optimal model order in ICA. CLiP-ICA also detected complex functional connectivity patterns across stages of cognitive decline, with cognitively normal subjects generally showing higher connectivity in sensorimotor and visual networks compared to patients with Alzheimer, along with patterns suggesting potential compensatory mechanisms.

cross Real-Time Stress Detection via Photoplethysmogram Signals: Implementation of a Combined Continuous Wavelet Transform and Convolutional Neural Network on Resource-Constrained Microcontrollers

Authors: Yasin Hasanpoor, Amin Rostami, Bahram Tarvirdizadeh, Khalil Alipour, Mohammad Ghamari

Abstract: This paper introduces a robust stress detection system utilizing a Convolutional Neural Network (CNN) designed for the analysis of Photoplethysmogram (PPG) signals. Employing the WESAD dataset, we applied Continuous Wavelet Transform (CWT) to extract informative features from wrist PPG signals, demonstrating enhanced stress detection and learning compared to conventional techniques. Notably, the CNN achieved an impressive accuracy of 93.7% after five epochs, post-implementation on a resource-constrained microcontroller. The optimization process, including pruning and Post-Train Quantization, was crucial to reduce the model size to 1.6 megabytes, overcoming the microcontroller's limited resources of 2 megabytes of Flash memory and 512 kilobytes of RAM. This optimized model not only addresses resource constraints but also outperforms traditional signal processing methods, positioning it as a promising solution for real-time stress monitoring on wearable devices.

cross A Human-Centered Approach for Improving Supervised Learning

Authors: Shubhi Bansal, Atharva Tendulkar, Nagendra Kumar

Abstract: Supervised Learning is a way of developing Artificial Intelligence systems in which a computer algorithm is trained on labeled data inputs. Effectiveness of a Supervised Learning algorithm is determined by its performance on a given dataset for a particular problem. In case of Supervised Learning problems, Stacking Ensembles usually perform better than individual classifiers due to their generalization ability. Stacking Ensembles combine predictions from multiple Machine Learning algorithms to make final predictions. Inspite of Stacking Ensembles superior performance, the overhead of Stacking Ensembles such as high cost, resources, time, and lack of explainability create challenges in real-life applications. This paper shows how we can strike a balance between performance, time, and resource constraints. Another goal of this research is to make Ensembles more explainable and intelligible using the Human-Centered approach. To achieve the aforementioned goals, we proposed a Human-Centered Behavior-inspired algorithm that streamlines the Ensemble Learning process while also reducing time, cost, and resource overhead, resulting in the superior performance of Supervised Learning in real-world applications. To demonstrate the effectiveness of our method, we perform our experiments on nine real-world datasets. Experimental results reveal that the proposed method satisfies our goals and outperforms the existing methods.

cross EEGPT: Unleashing the Potential of EEG Generalist Foundation Model by Autoregressive Pre-training

Authors: Tongtian Yue, Shuning Xue, Xuange Gao, Yepeng Tang, Longteng Guo, Jie Jiang, Jing Liu

Abstract: Electroencephalogram (EEG) signals are pivotal in providing insights into spontaneous brain activity, highlighting their significant importance in neuroscience research. However, the exploration of versatile EEG models is constrained by diverse data formats, outdated pre-training paradigms, and limited transfer learning methods, only leading to specialist models on single dataset. In this paper, we introduce EEGPT, the first generalist EEG foundation model designed to address these challenges. First, we propose an electrode-wise modeling strategy that treats each electrode as a fundamental unit, enabling the integration of diverse EEG datasets collected from up to 138 electrodes, amassing 37.5M pre-training samples. Second, we develop the first autoregressive EEG pre-trained model, moving away from traditional masked autoencoder approaches to a next signal prediction task that better captures the sequential and temporal dependencies of EEG data. We also explore scaling laws with model up to 1.1B parameters: the largest in EEG research to date. Third, we introduce a multi-task transfer learning paradigm using a learnable electrode graph network shared across tasks, which for the first time confirms multi-task compatibility and synergy. As the first generalist EEG foundation model, EEGPT shows broad compatibility with various signal acquisition devices, subjects, and tasks. It supports up to 138 electrodes and any combination thereof as input. Furthermore, we simultaneously evaluate it on 5 distinct tasks across 12 benchmarks. EEGPT consistently outperforms existing specialist models across all downstream tasks, with its effectiveness further validated through extensive ablation studies. This work sets a new direction for generalist EEG modeling, offering improved scalability, transferability, and adaptability for a wide range of EEG applications. The code and models will be released.

cross Sampling from Bayesian Neural Network Posteriors with Symmetric Minibatch Splitting Langevin Dynamics

Authors: Daniel Paulin, Peter A. Whalley, Neil K. Chada, Benedict Leimkuhler

Abstract: We propose a scalable kinetic Langevin dynamics algorithm for sampling parameter spaces of big data and AI applications. Our scheme combines a symmetric forward/backward sweep over minibatches with a symmetric discretization of Langevin dynamics. For a particular Langevin splitting method (UBU), we show that the resulting Symmetric Minibatch Splitting-UBU (SMS-UBU) integrator has bias $O(h^2 d^{1/2})$ in dimension $d>0$ with stepsize $h>0$, despite only using one minibatch per iteration, thus providing excellent control of the sampling bias as a function of the stepsize. We apply the algorithm to explore local modes of the posterior distribution of Bayesian neural networks (BNNs) and evaluate the calibration performance of the posterior predictive probabilities for neural networks with convolutional neural network architectures for classification problems on three different datasets (Fashion-MNIST, Celeb-A and chest X-ray). Our results indicate that BNNs sampled with SMS-UBU can offer significantly better calibration performance compared to standard methods of training and stochastic weight averaging.

cross Feasibility Analysis of Federated Neural Networks for Explainable Detection of Atrial Fibrillation

Authors: Diogo Reis Santos, Andrea Protani, Lorenzo Giusti, Albert Sund Aillet, Pierpaolo Brutti, Luigi Serio

Abstract: Early detection of atrial fibrillation (AFib) is challenging due to its asymptomatic and paroxysmal nature. However, advances in deep learning algorithms and the vast collection of electrocardiogram (ECG) data from devices such as the Internet of Things (IoT) hold great potential for the development of an effective solution. This study assesses the feasibility of training a neural network on a Federated Learning (FL) platform to detect AFib using raw ECG data. The performance of an advanced neural network is evaluated in centralized, local, and federated settings. The effects of different aggregation methods on model performance are investigated, and various normalization strategies are explored to address issues related to neural network federation. The results demonstrate that federated learning can significantly improve the accuracy of detection over local training. The best performing federated model achieved an F1 score of 77\%, improving performance by 15\% compared to the average performance of individually trained clients. This study emphasizes the promise of FL in medical diagnostics, offering a privacy-preserving and interpretable solution for large-scale healthcare applications.

cross Data-Driven Uncertainty-Aware Forecasting of Sea Ice Conditions in the Gulf of Ob Based on Satellite Radar Imagery

Authors: Stefan Maria Ailuro, Anna Nedorubova, Timofey Grigoryev, Evgeny Burnaev, Vladimir Vanovskiy

Abstract: The increase in Arctic marine activity due to rapid warming and significant sea ice loss necessitates highly reliable, short-term sea ice forecasts to ensure maritime safety and operational efficiency. In this work, we present a novel data-driven approach for sea ice condition forecasting in the Gulf of Ob, leveraging sequences of radar images from Sentinel-1, weather observations, and GLORYS forecasts. Our approach integrates advanced video prediction models, originally developed for vision tasks, with domain-specific data preprocessing and augmentation techniques tailored to the unique challenges of Arctic sea ice dynamics. Central to our methodology is the use of uncertainty quantification to assess the reliability of predictions, ensuring robust decision-making in safety-critical applications. Furthermore, we propose a confidence-based model mixture mechanism that enhances forecast accuracy and model robustness, crucial for reliable operations in volatile Arctic environments. Our results demonstrate substantial improvements over baseline approaches, underscoring the importance of uncertainty quantification and specialized data handling for effective and safe operations and reliable forecasting.

cross Enhancing Apple's Defect Classification: Insights from Visible Spectrum and Narrow Spectral Band Imaging

Authors: Omar Coello, Mois\'es Coronel, Dar\'io Carpio, Boris Vintimilla, Luis Chuquimarca

Abstract: This study addresses the classification of defects in apples as a crucial measure to mitigate economic losses and optimize the food supply chain. An innovative approach is employed that integrates images from the visible spectrum and 660 nm spectral wavelength to enhance accuracy and efficiency in defect classification. The methodology is based on the use of Single-Input and Multi-Inputs convolutional neural networks (CNNs) to validate the proposed strategies. Steps include image acquisition and preprocessing, classification model training, and performance evaluation. Results demonstrate that defect classification using the 660 nm spectral wavelength reveals details not visible in the entire visible spectrum. It is seen that the use of the appropriate spectral range in the classification process is slightly superior to the entire visible spectrum. The MobileNetV1 model achieves an accuracy of 98.80\% on the validation dataset versus the 98.26\% achieved using the entire visible spectrum. Conclusions highlight the potential to enhance the method by capturing images with specific spectral ranges using filters, enabling more effective network training for classification task. These improvements could further enhance the system's capability to identify and classify defects in apples.

cross How to Backdoor Consistency Models?

Authors: Chengen Wang, Murat Kantarcioglu

Abstract: Consistency models are a new class of models that generate images by directly mapping noise to data, allowing for one-step generation and significantly accelerating the sampling process. However, their robustness against adversarial attacks has not yet been thoroughly investigated. In this work, we conduct the first study on the vulnerability of consistency models to backdoor attacks. While previous research has explored backdoor attacks on diffusion models, these studies have primarily focused on conventional diffusion models, employing a customized backdoor training process and objective, whereas consistency models have distinct training processes and objectives. Our proposed framework demonstrates the vulnerability of consistency models to backdoor attacks. During image generation, poisoned consistency models produce images with a Fr\'echet Inception Distance (FID) comparable to that of a clean model when sampling from Gaussian noise. However, once the trigger is activated, they generate backdoor target images. We explore various trigger and target configurations to evaluate the vulnerability of consistency models, including the use of random noise as a trigger. This type of trigger is less conspicuous and aligns well with the sampling process of consistency models. Across all configurations, our framework successfully compromises the consistency models while maintaining high utility and specificity.

cross Leveraging Multi-Temporal Sentinel 1 and 2 Satellite Data for Leaf Area Index Estimation With Deep Learning

Authors: Clement Wang, Antoine Debouchage, Valentin Goldit\'e, Aur\'elien Wery, Jules Salzinger

Abstract: The Leaf Area Index (LAI) is a critical parameter to understand ecosystem health and vegetation dynamics. In this paper, we propose a novel method for pixel-wise LAI prediction by leveraging the complementary information from Sentinel 1 radar data and Sentinel 2 multi-spectral data at multiple timestamps. Our approach uses a deep neural network based on multiple U-nets tailored specifically to this task. To handle the complexity of the different input modalities, it is comprised of several modules that are pre-trained separately to represent all input data in a common latent space. Then, we fine-tune them end-to-end with a common decoder that also takes into account seasonality, which we find to play an important role. Our method achieved 0.06 RMSE and 0.93 R2 score on publicly available data. We make our contributions available at https://github.com/valentingol/LeafNothingBehind for future works to further improve on our current progress.

URLs: https://github.com/valentingol/LeafNothingBehind

cross Multi-modal Image and Radio Frequency Fusion for Optimizing Vehicle Positioning

Authors: Ouwen Huan, Tao Luo, Mingzhe Chen

Abstract: In this paper, a multi-modal vehicle positioning framework that jointly localizes vehicles with channel state information (CSI) and images is designed. In particular, we consider an outdoor scenario where each vehicle can communicate with only one BS, and hence, it can upload its estimated CSI to only its associated BS. Each BS is equipped with a set of cameras, such that it can collect a small number of labeled CSI, a large number of unlabeled CSI, and the images taken by cameras. To exploit the unlabeled CSI data and position labels obtained from images, we design an meta-learning based hard expectation-maximization (EM) algorithm. Specifically, since we do not know the corresponding relationship between unlabeled CSI and the multiple vehicle locations in images, we formulate the calculation of the training objective as a minimum matching problem. To reduce the impact of label noises caused by incorrect matching between unlabeled CSI and vehicle locations obtained from images and achieve better convergence, we introduce a weighted loss function on the unlabeled datasets, and study the use of a meta-learning algorithm for computing the weighted loss. Subsequently, the model parameters are updated according to the weighted loss function of unlabeled CSI samples and their matched position labels obtained from images. Simulation results show that the proposed method can reduce the positioning error by up to 61% compared to a baseline that does not use images and uses only CSI fingerprint for vehicle positioning.

cross Xeno-learning: knowledge transfer across species in deep learning-based spectral image analysis

Authors: Jan Sellner, Alexander Studier-Fischer, Ahmad Bin Qasim, Silvia Seidlitz, Nicholas Schreck, Minu Tizabi, Manuel Wiesenfarth, Annette Kopp-Schneider, Samuel Kn\"odler, Caelan Max Haney, Gabriel Salg, Berkin \"Ozdemir, Maximilian Dietrich, Maurice Stephan Michel, Felix Nickel, Karl-Friedrich Kowalewski, Lena Maier-Hein

Abstract: Novel optical imaging techniques, such as hyperspectral imaging (HSI) combined with machine learning-based (ML) analysis, have the potential to revolutionize clinical surgical imaging. However, these novel modalities face a shortage of large-scale, representative clinical data for training ML algorithms, while preclinical animal data is abundantly available through standardized experiments and allows for controlled induction of pathological tissue states, which is not ethically possible in patients. To leverage this situation, we propose a novel concept called "xeno-learning", a cross-species knowledge transfer paradigm inspired by xeno-transplantation, where organs from a donor species are transplanted into a recipient species. Using a total of 11,268 HSI images from humans as well as porcine and rat models, we show that although spectral signatures of organs differ across species, shared pathophysiological mechanisms manifest as comparable relative spectral changes across species. Such changes learnt in one species can thus be transferred to a new species via a novel "physiology-based data augmentation" method, enabling the large-scale secondary use of preclinical animal data for humans. The resulting ethical, monetary, and performance benefits of the proposed knowledge transfer paradigm promise a high impact of the methodology on future developments in the field.

cross Telco-DPR: A Hybrid Dataset for Evaluating Retrieval Models of 3GPP Technical Specifications

Authors: Thaina Saraiva, Marco Sousa, Pedro Vieira, Ant\'onio Rodrigues

Abstract: This paper proposes a Question-Answering (QA) system for the telecom domain using 3rd Generation Partnership Project (3GPP) technical documents. Alongside, a hybrid dataset, Telco-DPR, which consists of a curated 3GPP corpus in a hybrid format, combining text and tables, is presented. Additionally, the dataset includes a set of synthetic question/answer pairs designed to evaluate the retrieval performance of QA systems on this type of data. The retrieval models, including the sparse model, Best Matching 25 (BM25), as well as dense models, such as Dense Passage Retriever (DPR) and Dense Hierarchical Retrieval (DHR), are evaluated and compared using top-K accuracy and Mean Reciprocal Rank (MRR). The results show that DHR, a retriever model utilising hierarchical passage selection through fine-tuning at both the document and passage levels, outperforms traditional methods in retrieving relevant technical information, achieving a Top-10 accuracy of 86.2%. Additionally, the Retriever-Augmented Generation (RAG) technique, used in the proposed QA system, is evaluated to demonstrate the benefits of using the hybrid dataset and the DHR. The proposed QA system, using the developed RAG model and the Generative Pretrained Transformer (GPT)-4, achieves a 14% improvement in answer accuracy, when compared to a previous benchmark on the same dataset.

cross Data-Driven Cellular Network Selector for Vehicle Teleoperations

Authors: Barak Gahtan, Reuven Cohen, Alex M. Bronstein, Eli Shapira

Abstract: Remote control of robotic systems, also known as teleoperation, is crucial for the development of autonomous vehicle (AV) technology. It allows a remote operator to view live video from AVs and, in some cases, to make real-time decisions. The effectiveness of video-based teleoperation systems is heavily influenced by the quality of the cellular network and, in particular, its packet loss rate and latency. To optimize these parameters, an AV can be connected to multiple cellular networks and determine in real time over which cellular network each video packet will be transmitted. We present an algorithm, called Active Network Selector (ANS), which uses a time series machine learning approach for solving this problem. We compare ANS to a baseline non-learning algorithm, which is used today in commercial systems, and show that ANS performs much better, with respect to both packet loss and packet latency.

cross Substance Beats Style: Why Beginning Students Fail to Code with LLMs

Authors: Francesca Lucchetti, Zixuan Wu, Arjun Guha, Molly Q Feldman, Carolyn Jane Anderson

Abstract: Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.

cross DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks

Authors: Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, Ramesh S

Abstract: Deep Neural Networks (DNNs) are increasingly deployed across applications. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between models, especially with limited test datasets, making it difficult to select or combine models effectively. Differential testing addresses this by generating test inputs that expose discrepancies in DNN model behavior. However, existing approaches face significant limitations: many rely on model internals or are constrained by available seed inputs. To address these challenges, we propose DiffGAN, a black-box test image generation approach for differential testing of DNN models. DiffGAN leverages a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to generate diverse and valid triggering inputs that reveal behavioral discrepancies between models. DiffGAN employs two custom fitness functions, focusing on diversity and divergence, to guide the exploration of the GAN input space and identify discrepancies between models' outputs. By strategically searching this space, DiffGAN generates inputs with specific features that trigger differences in model behavior. DiffGAN is black-box, making it applicable in more situations. We evaluate DiffGAN on eight DNN model pairs trained on widely used image datasets. Our results show DiffGAN significantly outperforms a SOTA baseline, generating four times more triggering inputs, with greater diversity and validity, within the same budget. Additionally, the generated inputs improve the accuracy of a machine learning-based model selection mechanism, which selects the best-performing model based on input characteristics and can serve as a smart output voting mechanism when using alternative models.

cross Feature Clipping for Uncertainty Calibration

Authors: Linwei Tao, Minjing Dong, Chang Xu

Abstract: Deep neural networks (DNNs) have achieved significant success across various tasks, but ensuring reliable uncertainty estimates, known as model calibration, is crucial for their safe and effective deployment. Modern DNNs often suffer from overconfidence, leading to miscalibration. We propose a novel post-hoc calibration method called feature clipping (FC) to address this issue. FC involves clipping feature values to a specified threshold, effectively increasing entropy in high calibration error samples while maintaining the information in low calibration error samples. This process reduces the overconfidence in predictions, improving the overall calibration of the model. Our extensive experiments on datasets such as CIFAR-10, CIFAR-100, and ImageNet, and models including CNNs and transformers, demonstrate that FC consistently enhances calibration performance. Additionally, we provide a theoretical analysis that validates the effectiveness of our method. As the first calibration technique based on feature modification, feature clipping offers a novel approach to improving model calibration, showing significant improvements over both post-hoc and train-time calibration methods and pioneering a new avenue for feature-based model calibration.

cross Radon Implicit Field Transform (RIFT): Learning Scenes from Radar Signals

Authors: Daqian Bao, Alex Saad-Falcon, Justin Romberg

Abstract: Data acquisition in array signal processing (ASP) is costly, as high angular and range resolutions require large antenna apertures and wide frequency bandwidths. Data requirements grow multiplicatively with viewpoints and frequencies, increasing collection burdens. Implicit Neural Representations (INRs)--neural network models of 3D scenes--offer compact, continuous representations with minimal data, interpolating to unseen viewpoints, potentially reducing sampling costs in ASP. We propose the Radon Implicit Field Transform (RIFT), combining a radar forward model (Generalized Radon Transform, GRT) with an INR-based scene representation learned from radar signals. This method extends to other ASP problems by replacing the GRT with appropriate algorithms. In experiments, we synthesize radar data using the GRT and train the INR model by minimizing radar signal reconstruction error. We render the scene using the trained INR and evaluate it against ground truth. We introduce new error metrics: phase-Root Mean Square Error (p-RMSE) and magnitude-Structural Similarity Index Measure (m-SSIM). Compared to traditional scene models, our RIFT model achieves up to 188% improvement in scene reconstruction with only 10% of the data. Using the same amount of data, RIFT achieves 3x better reconstruction and shows a 10% improvement when generalizing to unseen viewpoints.

cross The Useful Side of Motion: Using Head Motion Parameters to Correct for Respiratory Confounds in BOLD fMRI

Authors: Abdoljalil Addeh, G. Bruce Pike, M. Ethan MacDonald

Abstract: Acquiring accurate external respiratory data during functional Magnetic Resonance Imaging (fMRI) is challenging, prompting the exploration of machine learning methods to estimate respiratory variation (RV) from fMRI data. Respiration induces head motion, including real and pseudo motion, which likely provides useful information about respiratory events. Recommended notch filters mitigate respiratory-induced motion artifacts, suggesting that a bandpass filter at the respiratory frequency band isolates respiratory-induced head motion. This study seeks to enhance the accuracy of RV estimation from resting-state BOLD-fMRI data by integrating estimated head motion parameters. Specifically, we aim to determine the impact of incorporating raw versus bandpass-filtered head motion parameters on RV reconstruction accuracy using one-dimensional convolutional neural networks (1D-CNNs). This approach addresses the limitations of traditional filtering techniques and leverages the potential of head motion data to provide a more robust estimation of respiratory-induced variations.

cross Comparing Surface Landmine Object Detection Models on a New Drone Flyby Dataset

Authors: Navin Agrawal-Chung, Zohran Moin

Abstract: Landmine detection using traditional methods is slow, dangerous and prohibitively expensive. Using deep learning-based object detection algorithms drone videos is promising but has multiple challenges due to the small, soda-can size of recently prevalent surface landmines. The literature currently lacks scientific evaluation of optimal ML models for this problem since most object detection research focuses on analysis of ground video surveillance images. In order to help train comprehensive models and drive research for surface landmine detection, we first create a custom dataset comprising drone images of POM-2 and POM-3 Russian surface landmines. Using this dataset, we train, test and compare 4 different computer vision foundation models YOLOF, DETR, Sparse-RCNN and VFNet. Generally, all 4 detectors do well with YOLOF outperforming other models with a mAP score of 0.89 while DETR, VFNET and Sparse-RCNN mAP scores are all around 0.82 for drone images taken from 10m AGL. YOLOF is also quicker to train consuming 56min of training time on a Nvidia V100 compute cluster. Finally, this research contributes landmine image, video datasets and model Jupyter notebooks at https://github.com/UnVeilX/ to enable future research in surface landmine detection.

URLs: https://github.com/UnVeilX/

cross Training Compute-Optimal Vision Transformers for Brain Encoding

Authors: Sana Ahmadi, Francois Paugam, Tristan Glatard, Pierre Lune Bellec

Abstract: The optimal training of a vision transformer for brain encoding depends on three factors: model size, data size, and computational resources. This study investigates these three pillars, focusing on the effects of data scaling, model scaling, and high-performance computing on brain encoding results. Using VideoGPT to extract efficient spatiotemporal features from videos and training a Ridge model to predict brain activity based on these features, we conducted benchmark experiments with varying data sizes (10k, 100k, 1M, 6M) and different model configurations of GPT-2, including hidden layer dimensions, number of layers, and number of attention heads. We also evaluated the effects of training models with 32-bit vs 16-bit floating point representations. Our results demonstrate that increasing the hidden layer dimensions significantly improves brain encoding performance, as evidenced by higher Pearson correlation coefficients across all subjects. In contrast, the number of attention heads does not have a significant effect on the encoding results. Additionally, increasing the number of layers shows some improvement in brain encoding correlations, but the trend is not as consistent as that observed with hidden layer dimensions. The data scaling results show that larger training datasets lead to improved brain encoding performance, with the highest Pearson correlation coefficients observed for the largest dataset size (6M). These findings highlight that the effects of data scaling are more significant compared to model scaling in enhancing brain encoding performance. Furthermore, we explored the impact of floating-point precision by comparing 32-bit and 16-bit representations. Training with 16-bit precision yielded the same brain encoding accuracy as 32-bit, while reducing training time by 1.17 times, demonstrating its efficiency for high-performance computing tasks.

cross ControlAgent: Automating Control System Design via Novel Integration of LLM Agents and Domain Expertise

Authors: Xingang Guo, Darioush Keivan, Usman Syed, Lianhui Qin, Huan Zhang, Geir Dullerud, Peter Seiler, Bin Hu

Abstract: Control system design is a crucial aspect of modern engineering with far-reaching applications across diverse sectors including aerospace, automotive systems, power grids, and robotics. Despite advances made by Large Language Models (LLMs) in various domains, their application in control system design remains limited due to the complexity and specificity of control theory. To bridge this gap, we introduce ControlAgent, a new paradigm that automates control system design via novel integration of LLM agents and control-oriented domain expertise. ControlAgent encodes expert control knowledge and emulates human iterative design processes by gradually tuning controller parameters to meet user-specified requirements for stability, performance, and robustness. ControlAgent integrates multiple collaborative LLM agents, including a central agent responsible for task distribution and task-specific agents dedicated to detailed controller design for various types of systems and requirements. ControlAgent also employs a Python computation agent that performs complex calculations and controller evaluations based on standard design information provided by task-specified LLM agents. Combined with a history and feedback module, the task-specific LLM agents iteratively refine controller parameters based on real-time feedback from prior designs. Overall, ControlAgent mimics the design processes used by (human) practicing engineers, but removes all the human efforts and can be run in a fully automated way to give end-to-end solutions for control system design with user-specified requirements. To validate ControlAgent's effectiveness, we develop ControlEval, an evaluation dataset that comprises 500 control tasks with various specific design goals. The effectiveness of ControlAgent is demonstrated via extensive comparative evaluations between LLM-based and traditional human-involved toolbox-based baselines.

cross BUNDL: Bayesian Uncertainty-aware Deep Learning with Noisy training Labels for Seizure Detection in EEG

Authors: Deeksha M Shama, Archana Venkataraman

Abstract: Deep learning methods are at the forefront of automated epileptic seizure detection and onset zone localization using scalp-EEG. However, the performance of deep learning methods rely heavily on the quality of annotated training datasets. Scalp EEG is susceptible to high noise levels, which in turn leads to imprecise annotations of the seizure timing and characteristics. This label noise presents a significant challenge in model training and generalization. In this paper, we introduce a novel statistical framework that informs a deep learning model of label ambiguity, thereby enhancing the overall seizure detection performance. Our Bayesian UncertaiNty-aware Deep Learning, BUNDL, strategy offers a straightforward and model-agnostic method for training deep neural networks with noisy training labels that does not add any parameters to existing architectures. By integrating domain knowledge into the statistical framework, we derive a novel KL-divergence-based loss function that capitalizes on uncertainty to better learn seizure characteristics from scalp EEG. Additionally, we explore the impact of improved seizure detection on the task of automated onset zone localization. We validate BUNDL using a comprehensive simulated EEG dataset and two publicly available datasets, TUH and CHB-MIT. BUNDL consistently improves the performance of three base models on simulated data under seven types of label noise and three EEG signal-to-noise ratios. Similar improvements were observed in the real-world TUH and CHB-MIT datasets. Finally, we demonstrate that BUNDL improves the accuracy of seizure onset zone localization. BUNDL is specifically designed to address label ambiguities, enabling the training of reliable and trustworthy models for epilepsy evaluation.

cross UniMTS: Unified Pre-training for Motion Time Series

Authors: Xiyuan Zhang, Diyan Teng, Ranak Roy Chowdhury, Shuheng Li, Dezhi Hong, Rajesh K. Gupta, Jingbo Shang

Abstract: Motion time series collected from mobile and wearable devices such as smartphones and smartwatches offer significant insights into human behavioral patterns, with wide applications in healthcare, automation, IoT, and AR/XR due to their low-power, always-on nature. However, given security and privacy concerns, building large-scale motion time series datasets remains difficult, preventing the development of pre-trained models for human activity analysis. Typically, existing models are trained and tested on the same dataset, leading to poor generalizability across variations in device location, device mounting orientation and human activity type. In this paper, we introduce UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities. Specifically, we employ a contrastive learning framework that aligns motion time series with text descriptions enriched by large language models. This helps the model learn the semantics of time series to generalize across activities. Given the absence of large-scale motion time series data, we derive and synthesize time series from existing motion skeleton data with all-joint coverage. Spatio-temporal graph networks are utilized to capture the relationships across joints for generalization across different device locations. We further design rotation-invariant augmentation to make the model agnostic to changes in device mounting orientations. Our model shows exceptional generalizability across 18 motion time series classification benchmark datasets, outperforming the best baselines by 340% in the zero-shot setting, 16.3% in the few-shot setting, and 9.2% in the full-shot setting.

cross Automatic Classification of Sleep Stages from EEG Signals Using Riemannian Metrics and Transformer Networks

Authors: Mathieu Seraphim (GREYC), Alexis Lechervy (GREYC), Florian Yger (MILES, LAMSADE, LITIS, App - LITIS), Luc Brun (COMETE, UNICAEN), Olivier Etard (COMETE, UNICAEN)

Abstract: Purpose: In sleep medicine, assessing the evolution of a subject's sleep often involves the costly manual scoring of electroencephalographic (EEG) signals. In recent years, a number of Deep Learning approaches have been proposed to automate this process, mainly by extracting features from said signals. However, despite some promising developments in related problems, such as Brain-Computer Interfaces, analyses of the covariances between brain regions remain underutilized in sleep stage scoring.Methods: Expanding upon our previous work, we investigate the capabilities of SPDTransNet, a Transformer-derived network designed to classify sleep stages from EEG data through timeseries of covariance matrices. Furthermore, we present a novel way of integrating learned signal-wise features into said matrices without sacrificing their Symmetric Definite Positive (SPD) nature.Results: Through comparison with other State-of-the-Art models within a methodology optimized for class-wise performance, we achieve a level of performance at or beyond various State-of-the-Art models, both in single-dataset and - particularly - multi-dataset experiments.Conclusion: In this article, we prove the capabilities of our SPDTransNet model, particularly its adaptability to multi-dataset tasks, within the context of EEG sleep stage scoring - though it could easily be adapted to any classification task involving timeseries of covariance matrices.

cross Non-invasive Neural Decoding in Source Reconstructed Brain Space

Authors: Yonatan Gideoni, Ryan Charles Timms, Oiwi Parker Jones

Abstract: Non-invasive brainwave decoding is usually done using Magneto/Electroencephalography (MEG/EEG) sensor measurements as inputs. This makes combining datasets and building models with inductive biases difficult as most datasets use different scanners and the sensor arrays have a nonintuitive spatial structure. In contrast, fMRI scans are acquired directly in brain space, a voxel grid with a typical structured input representation. By using established techniques to reconstruct the sensors' sources' neural activity it is possible to decode from voxels for MEG data as well. We show that this enables spatial inductive biases, spatial data augmentations, better interpretability, zero-shot generalisation between datasets, and data harmonisation.

cross Contrastive random lead coding for channel-agnostic self-supervision of biosignals

Authors: Thea Br\"usch, Mikkel N. Schmidt, Tommy S. Alstr{\o}m

Abstract: Contrastive learning yields impressive results for self-supervision in computer vision. The approach relies on the creation of positive pairs, something which is often achieved through augmentations. However, for multivariate time series effective augmentations can be difficult to design. Additionally, the number of input channels for biosignal datasets often varies from application to application, limiting the usefulness of large self-supervised models trained with specific channel configurations. Motivated by these challenges, we set out to investigate strategies for creation of positive pairs for channel-agnostic self-supervision of biosignals. We introduce contrastive random lead coding (CRLC), where random subsets of the input channels are used to create positive pairs and compare with using augmentations and neighboring segments in time as positive pairs. We validate our approach by pre-training models on EEG and ECG data, and then fine-tuning them for downstream tasks. CRLC outperforms competing strategies in both scenarios in the channel-agnostic setting. For EEG, the approach additionally outperforms the state-of-the-art reference model. Notably, for EEG tasks CRLC surpasses the current state-of-the-art reference model. While, the state-of-the-art reference model is superior in the ECG task, incorporating CRLC allows us to obtain comparable results. In conclusion, CRLC helps generalization across variable channel setups when training our channel-agnostic model.

cross Artificial intelligence for partial differential equations in computational mechanics: A review

Authors: Yizheng Wang, Jinshuai Bai, Zhongya Lin, Qimin Wang, Cosmin Anitescu, Jia Sun, Mohammad Sadegh Eshaghi, Yuantong Gu, Xi-Qiao Feng, Xiaoying Zhuang, Timon Rabczuk, Yinghua Liu

Abstract: In recent years, Artificial intelligence (AI) has become ubiquitous, empowering various fields, especially integrating artificial intelligence and traditional science (AI for Science: Artificial intelligence for science), which has attracted widespread attention. In AI for Science, using artificial intelligence algorithms to solve partial differential equations (AI for PDEs: Artificial intelligence for partial differential equations) has become a focal point in computational mechanics. The core of AI for PDEs is the fusion of data and partial differential equations (PDEs), which can solve almost any PDEs. Therefore, this article provides a comprehensive review of the research on AI for PDEs, summarizing the existing algorithms and theories. The article discusses the applications of AI for PDEs in computational mechanics, including solid mechanics, fluid mechanics, and biomechanics. The existing AI for PDEs algorithms include those based on Physics-Informed Neural Networks (PINNs), Deep Energy Methods (DEM), Operator Learning, and Physics-Informed Neural Operator (PINO). AI for PDEs represents a new method of scientific simulation that provides approximate solutions to specific problems using large amounts of data, then fine-tuning according to specific physical laws, avoiding the need to compute from scratch like traditional algorithms. Thus, AI for PDEs is the prototype for future foundation models in computational mechanics, capable of significantly accelerating traditional numerical algorithms.

cross A practical, fast method for solving sum-of-squares problems for very large polynomials

Authors: Daniel Keren, Margarita Osadchy, Roi Poranne

Abstract: Sum of squares (SOS) optimization is a powerful technique for solving problems where the positivity of a polynomials must be enforced. The common approach to solve an SOS problem is by relaxation to a Semidefinite Program (SDP). The main advantage of this transormation is that SDP is a convex problem for which efficient solvers are readily available. However, while considerable progress has been made in recent years, the standard approaches for solving SDPs are still known to scale poorly. Our goal is to devise an approach that can handle larger, more complex problems than is currently possible. The challenge indeed lies in how SDPs are commonly solved. State-Of-The-Art approaches rely on the interior point method, which requires the factorization of large matrices. We instead propose an approach inspired by polynomial neural networks, which exhibit excellent performance when optimized using techniques from the deep learning toolbox. In a somewhat counter-intuitive manner, we replace the convex SDP formulation with a non-convex, unconstrained, and \emph{over parameterized} formulation, and solve it using a first order optimization method. It turns out that this approach can handle very large problems, with polynomials having over four million coefficients, well beyond the range of current SDP-based approaches. Furthermore, we highlight theoretical and practical results supporting the experimental success of our approach in avoiding spurious local minima, which makes it amenable to simple and fast solutions based on gradient descent. In all the experiments, our approach had always converged to a correct global minimum, on general (non-sparse) polynomials, with running time only slightly higher than linear in the number of polynomial coefficients, compared to higher than quadratic in the number of coefficients for SDP-based methods.

cross Enhancing Trust and Safety in Digital Payments: An LLM-Powered Approach

Authors: Devendra Dahiphale (Google, Inc), Naveen Madiraju (Google, Inc), Justin Lin (Google, Inc), Rutvik Karve (Google, Inc), Monu Agrawal (Google, Inc), Anant Modwal (Google, Inc), Ramanan Balakrishnan (Google, Inc), Shanay Shah (Google, Inc), Govind Kaushal (Google, Inc), Priya Mandawat (Google, Inc), Prakash Hariramani (Google, Inc), Arif Merchant (Google, Inc)

Abstract: Digital payment systems have revolutionized financial transactions, offering unparalleled convenience and accessibility to users worldwide. However, the increasing popularity of these platforms has also attracted malicious actors seeking to exploit their vulnerabilities for financial gain. To address this challenge, robust and adaptable scam detection mechanisms are crucial for maintaining the trust and safety of digital payment ecosystems. This paper presents a comprehensive approach to scam detection, focusing on the Unified Payments Interface (UPI) in India, Google Pay (GPay) as a specific use case. The approach leverages Large Language Models (LLMs) to enhance scam classification accuracy and designs a digital assistant to aid human reviewers in identifying and mitigating fraudulent activities. The results demonstrate the potential of LLMs in augmenting existing machine learning models and improving the efficiency, accuracy, quality, and consistency of scam reviews, ultimately contributing to a safer and more secure digital payment landscape. Our evaluation of the Gemini Ultra model on curated transaction data showed a 93.33% accuracy in scam classification. Furthermore, the model demonstrated 89% accuracy in generating reasoning for these classifications. A promising fact, the model identified 32% new accurate reasons for suspected scams that human reviewers had not included in the review notes.

cross AEPL: Automated and Editable Prompt Learning for Brain Tumor Segmentation

Authors: Yongheng Sun, Mingxia Liu, Chunfeng Lian

Abstract: Brain tumor segmentation is crucial for accurate diagnosisand treatment planning, but the small size and irregular shapeof tumors pose significant challenges. Existing methods of-ten fail to effectively incorporate medical domain knowledgesuch as tumor grade, which correlates with tumor aggres-siveness and morphology, providing critical insights for moreaccurate detection of tumor subregions during segmentation.We propose an Automated and Editable Prompt Learning(AEPL) framework that integrates tumor grade into the seg-mentation process by combining multi-task learning andprompt learning with automatic and editable prompt gen-eration. Specifically, AEPL employs an encoder to extractimage features for both tumor-grade prediction and segmen-tation mask generation. The predicted tumor grades serveas auto-generated prompts, guiding the decoder to produceprecise segmentation masks. This eliminates the need formanual prompts while allowing clinicians to manually editthe auto-generated prompts to fine-tune the segmentation,enhancing both flexibility and precision. The proposed AEPLachieves state-of-the-art performance on the BraTS 2018dataset, demonstrating its effectiveness and clinical potential.The source code can be accessed online.

cross Dynamic User Grouping based on Location and Heading in 5G NR Systems

Authors: Dino Pjani\'c, Korkut Emre Arslant\"urk, Xuesong Cai, Fredrik Tufvesson

Abstract: User grouping based on geographic location in fifth generation (5G) New Radio (NR) systems has several applications that can significantly improve network performance, user experience, and service delivery. We demonstrate how Sounding Reference Signals channel fingerprints can be used for dynamic user grouping in a 5G NR commercial deployment based on outdoor positions and heading direction employing machine learning methods such as neural networks combined with clustering methods.

cross Personalized Recommendation Systems using Multimodal, Autonomous, Multi Agent Systems

Authors: Param Thakkar, Anushka Yadav

Abstract: This paper describes a highly developed personalised recommendation system using multimodal, autonomous, multi-agent systems. The system focuses on the incorporation of futuristic AI tech and LLMs like Gemini-1.5- pro and LLaMA-70B to improve customer service experiences especially within e-commerce. Our approach uses multi agent, multimodal systems to provide best possible recommendations to its users. The system is made up of three agents as a whole. The first agent recommends products appropriate for answering the given question, while the second asks follow-up questions based on images that belong to these recommended products and is followed up with an autonomous search by the third agent. It also features a real-time data fetch, user preferences-based recommendations and is adaptive learning. During complicated queries the application processes with Symphony, and uses the Groq API to answer quickly with low response times. It uses a multimodal way to utilize text and images comprehensively, so as to optimize product recommendation and customer interaction.

cross Breaking the Illusion: Real-world Challenges for Adversarial Patches in Object Detection

Authors: Jakob Shack, Katarina Petrovic, Olga Saukh

Abstract: Adversarial attacks pose a significant threat to the robustness and reliability of machine learning systems, particularly in computer vision applications. This study investigates the performance of adversarial patches for the YOLO object detection network in the physical world. Two attacks were tested: a patch designed to be placed anywhere within the scene - global patch, and another patch intended to partially overlap with specific object targeted for removal from detection - local patch. Various factors such as patch size, position, rotation, brightness, and hue were analyzed to understand their impact on the effectiveness of the adversarial patches. The results reveal a notable dependency on these parameters, highlighting the challenges in maintaining attack efficacy in real-world conditions. Learning to align digitally applied transformation parameters with those measured in the real world still results in up to a 64\% discrepancy in patch performance. These findings underscore the importance of understanding environmental influences on adversarial attacks, which can inform the development of more robust defenses for practical machine learning applications.

cross Predicting potato plant vigor from the seed tuber properties

Authors: Elisa Atza, Rob Klooster, Falko Hofstra, Frank van der Werff, Hans van Doorn, Neil Budko

Abstract: The vigor of potato plants, defined as the canopy area at the end of the exponential growth stage, depends on the origin and physiological state of the seed tuber. Experiments carried out with six potato varieties in three test fields over three years show that there is a 73%-90% correlation in the vigor of the plants from the same seedlot grown in different test fields. However, these correlations are not always observed on the level of individual varieties and vanish or become negative when the seed tubers and young plants experience environmental stress. A comprehensive study of the association between the vigor and the seed tuber biochemistry has revealed that, while 50%-70% of the variation in the plant vigor is explained by the tuber data, the vigor is dominated by the potato genotype. Analysis of individual predictors, such as the abundance of a particular metabolite, indicates that the vigor enhancing properties of the seed tubers differ between genotypes. Variety-specific models show that, for some varieties, up to 30% of the vigor variation within the variety is explained by and can be predicted from the tuber biochemistry, whereas, for other varieties, the association between the tuber composition and the vigor is much weaker.

cross Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies

Authors: Luping Wang, Sheng Chen, Linnan Jiang, Shu Pan, Runze Cai, Sen Yang, Fei Yang

Abstract: The large models, as predicted by scaling raw forecasts, have made groundbreaking progress in many fields, particularly in natural language generation tasks, where they have approached or even surpassed human levels. However, the unprecedented scale of their parameters brings significant computational and storage costs. These large models require substantial computational resources and GPU memory to operate. When adapting large models to specific downstream tasks, their massive parameter scale poses a significant challenge in fine-tuning on hardware platforms with limited computational power and GPU memory. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) offers a practical solution by efficiently adjusting the parameters of large pre-trained models to suit various downstream tasks. Specifically, PEFT adjusts the parameters of pre-trained large models to adapt to specific tasks or domains, minimizing the introduction of additional parameters and the computational resources required. This review mainly introduces the preliminary knowledge of PEFT, the core ideas and principles of various PEFT algorithms, the applications of PEFT, and potential future research directions. By reading this review, we believe that interested parties can quickly grasp the PEFT methodology, thereby accelerating its development and innovation.

cross Critical biblical studies via word frequency analysis: unveiling text authorship

Authors: Shira Faigenbaum-Golovin, Alon Kipnis, Axel B\"uhler, Eli Piasetzky, Thomas R\"omer, Israel Finkelstein

Abstract: The Bible, a product of an extensive and intricate process of oral-written transmission spanning centuries, obscures the contours of its earlier recensions. Debate rages over determining the existing layers and identifying the date of composition and historical background of the biblical texts. Traditional manual methodologies have grappled with authorship challenges through scrupulous textual criticism, employing linguistic, stylistic, inner-biblical, and historical criteria. Despite recent progress in computer-assisted analysis, many patterns still need to be uncovered in Biblical Texts. In this study, we address the question of authorship of biblical texts by employing statistical analysis to the frequency of words using a method that is particularly sensitive to deviations in frequencies associated with a few words out of potentially many. We aim to differentiate between three distinct authors across numerous chapters spanning the first nine books of the Bible. In particular, we examine 50 chapters labeled according to biblical exegesis considerations into three corpora (D, DtrH, and P). Without prior assumptions about author identity, our approach leverages subtle differences in word frequencies to distinguish among the three corpora and identify author-dependent linguistic properties. Our analysis indicates that the first two authors (D and DtrH) are much more closely related compared to P, a fact that aligns with expert assessments. Additionally, we attain high accuracy in attributing authorship by evaluating the similarity of each chapter with the reference corpora. This study sheds new light on the authorship of biblical texts by providing interpretable, statistically significant evidence that there are different linguistic characteristics of biblical authors and that these differences can be identified.

cross Ensembling Finetuned Language Models for Text Classification

Authors: Sebastian Pineda Arango, Maciej Janowski, Lennart Purucker, Arber Zela, Frank Hutter, Josif Grabocka

Abstract: Finetuning is a common practice widespread across different communities to adapt pretrained models to particular tasks. Text classification is one of these tasks for which many pretrained models are available. On the other hand, ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates. However, ensembling pretrained models for text classification is not a well-studied avenue. In this paper, we present a metadataset with predictions from five large finetuned models on six datasets, and report results of different ensembling strategies from these predictions. Our results shed light on how ensembling can improve the performance of finetuned text classifiers and incentivize future adoption of ensembles in such tasks.

cross Collaborative Inference over Wireless Channels with Feature Differential Privacy

Authors: Mohamed Seif, Yuqi Nie, Andrea J. Goldsmith, H. Vincent Poor

Abstract: Collaborative inference among multiple wireless edge devices has the potential to significantly enhance Artificial Intelligence (AI) applications, particularly for sensing and computer vision. This approach typically involves a three-stage process: a) data acquisition through sensing, b) feature extraction, and c) feature encoding for transmission. However, transmitting the extracted features poses a significant privacy risk, as sensitive personal data can be exposed during the process. To address this challenge, we propose a novel privacy-preserving collaborative inference mechanism, wherein each edge device in the network secures the privacy of extracted features before transmitting them to a central server for inference. Our approach is designed to achieve two primary objectives: 1) reducing communication overhead and 2) ensuring strict privacy guarantees during feature transmission, while maintaining effective inference performance. Additionally, we introduce an over-the-air pooling scheme specifically designed for classification tasks, which provides formal guarantees on the privacy of transmitted features and establishes a lower bound on classification accuracy.

cross Method for noise-induced regularization in quantum neural networks

Authors: Wilfrid Somogyi, Ekaterina Pankovets, Viacheslav Kuzmin, Alexey Melnikov

Abstract: In the current quantum computing paradigm, significant focus is placed on the reduction or mitigation of quantum decoherence. When designing new quantum processing units, the general objective is to reduce the amount of noise qubits are subject to, and in algorithm design, a large effort is underway to provide scalable error correction or mitigation techniques. Yet some previous work has indicated that certain classes of quantum algorithms, such as quantum machine learning, may, in fact, be intrinsically robust to or even benefit from the presence of a small amount of noise. Here, we demonstrate that noise levels in quantum hardware can be effectively tuned to enhance the ability of quantum neural networks to generalize data, acting akin to regularisation in classical neural networks. As an example, we consider a medical regression task, where, by tuning the noise level in the circuit, we improved the mean squared error loss by 8%.

cross Language Agents Meet Causality -- Bridging LLMs and Causal World Models

Authors: John Gkountouras, Matthias Lindemann, Phillip Lippe, Efstratios Gavves, Ivan Titov

Abstract: Large Language Models (LLMs) have recently shown great promise in planning and reasoning applications. These tasks demand robust systems, which arguably require a causal understanding of the environment. While LLMs can acquire and reflect common sense causal knowledge from their pretraining data, this information is often incomplete, incorrect, or inapplicable to a specific environment. In contrast, causal representation learning (CRL) focuses on identifying the underlying causal structure within a given environment. We propose a framework that integrates CRLs with LLMs to enable causally-aware reasoning and planning. This framework learns a causal world model, with causal variables linked to natural language expressions. This mapping provides LLMs with a flexible interface to process and generate descriptions of actions and states in text form. Effectively, the causal world model acts as a simulator that the LLM can query and interact with. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally-aware method outperforming LLM-based reasoners, especially for longer planning horizons.

cross Improving Multimodal Large Language Models Using Continual Learning

Authors: Shikhar Srivastava, Md Yousuf Harun, Robik Shrestha, Christopher Kanan

Abstract: Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15\% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks, effectively preserving linguistic skills while acquiring new multimodal capabilities.

cross Statistical Inference in Classification of High-dimensional Gaussian Mixture

Authors: Hanwen Huang, Peng Zeng

Abstract: We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size $n$ and the dimension $p$ approach infinity while their ratio $\alpha=n/p$ remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using $L_1$-regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.

cross On-Robot Reinforcement Learning with Goal-Contrastive Rewards

Authors: Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeckpeper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, Lawson L. S. Wong

Abstract: Reinforcement Learning (RL) has the potential to enable robots to learn from their own actions in the real world. Unfortunately, RL can be prohibitively expensive, in terms of on-robot runtime, due to inefficient exploration when learning from a sparse reward signal. Designing dense reward functions is labour-intensive and requires domain expertise. In our work, we propose GCR (Goal-Contrastive Rewards), a dense reward function learning method that can be trained on passive video demonstrations. By using videos without actions, our method is easier to scale, as we can use arbitrary videos. GCR combines two loss functions, an implicit value loss function that models how the reward increases when traversing a successful trajectory, and a goal-contrastive loss that discriminates between successful and failed trajectories. We perform experiments in simulated manipulation environments across RoboMimic and MimicGen tasks, as well as in the real world using a Franka arm and a Spot quadruped. We find that GCR leads to a more-sample efficient RL, enabling model-free RL to solve about twice as many tasks as our baseline reward learning methods. We also demonstrate positive cross-embodiment transfer from videos of people and of other robots performing a task. Appendix: \url{https://tinyurl.com/gcr-appendix-2}.

URLs: https://tinyurl.com/gcr-appendix-2

cross Dimension reduction via score ratio matching

Authors: Ricardo Baptista, Michael Brennan, Youssef Marzouk

Abstract: Gradient-based dimension reduction decreases the cost of Bayesian inference and probabilistic modeling by identifying maximally informative (and informed) low-dimensional projections of the data and parameters, allowing high-dimensional problems to be reformulated as cheaper low-dimensional problems. A broad family of such techniques identify these projections and provide error bounds on the resulting posterior approximations, via eigendecompositions of certain diagnostic matrices. Yet these matrices require gradients or even Hessians of the log-likelihood, excluding the purely data-driven setting and many problems of simulation-based inference. We propose a framework, derived from score-matching, to extend gradient-based dimension reduction to problems where gradients are unavailable. Specifically, we formulate an objective function to directly learn the score ratio function needed to compute the diagnostic matrices, propose a tailored parameterization for the score ratio network, and introduce regularization methods that capitalize on the hypothesized low-dimensional structure. We also introduce a novel algorithm to iteratively identify the low-dimensional reduced basis vectors more accurately with limited data based on eigenvalue deflation methods. We show that our approach outperforms standard score-matching for problems with low-dimensional structure, and demonstrate its effectiveness for PDE-constrained Bayesian inverse problems and conditional generative modeling.

cross Unsupervised Machine Learning for Detecting and Locating Human-Made Objects in 3D Point Cloud

Authors: Hong Zhao, Huyunting Huang, Tonglin Zhang, Baijian Yang, Jin Wei-Kocsis, Songlin Fei

Abstract: A 3D point cloud is an unstructured, sparse, and irregular dataset, typically collected by airborne LiDAR systems over a geological region. Laser pulses emitted from these systems reflect off objects both on and above the ground, resulting in a dataset containing the longitude, latitude, and elevation of each point, as well as information about the corresponding laser pulse strengths. A widely studied research problem, addressed in many previous works, is ground filtering, which involves partitioning the points into ground and non-ground subsets. This research introduces a novel task: detecting and identifying human-made objects amidst natural tree structures. This task is performed on the subset of non-ground points derived from the ground filtering stage. Marked Point Fields (MPFs) are used as models well-suited to these tasks. The proposed methodology consists of three stages: ground filtering, local information extraction (LIE), and clustering. In the ground filtering stage, a statistical method called One-Sided Regression (OSR) is introduced, addressing the limitations of prior ground filtering methods on uneven terrains. In the LIE stage, unsupervised learning methods are lacking. To mitigate this, a kernel-based method for the Hessian matrix of the MPF is developed. In the clustering stage, the Gaussian Mixture Model (GMM) is applied to the results of the LIE stage to partition the non-ground points into trees and human-made objects. The underlying assumption is that LiDAR points from trees exhibit a three-dimensional distribution, while those from human-made objects follow a two-dimensional distribution. The Hessian matrix of the MPF effectively captures this distinction. Experimental results demonstrate that the proposed ground filtering method outperforms previous techniques, and the LIE method successfully distinguishes between points representing trees and human-made objects.

cross Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models

Authors: Zheng Zhao, Yftah Ziser, Shay B. Cohen

Abstract: Fine-tuning pre-trained large language models (LLMs) on a diverse array of tasks has become a common approach for building models that can solve various natural language processing (NLP) tasks. However, where and to what extent these models retain task-specific knowledge remains largely unexplored. This study investigates the task-specific information encoded in pre-trained LLMs and the effects of instruction tuning on their representations across a diverse set of over 60 NLP tasks. We use a set of matrix analysis tools to examine the differences between the way pre-trained and instruction-tuned LLMs store task-specific information. Our findings reveal that while some tasks are already encoded within the pre-trained LLMs, others greatly benefit from instruction tuning. Additionally, we pinpointed the layers in which the model transitions from high-level general representations to more task-oriented representations. This finding extends our understanding of the governing mechanisms of LLMs and facilitates future research in the fields of parameter-efficient transfer learning and multi-task learning.

cross GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

Authors: Kyle B. Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, Benjamin Burchfiel

Abstract: Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively "glue together" language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.

cross Dynamic layer selection in decoder-only transformers

Authors: Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates

Abstract: The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using only 23.3% of its layers on average.

cross Architectural Flaw Detection in Civil Engineering Using GPT-4

Authors: Saket Kumar, Abul Ehtesham, Aditi Singh, Tala Talaei Khoei

Abstract: The application of artificial intelligence (AI) in civil engineering presents a transformative approach to enhancing design quality and safety. This paper investigates the potential of the advanced LLM GPT4 Turbo vision model in detecting architectural flaws during the design phase, with a specific focus on identifying missing doors and windows. The study evaluates the model's performance through metrics such as precision, recall, and F1 score, demonstrating AI's effectiveness in accurately detecting flaws compared to human-verified data. Additionally, the research explores AI's broader capabilities, including identifying load-bearing issues, material weaknesses, and ensuring compliance with building codes. The findings highlight how AI can significantly improve design accuracy, reduce costly revisions, and support sustainable practices, ultimately revolutionizing the civil engineering field by ensuring safer, more efficient, and aesthetically optimized structures.

cross ResAD: A Simple Framework for Class Generalizable Anomaly Detection

Authors: Xincheng Yao, Zixin Chen, Chao Gao, Guangtao Zhai, Chongyang Zhang

Abstract: This paper explores the problem of class-generalizable anomaly detection, where the objective is to train one unified AD model that can generalize to detect anomalies in diverse classes from different domains without any retraining or fine-tuning on the target data. Because normal feature representations vary significantly across classes, this will cause the widely studied one-for-one AD models to be poorly classgeneralizable (i.e., performance drops dramatically when used for new classes). In this work, we propose a simple but effective framework (called ResAD) that can be directly applied to detect anomalies in new classes. Our main insight is to learn the residual feature distribution rather than the initial feature distribution. In this way, we can significantly reduce feature variations. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. Therefore, the learned model can be directly adapted to new classes. ResAD consists of three components: (1) a Feature Converter that converts initial features into residual features; (2) a simple and shallow Feature Constraintor that constrains normal residual features into a spatial hypersphere for further reducing feature variations and maintaining consistency in feature scales among different classes; (3) a Feature Distribution Estimator that estimates the normal residual feature distribution, anomalies can be recognized as out-of-distribution. Despite the simplicity, ResAD can achieve remarkable anomaly detection results when directly used in new classes. The code is available at https://github.com/xcyao00/ResAD.

URLs: https://github.com/xcyao00/ResAD.

cross Super-resolved virtual staining of label-free tissue using diffusion models

Authors: Yijie Zhang, Luzhe Huang, Nir Pillar, Yuzhu Li, Hanlong Chen, Aydogan Ozcan

Abstract: Virtual staining of tissue offers a powerful tool for transforming label-free microscopy images of unstained tissue into equivalents of histochemically stained samples. This study presents a diffusion model-based super-resolution virtual staining approach utilizing a Brownian bridge process to enhance both the spatial resolution and fidelity of label-free virtual tissue staining, addressing the limitations of traditional deep learning-based methods. Our approach integrates novel sampling techniques into a diffusion model-based image inference process to significantly reduce the variance in the generated virtually stained images, resulting in more stable and accurate outputs. Blindly applied to lower-resolution auto-fluorescence images of label-free human lung tissue samples, the diffusion-based super-resolution virtual staining model consistently outperformed conventional approaches in resolution, structural similarity and perceptual accuracy, successfully achieving a super-resolution factor of 4-5x, increasing the output space-bandwidth product by 16-25-fold compared to the input label-free microscopy images. Diffusion-based super-resolved virtual tissue staining not only improves resolution and image quality but also enhances the reliability of virtual staining without traditional chemical staining, offering significant potential for clinical diagnostics.

cross ISDNN: A Deep Neural Network for Channel Estimation in Massive MIMO systems

Authors: Do Hai Son, Vu Tung Lam, Tran Thi Thuy Quynh

Abstract: Massive Multiple-Input Multiple-Output (massive MIMO) technology stands as a cornerstone in 5G and beyonds. Despite the remarkable advancements offered by massive MIMO technology, the extreme number of antennas introduces challenges during the channel estimation (CE) phase. In this paper, we propose a single-step Deep Neural Network (DNN) for CE, termed Iterative Sequential DNN (ISDNN), inspired by recent developments in data detection algorithms. ISDNN is a DNN based on the projected gradient descent algorithm for CE problems, with the iterative iterations transforming into a DNN using the deep unfolding method. Furthermore, we introduce the structured channel ISDNN (S-ISDNN), extending ISDNN to incorporate side information such as directions of signals and antenna array configurations for enhanced CE. Simulation results highlight that ISDNN significantly outperforms another DNN-based CE (DetNet), in terms of training time (13%), running time (4.6%), and accuracy (0.43 dB). Furthermore, the S-ISDNN demonstrates even faster than ISDNN in terms of training time, though its overall performance still requires further improvement.

cross On-Site Precise Screening of SARS-CoV-2 Systems Using a Channel-Wise Attention-Based PLS-1D-CNN Model with Limited Infrared Signatures

Authors: Wenwen Zhang, Zhouzhuo Tang, Yingmei Feng, Xia Yu, Qi Jie Wang, Zhiping Lin

Abstract: During the early stages of respiratory virus outbreaks, such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the efficient utilize of limited nasopharyngeal swabs for rapid and accurate screening is crucial for public health. In this study, we present a methodology that integrates attenuated total reflection-Fourier transform infrared spectroscopy (ATR-FTIR) with the adaptive iteratively reweighted penalized least squares (airPLS) preprocessing algorithm and a channel-wise attention-based partial least squares one-dimensional convolutional neural network (PLS-1D-CNN) model, enabling accurate screening of infected individuals within 10 minutes. Two cohorts of nasopharyngeal swab samples, comprising 126 and 112 samples from suspected SARS-CoV-2 Omicron variant cases, were collected at Beijing You'an Hospital for verification. Given that ATR-FTIR spectra are highly sensitive to variations in experimental conditions, which can affect their quality, we propose a biomolecular importance (BMI) evaluation method to assess signal quality across different conditions, validated by comparing BMI with PLS-GBM and PLS-RF results. For the ATR-FTIR signals in cohort 2, which exhibited a higher BMI, airPLS was utilized for signal preprocessing, followed by the application of the channel-wise attention-based PLS-1D-CNN model for screening. The experimental results demonstrate that our model outperforms recently reported methods in the field of respiratory virus spectrum detection, achieving a recognition screening accuracy of 96.48%, a sensitivity of 96.24%, a specificity of 97.14%, an F1-score of 96.12%, and an AUC of 0.99. It meets the World Health Organization (WHO) recommended criteria for an acceptable product: sensitivity of 95.00% or greater and specificity of 97.00% or greater for testing prior SARS-CoV-2 infection in moderate to high volume scenarios.

cross Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD

Authors: Aniket Das, Dheeraj Nagaraj, Soumyabrata Pal, Arun Suggala, Prateek Varshney

Abstract: We consider the problem of high-dimensional heavy-tailed statistical estimation in the streaming setting, which is much harder than the traditional batch setting due to memory constraints. We cast this problem as stochastic convex optimization with heavy tailed stochastic gradients, and prove that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite. More precisely, with $T$ samples, we show that Clipped-SGD, for smooth and strongly convex objectives, achieves an error of $\sqrt{\frac{\mathsf{Tr}(\Sigma)+\sqrt{\mathsf{Tr}(\Sigma)\|\Sigma\|_2}\log(\frac{\log(T)}{\delta})}{T}}$ with probability $1-\delta$, where $\Sigma$ is the covariance of the clipped gradient. Note that the fluctuations (depending on $\frac{1}{\delta}$) are of lower order than the term $\mathsf{Tr}(\Sigma)$. This improves upon the current best rate of $\sqrt{\frac{\mathsf{Tr}(\Sigma)\log(\frac{1}{\delta})}{T}}$ for Clipped-SGD, known only for smooth and strongly convex objectives. Our results also extend to smooth convex and lipschitz convex objectives. Key to our result is a novel iterative refinement strategy for martingale concentration, improving upon the PAC-Bayes approach of Catoni and Giulini.

cross CodePurify: Defend Backdoor Attacks on Neural Code Models via Entropy-based Purification

Authors: Fangwen Mu, Junjie Wang, Zhuohao Yu, Lin Shi, Song Wang, Mingyang Li, Qing Wang

Abstract: Neural code models have found widespread success in tasks pertaining to code intelligence, yet they are vulnerable to backdoor attacks, where an adversary can manipulate the victim model's behavior by inserting triggers into the source code. Recent studies indicate that advanced backdoor attacks can achieve nearly 100% attack success rates on many software engineering tasks. However, effective defense techniques against such attacks remain insufficiently explored. In this study, we propose CodePurify, a novel defense against backdoor attacks on code models through entropy-based purification. Entropy-based purification involves the process of precisely detecting and eliminating the possible triggers in the source code while preserving its semantic information. Within this process, CodePurify first develops a confidence-driven entropy-based measurement to determine whether a code snippet is poisoned and, if so, locates the triggers. Subsequently, it purifies the code by substituting the triggers with benign tokens using a masked language model. We extensively evaluate CodePurify against four advanced backdoor attacks across three representative tasks and two popular code models. The results show that CodePurify significantly outperforms four commonly used defense baselines, improving average defense performance by at least 40%, 40%, and 12% across the three tasks, respectively. These findings highlight the potential of CodePurify to serve as a robust defense against backdoor attacks on neural code models.

cross AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models

Authors: Yabin Zhang, Lei Zhang

Abstract: Recent research has shown that pre-trained vision-language models are effective at identifying out-of-distribution (OOD) samples by using negative labels as guidance. However, employing consistent negative labels across different OOD datasets often results in semantic misalignments, as these text labels may not accurately reflect the actual space of OOD images. To overcome this issue, we introduce \textit{adaptive negative proxies}, which are dynamically generated during testing by exploring actual OOD images, to align more closely with the underlying OOD label space and enhance the efficacy of negative proxy guidance. Specifically, our approach utilizes a feature memory bank to selectively cache discriminative features from test images, representing the targeted OOD distribution. This facilitates the creation of proxies that can better align with specific OOD datasets. While task-adaptive proxies average features to reflect the unique characteristics of each dataset, the sample-adaptive proxies weight features based on their similarity to individual test samples, exploring detailed sample-level nuances. The final score for identifying OOD samples integrates static negative labels with our proposed adaptive proxies, effectively combining textual and visual knowledge for enhanced performance. Our method is training-free and annotation-free, and it maintains fast testing speed. Extensive experiments across various benchmarks demonstrate the effectiveness of our approach, abbreviated as AdaNeg. Notably, on the large-scale ImageNet benchmark, our AdaNeg significantly outperforms existing methods, with a 2.45\% increase in AUROC and a 6.48\% reduction in FPR95. Codes are available at \url{https://github.com/YBZh/OpenOOD-VLM}.

URLs: https://github.com/YBZh/OpenOOD-VLM

cross The inexact power augmented Lagrangian method for constrained nonconvex optimization

Authors: Alexander Bodard, Konstantinos Oikonomidis, Emanuel Laude, Panagiotis Patrinos

Abstract: This work introduces an unconventional inexact augmented Lagrangian method, where the augmenting term is a Euclidean norm raised to a power between one and two. The proposed algorithm is applicable to a broad class of constrained nonconvex minimization problems, that involve nonlinear equality constraints over a convex set under a mild regularity condition. First, we conduct a full complexity analysis of the method, leveraging an accelerated first-order algorithm for solving the H\"older-smooth subproblems. Next, we present an inexact proximal point method to tackle these subproblems, demonstrating that it achieves an improved convergence rate. Notably, this rate reduces to the best-known convergence rate for first-order methods when the augmenting term is a squared Euclidean norm. Our worst-case complexity results further show that using lower powers for the augmenting term leads to faster constraint satisfaction, albeit with a slower decrease in the dual residual. Numerical experiments support our theoretical findings, illustrating that this trade-off between constraint satisfaction and cost minimization is advantageous for certain practical problems.

cross Your Image is Secretly the Last Frame of a Pseudo Video

Authors: Wenlong Chen, Wenlin Chen, Lapo Rastrelli, Yingzhen Li

Abstract: Diffusion models, which can be viewed as a special case of hierarchical variational autoencoders (HVAEs), have shown profound success in generating photo-realistic images. In contrast, standard HVAEs often produce images of inferior quality compared to diffusion models. In this paper, we hypothesize that the success of diffusion models can be partly attributed to the additional self-supervision information for their intermediate latent states provided by corrupted images, which along with the original image form a pseudo video. Based on this hypothesis, we explore the possibility of improving other types of generative models with such pseudo videos. Specifically, we first extend a given image generative model to their video generative model counterpart, and then train the video generative model on pseudo videos constructed by applying data augmentation to the original images. Furthermore, we analyze the potential issues of first-order Markov data augmentation methods, which are typically used in diffusion models, and propose to use more expressive data augmentation to construct more useful information in pseudo videos. Our empirical results on the CIFAR10 and CelebA datasets demonstrate that improved image generation quality can be achieved with additional self-supervised information from pseudo videos.

cross Cyberbullying or just Sarcasm? Unmasking Coordinated Networks on Reddit

Authors: Pinky Pamecha, Chaitya Shah, Divyam Jain, Kashish Gandhi, Kiran Bhowmick, Meera Narvekar

Abstract: With the rapid growth of social media usage, a common trend has emerged where users often make sarcastic comments on posts. While sarcasm can sometimes be harmless, it can blur the line with cyberbullying, especially when used in negative or harmful contexts. This growing issue has been exacerbated by the anonymity and vast reach of the internet, making cyberbullying a significant concern on platforms like Reddit. Our research focuses on distinguishing cyberbullying from sarcasm, particularly where online language nuances make it difficult to discern harmful intent. This study proposes a framework using natural language processing (NLP) and machine learning to differentiate between the two, addressing the limitations of traditional sentiment analysis in detecting nuanced behaviors. By analyzing a custom dataset scraped from Reddit, we achieved a 95.15% accuracy in distinguishing harmful content from sarcasm. Our findings also reveal that teenagers and minority groups are particularly vulnerable to cyberbullying. Additionally, our research uncovers coordinated graphs of groups involved in cyberbullying, identifying common patterns in their behavior. This research contributes to improving detection capabilities for safer online communities.

cross LLMs Can Evolve Continually on Modality for X-Modal Reasoning

Authors: Jiazuo Yu, Haomiao Xiong, Lu Zhang, Haiwen Diao, Yunzhi Zhuge, Lanqing Hong, Dong Wang, Huchuan Lu, You He, Long Chen

Abstract: Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at https://github.com/JiazuoYu/PathWeave

URLs: https://github.com/JiazuoYu/PathWeave

cross Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Authors: Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, Ariel Goldstein

Abstract: Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

cross Neural Fields in Robotics: A Survey

Authors: Muhammad Zubair Irshad, Mauro Comi, Yen-Chen Lin, Nick Heppert, Abhinav Valada, Rares Ambrus, Zsolt Kira, Jonathan Tremblay

Abstract: Neural Fields have emerged as a transformative approach for 3D scene representation in computer vision and robotics, enabling accurate inference of geometry, 3D semantics, and dynamics from posed 2D data. Leveraging differentiable rendering, Neural Fields encompass both continuous implicit and explicit neural representations enabling high-fidelity 3D reconstruction, integration of multi-modal sensor data, and generation of novel viewpoints. This survey explores their applications in robotics, emphasizing their potential to enhance perception, planning, and control. Their compactness, memory efficiency, and differentiability, along with seamless integration with foundation and generative models, make them ideal for real-time applications, improving robot adaptability and decision-making. This paper provides a thorough review of Neural Fields in robotics, categorizing applications across various domains and evaluating their strengths and limitations, based on over 200 papers. First, we present four key Neural Fields frameworks: Occupancy Networks, Signed Distance Fields, Neural Radiance Fields, and Gaussian Splatting. Second, we detail Neural Fields' applications in five major robotics domains: pose estimation, manipulation, navigation, physics, and autonomous driving, highlighting key works and discussing takeaways and open challenges. Finally, we outline the current limitations of Neural Fields in robotics and propose promising directions for future research. Project page: https://robonerf.github.io

URLs: https://robonerf.github.io

cross Recursive Function Definitions in Static Dataflow Graphs and their Implementation in TensorFlow

Authors: Kelly Kostopoulou, Angelos Charalambidis, Panos Rondogiannis

Abstract: Modern machine learning systems represent their computations as dataflow graphs. The increasingly complex neural network architectures crave for more powerful yet efficient programming abstractions. In this paper we propose an efficient technique for supporting recursive function definitions in dataflow-based systems such as TensorFlow. The proposed approach transforms the given recursive definitions into a static dataflow graph that is enriched with two simple yet powerful dataflow operations. Since static graphs do not change during execution, they can be easily partitioned and executed efficiently in distributed and heterogeneous environments. The proposed technique makes heavy use of the idea of tagging, which was one of the cornerstones of dataflow systems since their inception. We demonstrate that our technique is compatible with the idea of automatic differentiation, a notion that is crucial for dataflow systems that focus on deep learning applications. We describe the principles of an actual implementation of the technique in the TensorFlow framework, and present experimental results that demonstrate that the use of tagging is of paramount importance for developing efficient high-level abstractions for modern dataflow systems.

cross Improving Model Evaluation using SMART Filtering of Benchmark Datasets

Authors: Vipul Gupta, Candace Ross, David Pantoja, Rebecca J. Passonneau, Megan Ung, Adina Williams

Abstract: One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48\% on average, while increasing Pearson correlation with rankings from ChatBot Arena, a more open-ended human evaluation setting. Our method enables us to be more efficient, whether using SMART to make new benchmarks more challenging or to revitalize older datasets, while still preserving the relative model rankings.

cross Robust Model Evaluation over Large-scale Federated Networks

Authors: Amir Najafi, Samin Mahdizadeh Sani, Farzan Farnia

Abstract: In this paper, we address the challenge of certifying the performance of a machine learning model on an unseen target network, using measurements from an available source network. We focus on a scenario where heterogeneous datasets are distributed across a source network of clients, all connected to a central server. Specifically, consider a source network "A" composed of $K$ clients, each holding private data from unique and heterogeneous distributions, which are assumed to be independent samples from a broader meta-distribution $\mu$. Our goal is to provide certified guarantees for the model's performance on a different, unseen target network "B," governed by another meta-distribution $\mu'$, assuming the deviation between $\mu$ and $\mu'$ is bounded by either the Wasserstein distance or an $f$-divergence. We derive theoretical guarantees for the model's empirical average loss and provide uniform bounds on the risk CDF, where the latter correspond to novel and adversarially robust versions of the Glivenko-Cantelli theorem and the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. Our bounds are computable in polynomial time with a polynomial number of queries to the $K$ clients, preserving client privacy by querying only the model's (potentially adversarial) loss on private data. We also establish non-asymptotic generalization bounds that consistently converge to zero as both $K$ and the minimum client sample size grow. Extensive empirical evaluations validate the robustness and practicality of our bounds across real-world tasks.

cross You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models

Authors: Eric Slyman, Anirudh Kanneganti, Sanghyun Hong, Stefan Lee

Abstract: We study the impact of a standard practice in compressing foundation vision-language models - quantization - on the models' ability to produce socially-fair outputs. In contrast to prior findings with unimodal models that compression consistently amplifies social biases, our extensive evaluation of four quantization settings across three datasets and three CLIP variants yields a surprising result: while individual models demonstrate bias, we find no consistent change in bias magnitude or direction across a population of compressed models due to quantization.

cross Learning Approximated Maximal Safe Sets via Hypernetworks for MPC-Based Local Motion Planning

Authors: Bojan Deraji\'c, Mohamed-Khalil Bouzidi, Sebastian Bernhard, Wolfgang H\"onig

Abstract: This paper presents a novel learning-based approach for online estimation of maximal safe sets for local motion planning tasks in mobile robotics. We leverage the idea of hypernetworks to achieve good generalization properties and real-time performance simultaneously. As the source of supervision, we employ the Hamilton-Jacobi (HJ) reachability analysis, allowing us to consider general nonlinear dynamics and arbitrary constraints. We integrate our model into a model predictive control (MPC) local planner as a safety constraint and compare the performance with relevant baselines in realistic 3D simulations for different environments and robot dynamics. The results show the advantages of our approach in terms of a significantly higher success rate: 2 to 18 percent over the best baseline, while achieving real-time performance.

cross On the Gaussian process limit of Bayesian Additive Regression Trees

Authors: Giacomo Petrillo

Abstract: Bayesian Additive Regression Trees (BART) is a nonparametric Bayesian regression technique of rising fame. It is a sum-of-decision-trees model, and is in some sense the Bayesian version of boosting. In the limit of infinite trees, it becomes equivalent to Gaussian process (GP) regression. This limit is known but has not yet led to any useful analysis or application. For the first time, I derive and compute the exact BART prior covariance function. With it I implement the infinite trees limit of BART as GP regression. Through empirical tests, I show that this limit is worse than standard BART in a fixed configuration, but also that tuning the hyperparameters in the natural GP way yields a competitive method, although a properly tuned BART is still superior. The advantage of using a GP surrogate of BART is the analytical likelihood, which simplifies model building and sidesteps the complex BART MCMC. More generally, this study opens new ways to understand and develop BART and GP regression. The implementation of BART as GP is available in the Python package https://github.com/Gattocrucco/lsqfitgp .

URLs: https://github.com/Gattocrucco/lsqfitgp

cross Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain

Authors: Daniel C. Ruiz, John Sell

Abstract: In recent years, the widespread adoption of Large Language Models (LLMs) has sparked interest in their potential for application within the military domain. However, the current generation of LLMs demonstrate sub-optimal performance on Army use cases, due to the prevalence of domain-specific vocabulary and jargon. In order to fully leverage LLMs in-domain, many organizations have turned to fine-tuning to circumvent the prohibitive costs involved in training new LLMs from scratch. In light of this trend, we explore the viability of adapting open-source LLMs for usage in the Army domain in order to address their existing lack of domain-specificity. Our investigations have resulted in the creation of three distinct generations of TRACLM, a family of LLMs fine-tuned by The Research and Analysis Center (TRAC), Army Futures Command (AFC). Through continuous refinement of our training pipeline, each successive iteration of TRACLM displayed improved capabilities when applied to Army tasks and use cases. Furthermore, throughout our fine-tuning experiments, we recognized the need for an evaluation framework that objectively quantifies the Army domain-specific knowledge of LLMs. To address this, we developed MilBench, an extensible software framework that efficiently evaluates the Army knowledge of a given LLM using tasks derived from doctrine and assessments. We share preliminary results, models, methods, and recommendations on the creation of TRACLM and MilBench. Our work significantly informs the development of LLM technology across the DoD and augments senior leader decisions with respect to artificial intelligence integration.

cross Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds

Authors: Tiangang Cui, Alex Gorodetsky

Abstract: We present a new sampling-based approach for enabling efficient computation of low-rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular-value-decomposition (SVD) parametrization of low-rank matrices. Our prior is analogous to the seminal nuclear-norm regularization used in non-Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (-within-Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two-matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real-world benchmark problems we considered.

cross Logarithmically Quantized Distributed Optimization over Dynamic Multi-Agent Networks

Authors: Mohammadreza Doostmohammadian, S\'ergio Pequito

Abstract: Distributed optimization finds many applications in machine learning, signal processing, and control systems. In these real-world applications, the constraints of communication networks, particularly limited bandwidth, necessitate implementing quantization techniques. In this paper, we propose distributed optimization dynamics over multi-agent networks subject to logarithmically quantized data transmission. Under this condition, data exchange benefits from representing smaller values with more bits and larger values with fewer bits. As compared to uniform quantization, this allows for higher precision in representing near-optimal values and more accuracy of the distributed optimization algorithm. The proposed optimization dynamics comprise a primary state variable converging to the optimizer and an auxiliary variable tracking the objective function's gradient. Our setting accommodates dynamic network topologies, resulting in a hybrid system requiring convergence analysis using matrix perturbation theory and eigenspectrum analysis.

cross FoldMark: Protecting Protein Generative Models with Watermarking

Authors: Zaixi Zhang, Ruofan Jin, Kaidi Fu, Le Cong, Marinka Zitnik, Mengdi Wang

Abstract: Protein structure is key to understanding protein function and is essential for progress in bioengineering, drug discovery, and molecular biology. Recently, with the incorporation of generative AI, the power and accuracy of computational protein structure prediction/design have been improved significantly. However, ethical concerns such as copyright protection and harmful content generation (biosecurity) pose challenges to the wide implementation of protein generative models. Here, we investigate whether it is possible to embed watermarks into protein generative models and their outputs for copyright authentication and the tracking of generated structures. As a proof of concept, we propose a two-stage method FoldMark as a generalized watermarking strategy for protein generative models. FoldMark first pretrain watermark encoder and decoder, which can minorly adjust protein structures to embed user-specific information and faithfully recover the information from the encoded structure. In the second step, protein generative models are fine-tuned with watermark Low-Rank Adaptation (LoRA) modules to preserve generation quality while learning to generate watermarked structures with high recovery rates. Extensive experiments are conducted on open-source protein structure prediction models (e.g., ESMFold and MultiFlow) and de novo structure design models (e.g., FrameDiff and FoldFlow) and we demonstrate that our method is effective across all these generative models. Meanwhile, our watermarking framework only exerts a negligible impact on the original protein structure quality and is robust under potential post-processing and adaptive attacks.

cross TrajAgent: An Agent Framework for Unified Trajectory Modelling

Authors: Yuwei Du, Jie Feng, Jie Zhao, Yong Li

Abstract: Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modelling. However, due to the heterogeneity of data and the diversity of trajectory tasks, achieving unified trajectory modelling remains an important yet challenging task. In this paper, we propose TrajAgent, a large language model-based agentic framework, to unify various trajectory modelling tasks. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce TAgent, an agentic workflow designed for automatic trajectory modelling across various trajectory tasks. Specifically, we design AutOpt, a systematic optimization module within TAgent, to further improve the performance of the integrated model. With diverse trajectory tasks input in natural language, TrajAgent automatically generates competitive results via training and executing appropriate models. Extensive experiments on four tasks using four real-world datasets demonstrate the effectiveness of TrajAgent in unified trajectory modelling, achieving an average performance improvement of 15.43% over baseline methods.

cross When Less is More: Achieving Faster Convergence in Distributed Edge Machine Learning

Authors: Advik Raj Basani, Siddharth Chaitra Vivek, Advaith Krishna, Arnab K. Paul

Abstract: Distributed Machine Learning (DML) on resource-constrained edge devices holds immense potential for real-world applications. However, achieving fast convergence in DML in these heterogeneous environments remains a significant challenge. Traditional frameworks like Bulk Synchronous Parallel and Asynchronous Stochastic Parallel rely on frequent, small updates that incur substantial communication overhead and hinder convergence speed. Furthermore, these frameworks often employ static dataset sizes, neglecting the heterogeneity of edge devices and potentially leading to straggler nodes that slow down the entire training process. The straggler nodes, i.e., edge devices that take significantly longer to process their assigned data chunk, hinder the overall training speed. To address these limitations, this paper proposes Hermes, a novel probabilistic framework for efficient DML on edge devices. This framework leverages a dynamic threshold based on recent test loss behavior to identify statistically significant improvements in the model's generalization capability, hence transmitting updates only when major improvements are detected, thereby significantly reducing communication overhead. Additionally, Hermes employs dynamic dataset allocation to optimize resource utilization and prevents performance degradation caused by straggler nodes. Our evaluations on a real-world heterogeneous resource-constrained environment demonstrate that Hermes achieves faster convergence compared to state-of-the-art methods, resulting in a remarkable $13.22$x reduction in training time and a $62.1\%$ decrease in communication overhead.

cross Symbotunes: unified hub for symbolic music generative models

Authors: Pawe{\l} Skier\'s, Maksymilian {\L}azarski, Micha{\l} Kope\'c, Mateusz Modrzejewski

Abstract: Implementations of popular symbolic music generative models often differ significantly in terms of the libraries utilized and overall project structure. Therefore, directly comparing the methods or becoming acquainted with them may present challenges. To mitigate this issue we introduce Symbotunes, an open-source unified hub for symbolic music generative models. Symbotunes contains modern Python implementations of well-known methods for symbolic music generation, as well as a unified pipeline for generating and training.

cross CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

Authors: Ali TehraniJamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, Ali Jannesari

Abstract: Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to CUDA and Fortran to C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU, with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for this complex task, improving CodeBLEU by at least 4.63 points compared to closed-source and open-code LLMs.

cross Search Wide, Focus Deep: Automated Fetal Brain Extraction with Sparse Training Data

Authors: Javid Dadashkarimi, Valeria Pena Trujillo, Camilo Jaimes, Lilla Z\"ollei, Malte Hoffmann

Abstract: Automated fetal brain extraction from full-uterus MRI is a challenging task due to variable head sizes, orientations, complex anatomy, and prevalent artifacts. While deep-learning (DL) models trained on synthetic images have been successful in adult brain extraction, adapting these networks for fetal MRI is difficult due to the sparsity of labeled data, leading to increased false-positive predictions. To address this challenge, we propose a test-time strategy that reduces false positives in networks trained on sparse, synthetic labels. The approach uses a breadth-fine search (BFS) to identify a subvolume likely to contain the fetal brain, followed by a deep-focused sliding window (DFS) search to refine the extraction, pooling predictions to minimize false positives. We train models at different window sizes using synthetic images derived from a small number of fetal brain label maps, augmented with random geometric shapes. Each model is trained on diverse head positions and scales, including cases with partial or no brain tissue. Our framework matches state-of-the-art brain extraction methods on clinical HASTE scans of third-trimester fetuses and exceeds them by up to 5\% in terms of Dice in the second trimester as well as EPI scans across both trimesters. Our results demonstrate the utility of a sliding-window approach and combining predictions from several models trained on synthetic images, for improving brain-extraction accuracy by progressively refining regions of interest and minimizing the risk of missing brain mask slices or misidentifying other tissues as brain.

cross SIGMA: Single Interpolated Generative Model for Anomalies

Authors: Ranit Das, David Shih

Abstract: A key step in any resonant anomaly detection search is accurate modeling of the background distribution in each signal region. Data-driven methods like CATHODE accomplish this by training separate generative models on the complement of each signal region, and interpolating them into their corresponding signal regions. Having to re-train the generative model on essentially the entire dataset for each signal region is a major computational cost in a typical sliding window search with many signal regions. Here, we present SIGMA, a new, fully data-driven, computationally-efficient method for estimating background distributions. The idea is to train a single generative model on all of the data and interpolate its parameters in sideband regions in order to obtain a model for the background in the signal region. The SIGMA method significantly reduces the computational cost compared to previous approaches, while retaining a similar high quality of background modeling and sensitivity to anomalous signals.

cross Neural rendering enables dynamic tomography

Authors: Ivan Grega, William F. Whitney, Vikram S. Deshpande

Abstract: Interrupted X-ray computed tomography (X-CT) has been the common way to observe the deformation of materials during an experiment. While this approach is effective for quasi-static experiments, it has never been possible to reconstruct a full 3d tomography during a dynamic experiment which cannot be interrupted. In this work, we propose that neural rendering tools can be used to drive the paradigm shift to enable 3d reconstruction during dynamic events. First, we derive theoretical results to support the selection of projections angles. Via a combination of synthetic and experimental data, we demonstrate that neural radiance fields can reconstruct data modalities of interest more efficiently than conventional reconstruction methods. Finally, we develop a spatio-temporal model with spline-based deformation field and demonstrate that such model can reconstruct the spatio-temporal deformation of lattice samples in real-world experiments.

cross Unsupervised Panoptic Interpretation of Latent Spaces in GANs Using Space-Filling Vector Quantization

Authors: Mohammad Hassan Vali, Tom B\"ackstr\"om

Abstract: Generative adversarial networks (GANs) learn a latent space whose samples can be mapped to real-world images. Such latent spaces are difficult to interpret. Some earlier supervised methods aim to create an interpretable latent space or discover interpretable directions that require exploiting data labels or annotated synthesized samples for training. However, we propose using a modification of vector quantization called space-filling vector quantization (SFVQ), which quantizes the data on a piece-wise linear curve. SFVQ can capture the underlying morphological structure of the latent space and thus make it interpretable. We apply this technique to model the latent space of pretrained StyleGAN2 and BigGAN networks on various datasets. Our experiments show that the SFVQ curve yields a general interpretable model of the latent space that determines which part of the latent space corresponds to what specific generative factors. Furthermore, we demonstrate that each line of SFVQ's curve can potentially refer to an interpretable direction for applying intelligible image transformations. We also showed that the points located on an SFVQ line can be used for controllable data augmentation.

cross A Framework for Real-Time Volcano-Seismic Event Recognition Based on Multi-Station Seismograms and Semantic Segmentation Models

Authors: Camilo Espinosa-Curilem, Millaray Curilem, Daniel Basualto

Abstract: In volcano monitoring, effective recognition of seismic events is essential for understanding volcanic activity and raising timely warning alerts. Traditional methods rely on manual analysis, which can be subjective and labor-intensive. Furthermore, current automatic approaches often tackle detection and classification separately, mostly rely on single station information and generally require tailored preprocessing and representations to perform predictions. These limitations often hinder their application to real-time monitoring and utilization across different volcano conditions. This study introduces a novel approach that utilizes Semantic Segmentation models to automate seismic event recognition by applying a straight forward transformation of multi-channel 1D signals into 2D representations, enabling their use as images. Our framework employs a data-driven, end-to-end design that integrates multi-station seismic data with minimal preprocessing, performing both detection and classification simultaneously for five seismic event classes. We evaluated four state-of-the-art segmentation models (UNet, UNet++, DeepLabV3+ and SwinUNet) on approximately 25.000 seismic events recorded at four different Chilean volcanoes: Nevados del Chill\'an Volcanic Complex, Laguna del Maule, Villarrica and Puyehue-Cord\'on Caulle. Among these models, the UNet architecture was identified as the most effective model, achieving mean F1 and Intersection over Union (IoU) scores of up to 0.91 and 0.88, respectively, and demonstrating superior noise robustness and model flexibility to unseen volcano datasets.

cross Implementation and Application of an Intelligibility Protocol for Interaction with an LLM

Authors: Ashwin Srinivasan, Karan Bania, Shreyas V, Harshvardhan Mestha, Sidong Liu

Abstract: Our interest is in constructing interactive systems involving a human-expert interacting with a machine learning engine on data analysis tasks. This is of relevance when addressing complex problems arising in areas of science, the environment, medicine and so on, which are not immediately amenable to the usual methods of statistical or mathematical modelling. In such situations, it is possible that harnessing human expertise and creativity to modern machine-learning capabilities of identifying patterns by constructing new internal representations of the data may provide some insight to possible solutions. In this paper, we examine the implementation of an abstract protocol developed for interaction between agents, each capable of constructing predictions and explanations. The \PXP protocol, described in [12] is motivated by the notion of ''two-way intelligibility'' and is specified using a pair of communicating finite-state machines. While the formalisation allows the authors to prove several properties about the protocol, no implementation was presented. Here, we address this shortcoming for the case in which one of the agents acts as a ''generator'' using a large language model (LLM) and the other is an agent that acts as a ''tester'' using either a human-expert, or a proxy for a human-expert (for example, a database compiled using human-expertise). We believe these use-cases will be a widely applicable form of interaction for problems of the kind mentioned above. We present an algorithmic description of general-purpose implementation, and conduct preliminary experiments on its use in two different areas (radiology and drug-discovery). The experimental results provide early evidence in support of the protocol's capability of capturing one- and two-way intelligibility in human-LLM in the manner proposed in [12].

cross Kernel Approximation of Fisher-Rao Gradient Flows

Authors: Jia-Jie Zhu, Alexander Mielke

Abstract: The purpose of this paper is to answer a few open questions in the interface of kernel methods and PDE gradient flows. Motivated by recent advances in machine learning, particularly in generative modeling and sampling, we present a rigorous investigation of Fisher-Rao and Wasserstein type gradient flows concerning their gradient structures, flow equations, and their kernel approximations. Specifically, we focus on the Fisher-Rao (also known as Hellinger) geometry and its various kernel-based approximations, developing a principled theoretical framework using tools from PDE gradient flows and optimal transport theory. We also provide a complete characterization of gradient flows in the maximum-mean discrepancy (MMD) space, with connections to existing learning and inference algorithms. Our analysis reveals precise theoretical insights linking Fisher-Rao flows, Stein flows, kernel discrepancies, and nonparametric regression. We then rigorously prove evolutionary $\Gamma$-convergence for kernel-approximated Fisher-Rao flows, providing theoretical guarantees beyond pointwise convergence. Finally, we analyze energy dissipation using the Helmholtz-Rayleigh principle, establishing important connections between classical theory in mechanics and modern machine learning practice. Our results provide a unified theoretical foundation for understanding and analyzing approximations of gradient flows in machine learning applications through a rigorous gradient flow and variational method perspective.

cross Near Optimal Pure Exploration in Logistic Bandits

Authors: Eduardo Ochoa Rivera, Ambuj Tewari

Abstract: Bandit algorithms have garnered significant attention due to their practical applications in real-world scenarios. However, beyond simple settings such as multi-arm or linear bandits, optimal algorithms remain scarce. Notably, no optimal solution exists for pure exploration problems in the context of generalized linear model (GLM) bandits. In this paper, we narrow this gap and develop the first track-and-stop algorithm for general pure exploration problems under the logistic bandit called logistic track-and-stop (Log-TS). Log-TS is an efficient algorithm that asymptotically matches an approximation for the instance-specific lower bound of the expected sample complexity up to a logarithmic factor.

cross Injectivity capacity of ReLU gates

Authors: Mihailo Stojnic

Abstract: We consider the injectivity property of the ReLU networks layers. Determining the ReLU injectivity capacity (ratio of the number of layer's inputs and outputs) is established as isomorphic to determining the capacity of the so-called $\ell_0$ spherical perceptron. Employing \emph{fully lifted random duality theory} (fl RDT) a powerful program is developed and utilized to handle the $\ell_0$ spherical perceptron and implicitly the ReLU layers injectivity. To put the entire fl RDT machinery in practical use, a sizeable set of numerical evaluations is conducted as well. The lifting mechanism is observed to converge remarkably fast with relative corrections in the estimated quantities not exceeding $\sim 0.1\%$ already on the third level of lifting. Closed form explicit analytical relations among key lifting parameters are uncovered as well. In addition to being of incredible importance in handling all the required numerical work, these relations also shed a new light on beautiful parametric interconnections within the lifting structure. Finally, the obtained results are also shown to fairly closely match the replica predictions from [40].

cross A Statistical Analysis of Deep Federated Learning for Intrinsically Low-dimensional Data

Authors: Saptarshi Chakraborty, Peter L. Bartlett

Abstract: Federated Learning (FL) has emerged as a groundbreaking paradigm in collaborative machine learning, emphasizing decentralized model training to address data privacy concerns. While significant progress has been made in optimizing federated learning, the exploration of generalization error, particularly in heterogeneous settings, has been limited, focusing mainly on parametric cases. This paper investigates the generalization properties of deep federated regression within a two-stage sampling model. Our findings highlight that the intrinsic dimension, defined by the entropic dimension, is crucial for determining convergence rates when appropriate network sizes are used. Specifically, if the true relationship between response and explanatory variables is charecterized by a $\beta$-H\"older function and there are $n$ independent and identically distributed (i.i.d.) samples from $m$ participating clients, the error rate for participating clients scales at most as $\tilde{O}\left((mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$, and for non-participating clients, it scales as $\tilde{O}\left(\Delta \cdot m^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))} + (mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$. Here, $\bar{d}_{2\beta}(\lambda)$ represents the $2\beta$-entropic dimension of $\lambda$, the marginal distribution of the explanatory variables, and $\Delta$ characterizes the dependence between the sampling stages. Our results explicitly account for the "closeness" of clients, demonstrating that the convergence rates of deep federated learners depend on intrinsic rather than nominal high-dimensionality.

cross Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Authors: Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster

Abstract: Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines -- and can even recover most of the performance of the original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.

cross A Machine Learning-Driven Wireless System for Structural Health Monitoring

Authors: Marius Pop, Mihai Tudose, Daniel Visan, Mircea Bocioaga, Mihai Botan, Cesar Banu, Tiberiu Salaoru

Abstract: The paper presents a wireless system integrated with a machine learning (ML) model for structural health monitoring (SHM) of carbon fiber reinforced polymer (CFRP) structures, primarily targeting aerospace applications. The system collects data via carbon nanotube (CNT) piezoresistive sensors embedded within CFRP coupons, wirelessly transmitting these data to a central server for processing. A deep neural network (DNN) model predicts mechanical properties and can be extended to forecast structural failures, facilitating proactive maintenance and enhancing safety. The modular design supports scalability and can be embedded within digital twin frameworks, offering significant benefits to aircraft operators and manufacturers. The system utilizes an ML model with a mean absolute error (MAE) of 0.14 on test data for forecasting mechanical properties. Data transmission latency throughout the entire system is less than one second in a LAN setup, highlighting its potential for real-time monitoring applications in aerospace and other industries. However, while the system shows promise, challenges such as sensor reliability under extreme environmental conditions and the need for advanced ML models to handle diverse data streams have been identified as areas for future research.

cross MCI-GRU: Stock Prediction Model Based on Multi-Head Cross-Attention and Improved GRU

Authors: Peng Zhu, Yuante Li, Yifan Hu, Sheng Xiang, Qinyuan Liu, Dawei Cheng, Yuqi Liang

Abstract: As financial markets grow increasingly complex in the big data era, accurate stock prediction has become more critical. Traditional time series models, such as GRUs, have been widely used but often struggle to capture the intricate nonlinear dynamics of markets, particularly in the flexible selection and effective utilization of key historical information. Recently, methods like Graph Neural Networks and Reinforcement Learning have shown promise in stock prediction but require high data quality and quantity, and they tend to exhibit instability when dealing with data sparsity and noise. Moreover, the training and inference processes for these models are typically complex and computationally expensive, limiting their broad deployment in practical applications. Existing approaches also generally struggle to capture unobservable latent market states effectively, such as market sentiment and expectations, microstructural factors, and participant behavior patterns, leading to an inadequate understanding of market dynamics and subsequently impact prediction accuracy. To address these challenges, this paper proposes a stock prediction model, MCI-GRU, based on a multi-head cross-attention mechanism and an improved GRU. First, we enhance the GRU model by replacing the reset gate with an attention mechanism, thereby increasing the model's flexibility in selecting and utilizing historical information. Second, we design a multi-head cross-attention mechanism for learning unobservable latent market state representations, which are further enriched through interactions with both temporal features and cross-sectional features. Finally, extensive experiments on four main stock markets show that the proposed method outperforms SOTA techniques across multiple metrics. Additionally, its successful application in real-world fund management operations confirms its effectiveness and practicality.

cross Multi-modal Data based Semi-Supervised Learning for Vehicle Positioning

Authors: Ouwen Huan, Yang Yang, Tao Luo, Mingzhe Chen

Abstract: In this paper, a multi-modal data based semi-supervised learning (SSL) framework that jointly use channel state information (CSI) data and RGB images for vehicle positioning is designed. In particular, an outdoor positioning system where the vehicle locations are determined by a base station (BS) is considered. The BS equipped with several cameras can collect a large amount of unlabeled CSI data and a small number of labeled CSI data of vehicles, and the images taken by cameras. Although the collected images contain partial information of vehicles (i.e. azimuth angles of vehicles), the relationship between the unlabeled CSI data and its azimuth angle, and the distances between the BS and the vehicles captured by images are both unknown. Therefore, the images cannot be directly used as the labels of unlabeled CSI data to train a positioning model. To exploit unlabeled CSI data and images, a SSL framework that consists of a pretraining stage and a downstream training stage is proposed. In the pretraining stage, the azimuth angles obtained from the images are considered as the labels of unlabeled CSI data to pretrain the positioning model. In the downstream training stage, a small sized labeled dataset in which the accurate vehicle positions are considered as labels is used to retrain the model. Simulation results show that the proposed method can reduce the positioning error by up to 30% compared to a baseline where the model is not pretrained.

cross Joint Channel Selection using FedDRL in V2X

Authors: Lorenzo Mancini, Safwan Labbi, Karim Abed Meraim, Fouzi Boukhalfa, Alain Durmus, Paul Mangold, Eric Moulines

Abstract: Vehicle-to-everything (V2X) communication technology is revolutionizing transportation by enabling interactions between vehicles, devices, and infrastructures. This connectivity enhances road safety, transportation efficiency, and driver assistance systems. V2X benefits from Machine Learning, enabling real-time data analysis, better decision-making, and improved traffic predictions, making transportation safer and more efficient. In this paper, we study the problem of joint channel selection, where vehicles with different technologies choose one or more Access Points (APs) to transmit messages in a network. In this problem, vehicles must learn a strategy for channel selection, based on observations that incorporate vehicles' information (position and speed), network and communication data (Signal-to-Interference-plus-Noise Ratio from past communications), and environmental data (road type). We propose an approach based on Federated Deep Reinforcement Learning (FedDRL), which enables each vehicle to benefit from other vehicles' experiences. Specifically, we apply the federated Proximal Policy Optimization (FedPPO) algorithm to this task. We show that this method improves communication reliability while minimizing transmission costs and channel switches. The efficiency of the proposed solution is assessed via realistic simulations, highlighting the potential of FedDRL to advance V2X technology.

cross Wireless-Friendly Window Position Optimization for RIS-Aided Outdoor-to-Indoor Networks based on Multi-Modal Large Language Model

Authors: Jinbo Hou, Kehai Qiu, Zitian Zhang, Yong Yu, Kezhi Wang, Stefano Capolongo, Jiliang Zhang, Zeyang Li, Jie Zhang

Abstract: This paper aims to simultaneously optimize indoor wireless and daylight performance by adjusting the positions of windows and the beam directions of window-deployed reconfigurable intelligent surfaces (RISs) for RIS-aided outdoor-to-indoor (O2I) networks utilizing large language models (LLM) as optimizers. Firstly, we illustrate the wireless and daylight system models of RIS-aided O2I networks and formulate a joint optimization problem to enhance both wireless traffic sum rate and daylight illumination performance. Then, we present a multi-modal LLM-based window optimization (LMWO) framework, accompanied by a prompt construction template to optimize the overall performance in a zero-shot fashion, functioning as both an architect and a wireless network planner. Finally, we analyze the optimization performance of the LMWO framework and the impact of the number of windows, room size, number of RIS units, and daylight factor. Numerical results demonstrate that our proposed LMWO framework can achieve outstanding optimization performance in terms of initial performance, convergence speed, final outcomes, and time complexity, compared with classic optimization methods. The building's wireless performance can be significantly enhanced while ensuring indoor daylight performance.

cross Super Resolution Based on Deep Operator Networks

Authors: Siyuan Yang

Abstract: We use Deep Operator Networks (DeepONets) to perform super-resolution reconstruction of the solutions of two types of partial differential equations and compare the model predictions with the results obtained using conventional interpolation methods to verify the advantages of DeepONets. We employ two pooling methods to downsample the origin data and conduct super-resolution reconstruction under three different resolutions of input images. The results show that the DeepONet model can predict high-frequency oscillations and small-scale structures from low-resolution inputs very well. For the two-dimensional problem, we introduce convolutional layers to extract information from input images at a lower cost than purer MLPs. We adjust the size of the training set and observe the variation of prediction errors. In both one-dimensional and two-dimensional cases, the super-resolution reconstruction using the DeepONet model demonstrates much more accurate prediction results than cubic spline interpolation, highlighting the superiority of operator learning methods in handling such problems compared to traditional interpolation techniques.

cross Wearable-Based Real-time Freezing of Gait Detection in Parkinson's Disease Using Self-Supervised Learning

Authors: Shovito Barua Soumma, Kartik Mangipudi, Daniel Peterson, Shyamal Mehta, Hassan Ghasemzadeh

Abstract: LIFT-PD is an innovative self-supervised learning framework developed for real-time detection of Freezing of Gait (FoG) in Parkinson's Disease (PD) patients, using a single triaxial accelerometer. It minimizes the reliance on large labeled datasets by applying a Differential Hopping Windowing Technique (DHWT) to address imbalanced data during training. Additionally, an Opportunistic Inference Module is used to reduce energy consumption by activating the model only during active movement periods. Extensive testing on publicly available datasets showed that LIFT-PD improved precision by 7.25% and accuracy by 4.4% compared to supervised models, while using 40% fewer labeled samples and reducing inference time by 67%. These findings make LIFT-PD a highly practical and energy-efficient solution for continuous, in-home monitoring of PD patients.

cross Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

Authors: Mufei Li, Siqi Miao, Pan Li

Abstract: Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM outputs in structured external knowledge from KGs. However, current KG-based RAG frameworks still struggle to optimize the trade-off between retrieval effectiveness and efficiency in identifying a suitable amount of relevant graph information for the LLM to digest. We introduce SubgraphRAG, extending the KG-based RAG framework that retrieves subgraphs and leverages LLMs for reasoning and answer prediction. Our approach innovatively integrates a lightweight multilayer perceptron with a parallel triple-scoring mechanism for efficient and flexible subgraph retrieval while encoding directional structural distances to enhance retrieval effectiveness. The size of retrieved subgraphs can be flexibly adjusted to match the query's need and the downstream LLM's capabilities. This design strikes a balance between model complexity and reasoning power, enabling scalable and generalizable retrieval processes. Notably, based on our retrieved subgraphs, smaller LLMs like Llama3.1-8B-Instruct deliver competitive results with explainable reasoning, while larger models like GPT-4o achieve state-of-the-art accuracy compared with previous baselines -- all without fine-tuning. Extensive evaluations on the WebQSP and CWQ benchmarks highlight SubgraphRAG's strengths in efficiency, accuracy, and reliability by reducing hallucinations and improving response grounding.

cross Mitigating Unauthorized Speech Synthesis for Voice Protection

Authors: Zhisheng Zhang, Qianyi Yang, Derui Wang, Pengyang Huang, Yuxin Cao, Kai Ye, Jie Hao

Abstract: With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards.

cross Plan$\times$RAG: Planning-guided Retrieval Augmented Generation

Authors: Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, Amit Sharma

Abstract: We introduce Planning-guided Retrieval Augmented Generation (Plan$\times$RAG), a novel framework that augments the \emph{retrieve-then-reason} paradigm of existing RAG frameworks to \emph{plan-then-retrieve}. Plan$\times$RAG formulates a reasoning plan as a directed acyclic graph (DAG), decomposing queries into interrelated atomic sub-queries. Answer generation follows the DAG structure, allowing significant gains in efficiency through parallelized retrieval and generation. While state-of-the-art RAG solutions require extensive data generation and fine-tuning of language models (LMs), Plan$\times$RAG incorporates frozen LMs as plug-and-play experts to generate high-quality answers. Compared to existing RAG solutions, Plan$\times$RAG demonstrates significant improvements in reducing hallucinations and bolstering attribution due to its structured sub-query decomposition. Overall, Plan$\times$RAG offers a new perspective on integrating external knowledge in LMs while ensuring attribution by design, contributing towards more reliable LM-based systems.

cross Likelihood approximations via Gaussian approximate inference

Authors: Thang D. Bui

Abstract: Non-Gaussian likelihoods are essential for modelling complex real-world observations but pose significant computational challenges in learning and inference. Even with Gaussian priors, non-Gaussian likelihoods often lead to analytically intractable posteriors, necessitating approximation methods. To this end, we propose efficient schemes to approximate the effects of non-Gaussian likelihoods by Gaussian densities based on variational inference and moment matching in transformed bases. These enable efficient inference strategies originally designed for models with a Gaussian likelihood to be deployed. Our empirical results demonstrate that the proposed matching strategies attain good approximation quality for binary and multiclass classification in large-scale point-estimate and distributional inferential settings. In challenging streaming problems, the proposed methods outperform all existing likelihood approximations and approximate inference methods in the exact models. As a by-product, we show that the proposed approximate log-likelihoods are a superior alternative to least-squares on raw labels for neural network classification.

cross Robust Estimation for Kernel Exponential Families with Smoothed Total Variation Distances

Authors: Takafumi Kanamori, Kodai Yokoyama, Takayuki Kawashima

Abstract: In statistical inference, we commonly assume that samples are independent and identically distributed from a probability distribution included in a pre-specified statistical model. However, such an assumption is often violated in practice. Even an unexpected extreme sample called an {\it outlier} can significantly impact classical estimators. Robust statistics studies how to construct reliable statistical methods that efficiently work even when the ideal assumption is violated. Recently, some works revealed that robust estimators such as Tukey's median are well approximated by the generative adversarial net (GAN), a popular learning method for complex generative models using neural networks. GAN is regarded as a learning method using integral probability metrics (IPM), which is a discrepancy measure for probability distributions. In most theoretical analyses of Tukey's median and its GAN-based approximation, however, the Gaussian or elliptical distribution is assumed as the statistical model. In this paper, we explore the application of GAN-like estimators to a general class of statistical models. As the statistical model, we consider the kernel exponential family that includes both finite and infinite-dimensional models. To construct a robust estimator, we propose the smoothed total variation (STV) distance as a class of IPMs. Then, we theoretically investigate the robustness properties of the STV-based estimators. Our analysis reveals that the STV-based estimator is robust against the distribution contamination for the kernel exponential family. Furthermore, we analyze the prediction accuracy of a Monte Carlo approximation method, which circumvents the computational difficulty of the normalization constant.

cross MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Authors: Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, R\'obert Csord\'as

Abstract: Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption -- processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learnt delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively ``merges'' critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance. When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages, with significant additional improvements following multilingual training. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI and character-level tasks while reducing sequence lengths by up to 80%. Our approach presents a solution to the practical limitations of existing byte-level models.

cross An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation

Authors: Saarth Vardhan, Pavani R Acharya, Samarth S Rao, Oorjitha Ratna Jasthi, S Natarajan

Abstract: Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals. This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance across traditional Vocal, Drum, and Bass (VDB) stems, as well as expanding into second-level hierarchical separation for sub-stems like kick, snare, lead vocals, and background vocals. Our method addresses the limitations of relying on a single model by utilising the complementary strengths of various models, leading to more balanced results across stems. For stem selection, we used the harmonic mean of Signal-to-Noise Ratio (SNR) and Signal-to-Distortion Ratio (SDR), ensuring that extreme values do not skew the results and that both metrics are weighted effectively. In addition to consistently high performance across the VDB stems, we also explored second-level hierarchical separation, revealing important insights into the complexities of MSS and how factors like genre and instrumentation can influence model performance. While the second-level separation results show room for improvement, the ability to isolate sub-stems marks a significant advancement. Our findings pave the way for further research in MSS, particularly in expanding model capabilities beyond VDB and improving niche stem separations such as guitar and piano.

cross KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation

Authors: Rambod Azimi, Rishav Rishav, Marek Teichmann, Samira Ebrahimi Kahou

Abstract: Large language models (LLMs) have demonstrated remarkable performance across various downstream tasks. However, the high computational and memory requirements of LLMs are a major bottleneck. To address this, parameter-efficient fine-tuning (PEFT) methods such as low-rank adaptation (LoRA) have been proposed to reduce computational costs while ensuring minimal loss in performance. Additionally, knowledge distillation (KD) has been a popular choice for obtaining compact student models from teacher models. In this work, we present KD-LoRA, a novel fine-tuning method that combines LoRA with KD. Our results demonstrate that KD-LoRA achieves performance comparable to full fine-tuning (FFT) and LoRA while significantly reducing resource requirements. Specifically, KD-LoRA retains 98% of LoRA's performance on the GLUE benchmark, while being 40% more compact. Additionally, KD-LoRA reduces GPU memory usage by 30% compared to LoRA, while decreasing inference time by 30% compared to both FFT and LoRA. We evaluate KD-LoRA across three encoder-only models: BERT, RoBERTa, and DeBERTaV3. Code is available at https://github.com/rambodazimi/KD-LoRA.

URLs: https://github.com/rambodazimi/KD-LoRA.

cross Scaling-based Data Augmentation for Generative Models and its Theoretical Extension

Authors: Yoshitaka Koike, Takumi Nakagawa, Hiroki Waida, Takafumi Kanamori

Abstract: This paper studies stable learning methods for generative models that enable high-quality data generation. Noise injection is commonly used to stabilize learning. However, selecting a suitable noise distribution is challenging. Diffusion-GAN, a recently developed method, addresses this by using the diffusion process with a timestep-dependent discriminator. We investigate Diffusion-GAN and reveal that data scaling is a key component for stable learning and high-quality data generation. Building on our findings, we propose a learning algorithm, Scale-GAN, that uses data scaling and variance-based regularization. Furthermore, we theoretically prove that data scaling controls the bias-variance trade-off of the estimation error bound. As a theoretical extension, we consider GAN with invertible data augmentations. Comparative evaluations on benchmark datasets demonstrate the effectiveness of our method in improving stability and accuracy.

cross Graph-based Uncertainty Metrics for Long-form Language Model Outputs

Authors: Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, Tatsunori Hashimoto

Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.

cross SCULPT: Systematic Tuning of Long Prompts

Authors: Shanu Kumar, Akhila Yesantarao Venkata, Shubhanshu Khandelwal, Bishal Santra, Parag Agrawal, Manish Gupta

Abstract: As large language models become increasingly central to solving complex tasks, the challenge of optimizing long, unstructured prompts has become critical. Existing optimization techniques often struggle to effectively handle such prompts, leading to suboptimal performance. We introduce SCULPT (Systematic Tuning of Long Prompts), a novel framework that systematically refines long prompts by structuring them hierarchically and applying an iterative actor-critic mechanism. To enhance robustness and generalizability, SCULPT utilizes two complementary feedback mechanisms: Preliminary Assessment, which assesses the prompt's structure before execution, and Error Assessment, which diagnoses and addresses errors post-execution. By aggregating feedback from these mechanisms, SCULPT avoids overfitting and ensures consistent improvements in performance. Our experimental results demonstrate significant accuracy gains and enhanced robustness, particularly in handling erroneous and misaligned prompts. SCULPT consistently outperforms existing approaches, establishing itself as a scalable solution for optimizing long prompts across diverse and real-world tasks.

cross Deep Learning for Medical Text Processing: BERT Model Fine-Tuning and Comparative Study

Authors: Jiacheng Hu, Yiru Cang, Guiran Liu, Meiqi Wang, Weijie He, Runyuan Bao

Abstract: This paper proposes a medical literature summary generation method based on the BERT model to address the challenges brought by the current explosion of medical information. By fine-tuning and optimizing the BERT model, we develop an efficient summary generation system that can quickly extract key information from medical literature and generate coherent, accurate summaries. In the experiment, we compared various models, including Seq-Seq, Attention, Transformer, and BERT, and demonstrated that the improved BERT model offers significant advantages in the Rouge and Recall metrics. Furthermore, the results of this study highlight the potential of knowledge distillation techniques to further enhance model performance. The system has demonstrated strong versatility and efficiency in practical applications, offering a reliable tool for the rapid screening and analysis of medical literature.

cross Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Authors: Jiacheng Wang, Xiang Chen, Renjiu Hu, Rongguang Wang, Min Liu, Yaonan Wang, Jiazheng Wang, Hao Li, Hang Zhang

Abstract: Co-examination of second-harmonic generation (SHG) and bright-field (BF) microscopy enables the differentiation of tissue components and collagen fibers, aiding the analysis of human breast and pancreatic cancer tissues. However, large discrepancies between SHG and BF images pose challenges for current learning-based registration models in aligning SHG to BF. In this paper, we propose a novel multi-modal registration framework that employs fidelity-imposed displacement editing to address these challenges. The framework integrates batch-wise contrastive learning, feature-based pre-alignment, and instance-level optimization. Experimental results from the Learn2Reg COMULISglobe SHG-BF Challenge validate the effectiveness of our method, securing the 1st place on the online leaderboard.

cross FreqMark: Invisible Image Watermarking via Frequency Based Optimization in Latent Space

Authors: Yiyang Guo, Ruizhe Li, Mude Hui, Hanzhong Guo, Chen Zhang, Chuangjian Cai, Le Wan, Shangfei Wang

Abstract: Invisible watermarking is essential for safeguarding digital content, enabling copyright protection and content authentication. However, existing watermarking methods fall short in robustness against regeneration attacks. In this paper, we propose a novel method called FreqMark that involves unconstrained optimization of the image latent frequency space obtained after VAE encoding. Specifically, FreqMark embeds the watermark by optimizing the latent frequency space of the images and then extracts the watermark through a pre-trained image encoder. This optimization allows a flexible trade-off between image quality with watermark robustness and effectively resists regeneration attacks. Experimental results demonstrate that FreqMark offers significant advantages in image quality and robustness, permits flexible selection of the encoding bit number, and achieves a bit accuracy exceeding 90% when encoding a 48-bit hidden message under various attack scenarios.

cross Asteroid Mining: ACT&Friends' Results for the GTOC 12 Problem

Authors: Dario Izzo, Marcus M\"artens, Laurent Beauregard, Max Bannach, Giacomo Acciarini, Emmanuel Blazquez, Alexander Hadjiivanov, Jai Grover, Gernot Hei{\ss}el, Yuri Shimane, Chit Hong Yam

Abstract: In 2023, the 12th edition of Global Trajectory Competition was organised around the problem referred to as "Sustainable Asteroid Mining". This paper reports the developments that led to the solution proposed by ESA's Advanced Concepts Team. Beyond the fact that the proposed approach failed to rank higher than fourth in the final competition leader-board, several innovative fundamental methodologies were developed which have a broader application. In particular, new methods based on machine learning as well as on manipulating the fundamental laws of astrodynamics were developed and able to fill with remarkable accuracy the gap between full low-thrust trajectories and their representation as impulsive Lambert transfers. A novel technique was devised to formulate the challenge of optimal subset selection from a repository of pre-existing optimal mining trajectories as an integer linear programming problem. Finally, the fundamental problem of searching for single optimal mining trajectories (mining and collecting all resources), albeit ignoring the possibility of having intra-ship collaboration and thus sub-optimal in the case of the GTOC12 problem, was efficiently solved by means of a novel search based on a look-ahead score and thus making sure to select asteroids that had chances to be re-visited later on.

cross Active Causal Structure Learning with Latent Variables: Towards Learning to Detour in Autonomous Robots

Authors: Pablo de los Riscos, Fernando Corbacho

Abstract: Artificial General Intelligence (AGI) Agents and Robots must be able to cope with everchanging environments and tasks. They must be able to actively construct new internal causal models of their interactions with the environment when new structural changes take place in the environment. Thus, we claim that active causal structure learning with latent variables (ACSLWL) is a necessary component to build AGI agents and robots. This paper describes how a complex planning and expectation-based detour behavior can be learned by ACSLWL when, unexpectedly, and for the first time, the simulated robot encounters a sort of transparent barrier in its pathway towards its target. ACSWL consists of acting in the environment, discovering new causal relations, constructing new causal models, exploiting the causal models to maximize its expected utility, detecting possible latent variables when unexpected observations occur, and constructing new structures-internal causal models and optimal estimation of the associated parameters, to be able to cope efficiently with the new encountered situations. That is, the agent must be able to construct new causal internal models that transform a previously unexpected and inefficient (sub-optimal) situation, into a predictable situation with an optimal operating plan.

cross Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models

Authors: Weijian Luo, Colin Zhang, Debing Zhang, Zhengyang Geng

Abstract: In this paper, we introduce the Diff-Instruct*(DI*), a data-free approach for building one-step text-to-image generative models that align with human preference while maintaining the ability to generate highly realistic images. We frame human preference alignment as online reinforcement learning using human feedback (RLHF), where the goal is to maximize the reward function while regularizing the generator distribution to remain close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization, which leads to significantly better performances. Although the direct calculation of this divergence remains intractable, we demonstrate that we can efficiently compute its \emph{gradient} by deriving an equivalent yet tractable loss function. Remarkably, with Stable Diffusion V1.5 as the reference diffusion model, DI* outperforms \emph{all} previously leading models by a large margin. When using the 0.6B PixelArt-$\alpha$ model as the reference diffusion, DI* achieves a new record Aesthetic Score of 6.30 and an Image Reward of 1.31 with only a single generation step, almost doubling the scores of the rest of the models with similar sizes. It also achieves an HPSv2 score of 28.70, establishing a new state-of-the-art benchmark. We also observe that DI* can improve the layout and enrich the colors of generated images.

cross FACTS: A Factored State-Space Framework For World Modelling

Authors: Li Nanbo, Firas Laakom, Yucheng Xu, Wenyi Wang, J\"urgen Schmidhuber

Abstract: World modelling is essential for understanding and predicting the dynamics of complex systems by learning both spatial and temporal dependencies. However, current frameworks, such as Transformers and selective state-space models like Mambas, exhibit limitations in efficiently encoding spatial and temporal structures, particularly in scenarios requiring long-term high-dimensional sequence modelling. To address these issues, we propose a novel recurrent framework, the \textbf{FACT}ored \textbf{S}tate-space (\textbf{FACTS}) model, for spatial-temporal world modelling. The FACTS framework constructs a graph-structured memory with a routing mechanism that learns permutable memory representations, ensuring invariance to input permutations while adapting through selective state-space propagation. Furthermore, FACTS supports parallel computation of high-dimensional sequences. We empirically evaluate FACTS across diverse tasks, including multivariate time series forecasting and object-centric world modelling, demonstrating that it consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.

cross Neuro-symbolic Learning Yielding Logical Constraints

Authors: Zenan Li, Yunpeng Huang, Zhaoyu Li, Yuan Yao, Jingwei Xu, Taolue Chen, Xiaoxing Ma, Jian Lu

Abstract: Neuro-symbolic systems combine the abilities of neural perception and logical reasoning. However, end-to-end learning of neuro-symbolic systems is still an unsolved challenge. This paper proposes a natural framework that fuses neural network training, symbol grounding, and logical constraint synthesis into a coherent and efficient end-to-end learning process. The capability of this framework comes from the improved interactions between the neural and the symbolic parts of the system in both the training and inference stages. Technically, to bridge the gap between the continuous neural network and the discrete logical constraint, we introduce a difference-of-convex programming technique to relax the logical constraints while maintaining their precision. We also employ cardinality constraints as the language for logical constraint learning and incorporate a trust region method to avoid the degeneracy of logical constraint in learning. Both theoretical analyses and empirical evaluations substantiate the effectiveness of the proposed framework.

cross DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning

Authors: Xun Guo, Shan Zhang, Yongxin He, Ting Zhang, Wanquan Feng, Haibin Huang, Chongyang Ma

Abstract: Current techniques for detecting AI-generated text are largely confined to manual feature crafting and supervised binary classification paradigms. These methodologies typically lead to performance bottlenecks and unsatisfactory generalizability. Consequently, these methods are often inapplicable for out-of-distribution (OOD) data and newly emerged large language models (LLMs). In this paper, we revisit the task of AI-generated text detection. We argue that the key to accomplishing this task lies in distinguishing writing styles of different authors, rather than simply classifying the text into human-written or AI-generated text. To this end, we propose DeTeCtive, a multi-task auxiliary, multi-level contrastive learning framework. DeTeCtive is designed to facilitate the learning of distinct writing styles, combined with a dense information retrieval pipeline for AI-generated text detection. Our method is compatible with a range of text encoders. Extensive experiments demonstrate that our method enhances the ability of various text encoders in detecting AI-generated text across multiple benchmarks and achieves state-of-the-art results. Notably, in OOD zero-shot evaluation, our method outperforms existing approaches by a large margin. Moreover, we find our method boasts a Training-Free Incremental Adaptation (TFIA) capability towards OOD data, further enhancing its efficacy in OOD detection scenarios. We will open-source our code and models in hopes that our work will spark new thoughts in the field of AI-generated text detection, ensuring safe application of LLMs and enhancing compliance. Our code is available at https://github.com/heyongxin233/DeTeCtive.

URLs: https://github.com/heyongxin233/DeTeCtive.

cross BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Authors: Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang

Abstract: Despite their superb multimodal capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks, which are inference-time attacks that induce the model to output harmful responses with tricky prompts. It is thus essential to defend VLMs against potential jailbreaks for their trustworthy deployment in real-world applications. In this work, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends the black-box target VLM against jailbreak attacks without compromising its performance. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator fine-tuned via reinforcement learning for enhancing cross-modal robustness. We empirically show on three VLMs (LLaVA, MiniGPT-4, and Gemini) and two safety benchmarks (MM-SafetyBench and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.

cross Large Language Model-Guided Prediction Toward Quantum Materials Synthesis

Authors: Ryotaro Okabe, Zack West, Abhijatmedhi Chotrattanapituk, Mouyang Cheng, Denisse C\'ordova Carrizales, Weiwei Xie, Robert J. Cava, Mingda Li

Abstract: The synthesis of inorganic crystalline materials is essential for modern technology, especially in quantum materials development. However, designing efficient synthesis workflows remains a significant challenge due to the precise experimental conditions and extensive trial and error. Here, we present a framework using large language models (LLMs) to predict synthesis pathways for inorganic materials, including quantum materials. Our framework contains three models: LHS2RHS, predicting products from reactants; RHS2LHS, predicting reactants from products; and TGT2CEQ, generating full chemical equations for target compounds. Fine-tuned on a text-mined synthesis database, our model raises accuracy from under 40% with pretrained models, to under 80% using conventional fine-tuning, and further to around 90% with our proposed generalized Tanimoto similarity, while maintaining robust to additional synthesis steps. Our model further demonstrates comparable performance across materials with varying degrees of quantumness quantified using quantum weight, indicating that LLMs offer a powerful tool to predict balanced chemical equations for quantum materials discovery.

cross Reference-Free Formula Drift with Reinforcement Learning: From Driving Data to Tire Energy-Inspired, Real-World Policies

Authors: Franck Djeumou, Michael Thompson, Makoto Suminaka, John Subosits

Abstract: The skill to drift a car--i.e., operate in a state of controlled oversteer like professional drivers--could give future autonomous cars maximum flexibility when they need to retain control in adverse conditions or avoid collisions. We investigate real-time drifting strategies that put the car where needed while bypassing expensive trajectory optimization. To this end, we design a reinforcement learning agent that builds on the concept of tire energy absorption to autonomously drift through changing and complex waypoint configurations while safely staying within track bounds. We achieve zero-shot deployment on the car by training the agent in a simulation environment built on top of a neural stochastic differential equation vehicle model learned from pre-collected driving data. Experiments on a Toyota GR Supra and Lexus LC 500 show that the agent is capable of drifting smoothly through varying waypoint configurations with tracking error as low as 10 cm while stably pushing the vehicles to sideslip angles of up to 63{\deg}.

cross SepMamba: State-space models for speaker separation using Mamba

Authors: Thor H{\o}jhus Avenstrup, Boldizs\'ar Elek, Istv\'an L\'aszl\'o M\'adi, Andr\'as Bence Schin, Morten M{\o}rup, Bj{\o}rn Sand Jensen, Kenny Falk{\ae}r Olsen

Abstract: Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.

cross Breccia and basalt classification of thin sections of Apollo rocks with deep learning

Authors: Freja Thoresen, Aidan Cowley, Romeo Haak, Jonas Lewe, Clara Moriceau, Piotr Knapczyk, Victoria S. Engelschi{\o}n

Abstract: Human exploration of the moon is expected to resume in the next decade, following the last such activities in the Apollo programme time. One of the major objectives of returning to the Moon is to continue retrieving geological samples, with a focus on collecting high-quality specimens to maximize scientific return. Tools that assist astronauts in making informed decisions about sample collection activities can maximize the scientific value of future lunar missions. A lunar rock classifier is a tool that can potentially provide the necessary information for astronauts to analyze lunar rock samples, allowing them to augment in-situ value identification of samples. Towards demonstrating the value of such a tool, in this paper, we introduce a framework for classifying rock types in thin sections of lunar rocks. We leverage the vast collection of petrographic thin-section images from the Apollo missions, captured under plane-polarized light (PPL), cross-polarised light (XPL), and reflected light at varying magnifications. Advanced machine learning methods, including contrastive learning, are applied to analyze these images and extract meaningful features. The contrastive learning approach fine-tunes a pre-trained Inception-Resnet-v2 network with the SimCLR loss function. The fine-tuned Inception-Resnet-v2 network can then extract essential features effectively from the thin-section images of Apollo rocks. A simple binary classifier is trained using transfer learning from the fine-tuned Inception-ResNet-v2 to 98.44\% ($\pm$1.47) accuracy in separating breccias from basalts.

cross BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration

Authors: James Sharpnack, Kevin Hao, Phoebe Mulcaire, Klinton Bicknell, Geoff LaFlair, Kevin Yancey, Alina A. von Davier

Abstract: In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework.

cross CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity

Authors: Yutong Cheng, Osama Bajaber, Saimon Amanuel Tsegai, Dawn Song, Peng Gao

Abstract: Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Syntax parsing relies on fixed rules and dictionaries, while model fine-tuning requires large annotated datasets, making both paradigms challenging to adapt to new threats and ontologies. To bridge the gap, we propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models (LLMs) for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction. Unlike existing methods, CTINexus requires neither extensive data nor parameter tuning and can adapt to various ontologies with minimal annotated examples. This is achieved through (1) a carefully designed automatic prompt construction strategy with optimal demonstration retrieval for extracting a wide range of cybersecurity entities and relations; (2) a hierarchical entity alignment technique that canonicalizes the extracted knowledge and removes redundancy; (3) an ICL-enhanced long-distance relation prediction technique to further complete the CKSG with missing links. Our extensive evaluations using 150 real-world CTI reports collected from 10 platforms demonstrate that CTINexus significantly outperforms existing methods in constructing accurate and complete CSKGs, highlighting its potential to transform CTI analysis with an efficient and adaptable solution for the dynamic threat landscape.

cross Learning to Handle Complex Constraints for Vehicle Routing Problems

Authors: Jieyi Bi, Yining Ma, Jianan Zhou, Wen Song, Zhiguang Cao, Yaoxin Wu, Jie Zhang

Abstract: Vehicle Routing Problems (VRPs) can model many real-world scenarios and often involve complex constraints. While recent neural methods excel in constructing solutions based on feasibility masking, they struggle with handling complex constraints, especially when obtaining the masking itself is NP-hard. In this paper, we propose a novel Proactive Infeasibility Prevention (PIP) framework to advance the capabilities of neural methods towards more complex VRPs. Our PIP integrates the Lagrangian multiplier as a basis to enhance constraint awareness and introduces preventative infeasibility masking to proactively steer the solution construction process. Moreover, we present PIP-D, which employs an auxiliary decoder and two adaptive strategies to learn and predict these tailored masks, potentially enhancing performance while significantly reducing computational costs during training. To verify our PIP designs, we conduct extensive experiments on the highly challenging Traveling Salesman Problem with Time Window (TSPTW), and TSP with Draft Limit (TSPDL) variants under different constraint hardness levels. Notably, our PIP is generic to boost many neural methods, and exhibits both a significant reduction in infeasible rate and a substantial improvement in solution quality.

cross Accelerated Bayesian parameter estimation and model selection for gravitational waves with normalizing flows

Authors: Alicja Polanska, Thibeau Wouters, Peter T. H. Pang, Kaze K. W. Wong, Jason D. McEwen

Abstract: We present an accelerated pipeline, based on high-performance computing techniques and normalizing flows, for joint Bayesian parameter estimation and model selection and demonstrate its efficiency in gravitational wave astrophysics. We integrate the Jim inference toolkit, a normalizing flow-enhanced Markov chain Monte Carlo (MCMC) sampler, with the learned harmonic mean estimator. Our Bayesian evidence estimates run on $1$ GPU are consistent with traditional nested sampling techniques run on $16$ CPU cores, while reducing the computation time by factors of $5\times$ and $15\times$ for $4$-dimensional and $11$-dimensional gravitational wave inference problems, respectively. Our code is available in well-tested and thoroughly documented open-source packages, ensuring accessibility and reproducibility for the wider research community.

cross Stronger Regret Bounds for Safe Online Reinforcement Learning in the Linear Quadratic Regulator

Authors: Benjamin Schiffer, Lucas Janson

Abstract: Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we study Linear Quadratic Regulator (LQR) learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Unlike in previous works, we allow for both bounded and unbounded noise distributions and study stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to these complications, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Our primary contribution is the first $\tilde{O}_T(\sqrt{T})$-regret bound for constrained LQR learning, which we show relative to a specific baseline of non-linear controllers. We then prove that, for any non-linear baseline satisfying natural assumptions, $\tilde{O}_T(\sqrt{T})$-regret is possible when the noise distribution has sufficiently large support and $\tilde{O}_T(T^{2/3})$-regret is possible for any subgaussian noise distribution. An overarching theme of our results is that enforcing safety provides "free exploration" that compensates for the added cost of uncertainty in safety constrained control, resulting in the same regret rate as in the unconstrained problem.

cross LAMA: Stable Dual-Domain Deep Reconstruction For Sparse-View CT

Authors: Chi Ding, Qingchao Zhang, Ge Wang, Xiaojing Ye, Yunmei Chen

Abstract: Inverse problems arise in many applications, especially tomographic imaging. We develop a Learned Alternating Minimization Algorithm (LAMA) to solve such problems via two-block optimization by synergizing data-driven and classical techniques with proven convergence. LAMA is naturally induced by a variational model with learnable regularizers in both data and image domains, parameterized as composite functions of neural networks trained with domain-specific data. We allow these regularizers to be nonconvex and nonsmooth to extract features from data effectively. We minimize the overall objective function using Nesterov's smoothing technique and residual learning architecture. It is demonstrated that LAMA reduces network complexity, improves memory efficiency, and enhances reconstruction accuracy, stability, and interpretability. Extensive experiments show that LAMA significantly outperforms state-of-the-art methods on popular benchmark datasets for Computed Tomography.

cross Robustness and Generalization in Quantum Reinforcement Learning via Lipschitz Regularization

Authors: Nico Meyer, Julian Berberich, Christopher Mutschler, Daniel D. Scherer

Abstract: Quantum machine learning leverages quantum computing to enhance accuracy and reduce model complexity compared to classical approaches, promising significant advancements in various fields. Within this domain, quantum reinforcement learning has garnered attention, often realized using variational quantum circuits to approximate the policy function. This paper addresses the robustness and generalization of quantum reinforcement learning by combining principles from quantum computing and control theory. Leveraging recent results on robust quantum machine learning, we utilize Lipschitz bounds to propose a regularized version of a quantum policy gradient approach, named the RegQPG algorithm. We show that training with RegQPG improves the robustness and generalization of the resulting policies. Furthermore, we introduce an algorithmic variant that incorporates curriculum learning, which minimizes failures during training. Our findings are validated through numerical experiments, demonstrating the practical benefits of our approach.

cross A Unified Solution to Diverse Heterogeneities in One-shot Federated Learning

Authors: Jun Bai, Yiliao Song, Di Wu, Atul Sajjanhar, Yong Xiang, Wei Zhou, Xiaohui Tao, Yan Li

Abstract: One-shot federated learning (FL) limits the communication between the server and clients to a single round, which largely decreases the privacy leakage risks in traditional FLs requiring multiple communications. However, we find existing one-shot FL frameworks are vulnerable to distributional heterogeneity due to their insufficient focus on data heterogeneity while concentrating predominantly on model heterogeneity. Filling this gap, we propose a unified, data-free, one-shot federated learning framework (FedHydra) that can effectively address both model and data heterogeneity. Rather than applying existing value-only learning mechanisms, a structure-value learning mechanism is proposed in FedHydra. Specifically, a new stratified learning structure is proposed to cover data heterogeneity, and the value of each item during computation reflects model heterogeneity. By this design, the data and model heterogeneity issues are simultaneously monitored from different aspects during learning. Consequently, FedHydra can effectively mitigate both issues by minimizing their inherent conflicts. We compared FedHydra with three SOTA baselines on four benchmark datasets. Experimental results show that our method outperforms the previous one-shot FL methods in both homogeneous and heterogeneous settings.

cross Differentially Private Learned Indexes

Authors: Jianzhang Du, Tilak Mudgal, Rutvi Rahul Gadre, Yukui Luo, Chenghong Wang

Abstract: In this paper, we address the problem of efficiently answering predicate queries on encrypted databases, those secured by Trusted Execution Environments (TEEs), which enable untrusted providers to process encrypted user data without revealing its contents. A common strategy in modern databases to accelerate predicate queries is the use of indexes, which map attribute values (keys) to their corresponding positions in a sorted data array. This allows for fast lookup and retrieval of data subsets that satisfy specific predicates. Unfortunately, indexes cannot be directly applied to encrypted databases due to strong data dependent leakages. Recent approaches apply differential privacy (DP) to construct noisy indexes that enable faster access to encrypted data while maintaining provable privacy guarantees. However, these methods often suffer from large storage costs, with index sizes typically scaling linearly with the key space. To address this challenge, we propose leveraging learned indexes, a trending technique that repurposes machine learning models as indexing structures, to build more compact DP indexes.

cross On Homomorphic Encryption Based Strategies for Class Imbalance in Federated Learning

Authors: Arpit Guleria, J. Harshan, Ranjitha Prasad, B. N. Bharath

Abstract: Class imbalance in training datasets can lead to bias and poor generalization in machine learning models. While pre-processing of training datasets can efficiently address both these issues in centralized learning environments, it is challenging to detect and address these issues in a distributed learning environment such as federated learning. In this paper, we propose FLICKER, a privacy preserving framework to address issues related to global class imbalance in federated learning. At the heart of our contribution lies the popular CKKS homomorphic encryption scheme, which is used by the clients to privately share their data attributes, and subsequently balance their datasets before implementing the FL scheme. Extensive experimental results show that our proposed method significantly improves the FL accuracy numbers when used along with popular datasets and relevant baselines.

cross SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications

Authors: Ilias Diakonikolas, Samuel B. Hopkins, Ankit Pensia, Stefan Tiegel

Abstract: We prove that there is a universal constant $C>0$ so that for every $d \in \mathbb N$, every centered subgaussian distribution $\mathcal D$ on $\mathbb R^d$, and every even $p \in \mathbb N$, the $d$-variate polynomial $(Cp)^{p/2} \cdot \|v\|_{2}^p - \mathbb E_{X \sim \mathcal D} \langle v,X\rangle^p$ is a sum of square polynomials. This establishes that every subgaussian distribution is \emph{SoS-certifiably subgaussian} -- a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand's generic chaining/majorizing measures theorem.

cross BongLLaMA: LLaMA for Bangla Language

Authors: Abdullah Khan Zehady, Safi Al Mamun, Naymul Islam, Santu Karmaker

Abstract: Bangla (or "Bengali") is a language spoken by approximately 240 million native speakers and around 300 million people worldwide. Despite being the 5th largest spoken language in the world, Bangla is still a "low-resource" language, and existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks. This work addresses this gap by introducing BongLLaMA (i.e., Bangla-LLaMA), an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets. We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks. We believe BongLLaMA will serve as the new standard baseline for Bangla Language Models and, thus, facilitate future benchmarking studies focused on this widely-spoken yet "low-resource" language. All BongLLaMA models are available for public use at https://huggingface.co/BanglaLLM.

URLs: https://huggingface.co/BanglaLLM.

cross On learning higher-order cumulants in diffusion models

Authors: Gert Aarts, Diaa E. Habibi, Lingxiao Wang, Kai Zhou

Abstract: To analyse how diffusion models learn correlations beyond Gaussian ones, we study the behaviour of higher-order cumulants, or connected n-point functions, under both the forward and backward process. We derive explicit expressions for the moment- and cumulant-generating functionals, in terms of the distribution of the initial data and properties of forward process. It is shown analytically that during the forward process higher-order cumulants are conserved in models without a drift, such as the variance-expanding scheme, and that therefore the endpoint of the forward process maintains nontrivial correlations. We demonstrate that since these correlations are encoded in the score function, higher-order cumulants are learnt in the backward process, also when starting from a normal prior. We confirm our analytical results in an exactly solvable toy model with nonzero cumulants and in scalar lattice field theory.

cross HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

Authors: Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu

Abstract: Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and extrapolation.Inspired by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.

cross Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Authors: Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

Abstract: Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

cross Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback

Authors: Nour Jedidi, Yung-Sung Chuang, Leslie Shing, James Glass

Abstract: Building effective dense retrieval systems remains difficult when relevance supervision is not available. Recent work has looked to overcome this challenge by using a Large Language Model (LLM) to generate hypothetical documents that can be used to find the closest real document. However, this approach relies solely on the LLM to have domain-specific knowledge relevant to the query, which may not be practical. Furthermore, generating hypothetical documents can be inefficient as it requires the LLM to generate a large number of tokens for each query. To address these challenges, we introduce Real Document Embeddings from Relevance Feedback (ReDE-RF). Inspired by relevance feedback, ReDE-RF proposes to re-frame hypothetical document generation as a relevance estimation task, using an LLM to select which documents should be used for nearest neighbor search. Through this re-framing, the LLM no longer needs domain-specific knowledge but only needs to judge what is relevant. Additionally, relevance estimation only requires the LLM to output a single token, thereby improving search latency. Our experiments show that ReDE-RF consistently surpasses state-of-the-art zero-shot dense retrieval methods across a wide range of low-resource retrieval datasets while also making significant improvements in latency per-query.

cross LongReward: Improving Long-context Large Language Models with AI Feedback

Authors: Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu Hou, Yilin Niu, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li

Abstract: Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models' capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models' long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one's performance.

cross One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Authors: Zhendong Wang, Zhaoshuo Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, Ming-Yu Liu, Yu Zeng

Abstract: Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for real-time applications in resource-constrained robotics setups and dynamically changing environments. In this paper, we introduce the One-Step Diffusion Policy (OneDP), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only $2\%$-$10\%$ additional pre-training cost for convergence. We evaluated OneDP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. We share the project page at https://research.nvidia.com/labs/dir/onedp/.

URLs: https://research.nvidia.com/labs/dir/onedp/.

cross Quantum computing and persistence in topological data analysis

Authors: Casper Gyurik, Alexander Schmidhuber, Robbie King, Vedran Dunjko, Ryu Hayakawa

Abstract: Topological data analysis (TDA) aims to extract noise-robust features from a data set by examining the number and persistence of holes in its topology. We show that a computational problem closely related to a core task in TDA -- determining whether a given hole persists across different length scales -- is $\mathsf{BQP}_1$-hard and contained in $\mathsf{BQP}$. This result implies an exponential quantum speedup for this problem under standard complexity-theoretic assumptions. Our approach relies on encoding the persistence of a hole in a variant of the guided sparse Hamiltonian problem, where the guiding state is constructed from a harmonic representative of the hole.

cross Adaptive Transfer Clustering: A Unified Framework

Authors: Yuqi Gu, Zhongyuan Lyu, Kaizheng Wang

Abstract: We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method's effectiveness in various scenarios.

cross GPT-4o System Card

Authors: OpenAI (Tony), : (Tony), Aaron Hurst (Tony), Adam Lerer (Tony), Adam P. Goucher (Tony), Adam Perelman (Tony), Aditya Ramesh (Tony), Aidan Clark (Tony), AJ Ostrow (Tony), Akila Welihinda (Tony), Alan Hayes (Tony), Alec Radford (Tony), Aleksander M\k{a}dry (Tony), Alex Baker-Whitcomb (Tony), Alex Beutel (Tony), Alex Borzunov (Tony), Alex Carney (Tony), Alex Chow (Tony), Alex Kirillov (Tony), Alex Nichol (Tony), Alex Paino (Tony), Alex Renzin (Tony), Alex Tachard Passos (Tony), Alexander Kirillov (Tony), Alexi Christakis (Tony), Alexis Conneau (Tony), Ali Kamali (Tony), Allan Jabri (Tony), Allison Moyer (Tony), Allison Tam (Tony), Amadou Crookes (Tony), Amin Tootoochian (Tony), Amin Tootoonchian (Tony), Ananya Kumar (Tony), Andrea Vallone (Tony), Andrej Karpathy (Tony), Andrew Braunstein (Tony), Andrew Cann (Tony), Andrew Codispoti (Tony), Andrew Galu (Tony), Andrew Kondrich (Tony), Andrew Tulloch (Tony), Andrey Mishchenko (Tony), Angela Baek (Tony), Angela Jiang (Tony), Antoine Pelisse (Tony), Antonia Woodford (Tony), Anuj Gosalia (Tony), Arka Dhar (Tony), Ashley Pantuliano (Tony), Avi Nayak (Tony), Avital Oliver (Tony), Barret Zoph (Tony), Behrooz Ghorbani (Tony), Ben Leimberger (Tony), Ben Rossen (Tony), Ben Sokolowsky (Tony), Ben Wang (Tony), Benjamin Zweig (Tony), Beth Hoover (Tony), Blake Samic (Tony), Bob McGrew (Tony), Bobby Spero (Tony), Bogo Giertler (Tony), Bowen Cheng (Tony), Brad Lightcap (Tony), Brandon Walkin (Tony), Brendan Quinn (Tony), Brian Guarraci (Tony), Brian Hsu (Tony), Bright Kellogg (Tony), Brydon Eastman (Tony), Camillo Lugaresi (Tony), Carroll Wainwright (Tony), Cary Bassin (Tony), Cary Hudson (Tony), Casey Chu (Tony), Chad Nelson (Tony), Chak Li (Tony), Chan Jun Shern (Tony), Channing Conger (Tony), Charlotte Barette (Tony), Chelsea Voss (Tony), Chen Ding (Tony), Cheng Lu (Tony), Chong Zhang (Tony), Chris Beaumont (Tony), Chris Hallacy (Tony), Chris Koch (Tony), Christian Gibson (Tony), Christina Kim (Tony), Christine Choi (Tony), Christine McLeavey (Tony), Christopher Hesse (Tony), Claudia Fischer (Tony), Clemens Winter (Tony), Coley Czarnecki (Tony), Colin Jarvis (Tony), Colin Wei (Tony), Constantin Koumouzelis (Tony), Dane Sherburn (Tony), Daniel Kappler (Tony), Daniel Levin (Tony), Daniel Levy (Tony), David Carr (Tony), David Farhi (Tony), David Mely (Tony), David Robinson (Tony), David Sasaki (Tony), Denny Jin (Tony), Dev Valladares (Tony), Dimitris Tsipras (Tony), Doug Li (Tony), Duc Phong Nguyen (Tony), Duncan Findlay (Tony), Edede Oiwoh (Tony), Edmund Wong (Tony), Ehsan Asdar (Tony), Elizabeth Proehl (Tony), Elizabeth Yang (Tony), Eric Antonow (Tony), Eric Kramer (Tony), Eric Peterson (Tony), Eric Sigler (Tony), Eric Wallace (Tony), Eugene Brevdo (Tony), Evan Mays (Tony), Farzad Khorasani (Tony), Felipe Petroski Such (Tony), Filippo Raso (Tony), Francis Zhang (Tony), Fred von Lohmann (Tony), Freddie Sulit (Tony), Gabriel Goh (Tony), Gene Oden (Tony), Geoff Salmon (Tony), Giulio Starace (Tony), Greg Brockman (Tony), Hadi Salman (Tony), Haiming Bao (Tony), Haitang Hu (Tony), Hannah Wong (Tony), Haoyu Wang (Tony), Heather Schmidt (Tony), Heather Whitney (Tony), Heewoo Jun (Tony), Hendrik Kirchner (Tony), Henrique Ponde de Oliveira Pinto (Tony), Hongyu Ren (Tony), Huiwen Chang (Tony), Hyung Won Chung (Tony), Ian Kivlichan (Tony), Ian O'Connell (Tony), Ian O'Connell (Tony), Ian Osband (Tony), Ian Silber (Tony), Ian Sohl (Tony), Ibrahim Okuyucu (Tony), Ikai Lan (Tony), Ilya Kostrikov (Tony), Ilya Sutskever (Tony), Ingmar Kanitscheider (Tony), Ishaan Gulrajani (Tony), Jacob Coxon (Tony), Jacob Menick (Tony), Jakub Pachocki (Tony), James Aung (Tony), James Betker (Tony), James Crooks (Tony), James Lennon (Tony), Jamie Kiros (Tony), Jan Leike (Tony), Jane Park (Tony), Jason Kwon (Tony), Jason Phang (Tony), Jason Teplitz (Tony), Jason Wei (Tony), Jason Wolfe (Tony), Jay Chen (Tony), Jeff Harris (Tony), Jenia Varavva (Tony), Jessica Gan Lee (Tony), Jessica Shieh (Tony), Ji Lin (Tony), Jiahui Yu (Tony), Jiayi Weng (Tony), Jie Tang (Tony), Jieqi Yu (Tony), Joanne Jang (Tony), Joaquin Quinonero Candela (Tony), Joe Beutler (Tony), Joe Landers (Tony), Joel Parish (Tony), Johannes Heidecke (Tony), John Schulman (Tony), Jonathan Lachman (Tony), Jonathan McKay (Tony), Jonathan Uesato (Tony), Jonathan Ward (Tony), Jong Wook Kim (Tony), Joost Huizinga (Tony), Jordan Sitkin (Tony), Jos Kraaijeveld (Tony), Josh Gross (Tony), Josh Kaplan (Tony), Josh Snyder (Tony), Joshua Achiam (Tony), Joy Jiao (Tony), Joyce Lee (Tony), Juntang Zhuang (Tony), Justyn Harriman (Tony), Kai Fricke (Tony), Kai Hayashi (Tony), Karan Singhal (Tony), Katy Shi (Tony), Kavin Karthik (Tony), Kayla Wood (Tony), Kendra Rimbach (Tony), Kenny Hsu (Tony), Kenny Nguyen (Tony), Keren Gu-Lemberg (Tony), Kevin Button (Tony), Kevin Liu (Tony), Kiel Howe (Tony), Krithika Muthukumar (Tony), Kyle Luther (Tony), Lama Ahmad (Tony), Larry Kai (Tony), Lauren Itow (Tony), Lauren Workman (Tony), Leher Pathak (Tony), Leo Chen (Tony), Li Jing (Tony), Lia Guy (Tony), Liam Fedus (Tony), Liang Zhou (Tony), Lien Mamitsuka (Tony), Lilian Weng (Tony), Lindsay McCallum (Tony), Lindsey Held (Tony), Long Ouyang (Tony), Louis Feuvrier (Tony), Lu Zhang (Tony), Lukas Kondraciuk (Tony), Lukasz Kaiser (Tony), Luke Hewitt (Tony), Luke Metz (Tony), Lyric Doshi (Tony), Mada Aflak (Tony), Maddie Simens (Tony), Madelaine Boyd (Tony), Madeleine Thompson (Tony), Marat Dukhan (Tony), Mark Chen (Tony), Mark Gray (Tony), Mark Hudnall (Tony), Marvin Zhang (Tony), Marwan Aljubeh (Tony), Mateusz Litwin (Tony), Matthew Zeng (Tony), Max Johnson (Tony), Maya Shetty (Tony), Mayank Gupta (Tony), Meghan Shah (Tony), Mehmet Yatbaz (Tony), Meng Jia Yang (Tony), Mengchao Zhong (Tony), Mia Glaese (Tony), Mianna Chen (Tony), Michael Janner (Tony), Michael Lampe (Tony), Michael Petrov (Tony), Michael Wu (Tony), Michele Wang (Tony), Michelle Fradin (Tony), Michelle Pokrass (Tony), Miguel Castro (Tony), Miguel Oom Temudo de Castro (Tony), Mikhail Pavlov (Tony), Miles Brundage (Tony), Miles Wang (Tony), Minal Khan (Tony), Mira Murati (Tony), Mo Bavarian (Tony), Molly Lin (Tony), Murat Yesildal (Tony), Nacho Soto (Tony), Natalia Gimelshein (Tony), Natalie Cone (Tony), Natalie Staudacher (Tony), Natalie Summers (Tony), Natan LaFontaine (Tony), Neil Chowdhury (Tony), Nick Ryder (Tony), Nick Stathas (Tony), Nick Turley (Tony), Nik Tezak (Tony), Niko Felix (Tony), Nithanth Kudige (Tony), Nitish Keskar (Tony), Noah Deutsch (Tony), Noel Bundick (Tony), Nora Puckett (Tony), Ofir Nachum (Tony), Ola Okelola (Tony), Oleg Boiko (Tony), Oleg Murk (Tony), Oliver Jaffe (Tony), Olivia Watkins (Tony), Olivier Godement (Tony), Owen Campbell-Moore (Tony), Patrick Chao (Tony), Paul McMillan (Tony), Pavel Belov (Tony), Peng Su (Tony), Peter Bak (Tony), Peter Bakkum (Tony), Peter Deng (Tony), Peter Dolan (Tony), Peter Hoeschele (Tony), Peter Welinder (Tony), Phil Tillet (Tony), Philip Pronin (Tony), Philippe Tillet (Tony), Prafulla Dhariwal (Tony), Qiming Yuan (Tony), Rachel Dias (Tony), Rachel Lim (Tony), Rahul Arora (Tony), Rajan Troll (Tony), Randall Lin (Tony), Rapha Gontijo Lopes (Tony), Raul Puri (Tony), Reah Miyara (Tony), Reimar Leike (Tony), Renaud Gaubert (Tony), Reza Zamani (Tony), Ricky Wang (Tony), Rob Donnelly (Tony), Rob Honsby (Tony), Rocky Smith (Tony), Rohan Sahai (Tony), Rohit Ramchandani (Tony), Romain Huet (Tony), Rory Carmichael (Tony), Rowan Zellers (Tony), Roy Chen (Tony), Ruby Chen (Tony), Ruslan Nigmatullin (Tony), Ryan Cheu (Tony), Saachi Jain (Tony), Sam Altman (Tony), Sam Schoenholz (Tony), Sam Toizer (Tony), Samuel Miserendino (Tony), Sandhini Agarwal (Tony), Sara Culver (Tony), Scott Ethersmith (Tony), Scott Gray (Tony), Sean Grove (Tony), Sean Metzger (Tony), Shamez Hermani (Tony), Shantanu Jain (Tony), Shengjia Zhao (Tony), Sherwin Wu (Tony), Shino Jomoto (Tony), Shirong Wu (Tony), Shuaiqi (Tony), Xia, Sonia Phene, Spencer Papay, Srinivas Narayanan, Steve Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tao Xu, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Cunninghman, Thomas Degry, Thomas Dimson, Thomas Raoux, Thomas Shadwell, Tianhao Zheng, Todd Underwood, Todor Markov, Toki Sherbakov, Tom Rubin, Tom Stasi, Tomer Kaftan, Tristan Heywood, Troy Peterson, Tyce Walters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie Monaco, Vishal Kuo, Vlad Fomenko, Wayne Chang, Weiyi Zheng, Wenda Zhou, Wesam Manassra, Will Sheu, Wojciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, Youlong Cheng, Yu Zhang, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, Yury Malkov

Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

replace Bandits with Mean Bounds

Authors: Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai

Abstract: We study a variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided. We prove that these translate to tighter estimates of subgaussian factors and develop novel algorithms that exploit these estimates. In the linear setting, we present the Restricted-set OFUL (R-OFUL) algorithm that additionally uses the geometric properties of the problem to (potentially) restrict the set of arms being played and reduce exploration rates for suboptimal arms. In the stochastic case, we propose the non-optimistic Global Under-Explore (GLUE) algorithm which employs the inferred subgaussian estimates to adapt the rate of exploration for the arms. We analyze the regret of R-OFUL and GLUE, showing that our regret upper bounds are never worse than that of the standard OFUL and UCB algorithms respectively. Further, we also consider a practically motivated setting of learning from confounded logs where mean bounds appear naturally.

replace Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

Authors: Nihal Sharma, Rajat Sen, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai

Abstract: We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the 'best' among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from $\mathcal{O}\left(N\log\left(NT^2\sqrt{E}\right)\right)$ samples, ED-UCB guarantees a regret that scales as $\mathcal{O}\left(E(N+1) + \frac{N\sqrt{E}}{T^2}\right)$ for $N$ experts over $E$ episodes, each of length $T$. We finally empirically validate our findings through simulations.

replace Deep Reinforcement Learning for Demand Driven Services in Logistics and Transportation Systems: A Survey

Authors: Zefang Zong, Jingwei Wang, Tao Feng, Tong Xia, Depeng Jin, Yong Li

Abstract: Recent technology development brings the boom of numerous new Demand-Driven Services (DDS) into urban lives, including ridesharing, on-demand delivery, express systems and warehousing. In DDS, a service loop is an elemental structure, including its service worker, the service providers and corresponding service targets. The service workers should transport either people or parcels from the providers to the target locations. Various planning tasks within DDS can thus be classified into two individual stages: 1) Dispatching, which is to form service loops from demand/supply distributions, and 2) Routing, which is to decide specific serving orders within the constructed loops. Generating high-quality strategies in both stages is important to develop DDS but faces several challenges. Meanwhile, deep reinforcement learning (DRL) has been developed rapidly in recent years. It is a powerful tool to solve these problems since DRL can learn a parametric model without relying on too many problem-based assumptions and optimize long-term effects by learning sequential decisions. In this survey, we first define DDS, then highlight common applications and important decision/control problems within. For each problem, we comprehensively introduce the existing DRL solutions. We also introduce open simulation environments for development and evaluation of DDS applications. Finally, we analyze remaining challenges and discuss further research opportunities in DRL solutions for DDS.

replace Neural networks with trainable matrix activation functions

Authors: Zhengqi Liu, Shuhao Cao, Yuwen Li, Ludmil Zikatanov

Abstract: The training process of neural networks usually optimize weights and bias parameters of linear transformations, while nonlinear activation functions are pre-specified and fixed. This work develops a systematic approach to constructing matrix-valued activation functions whose entries are generalized from ReLU. The activation is based on matrix-vector multiplications using only scalar multiplications and comparisons. The proposed activation functions depend on parameters that are trained along with the weights and bias vectors. Neural networks based on this approach are simple and efficient and are shown to be robust in numerical experiments.

replace Graph Neural Networks with Feature and Structure Aware Random Walk

Authors: Wei Zhuo, Guang Tan

Abstract: Graph Neural Networks (GNNs) have received increasing attention for representation learning in various machine learning tasks. However, most existing GNNs applying neighborhood aggregation usually perform poorly on the graph with heterophily where adjacent nodes belong to different classes. In this paper, we show that in typical heterphilous graphs, the edges may be directed, and whether to treat the edges as is or simply make them undirected greatly affects the performance of the GNN models. Furthermore, due to the limitation of heterophily, it is highly beneficial for the nodes to aggregate messages from similar nodes beyond local neighborhood.These motivate us to develop a model that adaptively learns the directionality of the graph, and exploits the underlying long-distance correlations between nodes. We first generalize the graph Laplacian to digraph based on the proposed Feature-Aware PageRank algorithm, which simultaneously considers the graph directionality and long-distance feature similarity between nodes. Then digraph Laplacian defines a graph propagation matrix that leads to a model called {\em DiglacianGCN}. Based on this, we further leverage the node proximity measured by commute times between nodes, in order to preserve the nodes' long-distance correlation on the topology level. Extensive experiments on ten datasets with different levels of homophily demonstrate the effectiveness of our method over existing solutions in the task of node classification.

replace Adversarial robustness of VAEs through the lens of local geometry

Authors: Asif Khan, Amos Storkey

Abstract: In an unsupervised attack on variational autoencoders (VAEs), an adversary finds a small perturbation in an input sample that significantly changes its latent space encoding, thereby compromising the reconstruction for a fixed decoder. A known reason for such vulnerability is the distortions in the latent space resulting from a mismatch between approximated latent posterior and a prior distribution. Consequently, a slight change in an input sample can move its encoding to a low/zero density region in the latent space resulting in an unconstrained generation. This paper demonstrates that an optimal way for an adversary to attack VAEs is to exploit a directional bias of a stochastic pullback metric tensor induced by the encoder and decoder networks. The pullback metric tensor of an encoder measures the change in infinitesimal latent volume from an input to a latent space. Thus, it can be viewed as a lens to analyse the effect of input perturbations leading to latent space distortions. We propose robustness evaluation scores using the eigenspectrum of a pullback metric tensor. Moreover, we empirically show that the scores correlate with the robustness parameter $\beta$ of the $\beta-$VAE. Since increasing $\beta$ also degrades reconstruction quality, we demonstrate a simple alternative using \textit{mixup} training to fill the empty regions in the latent space, thus improving robustness with improved reconstruction.

replace Vanilla Feedforward Neural Networks as a Discretization of Dynamical Systems

Authors: Yifei Duan, Li'ang Li, Guanghua Ji, Yongqiang Cai

Abstract: Deep learning has made significant applications in the field of data science and natural science. Some studies have linked deep neural networks to dynamic systems, but the network structure is restricted to the residual network. It is known that residual networks can be regarded as a numerical discretization of dynamic systems. In this paper, we back to the classical network structure and prove that the vanilla feedforward networks could also be a numerical discretization of dynamic systems, where the width of the network is equal to the dimension of the input and output. Our proof is based on the properties of the leaky-ReLU function and the numerical technique of splitting method to solve differential equations. Our results could provide a new perspective for understanding the approximation properties of feedforward neural networks.

replace Mitigating Data Absence in Federated Learning Using Privacy-Controllable Data Digests

Authors: Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

Abstract: The absence of training data and their distribution changes in federated learning (FL) can significantly undermine model performance, especially in cross-silo scenarios. To address this challenge, we introduce the Federated Learning with Data Digest (FedDig) framework. FedDig manages unexpected distribution changes using a novel privacy-controllable data digest representation. This framework allows FL users to adjust the protection levels of the digest by manipulating hyperparameters that control the mixing of multiple low-dimensional features and applying differential privacy perturbation to these mixed features. Evaluation of FedDig across four diverse public datasets shows that it consistently outperforms five baseline algorithms by substantial margins in various data absence scenarios. We also thoroughly explored FedDig's hyperparameters, demonstrating its adaptability. Notably, the FedDig plugin design is inherently extensible and compatible with existing FL algorithms.

replace Label Attention Network for Temporal Sets Prediction: You Were Looking at a Wrong Self-Attention

Authors: Elizaveta Kovtun, Galina Boeva, Andrey Shulga, Alexey Zaytsev

Abstract: Most user-related data can be represented as a sequence of events associated with a timestamp and a collection of categorical labels. For example, the purchased basket of goods and the time of buying fully characterize the event of the store visit. Anticipation of the label set for the future event called the problem of temporal sets prediction, holds significant value, especially in such high-stakes industries as finance and e-commerce. A fundamental challenge of this task is the joint consideration of the temporal nature of events and label relations within sets. The existing models fail to capture complex time and label dependencies due to ineffective representation of historical information initially. We aim to address this shortcoming by presenting the framework with a specific way to aggregate the observed information into time- and set structure-aware views prior to transferring it into main architecture blocks. Our strong emphasis on input arrangement facilitates the subsequent efficient learning of label interactions. The proposed model is called Label-Attention NETwork, or LANET. We conducted experiments on four different datasets and made a comparison with four established models, including SOTA, in this area. The experimental results suggest that LANET provides significantly better quality than any other model, achieving an improvement up to $65 \%$ in terms of weighted F1 metric compared to the closest competitor. Moreover, we contemplate causal relationships between labels in our work, as well as a thorough study of LANET components' influence on performance. We provide an implementation of LANET to encourage its wider usage.

replace GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models

Authors: Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho

Abstract: Current studies on adversarial robustness mainly focus on aggregating local robustness results from a set of data samples to evaluate and rank different models. However, the local statistics may not well represent the true global robustness of the underlying unknown data distribution. To address this challenge, this paper makes the first attempt to present a new framework, called GREAT Score , for global robustness evaluation of adversarial perturbation using generative models. Formally, GREAT Score carries the physical meaning of a global statistic capturing a mean certified attack-proof perturbation level over all samples drawn from a generative model. For finite-sample evaluation, we also derive a probabilistic guarantee on the sample complexity and the difference between the sample mean and the true mean. GREAT Score has several advantages: (1) Robustness evaluations using GREAT Score are efficient and scalable to large models, by sparing the need of running adversarial attacks. In particular, we show high correlation and significantly reduced computation cost of GREAT Score when compared to the attack-based model ranking on RobustBench (Croce,et. al. 2021). (2) The use of generative models facilitates the approximation of the unknown data distribution. In our ablation study with different generative adversarial networks (GANs), we observe consistency between global robustness evaluation and the quality of GANs. (3) GREAT Score can be used for remote auditing of privacy-sensitive black-box models, as demonstrated by our robustness evaluation on several online facial recognition services.

replace Uncertainty Voting Ensemble for Imbalanced Deep Regression

Authors: Yuchang Jiang, Vivien Sainte Fare Garnot, Konrad Schindler, Jan Dirk Wegner

Abstract: Data imbalance is ubiquitous when applying machine learning to real-world problems, particularly regression problems. If training data are imbalanced, the learning is dominated by the densely covered regions of the target distribution and the learned regressor tends to exhibit poor performance in sparsely covered regions. Beyond standard measures like oversampling or reweighting, there are two main approaches to handling learning from imbalanced data. For regression, recent work leverages the continuity of the distribution, while for classification, the trend has been to use ensemble methods, allowing some members to specialize in predictions for sparser regions. In our method, named UVOTE, we integrate recent advances in probabilistic deep learning with an ensemble approach for imbalanced regression. We replace traditional regression losses with negative log-likelihood, which also predicts sample-wise aleatoric uncertainty. Our experiments show that this loss function handles imbalance better. Additionally, we use the predicted aleatoric uncertainty values to fuse the predictions of different expert models in the ensemble, eliminating the need for a separate aggregation module. We compare our method with existing alternatives on multiple public benchmarks and show that UVOTE consistently outperforms the prior art, while at the same time producing better-calibrated uncertainty estimates. Our code is available at https://github.com/SherryJYC/UVOTE.

URLs: https://github.com/SherryJYC/UVOTE.

replace C-MCTS: Safe Planning with Monte Carlo Tree Search

Authors: Dinesh Parthasarathy, Georgios Kontes, Axel Plinge, Christopher Mutschler

Abstract: The Constrained Markov Decision Process (CMDP) formulation allows to solve safety-critical decision making tasks that are subject to constraints. While CMDPs have been extensively studied in the Reinforcement Learning literature, little attention has been given to sampling-based planning algorithms such as MCTS for solving them. Previous approaches perform conservatively with respect to costs as they avoid constraint violations by using Monte Carlo cost estimates that suffer from high variance. We propose Constrained MCTS (C-MCTS), which estimates cost using a safety critic that is trained with Temporal Difference learning in an offline phase prior to agent deployment. The critic limits exploration by pruning unsafe trajectories within MCTS during deployment. C-MCTS satisfies cost constraints but operates closer to the constraint boundary, achieving higher rewards than previous work. As a nice byproduct, the planner is more efficient w.r.t. planning steps. Most importantly, under model mismatch between the planner and the real world, C-MCTS is less susceptible to cost violations than previous work.

replace Dynamic Bayesian Networks for Predicting Cryptocurrency Price Directions: Uncovering Causal Relationships

Authors: Rasoul Amirzadeh, Dhananjay Thiruvady, Asef Nazari, Mong Shan Ee

Abstract: Cryptocurrencies have gained popularity across various sectors, especially in finance and investment. Despite their growing popularity, cryptocurrencies can be a high-risk investment due to their price volatility. The inherent volatility in cryptocurrency prices, coupled with the effects of external global economic factors, makes predicting their price movements challenging. To address this challenge, we propose a dynamic Bayesian network (DBN)-based approach to uncover potential causal relationships among various features including social media data, traditional financial market factors, and technical indicators. Six popular cryptocurrencies, Bitcoin, Binance Coin, Ethereum, Litecoin, Ripple, and Tether are studied in this work. The proposed model's performance is compared to five baseline models of auto-regressive integrated moving average, support vector regression, long short-term memory, random forests, and support vector machines. The results show that while DBN performance varies across cryptocurrencies, with some cryptocurrencies exhibiting higher predictive accuracy than others, the DBN significantly outperforms the baseline models.

replace DEDUCE: Multi-head attention decoupled contrastive learning to discover cancer subtypes based on multi-omics data

Authors: Liangrui Pan, Xiang Wang, Qingchun Liang, Jiandong Shang, Wenjuan Liu, Liwen Xu, Shaoliang Peng

Abstract: Background and Objective: Given the high heterogeneity and clinical diversity of cancer, substantial variations exist in multi-omics data and clinical features across different cancer subtypes. Methods: We propose a model, named DEDUCE, based on a symmetric multi-head attention encoders (SMAE), for unsupervised contrastive learning to analyze multi-omics cancer data, with the aim of identifying and characterizing cancer subtypes. This model adopts a unsupervised SMAE that can deeply extract contextual features and long-range dependencies from multi-omics data, thereby mitigating the impact of noise. Importantly, DEDUCE introduces a subtype decoupled contrastive learning method based on a multi-head attention mechanism to simultaneously learn features from multi-omics data and perform clustering for identifying cancer subtypes. Subtypes are clustered by calculating the similarity between samples in both the feature space and sample space of multi-omics data. The fundamental concept involves decoupling various attributes of multi-omics data features and learning them as contrasting terms. A contrastive loss function is constructed to quantify the disparity between positive and negative examples, and the model minimizes this difference, thereby promoting the acquisition of enhanced feature representation. Results: The DEDUCE model undergoes extensive experiments on simulated multi-omics datasets, single-cell multi-omics datasets, and cancer multi-omics datasets, outperforming 10 deep learning models. The DEDUCE model outperforms state-of-the-art methods, and ablation experiments demonstrate the effectiveness of each module in the DEDUCE model. Finally, we applied the DEDUCE model to identify six cancer subtypes of AML.

replace Learning to Reach Goals via Diffusion

Authors: Vineet Jain, Siamak Ravanbakhsh

Abstract: We present a novel perspective on goal-conditioned reinforcement learning by framing it within the context of denoising diffusion models. Analogous to the diffusion process, where Gaussian noise is used to create random trajectories that walk away from the data manifold, we construct trajectories that move away from potential goal states. We then learn a goal-conditioned policy to reverse these deviations, analogous to the score function. This approach, which we call Merlin, can reach specified goals from arbitrary initial states without learning a separate value function. In contrast to recent works utilizing diffusion models in offline RL, Merlin stands out as the first method to perform diffusion in the state space, requiring only one ``denoising" iteration per environment step. We experimentally validate our approach in various offline goal-reaching tasks, demonstrating substantial performance enhancements compared to state-of-the-art methods while improving computational efficiency over other diffusion-based RL methods by an order of magnitude. Our results suggest that this perspective on diffusion for RL is a simple and scalable approach for sequential decision making.

replace Optimal Algorithms for Online Convex Optimization with Adversarial Constraints

Authors: Abhishek Sinha, Rahul Vaze

Abstract: A well-studied generalization of the standard online convex optimization (OCO) framework is constrained online convex optimization (COCO). In COCO, on every round, a convex cost function and a convex constraint function are revealed to the learner after it chooses the action for that round. The objective is to design an online learning policy that simultaneously achieves a small regret while ensuring a small cumulative constraint violation (CCV) against an adaptive adversary interacting over a horizon of length $T$. A long-standing open question in COCO is whether an online policy can simultaneously achieve $O(\sqrt{T})$ regret and $\tilde{O}(\sqrt{T})$ CCV without any restrictive assumptions. For the first time, we answer this in the affirmative and show that a simple first-order policy can simultaneously achieve these bounds. Furthermore, in the case of strongly convex cost and convex constraint functions, the regret guarantee can be improved to $O(\log T)$ while keeping the CCV bound the same as above. We establish these results by effectively combining adaptive OCO policies as a blackbox with Lyapunov optimization - a classic tool from control theory. Surprisingly, the analysis is short and elegant.

replace REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning

Authors: Maxwell A. Xu, Alexander Moreno, Hui Wei, Benjamin M. Marlin, James M. Rehg

Abstract: The success of self-supervised contrastive learning hinges on identifying positive data pairs, such that when they are pushed together in embedding space, the space encodes useful information for subsequent downstream tasks. Constructing positive pairs is non-trivial as the pairing must be similar enough to reflect a shared semantic meaning, but different enough to capture within-class variation. Classical approaches in vision use augmentations to exploit well-established invariances to construct positive pairs, but invariances in the time-series domain are much less obvious. In our work, we propose a novel method of using a learned measure for identifying positive pairs. Our Retrieval-Based Reconstruction (REBAR) measure measures the similarity between two sequences as the reconstruction error that results from reconstructing one sequence with retrieved information from the other. Then, if the two sequences have high REBAR similarity, we label them as a positive pair. Through validation experiments, we show that the REBAR error is a predictor of mutual class membership. Once integrated into a contrastive learning framework, our REBAR method learns an embedding that achieves state-of-the-art performance on downstream tasks across various modalities.

replace Brain Networks and Intelligence: A Graph Neural Network Based Approach to Resting State fMRI Data

Authors: Bishal Thapaliya, Esra Akbas, Jiayu Chen, Raam Sapkota, Bhaskar Ray, Pranav Suresh, Vince Calhoun, Jingyu Liu

Abstract: Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on rsfMRI derived static functional network connectivity matrices. Extending from the existing graph convolution networks, our approach incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer to reflect the nature of the brain sub-network organization and efficient network expression, in combination with TopK pooling and attention-based readout functions. We evaluated our proposed architecture on a large dataset, specifically the Adolescent Brain Cognitive Development Dataset, and demonstrated its effectiveness in predicting individual differences in intelligence. Our model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus exhibited a significant contribution to both fluid and crystallized intelligence, suggesting their pivotal role in these cognitive processes. Total composite scores identified a diverse set of brain regions to be relevant which underscores the complex nature of total intelligence.

replace HEALNet: Multimodal Fusion for Heterogeneous Biomedical Data

Authors: Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik

Abstract: Technological advances in medical data collection, such as high-throughput genomic sequencing and digital high-resolution histopathology, have contributed to the rising requirement for multimodal biomedical modelling, specifically for image, tabular and graph data. Most multimodal deep learning approaches use modality-specific architectures that are often trained separately and cannot capture the crucial cross-modal information that motivates the integration of different data sources. This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet): a flexible multimodal fusion architecture, which a) preserves modality-specific structural information, b) captures the cross-modal interactions and structural information in a shared latent space, c) can effectively handle missing modalities during training and inference, and d) enables intuitive model inspection by learning on the raw data input instead of opaque embeddings. We conduct multimodal survival analysis on Whole Slide Images and Multi-omic data on four cancer datasets from The Cancer Genome Atlas (TCGA). HEALNet achieves state-of-the-art performance compared to other end-to-end trained fusion models, substantially improving over unimodal and multimodal baselines whilst being robust in scenarios with missing modalities.

replace Multiscale Hodge Scattering Networks for Data Analysis

Authors: Naoki Saito, Stefan C. Schonsheck, Eugene Shvarts

Abstract: We propose new scattering networks for signals measured on simplicial complexes, which we call \emph{Multiscale Hodge Scattering Networks} (MHSNs). Our construction is based on multiscale basis dictionaries on simplicial complexes, i.e., the $\kappa$-GHWT and $\kappa$-HGLET, which we recently developed for simplices of dimension $\kappa \in \mathbb{N}$ in a given simplicial complex by generalizing the node-based Generalized Haar-Walsh Transform (GHWT) and Hierarchical Graph Laplacian Eigen Transform (HGLET). The $\kappa$-GHWT and the $\kappa$-HGLET both form redundant sets (i.e., dictionaries) of multiscale basis vectors and the corresponding expansion coefficients of a given signal. Our MHSNs use a layered structure analogous to a convolutional neural network (CNN) to cascade the moments of the modulus of the dictionary coefficients. The resulting features are invariant to reordering of the simplices (i.e., node permutation of the underlying graphs). Importantly, the use of multiscale basis dictionaries in our MHSNs admits a natural pooling operation that is akin to local pooling in CNNs, and which may be performed either locally or per-scale. These pooling operations are harder to define in both traditional scattering networks based on Morlet wavelets, and geometric scattering networks based on Diffusion Wavelets. As a result, we are able to extract a rich set of descriptive yet robust features that can be used along with very simple machine learning methods (i.e., logistic regression or support vector machines) to achieve high-accuracy classification systems with far fewer parameters to train than most modern graph neural networks. Finally, we demonstrate the usefulness of our MHSNs in three distinct types of problems: signal classification, domain (i.e., graph/simplex) classification, and molecular dynamics prediction.

replace Continuous Management of Machine Learning-Based Application Behavior

Authors: Marco Anisetti, Claudio A. Ardagna, Nicola Bena, Ernesto Damiani, Paolo G. Panero

Abstract: Modern applications are increasingly driven by Machine Learning (ML) models whose non-deterministic behavior is affecting the entire application life cycle from design to operation. The pervasive adoption of ML is urgently calling for approaches that guarantee a stable non-functional behavior of ML-based applications over time and across model changes. To this aim, non-functional properties of ML models, such as privacy, confidentiality, fairness, and explainability, must be monitored, verified, and maintained. Existing approaches mostly focus on i) implementing solutions for classifier selection according to the functional behavior of ML models, ii) finding new algorithmic solutions, such as continuous re-training. In this paper, we propose a multi-model approach that aims to guarantee a stable non-functional behavior of ML-based applications. An architectural and methodological approach is provided to compare multiple ML models showing similar non-functional properties and select the model supporting stable non-functional behavior over time according to (dynamic and unpredictable) contextual changes. Our approach goes beyond the state of the art by providing a solution that continuously guarantees a stable non-functional behavior of ML-based applications, is ML algorithm-agnostic, and is driven by non-functional properties assessed on the ML models themselves. It consists of a two-step process working during application operation, where model assessment verifies non-functional properties of ML models trained and selected at development time, and model substitution guarantees continuous and stable support of non-functional properties. We experimentally evaluate our solution in a real-world scenario focusing on non-functional property fairness.

replace GraphMETRO: Mitigating Complex Graph Distribution Shifts via Mixture of Aligned Experts

Authors: Shirley Wu, Kaidi Cao, Bruno Ribeiro, James Zou, Jure Leskovec

Abstract: Graph data are inherently complex and heterogeneous, leading to a high natural diversity of distributional shifts. However, it remains unclear how to build machine learning architectures that generalize to the complex distributional shifts naturally occurring in the real world. Here, we develop GraphMETRO, a Graph Neural Network architecture that models natural diversity and captures complex distributional shifts. GraphMETRO employs a Mixture-of-Experts (MoE) architecture with a gating model and multiple expert models, where each expert model targets a specific distributional shift to produce a referential representation w.r.t. a reference model, and the gating model identifies shift components. Additionally, we design a novel objective that aligns the representations from different expert models to ensure reliable optimization. GraphMETRO achieves state-of-the-art results on four datasets from the GOOD benchmark, which is comprised of complex and natural real-world distribution shifts, improving by 67% and 4.2% on the WebKB and Twitch datasets. Code and data are available at https://github.com/Wuyxin/GraphMETRO.

URLs: https://github.com/Wuyxin/GraphMETRO.

replace Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models

Authors: Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao

Abstract: The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims to systematically address these challenges by reviewing a broad spectrum of techniques designed to enhance the resource efficiency of LLMs. We categorize methods based on their optimization focus: computational, memory, energy, financial, and network resources and their applicability across various stages of an LLM's lifecycle, including architecture design, pretraining, finetuning, and system design. Additionally, the survey introduces a nuanced categorization of resource efficiency techniques by their specific resource types, which uncovers the intricate relationships and mappings between various resources and corresponding optimization techniques. A standardized set of evaluation metrics and datasets is also presented to facilitate consistent and fair comparisons across different models and techniques. By offering a comprehensive overview of the current sota and identifying open research avenues, this survey serves as a foundational reference for researchers and practitioners, aiding them in developing more sustainable and efficient LLMs in a rapidly evolving landscape.

replace Modular Learning of Deep Causal Generative Models for High-dimensional Causal Inference

Authors: Md Musfiqur Rahman, Murat Kocaoglu

Abstract: Sound and complete algorithms have been proposed to compute identifiable causal queries using the causal structure and data. However, most of these algorithms assume accurate estimation of the data distribution, which is impractical for high-dimensional variables such as images. On the other hand, modern deep generative architectures can be trained to sample from high-dimensional distributions. However, training these networks are typically very costly. Thus, it is desirable to leverage pre-trained models to answer causal queries using such high-dimensional data. To address this, we propose modular training of deep causal generative models that not only makes learning more efficient, but also allows us to utilize large, pre-trained conditional generative models. To the best of our knowledge, our algorithm, Modular-DCM is the first algorithm that, given the causal structure, uses adversarial training to learn the network weights, and can make use of pre-trained models to provably sample from any identifiable causal query in the presence of latent confounders. With extensive experiments on the Colored-MNIST dataset, we demonstrate that our algorithm outperforms the baselines. We also show our algorithm's convergence on the COVIDx dataset and its utility with a causal invariant prediction problem on CelebA-HQ.

replace Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors

Authors: Top Piriyakulkij, Yingheng Wang, Volodymyr Kuleshov

Abstract: We propose denoising diffusion variational inference (DDVI), a black-box variational inference algorithm for latent variable models which relies on diffusion models as flexible approximate posteriors. Specifically, our method introduces an expressive class of diffusion-based variational posteriors that perform iterative refinement in latent space; we train these posteriors with a novel regularized evidence lower bound (ELBO) on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. We find that DDVI improves inference and learning in deep latent variable models across common benchmarks as well as on a motivating task in biology -- inferring latent ancestry from human genomes -- where it outperforms strong baselines on the Thousand Genomes dataset.

replace Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

Authors: Sonia Laguna, Ri\v{c}ards Marcinkevi\v{c}s, Moritz Vandenhirtz, Julia E. Vogt

Abstract: Recently, interpretable machine learning has re-explored concept bottleneck models (CBM). An advantage of this model class is the user's ability to intervene on predicted concept values, affecting the downstream output. In this work, we introduce a method to perform such concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. Furthermore, we formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.

replace KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Abstract: LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; and (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges. By applying our method to the LLaMA, Llama-2, Llama-3, and Mistral models, we achieve < 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. We develop custom CUDA kernels for KVQuant, showing that we can achieve up to ~1.7x speedups, compared to baseline fp16 matrix-vector multiplications, for the LLaMA-7B model.

replace Reinforcement Learning from Bagged Reward

Authors: Yuting Tang, Xin-Qiang Cai, Yao-Xiang Ding, Qiyu Wu, Guoqing Liu, Masashi Sugiyama

Abstract: In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent, helping the agent maximize cumulative rewards to obtain the optimal policy. However, in many real-world scenarios, designing immediate reward signals is difficult; instead, agents receive a single reward that is contingent upon a partial sequence or a complete trajectory. In this work, we define this challenging problem as RL from Bagged Reward (RLBR), where sequences of data are treated as bags with non-Markovian bagged rewards, leading to the formulation of Bagged Reward Markov Decision Processes (BRMDPs). Theoretically, we demonstrate that RLBR can be addressed by solving a standard MDP with properly redistributed bagged rewards allocated to each instance within a bag. Empirically, we find that reward redistribution becomes more challenging as the bag length increases, due to reduced informational granularity. Existing reward redistribution methods are insufficient to address these challenges. Therefore, we propose a novel reward redistribution method equipped with a bidirectional attention mechanism, enabling the accurate interpretation of contextual nuances and temporal dependencies within each bag. We experimentally demonstrate that our proposed method consistently outperforms existing approaches.

replace L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Authors: Hyesung Jeon, Yulhwa Kim, Jae-joon Kim

Abstract: Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically follow a two-step approach: first, applying post-training quantization (PTQ) to model weights, followed by PEFT on the quantized model. However, recovering from the quantization error introduced by PTQ through fine-tuning has proven challenging. Additionally, most PTQ-based PEFT methods result in a mixture of low-precision quantized weights and high-precision adapter weights, limiting the efficiency of full quantization during inference. While a previous method attempted to address these issues, it still suffers from limited adaptability due to the constrained LoRA parameter structure required to produce fully-quantized models. To overcome these challenges, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA to effectively reduce quantization error. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead while producing fully-quantized weights, enabling effective adaptation to downstream tasks. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in sub-4-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA model families and instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot learning.

replace On Differentially Private Subspace Estimation in a Distribution-Free Setting

Authors: Eliad Tsfadia

Abstract: Private data analysis faces a significant challenge known as the curse of dimensionality, leading to increased costs. However, many datasets possess an inherent low-dimensional structure. For instance, during optimization via gradient descent, the gradients frequently reside near a low-dimensional subspace. If the low-dimensional structure could be privately identified using a small amount of points, we could avoid paying for the high ambient dimension. On the negative side, Dwork, Talwar, Thakurta, and Zhang (STOC 2014) proved that privately estimating subspaces, in general, requires an amount of points that has a polynomial dependency on the dimension. However, their bounds do not rule out the possibility to reduce the number of points for "easy" instances. Yet, providing a measure that captures how much a given dataset is "easy" for this task turns out to be challenging, and was not properly addressed in prior works. Inspired by the work of Singhal and Steinke (NeurIPS 2021), we provide the first measures that quantify "easiness" as a function of multiplicative singular-value gaps in the input dataset, and support them with new upper and lower bounds. In particular, our results determine the first types of gaps that are sufficient and necessary for estimating a subspace with an amount of points that is independent of the dimension. Furthermore, we realize our upper bounds using a practical algorithm and demonstrate its advantage in high-dimensional regimes compared to prior approaches.

replace SimMLP: Training MLPs on Graphs without Supervision

Authors: Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye

Abstract: Graph Neural Networks (GNNs) have demonstrated their effectiveness in various graph learning tasks, yet their reliance on neighborhood aggregation during inference poses challenges for deployment in latency-sensitive applications, such as real-time financial fraud detection. To address this limitation, recent studies have proposed distilling knowledge from teacher GNNs into student Multi-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate inference. However, these approaches often inadequately explore structural information when inferring unseen nodes. To this end, we introduce SimMLP, a Self-supervised framework for learning MLPs on graphs, designed to fully integrate rich structural information into MLPs. Notably, SimMLP is the first MLP-learning method that can achieve equivalence to GNNs in the optimal case. The key idea is to employ self-supervised learning to align the representations encoded by graph context-aware GNNs and neighborhood dependency-free MLPs, thereby fully integrating the structural information into MLPs. We provide a comprehensive theoretical analysis, demonstrating the equivalence between SimMLP and GNNs based on mutual information and inductive bias, highlighting SimMLP's advanced structural learning capabilities. Additionally, we conduct extensive experiments on 20 benchmark datasets, covering node classification, link prediction, and graph classification, to showcase SimMLP's superiority over state-of-the-art baselines, particularly in scenarios involving unseen nodes (e.g., inductive and cold-start node classification) where structural insights are crucial. Our codes are available at: https://github.com/Zehong-Wang/SimMLP.

URLs: https://github.com/Zehong-Wang/SimMLP.

replace Recurrent Reinforcement Learning with Memoroids

Authors: Steven Morad, Chris Lu, Ryan Kortvelesy, Stephan Liwicki, Jakob Foerster, Amanda Prorok

Abstract: Memory models such as Recurrent Neural Networks (RNNs) and Transformers address Partially Observable Markov Decision Processes (POMDPs) by mapping trajectories to latent Markov states. Neither model scales particularly well to long sequences, especially compared to an emerging class of memory models called Linear Recurrent Models. We discover that the recurrent update of these models resembles a monoid, leading us to reformulate existing models using a novel monoid-based framework that we call memoroids. We revisit the traditional approach to batching in recurrent reinforcement learning, highlighting theoretical and empirical deficiencies. We leverage memoroids to propose a batching method that improves sample efficiency, increases the return, and simplifies the implementation of recurrent loss functions in reinforcement learning.

replace Data-Efficient Operator Learning via Unsupervised Pretraining and In-Context Learning

Authors: Wuyang Chen, Jialin Song, Pu Ren, Shashank Subramanian, Dmitriy Morozov, Michael W. Mahoney

Abstract: Recent years have witnessed the promise of coupling machine learning methods and physical domain-specific insights for solving scientific problems based on partial differential equations (PDEs). However, being data-intensive, these methods still require a large amount of PDE data. This reintroduces the need for expensive numerical PDE solutions, partially undermining the original goal of avoiding these expensive simulations. In this work, seeking data efficiency, we design unsupervised pretraining for PDE operator learning. To reduce the need for training data with heavy simulation costs, we mine unlabeled PDE data without simulated solutions, and we pretrain neural operators with physics-inspired reconstruction-based proxy tasks. To improve out-of-distribution performance, we further assist neural operators in flexibly leveraging a similarity-based method that learns in-context examples, without incurring extra training costs or designs. Extensive empirical evaluations on a diverse set of PDEs demonstrate that our method is highly data-efficient, more generalizable, and even outperforms conventional vision-pretrained models. We provide our code at https://github.com/delta-lab-ai/data_efficient_nopt.

URLs: https://github.com/delta-lab-ai/data_efficient_nopt.

replace Conformalized Selective Regression

Authors: Anna Sokol, Nuno Moniz, Nitesh Chawla

Abstract: Should prediction models always deliver a prediction? In the pursuit of maximum predictive performance, critical considerations of reliability and fairness are often overshadowed, particularly when it comes to the role of uncertainty. Selective regression, also known as the "reject option," allows models to abstain from predictions in cases of considerable uncertainty. Initially proposed seven decades ago, approaches to selective regression have mostly focused on distribution-based proxies for measuring uncertainty, particularly conditional variance. However, this focus neglects the significant influence of model-specific biases on a model's performance. In this paper, we propose a novel approach to selective regression by leveraging conformal prediction, which provides grounded confidence measures for individual predictions based on model-specific biases. In addition, we propose a standardized evaluation framework to allow proper comparison of selective regression approaches. Via an extensive experimental approach, we demonstrate how our proposed approach, conformalized selective regression, demonstrates an advantage over multiple state-of-the-art baselines.

replace RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

Authors: Jie Cheng, Gang Xiong, Xingyuan Dai, Qinghai Miao, Yisheng Lv, Fei-Yue Wang

Abstract: Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at https://github.com/CJReinforce/RIME_ICML2024.

URLs: https://github.com/CJReinforce/RIME_ICML2024.

replace Log Neural Controlled Differential Equations: The Lie Brackets Make a Difference

Authors: Benjamin Walker, Andrew D. McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, Terry Lyons

Abstract: The vector field of a controlled differential equation (CDE) describes the relationship between a control path and the evolution of a solution path. Neural CDEs (NCDEs) treat time series data as observations from a control path, parameterise a CDE's vector field using a neural network, and use the solution path as a continuously evolving hidden state. As their formulation makes them robust to irregular sampling rates, NCDEs are a powerful approach for modelling real-world data. Building on neural rough differential equations (NRDEs), we introduce Log-NCDEs, a novel, effective, and efficient method for training NCDEs. The core component of Log-NCDEs is the Log-ODE method, a tool from the study of rough paths for approximating a CDE's solution. Log-NCDEs are shown to outperform NCDEs, NRDEs, the linear recurrent unit, S5, and MAMBA on a range of multivariate time series datasets with up to $50{,}000$ observations.

replace Pruning neural network models for gene regulatory dynamics using data and domain knowledge

Authors: Intekhab Hossain, Jonas Fischer, Rebekka Burkholz, John Quackenbush

Abstract: The practical utility of machine learning models in the sciences often hinges on their interpretability. It is common to assess a model's merit for scientific discovery, and thus novel insights, by how well it aligns with already available domain knowledge--a dimension that is currently largely disregarded in the comparison of neural network models. While pruning can simplify deep neural network architectures and excels in identifying sparse models, as we show in the context of gene regulatory network inference, state-of-the-art techniques struggle with biologically meaningful structure learning. To address this issue, we propose DASH, a generalizable framework that guides network pruning by using domain-specific structural information in model fitting and leads to sparser, better interpretable models that are more robust to noise. Using both synthetic data with ground truth information, as well as real-world gene expression data, we show that DASH, using knowledge about gene interaction partners within the putative regulatory network, outperforms general pruning methods by a large margin and yields deeper insights into the biological systems being studied.

replace Taming Cross-Domain Representation Variance in Federated Prototype Learning with Heterogeneous Data Domains

Authors: Lei Wang, Jieming Bian, Letian Zhang, Chen Chen, Jie Xu

Abstract: Federated learning (FL) allows collaborative machine learning training without sharing private data. While most FL methods assume identical data domains across clients, real-world scenarios often involve heterogeneous data domains. Federated Prototype Learning (FedPL) addresses this issue, using mean feature vectors as prototypes to enhance model generalization. However, existing FedPL methods create the same number of prototypes for each client, leading to cross-domain performance gaps and disparities for clients with varied data distributions. To mitigate cross-domain feature representation variance, we introduce FedPLVM, which establishes variance-aware dual-level prototypes clustering and employs a novel $\alpha$-sparsity prototype loss. The dual-level prototypes clustering strategy creates local clustered prototypes based on private data features, then performs global prototypes clustering to reduce communication complexity and preserve local data privacy. The $\alpha$-sparsity prototype loss aligns samples from underrepresented domains, enhancing intra-class similarity and reducing inter-class similarity. Evaluations on Digit-5, Office-10, and DomainNet datasets demonstrate our method's superiority over existing approaches.

replace Rethinking Adversarial Inverse Reinforcement Learning: Policy Imitation, Transferable Reward Recovery and Algebraic Equilibrium Proof

Authors: Yangchun Zhang, Qiang Liu, Weiming Li, Yirui Zhou

Abstract: Adversarial inverse reinforcement learning (AIRL) stands as a cornerstone approach in imitation learning, yet it faces criticisms from prior studies. In this paper, we rethink AIRL and respond to these criticisms. Criticism 1 lies in Inadequate Policy Imitation. We show that substituting the built-in algorithm with soft actor-critic (SAC) during policy updating (requires multi-iterations) significantly enhances the efficiency of policy imitation. Criticism 2 lies in Limited Performance in Transferable Reward Recovery Despite SAC Integration. While we find that SAC indeed exhibits a significant improvement in policy imitation, it introduces drawbacks to transferable reward recovery. We prove that the SAC algorithm itself is not feasible to disentangle the reward function comprehensively during the AIRL training process, and propose a hybrid framework, PPO-AIRL + SAC, for a satisfactory transfer effect. Criticism 3 lies in Unsatisfactory Proof from the Perspective of Potential Equilibrium. We reanalyze it from an algebraic theory perspective.

replace Accelerating Transformer Pre-training with 2:4 Sparsity

Authors: Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, Jun Zhu

Abstract: Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this property, we comprehensively investigate the feasibility of accelerating feed-forward networks (FFNs) of transformers in pre-training. First, we define a ``flip rate'' to monitor the stability of a 2:4 training process. Utilizing this metric, we propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the masked decay term on gradients, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality by a dense fine-tuning procedure near the end of pre-training. Besides, we devise two techniques to practically accelerate training: to calculate transposable 2:4 masks by convolution, and to accelerate gated activation functions by reducing GPU L2 cache miss. Experiments show that our 2:4 sparse training algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.

URLs: https://github.com/huyz2023/2by4-pretrain.

replace CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models

Authors: Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini

Abstract: Large Language Models (LLMs) have dramatically advanced AI applications, yet their deployment remains challenging due to their immense inference costs. Recent studies ameliorate the computational costs of LLMs by increasing their activation sparsity but suffer from significant performance degradation on downstream tasks. In this work, we introduce a new framework for sparsifying the activations of base LLMs and reducing inference costs, dubbed Contextually Aware Thresholding for Sparsity (CATS). CATS is relatively simple, easy to implement, and highly effective. At the heart of our framework is a new non-linear activation function. We demonstrate that CATS can be applied to various base models, including Mistral-7B and Llama2-7B, and outperforms existing sparsification techniques in downstream task performance. More precisely, CATS-based models often achieve downstream task performance within 1-2% of their base models without any fine-tuning and even at activation sparsity levels of 50%. Furthermore, CATS-based models converge faster and display better task performance than competing techniques when fine-tuning is applied. Finally, we develop a custom GPU kernel for efficient implementation of CATS that translates the activation of sparsity of CATS to real wall-clock time speedups. Our custom kernel implementation of CATS results in a ~15% improvement in wall-clock inference latency of token generation on both Llama-7B and Mistral-7B.

replace Semi-supervised Symmetric Non-negative Matrix Factorization with Low-Rank Tensor Representation

Authors: Yuheng Jia, Jia-Nan Li, Wenhui Wu, Ran Wang

Abstract: Semi-supervised symmetric non-negative matrix factorization (SNMF) utilizes the available supervisory information (usually in the form of pairwise constraints) to improve the clustering ability of SNMF. The previous methods introduce the pairwise constraints from the local perspective, i.e., they either directly refine the similarity matrix element-wisely or restrain the distance of the decomposed vectors in pairs according to the pairwise constraints, which overlook the global perspective, i.e., in the ideal case, the pairwise constraint matrix and the ideal similarity matrix possess the same low-rank structure. To this end, we first propose a novel semi-supervised SNMF model by seeking low-rank representation for the tensor synthesized by the pairwise constraint matrix and a similarity matrix obtained by the product of the embedding matrix and its transpose, which could strengthen those two matrices simultaneously from a global perspective. We then propose an enhanced SNMF model, making the embedding matrix tailored to the above tensor low-rank representation. We finally refine the similarity matrix by the strengthened pairwise constraints. We repeat the above steps to continuously boost the similarity matrix and pairwise constraint matrix, leading to a high-quality embedding matrix. Extensive experiments substantiate the superiority of our method. The code is available at https://github.com/JinaLeejnl/TSNMF.

URLs: https://github.com/JinaLeejnl/TSNMF.

replace Active Preference Learning for Ordering Items In- and Out-of-sample

Authors: Herman Bergstr\"om, Emil Carlsson, Devdatt Dubhashi, Fredrik D. Johansson

Abstract: Learning an ordering of items based on pairwise comparisons is useful when items are difficult to rate consistently on an absolute scale, for example, when annotators have to make subjective assessments. When exhaustive comparison is infeasible, actively sampling item pairs can reduce the number of annotations necessary for learning an accurate ordering. However, many algorithms ignore shared structure between items, limiting their sample efficiency and precluding generalization to new items. It is also common to disregard how noise in comparisons varies between item pairs, despite it being informative of item similarity. In this work, we study active preference learning for ordering items with contextual attributes, both in- and out-of-sample. We give an upper bound on the expected ordering error of a logistic preference model as a function of which items have been compared. Next, we propose an active learning strategy that samples items to minimize this bound by accounting for aleatoric and epistemic uncertainty in comparisons. We evaluate the resulting algorithm, and a variant aimed at reducing model misspecification, in multiple realistic ordering tasks with comparisons made by human annotators. Our results demonstrate superior sample efficiency and generalization compared to non-contextual ranking approaches and active preference learning baselines.

replace Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics

Authors: Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, Stuart Russell

Abstract: Auto-regressive large language models (LLMs) show impressive capacities to solve many complex reasoning tasks while struggling with some simple logical reasoning tasks such as inverse search: when trained on '$A \to B$' (e.g., 'Tom is the parent of John'), LLM fails to directly conclude '$B \gets A$' (e.g., 'John is the child of Tom') during inference even if the two sentences are semantically identical, which is known as the 'reversal curse'. In this paper, we theoretically analyze the reversal curse via the training dynamics of (stochastic) gradient descent for two auto-regressive models: (1) a bilinear model that can be viewed as a simplification of a one-layer transformer; (2) one-layer transformers under certain assumptions. Our analysis reveals that for both models, the reversal curse is a consequence of the (effective) model weights 'asymmetry', i.e., the increase of weights from a token $A$ to token $B$ during training does not necessarily cause the increase of the weights from $B$ to $A$, which is caused by the training dynamics under certain choice of loss function and the optimization space of model parameters. Moreover, our analysis can be naturally applied to other logical reasoning tasks such as chain-of-thought (COT), which provides a new perspective different from previous work that focuses on expressivity. Finally, we conduct experiments to validate our theory on multi-layer transformers under different settings. Our code is available at https://github.com/marlo-z/reversal_curse_analysis/.

URLs: https://github.com/marlo-z/reversal_curse_analysis/.

replace Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

Authors: Yang Dai, Oubo Ma, Longfei Zhang, Xingxing Liang, Shengchao Hu, Mengzhu Wang, Shouling Ji, Jincai Huang, Li Shen

Abstract: Transformer-based trajectory optimization methods have demonstrated exceptional performance in offline Reinforcement Learning (offline RL). Yet, it poses challenges due to substantial parameter size and limited scalability, which is particularly critical in sequential decision-making scenarios where resources are constrained such as in robots and drones with limited computational power. Mamba, a promising new linear-time sequence model, offers performance on par with transformers while delivering substantially fewer parameters on long sequences. As it remains unclear whether Mamba is compatible with trajectory optimization, this work aims to conduct comprehensive experiments to explore the potential of Decision Mamba (dubbed DeMa) in offline RL from the aspect of data structures and essential components with the following insights: (1) Long sequences impose a significant computational burden without contributing to performance improvements since DeMa's focus on sequences diminishes approximately exponentially. Consequently, we introduce a Transformer-like DeMa as opposed to an RNN-like DeMa. (2) For the components of DeMa, we identify the hidden attention mechanism as a critical factor in its success, which can also work well with other residual structures and does not require position embedding. Extensive evaluations demonstrate that our specially designed DeMa is compatible with trajectory optimization and surpasses previous methods, outperforming Decision Transformer (DT) with higher performance while using 30\% fewer parameters in Atari, and exceeding DT with only a quarter of the parameters in MuJoCo.

replace Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow

Authors: Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng-Kuang Lee, Simon See, Chun-Yi Lee

Abstract: Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy improvement. In the policy evaluation steps, the critic is updated to capture the soft Q-function. In the policy improvement steps, the actor is adjusted in accordance with the updated soft Q-function. In this paper, we introduce a new MaxEnt RL framework modeled using Energy-Based Normalizing Flows (EBFlow). This framework integrates the policy evaluation steps and the policy improvement steps, resulting in a single objective training process. Our method enables the calculation of the soft value function used in the policy evaluation target without Monte Carlo approximation. Moreover, this design supports the modeling of multi-modal action distributions while facilitating efficient action sampling. To evaluate the performance of our method, we conducted experiments on the MuJoCo benchmark suite and a number of high-dimensional robotic tasks simulated by Omniverse Isaac Gym. The evaluation results demonstrate that our method achieves superior performance compared to widely-adopted representative baselines.

replace Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Authors: Shuaipeng Li, Penghao Zhao, Hailin Zhang, Xingwu Sun, Hao Wu, Dian Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin Cui, Di Wang

Abstract: In current deep learning tasks, Adam style optimizers such as Adam, Adagrad, RMSProp, Adafactor, and Lion have been widely used as alternatives to SGD style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly or follows similar rules with batch size for SGD style optimizers. However, this conclusion is not applicable to Adam style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conducted experiments on various CV and NLP tasks and verified the correctness of the scaling law.

replace IncomeSCM: From tabular data set to time-series simulator and causal estimation benchmark

Authors: Fredrik D. Johansson

Abstract: Evaluating observational estimators of causal effects demands information that is rarely available: unconfounded interventions and outcomes from the population of interest, created either by randomization or adjustment. As a result, it is customary to fall back on simulators when creating benchmark tasks. Simulators offer great control but are often too simplistic to make challenging tasks, either because they are hand-designed and lack the nuances of real-world data, or because they are fit to observational data without structural constraints. In this work, we propose a general, repeatable strategy for turning observational data into sequential structural causal models and challenging estimation tasks by following two simple principles: 1) fitting real-world data where possible, and 2) creating complexity by composing simple, hand-designed mechanisms. We implement these ideas in a highly configurable software package and apply it to the well-known Adult income data set to construct the IncomeSCM simulator. From this, we devise multiple estimation tasks and sample data sets to compare established estimators of causal effects. The tasks present a suitable challenge, with effect estimates varying greatly in quality between methods, despite similar performance in the modeling of factual outcomes, highlighting the need for dedicated causal estimators and model selection criteria.

replace Looks Too Good To Be True: An Information-Theoretic Analysis of Hallucinations in Generative Restoration Models

Authors: Regev Cohen, Idan Kligvasser, Ehud Rivlin, Daniel Freedman

Abstract: The pursuit of high perceptual quality in image restoration has driven the development of revolutionary generative models, capable of producing results often visually indistinguishable from real data. However, as their perceptual quality continues to improve, these models also exhibit a growing tendency to generate hallucinations - realistic-looking details that do not exist in the ground truth images. Hallucinations in these models create uncertainty about their reliability, raising major concerns about their practical application. This paper investigates this phenomenon through the lens of information theory, revealing a fundamental tradeoff between uncertainty and perception. We rigorously analyze the relationship between these two factors, proving that the global minimal uncertainty in generative models grows in tandem with perception. In particular, we define the inherent uncertainty of the restoration problem and show that attaining perfect perceptual quality entails at least twice this uncertainty. Additionally, we establish a relation between distortion, uncertainty and perception, through which we prove the aforementioned uncertainly-perception tradeoff induces the well-known perception-distortion tradeoff. We demonstrate our theoretical findings through experiments with super-resolution and inpainting algorithms. This work uncovers fundamental limitations of generative models in achieving both high perceptual quality and reliable predictions for image restoration. Thus, we aim to raise awareness among practitioners about this inherent tradeoff, empowering them to make informed decisions and potentially prioritize safety over perceptual performance.

replace On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

Authors: Chenyu Zheng, Wei Huang, Rongzhen Wang, Guoqiang Wu, Jun Zhu, Chongxuan Li

Abstract: Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objective function in-context. However, whether the practical non-convex training dynamics will converge to the ideal mesa-optimizer is still unclear. Towards filling this gap, we investigate the non-convex dynamics of a one-layer linear causal self-attention model autoregressively trained by gradient flow, where the sequences are generated by an AR process $x_{t+1} = W x_t$. First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns $W$ by implementing one step of gradient descent to minimize an ordinary least squares (OLS) problem in-context. It then applies the learned $\widehat{W}$ for next-token prediction, thereby verifying the mesa-optimization hypothesis. Next, under the same data conditions, we explore the capability limitations of the obtained mesa-optimizer. We show that a stronger assumption related to the moments of data is the sufficient and necessary condition that the learned mesa-optimizer recovers the distribution. Besides, we conduct exploratory analyses beyond the first data condition and prove that generally, the trained transformer will not perform vanilla gradient descent for the OLS problem. Finally, our simulation results verify the theoretical results.

replace Recurrent Complex-Weighted Autoencoders for Unsupervised Object Discovery

Authors: Anand Gopalakrishnan, Aleksandar Stani\'c, J\"urgen Schmidhuber, Michael Curtis Mozer

Abstract: Current state-of-the-art synchrony-based models encode object bindings with complex-valued activations and compute with real-valued weights in feedforward architectures. We argue for the computational advantages of a recurrent architecture with complex-valued weights. We propose a fully convolutional autoencoder, SynCx, that performs iterative constraint satisfaction: at each iteration, a hidden layer bottleneck encodes statistically regular configurations of features in particular phase relationships; over iterations, local constraints propagate and the model converges to a globally consistent configuration of phase assignments. Binding is achieved simply by the matrix-vector product operation between complex-valued weights and activations, without the need for additional mechanisms that have been incorporated into current synchrony-based models. SynCx outperforms or is strongly competitive with current models for unsupervised object discovery. SynCx also avoids certain systematic grouping errors of current models, such as the inability to separate similarly colored objects without additional supervision.

replace 4-bit Shampoo for Memory-Efficient Network Training

Authors: Sike Wang, Pan Zhou, Jia Li, Hua Huang

Abstract: Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification and natural language modeling demonstrates that our 4-bit Shampoo achieves comparable performance to its 32-bit counterpart while being more memory-efficient.

replace IM-Context: In-Context Learning for Imbalanced Regression Tasks

Authors: Ismail Nejjar, Faez Ahmed, Olga Fink

Abstract: Regression models often fail to generalize effectively in regions characterized by highly imbalanced label distributions. Previous methods for deep imbalanced regression rely on gradient-based weight updates, which tend to overfit in underrepresented regions. This paper proposes a paradigm shift towards in-context learning as an effective alternative to conventional in-weight learning methods, particularly for addressing imbalanced regression. In-context learning refers to the ability of a model to condition itself, given a prompt sequence composed of in-context samples (input-label pairs) alongside a new query input to generate predictions, without requiring any parameter updates. In this paper, we study the impact of the prompt sequence on the model performance from both theoretical and empirical perspectives. We emphasize the importance of localized context in reducing bias within regions of high imbalance. Empirical evaluations across a variety of real-world datasets demonstrate that in-context learning substantially outperforms existing in-weight learning methods in scenarios with high levels of imbalance.

replace A Canonicalization Perspective on Invariant and Equivariant Learning

Authors: George Ma, Yifei Wang, Derek Lim, Stefanie Jegelka, Yisen Wang

Abstract: In many applications, we desire neural networks to exhibit invariance or equivariance to certain groups due to symmetries inherent in the data. Recently, frame-averaging methods emerged to be a unified framework for attaining symmetries efficiently by averaging over input-dependent subsets of the group, i.e., frames. What we currently lack is a principled understanding of the design of frames. In this work, we introduce a canonicalization perspective that provides an essential and complete view of the design of frames. Canonicalization is a classic approach for attaining invariance by mapping inputs to their canonical forms. We show that there exists an inherent connection between frames and canonical forms. Leveraging this connection, we can efficiently compare the complexity of frames as well as determine the optimality of certain frames. Guided by this principle, we design novel frames for eigenvectors that are strictly superior to existing methods -- some are even optimal -- both theoretically and empirically. The reduction to the canonicalization perspective further uncovers equivalences between previous methods. These observations suggest that canonicalization provides a fundamental understanding of existing frame-averaging methods and unifies existing equivariant and invariant learning methods. Code is available at https://github.com/GeorgeMLP/canonicalization.

URLs: https://github.com/GeorgeMLP/canonicalization.

replace Can Graph Learning Improve Planning in LLM-based Agents?

Authors: Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, Dongsheng Li

Abstract: Task planning in language agents is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests in natural language into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, task planning is a decision-making problem that involves selecting a connected path or subgraph within the corresponding graph and invoking it. In this paper, we explore graph learning-based methods for task planning, a direction that is orthogonal to the prevalent focus on prompt design. Our interest in graph learning stems from a theoretical discovery: the biases of attention and auto-regressive loss impede LLMs' ability to effectively navigate decision-making on graphs, which is adeptly addressed by graph neural networks (GNNs). This theoretical insight led us to integrate GNNs with LLMs to enhance overall performance. Extensive experiments demonstrate that GNN-based methods surpass existing solutions even without training, and minimal training can further enhance their performance. The performance gain increases with a larger task graph size.

replace Preference Alignment with Flow Matching

Authors: Minu Kim, Yongsik Lee, Sehyeok Kang, Jihwan Oh, Song Chong, Se-Young Yun

Abstract: We present Preference Flow Matching (PFM), a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models. Existing PbRL methods require fine-tuning pre-trained models, which presents challenges such as scalability, inefficiency, and the need for model modifications, especially with black-box APIs like GPT-4. In contrast, PFM utilizes flow matching techniques to directly learn from preference data, thereby reducing the dependency on extensive fine-tuning of pre-trained models. By leveraging flow-based models, PFM transforms less preferred data into preferred outcomes, and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation, thus avoiding common issues like overfitting in reward models. We provide theoretical insights that support our method's alignment with standard PbRL objectives. Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference. Our code is available at https://github.com/jadehaus/preference-flow-matching.

URLs: https://github.com/jadehaus/preference-flow-matching.

replace Segment, Shuffle, and Stitch: A Simple Layer for Improving Time-Series Representations

Authors: Shivam Grover, Amin Jalali, Ali Etemad

Abstract: Existing approaches for learning representations of time-series keep the temporal arrangement of the time-steps intact with the presumption that the original order is the most optimal for learning. However, non-adjacent sections of real-world time-series may have strong dependencies. Accordingly, we raise the question: Is there an alternative arrangement for time-series which could enable more effective representation learning? To address this, we propose a simple plug-and-play neural network layer called Segment, Shuffle, and Stitch (S3) designed to improve representation learning by time-series models. S3 works by creating non-overlapping segments from the original sequence and shuffling them in a learned manner that is optimal for the task at hand. It then re-attaches the shuffled segments back together and performs a learned weighted sum with the original input to capture both the newly shuffled sequence along with the original sequence. S3 is modular and can be stacked to achieve different levels of granularity, and can be added to many forms of neural architectures including CNNs or Transformers with negligible computation overhead. Through extensive experiments on several datasets and state-of-the-art baselines, we show that incorporating S3 results in significant improvements for the tasks of time-series classification, forecasting, and anomaly detection, improving performance on certain datasets by up to 68\%. We also show that S3 makes the learning more stable with a smoother training loss curve and loss landscape compared to the original baseline. The code is available at https://github.com/shivam-grover/S3-TimeSeries.

URLs: https://github.com/shivam-grover/S3-TimeSeries.

replace Iteration Head: A Mechanistic Study of Chain-of-Thought

Authors: Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Alice Yang, Francois Charton, Julia Kempe

Abstract: Chain-of-Thought (CoT) reasoning is known to improve Large Language Models both empirically and in terms of theoretical approximation power. However, our understanding of the inner workings and conditions of apparition of CoT capabilities remains limited. This paper helps fill this gap by demonstrating how CoT reasoning emerges in transformers in a controlled and interpretable setting. In particular, we observe the appearance of a specialized attention mechanism dedicated to iterative reasoning, which we coined "iteration heads". We track both the emergence and the precise working of these iteration heads down to the attention level, and measure the transferability of the CoT skills to which they give rise between tasks.

replace Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models

Authors: Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

Abstract: Diffusion models (DMs) produce very detailed and high-quality images. Their power results from extensive training on large amounts of data, usually scraped from the internet without proper attribution or consent from content creators. Unfortunately, this practice raises privacy and intellectual property concerns, as DMs can memorize and later reproduce their potentially sensitive or copyrighted training images at inference time. Prior efforts prevent this issue by either changing the input to the diffusion process, thereby preventing the DM from generating memorized samples during inference, or removing the memorized data from training altogether. While those are viable solutions when the DM is developed and deployed in a secure and constantly monitored environment, they hold the risk of adversaries circumventing the safeguards and are not effective when the DM itself is publicly released. To solve the problem, we introduce NeMo, the first method to localize memorization of individual data samples down to the level of neurons in DMs' cross-attention layers. Through our experiments, we make the intriguing finding that in many cases, single neurons are responsible for memorizing particular training samples. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data. In this way, our NeMo contributes to a more responsible deployment of DMs.

replace Amalgam: A Framework for Obfuscated Neural Network Training on the Cloud

Authors: Sifat Ut Taki, Spyridon Mastorakis

Abstract: Training a proprietary Neural Network (NN) model with a proprietary dataset on the cloud comes at the risk of exposing the model architecture and the dataset to the cloud service provider. To tackle this problem, in this paper, we present an NN obfuscation framework, called Amalgam, to train NN models in a privacy-preserving manner in existing cloud-based environments. Amalgam achieves that by augmenting NN models and the datasets to be used for training with well-calibrated noise to "hide" both the original model architectures and training datasets from the cloud. After training, Amalgam extracts the original models from the augmented models and returns them to users. Our evaluation results with different computer vision and natural language processing models and datasets demonstrate that Amalgam: (i) introduces modest overheads into the training process without impacting its correctness, and (ii) does not affect the model's accuracy. The prototype implementation is available at: https://github.com/SifatTaj/amalgam

URLs: https://github.com/SifatTaj/amalgam

replace Probabilistic Weather Forecasting with Hierarchical Graph Neural Networks

Authors: Joel Oskarsson, Tomas Landelius, Marc Peter Deisenroth, Fredrik Lindsten

Abstract: In recent years, machine learning has established itself as a powerful tool for high-resolution weather forecasting. While most current machine learning models focus on deterministic forecasts, accurately capturing the uncertainty in the chaotic weather system calls for probabilistic modeling. We propose a probabilistic weather forecasting model called Graph-EFM, combining a flexible latent-variable formulation with the successful graph-based forecasting framework. The use of a hierarchical graph construction allows for efficient sampling of spatially coherent forecasts. Requiring only a single forward pass per time step, Graph-EFM allows for fast generation of arbitrarily large ensembles. We experiment with the model on both global and limited area forecasting. Ensemble forecasts from Graph-EFM achieve equivalent or lower errors than comparable deterministic models, with the added benefit of accurately capturing forecast uncertainty.

replace CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning

Authors: Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, Bernard Ghanem

Abstract: Current parameter-efficient fine-tuning (PEFT) methods build adapters widely agnostic of the context of downstream task to learn, or the context of important knowledge to maintain. As a result, there is often a performance gap compared to full-parameter fine-tuning, and meanwhile the fine-tuned model suffers from catastrophic forgetting of the pre-trained world knowledge. In this paper, we propose CorDA, a Context-oriented Decomposition Adaptation method that builds learnable task-aware adapters from weight decomposition oriented by the context of downstream task or the world knowledge to maintain. Concretely, we collect a few data samples, and perform singular value decomposition for each linear layer of a pre-trained LLM multiplied by the covariance matrix of the input activation using these samples. The inverse of the covariance matrix is multiplied with the decomposed components to reconstruct the original weights. By doing so, the context of the representative samples is captured through deciding the factorizing orientation. Our method enables two options, the knowledge-preserved adaptation and the instruction-previewed adaptation. For the former, we use question-answering samples to obtain the covariance matrices, and use the decomposed components with the smallest $r$ singular values to initialize a learnable adapter, with the others frozen such that the world knowledge is better preserved. For the latter, we use the instruction data from the fine-tuning task, such as math or coding, to orientate the decomposition and train the largest $r$ components that most correspond to the task to learn. We conduct extensive experiments on Math, Code, and Instruction Following tasks.

replace LoCoCo: Dropping In Convolutions for Long Context Compression

Authors: Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

Abstract: This paper tackles the memory hurdle of processing long context sequences in Large Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo). LoCoCo employs only a fixed-size Key-Value (KV) cache, and can enhance efficiency in both inference and fine-tuning stages. Diverging from prior methods that selectively drop KV pairs based on heuristics, LoCoCo leverages a data-driven adaptive fusion technique, blending previous KV pairs with incoming tokens to minimize the loss of contextual information and ensure accurate attention modeling. This token integration is achieved through injecting one-dimensional convolutional kernels that dynamically calculate mixing weights for each KV cache slot. Designed for broad compatibility with existing LLM frameworks, LoCoCo allows for straightforward "drop-in" integration without needing architectural modifications, while incurring minimal tuning overhead. Experiments demonstrate that LoCoCo maintains consistently outstanding performance across various context lengths and can achieve a high context compression rate during both inference and fine-tuning phases. During inference, we successfully compressed up to 3482 tokens into a 128-size KV cache, while retaining comparable performance to the full sequence - an accuracy improvement of up to 0.2791 compared to baselines at the same cache size. During post-training tuning, we also effectively extended the context length from 4K to 32K using a KV cache of fixed size 512, achieving performance similar to fine-tuning with entire sequences.

replace Learning Continually by Spectral Regularization

Authors: Alex Lewandowski, Micha{\l} Bortkiewicz, Saurabh Kumar, Andr\'as Gy\"orgy, Dale Schuurmans, Mateusz Ostaszewski, Marlos C. Machado

Abstract: Loss of plasticity is a phenomenon where neural networks can become more difficult to train over the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good performance while maintaining network trainability. We develop a new technique for improving continual learning inspired by the observation that the singular values of the neural network parameters at initialization are an important factor for trainability during early phases of learning. From this perspective, we derive a new spectral regularizer for continual learning that better sustains these beneficial initialization properties throughout training. In particular, the regularizer keeps the maximum singular value of each layer close to one. Spectral regularization directly ensures that gradient diversity is maintained throughout training, which promotes continual trainability, while minimally interfering with performance in a single task. We present an experimental analysis that shows how the proposed spectral regularizer can sustain trainability and performance across a range of model architectures in continual supervised and reinforcement learning settings. Spectral regularization is less sensitive to hyperparameters while demonstrating better training in individual tasks, sustaining trainability as new tasks arrive, and achieving better generalization performance.

replace Distributional MIPLIB: a Multi-Domain Library for Advancing ML-Guided MILP Methods

Authors: Weimin Huang, Taoan Huang, Aaron M Ferber, Bistra Dilkina

Abstract: Mixed Integer Linear Programming (MILP) is a fundamental tool for modeling combinatorial optimization problems. Recently, a growing body of research has used machine learning to accelerate MILP solving. Despite the increasing popularity of this approach, there is a lack of a common repository that provides distributions of similar MILP instances across different domains, at different hardness levels, with standardized test sets. In this paper, we introduce Distributional MIPLIB, a multi-domain library of problem distributions for advancing ML-guided MILP methods. We curate MILP distributions from existing work in this area as well as real-world problems that have not been used, and classify them into different hardness levels. It will facilitate research in this area by enabling comprehensive evaluation on diverse and realistic domains. We empirically illustrate the benefits of using Distributional MIPLIB as a research vehicle in two ways. We evaluate the performance of ML-guided variable branching on previously unused distributions to identify potential areas for improvement. Moreover, we propose to learn branching policies from a mix of distributions, demonstrating that mixed distributions achieve better performance compared to homogeneous distributions when there is limited data and generalize well to larger instances. The dataset is publicly available at https://sites.google.com/usc.edu/distributional-miplib/home.

URLs: https://sites.google.com/usc.edu/distributional-miplib/home.

replace Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification

Authors: Yuankai Luo, Lei Shi, Xiao-Ming Wu

Abstract: Graph Transformers (GTs) have recently emerged as popular alternatives to traditional message-passing Graph Neural Networks (GNNs), due to their theoretically superior expressiveness and impressive performance reported on standard node classification benchmarks, often significantly outperforming GNNs. In this paper, we conduct a thorough empirical analysis to reevaluate the performance of three classic GNN models (GCN, GAT, and GraphSAGE) against GTs. Our findings suggest that the previously reported superiority of GTs may have been overstated due to suboptimal hyperparameter configurations in GNNs. Remarkably, with slight hyperparameter tuning, these classic GNN models achieve state-of-the-art performance, matching or even exceeding that of recent GTs across 17 out of the 18 diverse datasets examined. Additionally, we conduct detailed ablation studies to investigate the influence of various GNN configurations, such as normalization, dropout, residual connections, and network depth, on node classification performance. Our study aims to promote a higher standard of empirical rigor in the field of graph machine learning, encouraging more accurate comparisons and evaluations of model capabilities.

replace Enhancing Domain Adaptation through Prompt Gradient Alignment

Authors: Hoang Phan, Lam Tran, Quyen Tran, Trung Le

Abstract: Prior Unsupervised Domain Adaptation (UDA) methods often aim to train a domain-invariant feature extractor, which may hinder the model from learning sufficiently discriminative features. To tackle this, a line of works based on prompt learning leverages the power of large-scale pre-trained vision-language models to learn both domain-invariant and specific features through a set of domain-agnostic and domain-specific learnable prompts. Those studies typically enforce invariant constraints on representation, output, or prompt space to learn such prompts. Differently, we cast UDA as a multiple-objective optimization problem in which each objective is represented by a domain loss. Under this new framework, we propose aligning per-objective gradients to foster consensus between them. Additionally, to prevent potential overfitting when fine-tuning this deep learning architecture, we penalize the norm of these gradients. To achieve these goals, we devise a practical gradient update procedure that can work under both single-source and multi-source UDA. Empirically, our method consistently surpasses other prompt-based baselines by a large margin on different UDA benchmarks.

replace QTIP: Quantization with Trellises and Incoherence Processing

Authors: Albert Tseng, Qingyao Sun, David Hou, Christopher De Sa

Abstract: Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.

replace Evaluating the design space of diffusion-based generative models

Authors: Yuqing Wang, Ye He, Molei Tao

Abstract: Most existing theoretical investigations of the accuracy of diffusion models, albeit significant, assume the score function has been approximated to a certain accuracy, and then use this a priori bound to control the error of generation. This article instead provides a first quantitative understanding of the whole generation process, i.e., both training and sampling. More precisely, it conducts a non-asymptotic convergence analysis of denoising score matching under gradient descent. In addition, a refined sampling error analysis for variance exploding models is also provided. The combination of these two results yields a full error analysis, which elucidates (again, but this time theoretically) how to design the training and sampling processes for effective generation. For instance, our theory implies a preference toward noise distribution and loss weighting in training that qualitatively agree with the ones used in [Karras et al., 2022]. It also provides perspectives on the choices of time and variance schedules in sampling: when the score is well trained, the design in [Song et al., 2021] is more preferable, but when it is less trained, the design in [Karras et al., 2022] becomes more preferable.

replace CollaFuse: Collaborative Diffusion Models

Authors: Simeon Allmendinger, Domenique Zipperling, Lukas Struppek, Niklas K\"uhl

Abstract: In the landscape of generative artificial intelligence, diffusion-based models have emerged as a promising method for generating synthetic images. However, the application of diffusion models poses numerous challenges, particularly concerning data availability, computational requirements, and privacy. Traditional approaches to address these shortcomings, like federated learning, often impose significant computational burdens on individual clients, especially those with constrained resources. In response to these challenges, we introduce a novel approach for distributed collaborative diffusion models inspired by split learning. Our approach facilitates collaborative training of diffusion models while alleviating client computational burdens during image synthesis. This reduced computational burden is achieved by retaining data and computationally inexpensive processes locally at each client while outsourcing the computationally expensive processes to shared, more efficient server resources. Through experiments on the common CelebA dataset, our approach demonstrates enhanced privacy by reducing the necessity for sharing raw data. These capabilities hold significant potential across various application areas, including the design of edge computing solutions. Thus, our work advances distributed machine learning by contributing to the evolution of collaborative diffusion models.

replace Zero-Inflated Tweedie Boosted Trees with CatBoost for Insurance Loss Analytics

Authors: Banghee So, Emiliano A. Valdez

Abstract: In this paper, we explore advanced modifications to the Tweedie regression model in order to address its limitations in modeling aggregate claims for various types of insurance such as automobile, health, and liability. Traditional Tweedie models, while effective in capturing the probability and magnitude of claims, usually fall short in accurately representing the large incidence of zero claims. Our recommended approach involves a refined modeling of the zero-claim process, together with the integration of boosting methods in order to help leverage an iterative process to enhance predictive accuracy. Despite the inherent slowdown in learning algorithms due to this iteration, several efficient implementation techniques that also help precise tuning of parameters like XGBoost, LightGBM, and CatBoost have emerged. Nonetheless, we chose to utilize CatBoost, an efficient boosting approach that effectively handles categorical and other special types of data. The core contribution of our paper is the assembly of separate modeling for zero claims and the application of tree-based boosting ensemble methods within a CatBoost framework, assuming that the inflated probability of zero is a function of the mean parameter. The efficacy of our enhanced Tweedie model is demonstrated through the application of an insurance telematics dataset, which presents the additional complexity of compositional feature variables. Our modeling results reveal a marked improvement in model performance, showcasing its potential to deliver more accurate predictions suitable for insurance claim analytics.

replace Are Language Models Actually Useful for Time Series Forecasting?

Authors: Mingtian Tan, Mike A. Merrill, Vinayak Gupta, Tim Althoff, Thomas Hartvigsen

Abstract: Large language models (LLMs) are being applied to time series forecasting. But are language models actually useful for time series? In a series of ablation studies on three recent and popular LLM-based time series forecasting methods, we find that removing the LLM component or replacing it with a basic attention layer does not degrade forecasting performance -- in most cases, the results even improve! We also find that despite their significant computational cost, pretrained LLMs do no better than models trained from scratch, do not represent the sequential dependencies in time series, and do not assist in few-shot settings. Additionally, we explore time series encoders and find that patching and attention structures perform similarly to LLM-based forecasters.

replace Towards Efficient and Scalable Training of Differentially Private Deep Learning

Authors: Sebastian Rodriguez Beltran, Marlon Tobaben, Joonas J\"alk\"o, Niki Loppi, Antti Honkela

Abstract: Differentially private stochastic gradient descent (DP-SGD) is the standard algorithm for training machine learning models under differential privacy (DP). The most common DP-SGD privacy accountants rely on Poisson subsampling for ensuring the theoretical DP guarantees. Implementing computationally efficient DP-SGD with Poisson subsampling is not trivial, which leads to many implementations ignoring this requirement. We conduct a comprehensive empirical study to quantify the computational cost of training deep learning models under DP given the requirement of Poisson subsampling, by re-implementing efficient methods using Poisson subsampling and benchmarking them. We find that using the naive implementation DP-SGD with Opacus in PyTorch has between 2.6 and 8 times lower throughput of processed training examples per second than SGD. However, efficient gradient clipping implementations with e.g. Ghost Clipping can roughly halve this cost. We propose alternative computationally efficient ways of implementing DP-SGD with JAX that are using Poisson subsampling and achieve only around 1.2 times lower throughput than SGD based on PyTorch. We highlight important implementation considerations with JAX. Finally, we study the scaling behaviour using up to 80 GPUs and find that DP-SGD scales better than SGD. We share our re-implementations using Poisson subsampling at https://github.com/DPBayes/Towards-Efficient-Scalable-Training-DP-DL.

URLs: https://github.com/DPBayes/Towards-Efficient-Scalable-Training-DP-DL.

replace Decoding-Time Language Model Alignment with Multiple Objectives

Authors: Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A. Smith, Simon S. Du

Abstract: Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a linear combination of predictions of all base models, for any given weightings over different objectives. We exploit a common form among a family of $f$-divergence regularized alignment approaches (such as PPO, DPO, and their variants) to identify a closed-form solution by Legendre transform, and derive an efficient decoding strategy. Theoretically, we show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method. Empirical results demonstrate the effectiveness of the algorithm. For example, compared to a parameter-merging baseline, MOD achieves 12.8% overall reward improvement when equally optimizing towards $3$ objectives. Moreover, we experiment with MOD on combining three fully-finetuned LLMs of different model sizes, each aimed at different objectives such as safety, coding, and general user preference. Unlike traditional methods that require careful curation of a mixture of datasets to achieve comprehensive improvement, we can quickly experiment with preference weightings using MOD to find the best combination of models. Our best combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9--33.3% improvement across other three metrics ($\textit{i.e.}$, Codex@1, GSM-COT, BBH-COT).

replace Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Authors: Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

Abstract: Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.

replace Commute Graph Neural Networks

Authors: Wei Zhuo, Guang Tan

Abstract: Graph Neural Networks (GNNs) have shown remarkable success in learning from graph-structured data. However, their application to directed graphs (digraphs) presents unique challenges, primarily due to the inherent asymmetry in node relationships. Traditional GNNs are adept at capturing unidirectional relations but fall short in encoding the mutual path dependencies between nodes, such as asymmetrical shortest paths typically found in digraphs. Recognizing this gap, we introduce Commute Graph Neural Networks (CGNN), an approach that seamlessly integrates node-wise commute time into the message passing scheme. The cornerstone of CGNN is an efficient method for computing commute time using a newly formulated digraph Laplacian. Commute time is then integrated into the neighborhood aggregation process, with neighbor contributions weighted according to their respective commute time to the central node in each layer. It enables CGNN to directly capture the mutual, asymmetric relationships in digraphs. Extensive experiments confirm the superior performance of CGNN.

replace CONGO: Compressive Online Gradient Optimization

Authors: Jeremy Carleton, Prathik Vijaykumar, Divyanshu Saxena, Dheeraj Narasimha, Srinivas Shakkottai, Aditya Akella

Abstract: We address the challenge of zeroth-order online convex optimization where the objective function's gradient exhibits sparsity, indicating that only a small number of dimensions possess non-zero gradients. Our aim is to leverage this sparsity to obtain useful estimates of the objective function's gradient even when the only information available is a limited number of function samples. Our motivation stems from the optimization of large-scale queueing networks that process time-sensitive jobs. Here, a job must be processed by potentially many queues in sequence to produce an output, and the service time at any queue is a function of the resources allocated to that queue. Since resources are costly, the end-to-end latency for jobs must be balanced with the overall cost of the resources used. While the number of queues is substantial, the latency function primarily reacts to resource changes in only a few, rendering the gradient sparse. We tackle this problem by introducing the Compressive Online Gradient Optimization framework which allows compressive sensing methods previously applied to stochastic optimization to achieve regret bounds with an optimal dependence on the time horizon without the full problem dimension appearing in the bound. For specific algorithms, we reduce the samples required per gradient estimate to scale with the gradient's sparsity factor rather than its full dimensionality. Numerical simulations and real-world microservices benchmarks demonstrate CONGO's superiority over gradient descent approaches that do not account for sparsity.

replace End-To-End Causal Effect Estimation from Unstructured Natural Language Data

Authors: Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul G. Krishnan, Chris J. Maddison

Abstract: Knowing the effect of an intervention is critical for human decision-making, but current approaches for causal effect estimation rely on manual data collection and structuring, regardless of the causal assumptions. This increases both the cost and time-to-completion for studies. We show how large, diverse observational text data can be mined with large language models (LLMs) to produce inexpensive causal effect estimates under appropriate causal assumptions. We introduce NATURAL, a novel family of causal effect estimators built with LLMs that operate over datasets of unstructured text. Our estimators use LLM conditional distributions (over variables of interest, given the text data) to assist in the computation of classical estimators of causal effect. We overcome a number of technical challenges to realize this idea, such as automating data curation and using LLMs to impute missing information. We prepare six (two synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials. Our results suggest that unstructured text data is a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource.

replace RIO-CPD: A Riemannian Geometric Method for Correlation-aware Online Change Point Detection

Authors: Chengyuan Deng, Zhengzhang Chen, Xujiang Zhao, Haoyu Wang, Junxiang Wang, Haifeng Chen, Jie Gao

Abstract: Change point detection aims to identify abrupt shifts occurring at multiple points within a data sequence. This task becomes particularly challenging in the online setting, where different types of changes can occur, including shifts in both the marginal and joint distributions of the data. In this paper, we address these challenges by tracking the Riemannian geometry of correlation matrices, allowing Riemannian metrics to compute the geodesic distance as an accurate measure of correlation dynamics. We introduce Rio-CPD, a non-parametric, correlation-aware online change point detection framework that integrates the Riemannian geometry of the manifold of symmetric positive definite matrices with the cumulative sum (CUSUM) statistic for detecting change points. Rio-CPD employs a novel CUSUM design by computing the geodesic distance between current observations and the Fr\'echet mean of prior observations. With appropriate choices of Riemannian metrics, Rio-CPD offers a simple yet effective and computationally efficient algorithm. Experimental results on both synthetic and real-world datasets demonstrate that Rio-CPD outperforms existing methods on detection accuracy, average detection delay and efficiency.

replace The Group Robustness is in the Details: Revisiting Finetuning under Spurious Correlations

Authors: Tyler LaBonte, John C. Hill, Xinchen Zhang, Vidya Muthukumar, Abhishek Kumar

Abstract: Modern machine learning models are prone to over-reliance on spurious correlations, which can often lead to poor performance on minority groups. In this paper, we identify surprising and nuanced behavior of finetuned models on worst-group accuracy via comprehensive experiments on four well-established benchmarks across vision and language tasks. We first show that the commonly used class-balancing techniques of mini-batch upsampling and loss upweighting can induce a decrease in worst-group accuracy (WGA) with training epochs, leading to performance no better than without class-balancing. While in some scenarios, removing data to create a class-balanced subset is more effective, we show this depends on group structure and propose a mixture method which can outperform both techniques. Next, we show that scaling pretrained models is generally beneficial for worst-group accuracy, but only in conjunction with appropriate class-balancing. Finally, we identify spectral imbalance in finetuning features as a potential source of group disparities -- minority group covariance matrices incur a larger spectral norm than majority groups once conditioned on the classes. Our results show more nuanced interactions of modern finetuned models with group robustness than was previously known. Our code is available at https://github.com/tmlabonte/revisiting-finetuning.

URLs: https://github.com/tmlabonte/revisiting-finetuning.

replace Range Membership Inference Attacks

Authors: Jiashu Tao, Reza Shokri

Abstract: Machine learning models can leak private information about their training data, but the standard methods to measure this risk, based on membership inference attacks (MIAs), have a major limitation. They only check if a given data point \textit{exactly} matches a training point, neglecting the potential of similar or partially overlapping data revealing the same private information. To address this issue, we introduce the class of range membership inference attacks (RaMIAs), testing if the model was trained on any data in a specified range (defined based on the semantics of privacy). We formulate the RaMIAs game and design a principled statistical test for its complex hypotheses. We show that RaMIAs can capture privacy loss more accurately and comprehensively than MIAs on various types of data, such as tabular, image, and language. RaMIA paves the way for a more comprehensive and meaningful privacy auditing of machine learning algorithms.

replace A Multivocal Literature Review on Privacy and Fairness in Federated Learning

Authors: Beatrice Balbierer, Lukas Heinlein, Domenique Zipperling, Niklas K\"uhl

Abstract: Federated Learning presents a way to revolutionize AI applications by eliminating the necessity for data sharing. Yet, research has shown that information can still be extracted during training, making additional privacy-preserving measures such as differential privacy imperative. To implement real-world federated learning applications, fairness, ranging from a fair distribution of performance to non-discriminative behaviour, must be considered. Particularly in high-risk applications (e.g. healthcare), avoiding the repetition of past discriminatory errors is paramount. As recent research has demonstrated an inherent tension between privacy and fairness, we conduct a multivocal literature review to examine the current methods to integrate privacy and fairness in federated learning. Our analyses illustrate that the relationship between privacy and fairness has been neglected, posing a critical risk for real-world applications. We highlight the need to explore the relationship between privacy, fairness, and performance, advocating for the creation of integrated federated learning frameworks.

replace Clustering and Alignment: Understanding the Training Dynamics in Modular Addition

Authors: Tiberiu Musat

Abstract: Recent studies have revealed that neural networks learn interpretable algorithms for many simple problems. However, little is known about how these algorithms emerge during training. In this article, I study the training dynamics of a small neural network with 2-dimensional embeddings on the problem of modular addition. I observe that embedding vectors tend to organize into two types of structures: grids and circles. I study these structures and explain their emergence as a result of two simple tendencies exhibited by pairs of embeddings: clustering and alignment. I propose explicit formulae for these tendencies as interaction forces between different pairs of embeddings. To show that my formulae can fully account for the emergence of these structures, I construct an equivalent particle simulation where I show that identical structures emerge. I discuss the role of weight decay in my setup and reveal a new mechanism that links regularization and training dynamics. To support my findings, I also release an interactive demo available at https://modular-addition.vercel.app/.

URLs: https://modular-addition.vercel.app/.

replace Optimization Hyper-parameter Laws for Large Language Models

Authors: Xingyu Xie, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei

Abstract: Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.

replace Denoising: A Powerful Building-Block for Imaging, Inverse Problems, and Machine Learning

Authors: Peyman Milanfar, Mauricio Delbracio

Abstract: Denoising, the process of reducing random fluctuations in a signal to emphasize essential patterns, has been a fundamental problem of interest since the dawn of modern scientific inquiry. Recent denoising techniques, particularly in imaging, have achieved remarkable success, nearing theoretical limits by some measures. Yet, despite tens of thousands of research papers, the wide-ranging applications of denoising beyond noise removal have not been fully recognized. This is partly due to the vast and diverse literature, making a clear overview challenging. This paper aims to address this gap. We present a clarifying perspective on denoisers, their structure, and desired properties. We emphasize the increasing importance of denoising and showcase its evolution into an essential building block for complex tasks in imaging, inverse problems, and machine learning. Despite its long history, the community continues to uncover unexpected and groundbreaking uses for denoising, further solidifying its place as a cornerstone of scientific and engineering practice.

replace Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control

Authors: Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, Ricky T. Q. Chen

Abstract: Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there have not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.

replace S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Authors: Yuezhou Hu, Jun Zhu, Jianfei Chen

Abstract: Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In the light of this statement, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpass previous 2:4 pre-training recipes and is comparable even with full parameter models.

replace Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Authors: Qian Shao, Jiangrui Kang, Qiyuan Chen, Zepeng Li, Hongxia Xu, Yiwen Cao, Jiajuan Liang, Jian Wu

Abstract: Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion $\alpha$-Maximum Mean Discrepancy ($\alpha$-MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing $\alpha$-MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.

replace Mastering Chess with a Transformer Model

Authors: Daniel Monroe, Philip A. Chalmers

Abstract: Transformer models have demonstrated impressive capabilities when trained at scale, excelling at difficult cognitive tasks requiring complex reasoning and rational decision-making. In this paper, we explore the application of transformers to chess, focusing on the critical role of the position representation within the attention mechanism. We show that transformers endowed with a sufficiently expressive position representation can match existing chess-playing models at a fraction of the computational cost. Our architecture, which we call the Chessformer, significantly outperforms AlphaZero in both playing strength and puzzle solving ability with 8x less computation and matches prior grandmaster-level transformer-based agents in those metrics with 30x less computation. Our models also display an understanding of chess dissimilar and orthogonal to that of top traditional engines, detecting high-level positional features like trapped pieces and fortresses that those engines struggle with. This work demonstrates that domain-specific enhancements can in large part replace the need for model scale, while also highlighting that deep learning can make strides even in areas dominated by search-based methods.

replace MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification

Authors: Siddhant Bikram Shah, Shuvam Shiwakoti, Maheep Chaudhary, Haohan Wang

Abstract: The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.

URLs: https://github.com/SiddhantBikram/MemeCLIP.

replace TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

Authors: Andrei Margeloiu, Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik

Abstract: Data collection is often difficult in critical fields such as medicine, physics, and chemistry. As a result, classification methods usually perform poorly with these small datasets, leading to weak predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream classification performance. However, current tabular generative methods that learn either the joint distribution $ p(\mathbf{x}, y) $ or the class-conditional distribution $ p(\mathbf{x} \mid y) $ often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs). Unlike existing methods that use a shared model to approximate all class-conditional densities, our key innovation is to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually. This approach creates robust energy landscapes, even in ambiguous class distributions. Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently improves the classification performance across diverse datasets of various sizes, especially small ones. Code is available at \url{https://github.com/andreimargeloiu/TabEBM}.

URLs: https://github.com/andreimargeloiu/TabEBM

replace On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy

Authors: Saber Malekmohammadi, Golnoosh Farnadi

Abstract: A significant approach in natural language processing involves large-scale pre-training models on general domain data followed by their adaptation to specific tasks or domains. As models grow in size, full fine-tuning all of their parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g., LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between distribution of the injected noise and a Gaussian distribution with the same variance, we show that the dynamics of low-rank adaptation is close to that of differentially private fine-tuning of the adapters. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient scaling, low-rank adaptation is very close to performing DPSGD algorithm with a fixed noise scale to fine-tune the adapters. These theoretical findings suggest that unlike other existing fine-tuning algorithms, low-rank adaptation provides privacy w.r.t the fine-tuning data implicitly.

replace Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning

Authors: Grzegorz Rype\'s\'c, Sebastian Cygert, Tomasz Trzci\'nski, Bart{\l}omiej Twardowski

Abstract: Exemplar-Free Class Incremental Learning (EFCIL) tackles the problem of training a model on a sequence of tasks without access to past data. Existing state-of-the-art methods represent classes as Gaussian distributions in the feature extractor's latent space, enabling Bayes classification or training the classifier by replaying pseudo features. However, we identify two critical issues that compromise their efficacy when the feature extractor is updated on incremental tasks. First, they do not consider that classes' covariance matrices change and must be adapted after each task. Second, they are susceptible to a task-recency bias caused by dimensionality collapse occurring during training. In this work, we propose AdaGauss -- a novel method that adapts covariance matrices from task to task and mitigates the task-recency bias owing to the additional anti-collapse loss function. AdaGauss yields state-of-the-art results on popular EFCIL benchmarks and datasets when training from scratch or starting from a pre-trained backbone. The code is available at: https://github.com/grypesc/AdaGauss.

URLs: https://github.com/grypesc/AdaGauss.

replace A Taxonomy of Loss Functions for Stochastic Optimal Control

Authors: Carles Domingo-Enrich

Abstract: Stochastic optimal control (SOC) aims to direct the behavior of noisy systems and has widespread applications in science, engineering, and artificial intelligence. In particular, reward fine-tuning of diffusion and flow matching models and sampling from unnormalized methods can be recast as SOC problems. A recent work has introduced Adjoint Matching (Domingo-Enrich et al., 2024), a loss function for SOC problems that vastly outperforms existing loss functions in the reward fine-tuning setup. The goal of this work is to clarify the connections between all the existing (and some new) SOC loss functions. Namely, we show that SOC loss functions can be grouped into classes that share the same gradient in expectation, which means that their optimization landscape is the same; they only differ in their gradient variance. We perform simple SOC experiments to understand the strengths and weaknesses of different loss functions.

replace "Show Me What's Wrong!": Combining Charts and Text to Guide Data Analysis

Authors: Beatriz Feliciano, Rita Costa, Jean Alves, Javier Li\'ebana, Diogo Duarte, Pedro Bizarro

Abstract: Analyzing and finding anomalies in multi-dimensional datasets is a cumbersome but vital task across different domains. In the context of financial fraud detection, analysts must quickly identify suspicious activity among transactional data. This is an iterative process made of complex exploratory tasks such as recognizing patterns, grouping, and comparing. To mitigate the information overload inherent to these steps, we present a tool combining automated information highlights, Large Language Model generated textual insights, and visual analytics, facilitating exploration at different levels of detail. We perform a segmentation of the data per analysis area and visually represent each one, making use of automated visual cues to signal which require more attention. Upon user selection of an area, our system provides textual and graphical summaries. The text, acting as a link between the high-level and detailed views of the chosen segment, allows for a quick understanding of relevant details. A thorough exploration of the data comprising the selection can be done through graphical representations. The feedback gathered in a study performed with seven domain experts suggests our tool effectively supports and guides exploratory analysis, easing the identification of suspicious information.

replace How much can we forget about Data Contamination?

Authors: Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike von Luxburg

Abstract: The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we use experimental evidence and theoretical estimates to challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). We find that if model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. We then derive a simple theory of example forgetting via cumulative weight decay. It allows us to bound the number of gradient steps required to forget past data for any training run where we know the hyperparameters of AdamW. This indicates that many LLMs, including Llama 3, have forgotten the data seen at the beginning of training. Experimentally, we demonstrate that forgetting occurs faster than what is predicted by our bounds. Taken together, our results suggest that moderate amounts of contamination can be forgotten at the end of realistically scaled training runs.

replace Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

Authors: Tianheng Ling, Chao Qian, Gregor Schiele

Abstract: This study addresses the deployment challenges of integer-only quantized Transformers on resource-constrained embedded FPGAs (Xilinx Spartan-7 XC7S15). We enhanced the flexibility of our VHDL template by introducing a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck by utilizing BRAM efficiently. Moreover, we developed a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies without requiring extensive expertise in Neural Architecture Search. This method provides accurate resource utilization estimates with a precision discrepancy as low as 3%, compared to actual deployment metrics. Compared to previous work, our approach has successfully facilitated the deployment of model configurations utilizing mixed-precision quantization, thus overcoming the limitations inherent in five previously non-deployable configurations with uniform quantization bitwidths. Consequently, this research enhances the applicability of Transformers in embedded systems, facilitating a broader range of Transformer-powered applications on edge devices.

replace FutureFill: Fast Generation from Convolutional Sequence Models

Authors: Naman Agarwal, Xinyi Chen, Evan Dogariu, Vlad Feinberg, Daniel Suo, Peter Bartlett, Elad Hazan

Abstract: We address the challenge of efficient auto-regressive generation in sequence prediction models by introducing FutureFill - a method for fast generation that applies to any sequence prediction algorithm based on convolutional operators. Our approach reduces the generation time requirement from quadratic to quasilinear relative to the context length. Additionally, FutureFill requires a prefill cache sized only by the number of tokens generated, which is smaller than the cache requirements for standard convolutional and attention-based models. We validate our theoretical findings with experimental evidence demonstrating correctness and efficiency gains in a synthetic generation task.

replace Benchmarking Data Heterogeneity Evaluation Approaches for Personalized Federated Learning

Authors: Zhilong Li, Xiaohu Wu, Xiaoli Tang, Tiantian He, Yew-Soon Ong, Mengmeng Chen, Qiqi Liu, Qicheng Lao, Han Yu

Abstract: There is growing research interest in measuring the statistical heterogeneity of clients' local datasets. Such measurements are used to estimate the suitability for collaborative training of personalized federated learning (PFL) models. Currently, these research endeavors are taking place in silos and there is a lack of a unified benchmark to provide a fair and convenient comparison among various approaches in common settings. We aim to bridge this important gap in this paper. The proposed benchmarking framework currently includes six representative approaches. Extensive experiments have been conducted to compare these approaches under five standard non-IID FL settings, providing much needed insights into which approaches are advantageous under which settings. The proposed framework offers useful guidance on the suitability of various data divergence measures in FL systems. It is beneficial for keeping related research activities on the right track in terms of: (1) designing PFL schemes, (2) selecting appropriate data heterogeneity evaluation approaches for specific FL application scenarios, and (3) addressing fairness issues in collaborative model training. The code is available at https://github.com/Xiaoni-61/DH-Benchmark.

URLs: https://github.com/Xiaoni-61/DH-Benchmark.

replace Masked Generative Priors Improve World Models Sequence Modelling Capabilities

Authors: Cristian Meo, Mircea Lica, Zarif Ikram, Akihiro Nakano, Vedant Shah, Aniket Rajiv Didolkar, Dianbo Liu, Anirudh Goyal, Justin Dauwels

Abstract: Deep Reinforcement Learning (RL) has become the leading approach for creating artificial agents in complex environments. Model-based approaches, which are RL methods with world models that predict environment dynamics, are among the most promising directions for improving data efficiency, forming a critical step toward bridging the gap between research and real-world deployment. In particular, world models enhance sample efficiency by learning in imagination, which involves training a generative sequence model of the environment in a self-supervised manner. Recently, Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling and generating token sequences. Building on the Efficient Stochastic Transformer-based World Models (STORM) architecture, we replace the traditional MLP prior with a Masked Generative Prior (e.g., MaskGIT Prior) and introduce GIT-STORM. We evaluate our model on two downstream tasks: reinforcement learning and video prediction. GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark. Moreover, we apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research. To achieve this, we employ a state mixer function that integrates latent state representations with actions, enabling our model to handle continuous control tasks. We validate this approach through qualitative and quantitative analyses on the DeepMind Control Suite, showcasing the effectiveness of Transformer-based World Models in this new domain. Our results highlight the versatility and efficacy of the MaskGIT dynamics prior, paving the way for more accurate world models and effective RL policies.

replace Hierarchical Universal Value Function Approximators

Authors: Rushiv Arora

Abstract: There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions -- key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: $Q(s, g, o; \theta)$ and $Q(s, g, o, a; \theta)$. Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.

replace From Theory to Practice: Implementing and Evaluating e-Fold Cross-Validation

Authors: Christopher Mahlich, Tobias Vente, Joeran Beel

Abstract: This paper introduces e-fold cross-validation, an energy-efficient alternative to k-fold cross-validation. It dynamically adjusts the number of folds based on a stopping criterion. The criterion checks after each fold whether the standard deviation of the evaluated folds has consistently decreased or remained stable. Once met, the process stops early. We tested e-fold cross-validation on 15 datasets and 10 machine-learning algorithms. On average, it required 4 fewer folds than 10-fold cross-validation, reducing evaluation time, computational resources, and energy use by about 40%. Performance differences between e-fold and 10-fold cross-validation were less than 2% for larger datasets. More complex models showed even smaller discrepancies. In 96% of iterations, the results were within the confidence interval, confirming statistical significance. E-fold cross-validation offers a reliable and efficient alternative to k-fold, reducing computational costs while maintaining comparable accuracy.

replace ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

Authors: Nandan Kumar Jha, Brandon Reagen

Abstract: LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are {\em ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical properties -- specialization in input space and intra-class selectivity -- lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.

replace Replay-and-Forget-Free Graph Class-Incremental Learning: A Task Profiling and Prompting Approach

Authors: Chaoxi Niu, Guansong Pang, Ling Chen, Bing Liu

Abstract: Class-incremental learning (CIL) aims to continually learn a sequence of tasks, with each task consisting of a set of unique classes. Graph CIL (GCIL) follows the same setting but needs to deal with graph tasks (e.g., node classification in a graph). The key characteristic of CIL lies in the absence of task identifiers (IDs) during inference, which causes a significant challenge in separating classes from different tasks (i.e., inter-task class separation). Being able to accurately predict the task IDs can help address this issue, but it is a challenging problem. In this paper, we show theoretically that accurate task ID prediction on graph data can be achieved by a Laplacian smoothing-based graph task profiling approach, in which each graph task is modeled by a task prototype based on Laplacian smoothing over the graph. It guarantees that the task prototypes of the same graph task are nearly the same with a large smoothing step, while those of different tasks are distinct due to differences in graph structure and node attributes. Further, to avoid the catastrophic forgetting of the knowledge learned in previous graph tasks, we propose a novel graph prompting approach for GCIL which learns a small discriminative graph prompt for each task, essentially resulting in a separate classification model for each task. The prompt learning requires the training of a single graph neural network (GNN) only once on the first task, and no data replay is required thereafter, thereby obtaining a GCIL model being both replay-free and forget-free. Extensive experiments on four GCIL benchmarks show that i) our task prototype-based method can achieve 100% task ID prediction accuracy on all four datasets, ii) our GCIL model significantly outperforms state-of-the-art competing methods by at least 18% in average CIL accuracy, and iii) our model is fully free of forgetting on the four datasets.

replace Learning to Optimize for Mixed-Integer Non-linear Programming

Authors: Bo Tang, Elias B. Khalil, J\'an Drgo\v{n}a

Abstract: Mixed-integer non-linear programs (MINLPs) arise in various domains, such as energy systems and transportation, but are notoriously difficult to solve. Recent advances in machine learning have led to remarkable successes in optimization tasks, an area broadly known as learning to optimize. This approach includes using predictive models to generate solutions for optimization problems with continuous decision variables, thereby avoiding the need for computationally expensive optimization algorithms. However, applying learning to MINLPs remains challenging primarily due to the presence of integer decision variables, which complicate gradient-based learning. To address this limitation, we propose two differentiable correction layers that generate integer outputs while preserving gradient information. Combined with a soft penalty for constraint violation, our framework can tackle both the integrality and non-linear constraints in a MINLP. Experiments on three problem classes with convex/non-convex objective/constraints and integer/mixed-integer variables show that the proposed learning-based approach consistently produces high-quality solutions for parametric MINLPs extremely quickly. As problem size increases, traditional exact solvers and heuristic methods struggle to find feasible solutions, whereas our approach continues to deliver reliable results. Our work extends the scope of learning-to-optimize to MINLP, paving the way for integrating integer constraints into deep learning models. Our code is available at https://github.com/pnnl/L2O-pMINLP.

URLs: https://github.com/pnnl/L2O-pMINLP.

replace Utilizing Large Language Models in an Iterative Paradigm with Domain Feedback for Zero-shot Molecule Optimization

Authors: Khiem Le, Nitesh V. Chawla

Abstract: Molecule optimization is a critical task in drug discovery to optimize desired properties of a given molecule through chemical modification. Despite Large Language Models (LLMs) holding the potential to efficiently simulate this task by using natural language to direct the optimization, straightforwardly utilizing shows limited performance. In this work, we facilitate utilizing LLMs in an iterative paradigm by proposing a simple yet highly effective domain feedback provider, namely $\text{Re}^2$DF. In detail, $\text{Re}^2$DF harnesses an external toolkit, RDKit, to handle the molecule hallucination, if the modified molecule is chemically invalid. Otherwise, its desired properties are computed and compared to the original one, establishing reliable domain feedback with correct direction and distance towards the objective, followed by a retrieved example, to explicitly guide the LLM to refine the modified molecule. We conduct experiments across both single- and multi-property objectives with 2 thresholds, where $\text{Re}^2$DF shows significant improvements. Particularly, for 20 single-property objectives, $\text{Re}^2$DF enhances Hit ratio by 16.95% and 20.76% under loose and strict thresholds, respectively. For 32 multi-property objectives, $\text{Re}^2$DF enhances Hit ratio by 6.04% and 5.25%.

replace Exogenous Matching: Learning Good Proposals for Tractable Counterfactual Estimation

Authors: Yikang Chen, Dehui Du, Lili Tian

Abstract: We propose an importance sampling method for tractable and efficient estimation of counterfactual expressions in general settings, named Exogenous Matching. By minimizing a common upper bound of counterfactual estimators, we transform the variance minimization problem into a conditional distribution learning problem, enabling its integration with existing conditional distribution modeling approaches. We validate the theoretical results through experiments under various types and settings of Structural Causal Models (SCMs) and demonstrate the outperformance on counterfactual estimation tasks compared to other existing importance sampling methods. We also explore the impact of injecting structural prior knowledge (counterfactual Markov boundaries) on the results. Finally, we apply this method to identifiable proxy SCMs and demonstrate the unbiasedness of the estimates, empirically illustrating the applicability of the method to practical scenarios.

replace CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers and Fully-Connected Neural Networks for Causally Constrained Predictions

Authors: Matthew J. Vowels, Mathieu Rochat, Sina Akbari

Abstract: Artificial Neural Networks (ANNs), including fully-connected networks and transformers, are highly flexible and powerful function approximators, widely applied in fields like computer vision and natural language processing. However, their inability to inherently respect causal structures can limit their robustness, making them vulnerable to covariate shift and difficult to interpret/explain. This poses significant challenges for their reliability in real-world applications. In this paper, we introduce Causal Fully-Connected Neural Networks (CFCNs) and Causal Transformers (CaTs), two general model families designed to operate under predefined causal constraints, as specified by a Directed Acyclic Graph (DAG). These models retain the powerful function approximation abilities of traditional neural networks while adhering to the underlying structural constraints, improving robustness, reliability, and interpretability at inference time. This approach opens new avenues for deploying neural networks in more demanding, real-world scenarios where robustness and explainability is critical.

replace A Systematic Survey on Large Language Models for Algorithm Design

Authors: Fei Liu, Yiming Yao, Ping Guo, Zhiyuan Yang, Zhe Zhao, Xi Lin, Xialiang Tong, Mingxuan Yuan, Zhichao Lu, Zhenkun Wang, Qingfu Zhang

Abstract: Algorithm Design (AD) is crucial for effective problem-solving across various domains. The advent of Large Language Models (LLMs) has notably enhanced the automation and innovation within this field, offering new perspectives and promising solutions. Over the past three years, the integration of LLMs into AD (LLM4AD) has progressed significantly, finding applications in diverse areas such as optimization, machine learning, mathematical reasoning, and scientific discovery. Given the rapid development and broadening scope of this field, a systematic review is both timely and essential. This paper provides a systematic review of the works on LLM4AD. First, we present an overview and summary of existing studies. Then, we present a systematic taxonomy and a review of existing works along four dimensions, including the role of LLMs, search techniques, prompt strategies, and applications, with a discussion of the potential and achievements of using LLMs. Finally, we explore current challenges and propose several open questions and promising directions for future research.

replace Gradient Rewiring for Editable Graph Neural Network Training

Authors: Zhimeng Jiang, Zirui Liu, Xiaotian Han, Qizhang Feng, Hongye Jin, Qiaoyu Tan, Kaixiong Zhou, Na Zou, Xia Hu

Abstract: Deep neural networks are ubiquitously adopted in many applications, such as computer vision, natural language processing, and graph analytics. However, well-trained neural networks can make prediction errors after deployment as the world changes. \textit{Model editing} involves updating the base model to correct prediction errors with less accessible training data and computational resources. Despite recent advances in model editors in computer vision and natural language processing, editable training in graph neural networks (GNNs) is rarely explored. The challenge with editable GNN training lies in the inherent information aggregation across neighbors, which can lead model editors to affect the predictions of other nodes unintentionally. In this paper, we first observe the gradient of cross-entropy loss for the target node and training nodes with significant inconsistency, which indicates that directly fine-tuning the base model using the loss on the target node deteriorates the performance on training nodes. Motivated by the gradient inconsistency observation, we propose a simple yet effective \underline{G}radient \underline{R}ewiring method for \underline{E}ditable graph neural network training, named \textbf{GRE}. Specifically, we first store the anchor gradient of the loss on training nodes to preserve the locality. Subsequently, we rewire the gradient of the loss on the target node to preserve performance on the training node using anchor gradient. Experiments demonstrate the effectiveness of GRE on various model architectures and graph datasets in terms of multiple editing situations. The source code is available at \url{https://github.com/zhimengj0326/Gradient_rewiring_editing}

URLs: https://github.com/zhimengj0326/Gradient_rewiring_editing

replace Survival Multiarmed Bandits with Bootstrapping Methods

Authors: Peter Veroutis, Fr\'ed\'eric Godin

Abstract: The Multiarmed Bandits (MAB) problem has been extensively studied and has seen many practical applications in a variety of fields. The Survival Multiarmed Bandits (S-MAB) open problem is an extension which constrains an agent to a budget that is directly related to observed rewards. As budget depletion leads to ruin, an agent's objective is to both maximize expected cumulative rewards and minimize the probability of ruin. This paper presents a framework that addresses such a dual goal using an objective function balanced by a ruin aversion component. Action values are estimated through a novel approach which consists of bootstrapping samples from previously observed rewards. In numerical experiments, the policies we present outperform benchmarks from the literature.

replace An Effective Theory of Bias Amplification

Authors: Arjun Subramonian, Sam Bell, Levent Sagun, Elvis Dohmatob

Abstract: Machine learning models may capture and amplify biases present in data, leading to disparate test performance across social groups. To better understand, evaluate, and mitigate these possible biases, a deeper theoretical understanding of how model design choices and data distribution properties could contribute to bias is needed. In this work, we contribute a precise analytical theory in the context of ridge regression, both with and without random projections, where the former models neural networks in a simplified regime. Our theory offers a unified and rigorous explanation of machine learning bias, providing insights into phenomena such as bias amplification and minority-group bias in various feature and parameter regimes. For example, we demonstrate that there may be an optimal regularization penalty or training time to avoid bias amplification, and there can be fundamental differences in test error between groups that do not vanish with increased parameterization. Importantly, our theoretical predictions align with several empirical observations reported in the literature. We extensively empirically validate our theory on diverse synthetic and semi-synthetic datasets.

replace Beyond position: how rotary embeddings shape representations and memory in autoregressive transfomers

Authors: Valeria Ruscio, Fabrizio Silvestri

Abstract: Rotary Positional Embeddings (RoPE) enhance positional encoding in Transformer models, yet their full impact on model dynamics remains underexplored. This paper studies how RoPE introduces position-dependent rotations, causing phase shifts in token embeddings that influence higher-frequency components within the model's internal representations. Through spectral analysis, we demonstrate that RoPE's rotation matrices induce oscillatory behaviors in embeddings, affecting information retention across layers and shaping temporal modeling capabilities. We show that activation functions in feed-forward networks interact with RoPE-modulated embeddings to generate harmonics, leading to constructive or destructive interference based on phase alignment. Our findings reveal that phase alignment amplifies activations and sharpens attention, while misalignment weakens activations and disrupts focus on positional patterns. This study underscores the importance of frequency components as intrinsic elements of model behavior, offering new insights beyond traditional analyses.

replace Fast Inference for Augmented Large Language Models

Authors: Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher

Abstract: Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs by integrating external data sources through API calls. In interactive LLM applications, efficient scheduling is crucial for maintaining low request completion times, directly impacting user engagement. However, these augmentations introduce scheduling challenges due to the need to manage limited memory for cached information (KV caches). As a result, traditional size-based scheduling algorithms, such as Shortest Job First (SJF), become less effective at minimizing completion times. Existing work focuses only on handling requests during API calls by preserving, discarding, or swapping memory without considering how to schedule requests with API calls. In this paper, we propose LAMPS, a novel LLM inference framework for augmented LLMs. LAMPS minimizes request completion time through a unified scheduling approach that considers the total length of requests and their handling strategies during API calls. Recognizing that LLM inference is memory-bound, our approach ranks requests based on their consumption of memory over time, which depends on both the output sizes and how a request is managed during its API calls. To implement our scheduling, LAMPS predicts the strategy that minimizes memory waste of a request during its API calls, aligning with but improving upon existing approaches. We also propose starvation prevention techniques and optimizations to mitigate the overhead of our scheduling. We implement LAMPS on top of vLLM and evaluate its performance against baseline LLM inference systems, demonstrating improvements in end-to-end latency by 27%-85% and reductions in TTFT by 4%-96% compared to the existing augmented-LLM system, with even greater gains over vLLM.

replace What If the Input is Expanded in OOD Detection?

Authors: Boxuan Zhang, Jianing Zhu, Zengmao Wang, Tongliang Liu, Bo Du, Bo Han

Abstract: Out-of-distribution (OOD) detection aims to identify OOD inputs from unknown classes, which is important for the reliable deployment of machine learning models in the open world. Various scoring functions are proposed to distinguish it from in-distribution (ID) data. However, existing methods generally focus on excavating the discriminative information from a single input, which implicitly limits its representation dimension. In this work, we introduce a novel perspective, i.e., employing different common corruptions on the input space, to expand that. We reveal an interesting phenomenon termed confidence mutation, where the confidence of OOD data can decrease significantly under the corruptions, while the ID data shows a higher confidence expectation considering the resistance of semantic features. Based on that, we formalize a new scoring method, namely, Confidence aVerage (CoVer), which can capture the dynamic differences by simply averaging the scores obtained from different corrupted inputs and the original ones, making the OOD and ID distributions more separable in detection tasks. Extensive experiments and analyses have been conducted to understand and verify the effectiveness of CoVer. The code is publicly available at: https://github.com/tmlr-group/CoVer.

URLs: https://github.com/tmlr-group/CoVer.

replace ArterialNet: Reconstructing Arterial Blood Pressure Waveform with Wearable Pulsatile Signals, a Cohort-Aware Approach

Authors: Sicong Huang, Roozbeh Jafari, Bobak J. Mortazavi

Abstract: Continuous arterial blood pressure (ABP) monitoring is invasive but essential for hemodynamic monitoring. Recent techniques have reconstructed ABP non-invasively using pulsatile signals but produced inaccurate systolic and diastolic blood pressure (SBP and DBP) values and were sensitive to individual variability. ArterialNet integrates generalized pulsatile-to-ABP signal translation and personalized feature extraction using hybrid loss functions and regularization. We validated ArterialNet using the MIMIC-III dataset and achieved a root mean square error (RMSE) of 5.41 mmHg, with at least a 58% lower standard deviation. ArterialNet reconstructed ABP with an RMSE of 7.99 mmHg in remote health scenarios. ArterialNet achieved superior performance in ABP reconstruction and SBP and DBP estimations, with significantly reduced subject variance, demonstrating its potential in remote health settings. We also ablated ArterialNet architecture to investigate the contributions of each component and evaluated its translational impact and robustness by conducting a series of ablations on data quality and availability.

replace Multi-Agent Reinforcement Learning with Selective State-Space Models

Authors: Jemma Daniel, Ruan de Kock, Louay Ben Nessir, Sasha Abramowitz, Omayma Mahjoub, Wiem Khlifi, Claude Formanek, Arnu Pretorius

Abstract: The Transformer model has demonstrated success across a wide range of domains, including in Multi-Agent Reinforcement Learning (MARL) where the Multi-Agent Transformer (MAT) has emerged as a leading algorithm in the field. However, a significant drawback of Transformer models is their quadratic computational complexity relative to input size, making them computationally expensive when scaling to larger inputs. This limitation restricts MAT's scalability in environments with many agents. Recently, State-Space Models (SSMs) have gained attention due to their computational efficiency, but their application in MARL remains unexplored. In this work, we investigate the use of Mamba, a recent SSM, in MARL and assess whether it can match the performance of MAT while providing significant improvements in efficiency. We introduce a modified version of MAT that incorporates standard and bi-directional Mamba blocks, as well as a novel "cross-attention" Mamba block. Extensive testing shows that our Multi-Agent Mamba (MAM) matches the performance of MAT across multiple standard multi-agent environments, while offering superior scalability to larger agent scenarios. This is significant for the MARL community, because it indicates that SSMs could replace Transformers without compromising performance, whilst also supporting more effective scaling to higher numbers of agents. Our project page is available at https://sites.google.com/view/multi-agent-mamba .

URLs: https://sites.google.com/view/multi-agent-mamba

replace Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression

Authors: Yixiu Mao, Qi Wang, Chen Chen, Yun Qu, Xiangyang Ji

Abstract: In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus, but we argue that there exists an OOD state issue that also impairs performance yet has been underexplored. Such an issue describes the scenario when the agent encounters states out of the offline dataset during the test phase, leading to uncontrolled behavior and performance degradation. To this end, we propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL. Technically, SCAS achieves value-aware OOD state correction, capable of correcting the agent from OOD states to high-value in-distribution states. Theoretical and empirical results show that SCAS also exhibits the effect of suppressing OOD actions. On standard offline RL benchmarks, SCAS achieves excellent performance without additional hyperparameter tuning. Moreover, benefiting from its OOD state correction feature, SCAS demonstrates enhanced robustness against environmental perturbations.

replace LOCAL: Learning with Orientation Matrix to Infer Causal Structure from Time Series Data

Authors: Yue Cheng, Jiajun Zhang, Weiwei Xing, Xiaoyu Guo, Xiaohui Gao

Abstract: Discovering the underlying Directed Acyclic Graph (DAG) from time series observational data is highly challenging due to the dynamic nature and complex nonlinear interactions between variables. Existing methods often struggle with inefficiency and the handling of high-dimensional data. To address these research gap, we propose LOCAL, a highly efficient, easy-to-implement, and constraint-free method for recovering dynamic causal structures. LOCAL is the first attempt to formulate a quasi-maximum likelihood-based score function for learning the dynamic DAG equivalent to the ground truth. On this basis, we propose two adaptive modules for enhancing the algebraic characterization of acyclicity with new capabilities: Asymptotic Causal Mask Learning (ACML) and Dynamic Graph Parameter Learning (DGPL). ACML generates causal masks using learnable priority vectors and the Gumbel-Sigmoid function, ensuring the creation of DAGs while optimizing computational efficiency. DGPL transforms causal learning into decomposed matrix products, capturing the dynamic causal structure of high-dimensional data and enhancing interpretability. Extensive experiments on synthetic and real-world datasets demonstrate that LOCAL significantly outperforms existing methods, and highlight LOCAL's potential as a robust and efficient method for dynamic causal discovery. Our code will be available soon.

replace FLiP: Privacy-Preserving Federated Learning based on the Principle of Least Privileg

Authors: ShiMao Xu, Xiaopeng Ke, Xing Su, Shucheng Li, Hao Wu, Sheng Zhong, Fengyuan Xu

Abstract: Federated Learning (FL) allows users to share knowledge instead of raw data to train a model with high accuracy. Unfortunately, during the training, users lose control over the knowledge shared, which causes serious data privacy issues. We hold that users are only willing and need to share the essential knowledge to the training task to obtain the FL model with high accuracy. However, existing efforts cannot help users minimize the shared knowledge according to the user intention in the FL training procedure. This work proposes FLiP, which aims to bring the principle of least privilege (PoLP) to FL training. The key design of FLiP is applying elaborate information reduction on the training data through a local-global dataset distillation design. We measure the privacy performance through attribute inference and membership inference attacks. Extensive experiments show that FLiP strikes a good balance between model accuracy and privacy protection.

replace Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

Authors: Nicol\'as Nieto, Simon B. Eickhoff, Christian Jung, Martin Reuter, Kersten Diers, Malte Kelm, Artur Lichtenberg, Federico Raimondo, Kaustubh R. Patil

Abstract: Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.

replace Conformal Prediction for Multimodal Regression

Authors: Alexis Bose, Jonathan Ethier, Paul Guinand

Abstract: This paper introduces multimodal conformal regression. Traditionally confined to scenarios with solely numerical input features, conformal prediction is now extended to multimodal contexts through our methodology, which harnesses internal features from complex neural network architectures processing images and unstructured text. Our findings highlight the potential for internal neural network features, extracted from convergence points where multimodal information is combined, to be used by conformal prediction to construct prediction intervals (PIs). This capability paves new paths for deploying conformal prediction in domains abundant with multimodal data, enabling a broader range of problems to benefit from guaranteed distribution-free uncertainty quantification.

replace-cross A Model for Intelligible Interaction Between Agents That Predict and Explain

Authors: A. Baskar, Ashwin Srinivasan, Michael Bain, Enrico Coiera

Abstract: Machine Learning (ML) has emerged as a powerful form of data modelling with widespread applicability beyond its roots in the design of autonomous agents. However, relatively little attention has been paid to the interaction between people and ML systems. In this paper we view interaction between humans and ML systems within the broader context of communication between agents capable of prediction and explanation. We formalise the interaction model by taking agents to be automata with some special characteristics and define a protocol for communication between such agents. We define One- and Two-Way Intelligibility as properties that emerge at run-time by execution of the protocol. The formalisation allows us to identify conditions under which run-time sequences are bounded, and identify conditions under which the protocol can correctly implement an axiomatic specification of intelligible interaction between a human and an ML system. We also demonstrate using the formal model to: (a) identify instances of One- and Two-Way Intelligibility in literature reports on humans interacting with ML systems providing logic-based explanations, as is done in Inductive Logic Programming (ILP); and (b) map interactions between humans and machines in an elaborate natural-language based dialogue-model to One- or Two-Way Intelligible interactions in the formal model.

replace-cross A first-order augmented Lagrangian method for constrained minimax optimization

Authors: Zhaosong Lu, Sanyou Mei

Abstract: In this paper we study a class of constrained minimax problems. In particular, we propose a first-order augmented Lagrangian method for solving them, whose subproblems turn out to be a much simpler structured minimax problem and are suitably solved by a first-order method developed in this paper. Under some suitable assumptions, an \emph{operation complexity} of $O(\varepsilon^{-4}\log\varepsilon^{-1})$, measured by its fundamental operations, is established for the first-order augmented Lagrangian method for finding an $\varepsilon$-KKT solution of the constrained minimax problems.

replace-cross Anchored Learning for On-the-Fly Adaptation -- Extended Technical Report

Authors: Bassel El Mabsout, Shahin Roozkhosh, Siddharth Mysore, Kate Saenko, Renato Mancuso

Abstract: This study presents "anchor critics", a novel strategy for enhancing the robustness of reinforcement learning (RL) agents in crossing the sim-to-real gap. While RL agents can be successfully trained in simulation, they often encounter difficulties such as unpredictability, inefficient power consumption, and operational failures when deployed in real-world scenarios. We identify that naive fine-tuning approaches lead to catastrophic forgetting, where policies maintain high rewards on frequently encountered states but lose performance on rarer, yet critical scenarios. Our method maximizes multiple Q-values across domains, ensuring high performance in both simulation and reality. Evaluations demonstrate that our approach enables behavior retention in sim-to-sim gymnasium tasks and in sim-to-real scenarios with racing quadrotors, achieving a near-50% reduction in power consumption while maintaining controllable, stable flight. We also contribute SwannFlight, an open-source firmware for testing adaptation techniques on real robots.

replace-cross Theoretical guarantees for neural control variates in MCMC

Authors: Denis Belomestny, Artur Goldman, Alexey Naumov, Sergey Samsonov

Abstract: In this paper, we propose a variance reduction approach for Markov chains based on additive control variates and the minimization of an appropriate estimate for the asymptotic variance. We focus on the particular case when control variates are represented as deep neural networks. We derive the optimal convergence rate of the asymptotic variance under various ergodicity assumptions on the underlying Markov chain. The proposed approach relies upon recent results on the stochastic errors of variance reduction algorithms and function approximation theory.

replace-cross Decoupled Kullback-Leibler Divergence Loss

Authors: Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu, Hanwang Zhang

Abstract: In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. Thanks to the decomposed formulation of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL/DKL in scenarios like knowledge distillation by breaking its asymmetric optimization property. This modification ensures that the $\mathbf{w}$MSE component is always effective during training, providing extra constructive cues. Secondly, we introduce class-wise global information into KL/DKL to mitigate bias from individual samples. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training, and knowledge distillation tasks. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

URLs: https://github.com/jiequancui/DKL.

replace-cross Improving Neural Additive Models with Bayesian Principles

Authors: Kouroche Bouchiat, Alexander Immer, Hugo Y\`eche, Gunnar R\"atsch, Vincent Fortuin

Abstract: Neural additive models (NAMs) enhance the transparency of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we augment them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) facilitating the ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.

replace-cross Nonconvex Stochastic Bregman Proximal Gradient Method for Nonconvex Composite Problems

Authors: Kuangyu Ding, Jingyang Li, Kim-Chuan Toh

Abstract: Stochastic gradient methods for minimizing nonconvex composite objective functions typically rely on the Lipschitz smoothness of the differentiable part, but this assumption fails in many important problem classes, leading to instability of the algorithms in both theory and practice. To address this, we propose a family of stochastic Bregman proximal gradient (SBPG) methods that only require smooth adaptivity. SBPG replaces the quadratic approximation in SGD with a Bregman proximity measure, offering a better approximation model that handles non-Lipschitz gradients in nonconvex objectives. We establish the convergence properties of vanilla SBPG and show it achieves optimal sample complexity in the nonconvex setting. Experimental results on quadratic inverse problems demonstrate SBPG's robustness in terms of stepsize selection and sensitivity to the initial point. Furthermore, we introduce a momentum-based variant, MSBPG, which enhances convergence by relaxing the mini-batch size requirement while preserving the optimal oracle complexity. We apply a polynomial kernel function based MBPG to the loss function with polynomial growth. Experimental results on benchmark datasets confirm the effectiveness and robustness of MSBPG. Given its negligible additional computational cost compared to SGD in large-scale optimization, MSBPG shows promise as a universal optimizer for future applications.

replace-cross Virtual imaging trials improved the transparency and reliability of AI systems in COVID-19 imaging

Authors: Fakrul Islam Tushar, Lavsen Dahal, Saman Sotoudeh-Paima, Ehsan Abadi, W. Paul Segars, Ehsan Samei, Joseph Y. Lo

Abstract: The credibility of Artificial Intelligence (AI) models in medical imaging, particularly during the COVID-19 pandemic, has been challenged by reproducibility issues and obscured clinical insights. To address these concerns, we propose a Virtual Imaging Trials (VIT) framework, utilizing both clinical and simulated datasets to evaluate AI systems. This study focuses on using convolutional neural networks (CNNs) for COVID-19 diagnosis using computed tomography (CT) and chest radiography (CXR). We developed and tested multiple AI models, 3D ResNet-like and 2D EfficientNetv2 architectures, across diverse datasets. Our evaluation metrics included the area under the curve (AUC). Statistical analyses, such as the DeLong method for AUC confidence intervals, were employed to assess performance differences. Our findings demonstrate that VIT provides a robust platform for objective assessment, revealing significant influences of dataset characteristics, patient factors, and imaging physics on AI efficacy. Notably, models trained on the most diverse datasets showed the highest external testing performance, with AUC values ranging from 0.73 to 0.76 for CT and 0.70 to 0.73 for CXR. Internal testing yielded higher AUC values (0.77 to 0.85 for CT and 0.77 to 1.0 for CXR), highlighting a substantial drop in performance during external validation, which underscores the importance of diverse and comprehensive training and testing data. This approach enhances model transparency and reliability, offering nuanced insights into the factors driving AI performance and bridging the gap between experimental and clinical settings. The study underscores the potential of VIT to improve the reproducibility and clinical relevance of AI systems in medical imaging.

replace-cross Accelerating Nash Equilibrium Convergence in Monte Carlo Settings Through Counterfactual Value Based Fictitious Play

Authors: Ju Qi, Falin Hei, Ting Feng, Dengbing Yi, Zhemei Fang, Yunfeng Luo

Abstract: Counterfactual Regret Minimization (CFR) and its variants are widely recognized as effective algorithms for solving extensive-form imperfect information games. Recently, many improvements have been focused on enhancing the convergence speed of the CFR algorithm. However, most of these variants are not applicable under Monte Carlo (MC) conditions, making them unsuitable for training in large-scale games. We introduce a new MC-based algorithm for solving extensive-form imperfect information games, called MCCFVFP (Monte Carlo Counterfactual Value-Based Fictitious Play). MCCFVFP combines CFR's counterfactual value calculations with fictitious play's best response strategy, leveraging the strengths of fictitious play to gain significant advantages in games with a high proportion of dominated strategies. Experimental results show that MCCFVFP achieved convergence speeds approximately 20\%$\sim$50\% faster than the most advanced MCCFR variants in games like poker and other test games.

replace-cross A General Framework for Verification and Control of Dynamical Models via Certificate Synthesis

Authors: Alec Edwards, Andrea Peruffo, Alessandro Abate

Abstract: An emerging branch of control theory specialises in certificate learning, concerning the specification of a desired (possibly complex) system behaviour for an autonomous or control model, which is then analytically verified by means of a function-based proof. However, the synthesis of controllers abiding by these complex requirements is in general a non-trivial task and may elude the most expert control engineers. This results in a need for automatic techniques that are able to design controllers and to analyse a wide range of elaborate specifications. In this paper, we provide a general framework to encode system specifications and define corresponding certificates, and we present an automated approach to formally synthesise controllers and certificates. Our approach contributes to the broad field of safe learning for control, exploiting the flexibility of neural networks to provide candidate control and certificate functions, whilst using SMT-solvers to offer a formal guarantee of correctness. We test our framework by developing a prototype software tool, and assess its efficacy at verification via control and certificate synthesis over a large and varied suite of benchmarks.

replace-cross Multi-Swap $k$-Means++

Authors: Lorenzo Beretta, Vincent Cohen-Addad, Silvio Lattanzi, Nikos Parotsidis

Abstract: The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.

replace-cross On Linear Convergence of PI Consensus Algorithm under the Restricted Secant Inequality

Authors: Kushal Chakrabarti, Mayank Baranwal

Abstract: This paper considers solving distributed optimization problems in peer-to-peer multi-agent networks. The network is synchronous and connected. By using the proportional-integral (PI) control strategy, various algorithms with fixed stepsize have been developed. Two notable among them are the PI algorithm and the PI consensus algorithm. Although the PI algorithm has provable linear or exponential convergence without the standard requirement of (strong) convexity, a similar guarantee for the PI consensus algorithm is unavailable. In this paper, using Lyapunov theory, we guarantee exponential convergence of the PI consensus algorithm for global cost functions that satisfy the restricted secant inequality, with rate-matching discretization, without requiring convexity. To accelerate the PI consensus algorithm, we incorporate local pre-conditioning in the form of constant positive definite matrices and numerically validate its efficiency compared to the prominent distributed convex optimization algorithms. Unlike classical pre-conditioning, where only the gradients are multiplied by a pre-conditioner, the proposed pre-conditioning modifies both the gradients and the consensus terms, thereby controlling the effect of the communication graph on the algorithm.

replace-cross PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners

Authors: Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Quanquan Gu, Haifeng Chen, Wei Wang, Wei Cheng

Abstract: The proliferation of Large Language Models (LLMs) has driven considerable interest in fine-tuning them with domain-specific data to create specialized language models. Nevertheless, such domain-specific fine-tuning data often contains contextually sensitive personally identifiable information (PII). Direct fine-tuning of LLMs on this data without privacy protection poses a risk of data leakage of sensitive PII during inference time. To address this challenge, we introduce Contextual Privacy Protection Language Models (PrivacyMind), a novel paradigm for fine-tuning LLMs that effectively injects domain-specific knowledge while safeguarding inference-time data privacy. Our work offers a theoretical analysis for model design and benchmarks various techniques such as corpus curation, penalty-based unlikelihood in training loss, instruction-based tuning, etc. Extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches. In particular, instruction tuning with both positive and negative examples stands out as a promising method, effectively protecting private data while enhancing the model's knowledge. Our work underscores the potential for Large Language Models as robust contextual privacy protection learners. The complete code and data for the work can be found at https://github.com/Yijia-Xiao/PrivacyMind.

URLs: https://github.com/Yijia-Xiao/PrivacyMind.

replace-cross Aligning Text-to-Image Diffusion Models with Reward Backpropagation

Authors: Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki

Abstract: Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works finetune diffusion models to downstream reward functions using vanilla reinforcement learning, notorious for the high variance of the gradient estimators. In this paper, we propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient through the denoising process. While naive implementation of such backpropagation would require prohibitive memory resources for storing the partial derivatives of modern text-to-image models, AlignProp finetunes low-rank adapter weight modules and uses gradient checkpointing, to render its memory usage viable. We test AlignProp in finetuning diffusion models to various objectives, such as image-text semantic alignment, aesthetics, compressibility and controllability of the number of objects present, as well as their combinations. We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler, making it a straightforward choice for optimizing diffusion models for differentiable reward functions of interest. Code and Visualization results are available at https://align-prop.github.io/.

URLs: https://align-prop.github.io/.

replace-cross Nebula: Self-Attention for Dynamic Malware Analysis

Authors: Dmitrijs Trizna, Luca Demetrio, Battista Biggio, Fabio Roli

Abstract: Dynamic analysis enables detecting Windows malware by executing programs in a controlled environment and logging their actions. Previous work has proposed training machine learning models, i.e., convolutional and long short-term memory networks, on homogeneous input features like runtime APIs to either detect or classify malware, neglecting other relevant information coming from heterogeneous data like network and file operations. To overcome these issues, we introduce Nebula, a versatile, self-attention Transformer-based neural architecture that generalizes across different behavioral representations and formats, combining diverse information from dynamic log reports. Nebula is composed by several components needed to tokenize, filter, normalize and encode data to feed the transformer architecture. We firstly perform a comprehensive ablation study to evaluate their impact on the performance of the whole system, highlighting which components can be used as-is, and which must be enriched with specific domain knowledge. We perform extensive experiments on both malware detection and classification tasks, using three datasets acquired from different dynamic analyses platforms, show that, on average, Nebula outperforms state-of-the-art models at low false positive rates, with a peak of 12% improvement. Moreover, we showcase how self-supervised learning pre-training matches the performance of fully-supervised models with only 20% of training data, and we inspect the output of Nebula through explainable AI techniques, pinpointing how attention is focusing on specific tokens correlated to malicious activities of malware families. To foster reproducibility, we open-source our findings and models at https://github.com/dtrizna/nebula.

URLs: https://github.com/dtrizna/nebula.

replace-cross Double Debiased Covariate Shift Adaptation Robust to Density-Ratio Estimation

Authors: Masahiro Kato, Kota Matsui, Ryo Inokuchi

Abstract: Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies.

replace-cross Audio-Visual Instance Segmentation

Authors: Ruohao Guo, Xianghua Ying, Yaru Chen, Dantong Niu, Guangyao Li, Liao Qu, Yanyu Qi, Bowei Xing, Wenzhen Yue, Ji Shi, Qixun Wang, Peiliang Zhang, Buwen Liang

Abstract: In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models; however, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding.

replace-cross Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides

Authors: Maren H{\o}ib{\o}, Andr\'e Pedersen, Vibeke Grotnes Dale, Sissel Marie Berget, Borgny Ytterhus, Cecilia Lindskog, Elisabeth Wik, Lars A. Akslen, Ingerid Reinertsen, Erik Smistad, Marit Valla

Abstract: Digital pathology enables automatic analysis of histopathological sections using artificial intelligence (AI). Automatic evaluation could improve diagnostic efficiency and help find associations between morphological features and clinical outcome. For development of such prediction models, identifying invasive epithelial cells, and separating these from benign epithelial cells and in situ lesions would be the first step. In this study, we aimed to develop an AI model for segmentation of epithelial cells in sections from breast cancer. We generated epithelial ground truth masks by restaining hematoxylin and eosin (HE) sections with cytokeratin (CK) AE1/AE3, and by pathologists' annotations. HE/CK image pairs were used to train a convolutional neural network, and data augmentation was used to make the model more robust. Tissue microarrays (TMAs) from 839 patients, and whole slide images from two patients were used for training and evaluation of the models. The sections were derived from four cohorts of breast cancer patients. TMAs from 21 patients from a fifth cohort was used as a second test set. In quantitative evaluation, a mean Dice score of 0.70, 0.79, and 0.75 for invasive epithelial cells, benign epithelial cells, and in situ lesions, respectively, were achieved. In qualitative scoring (0-5) by pathologists, results were best for all epithelium and invasive epithelium, with scores of 4.7 and 4.4. Scores for benign epithelium and in situ lesions were 3.7 and 2.0. The proposed model segmented epithelial cells in HE stained breast cancer slides well, but further work is needed for accurate division between the classes. Immunohistochemistry, together with pathologists' annotations, enabled the creation of accurate ground truths. The model is made freely available in FastPathology and the code is available at https://github.com/AICAN-Research/breast-epithelium-segmentation

URLs: https://github.com/AICAN-Research/breast-epithelium-segmentation

replace-cross The HR-Calculus: Enabling Information Processing with Quaternion Algebra

Authors: Danilo P. Mandic, Sayed Pouria Talebi, Clive Cheong Took, Yili Xia, Dongpo Xu, Min Xiang, Pauline Bourigault

Abstract: From their inception, quaternions and their division algebra have proven to be advantageous in modelling rotation/orientation in three-dimensional spaces and have seen use from the initial formulation of electromagnetic filed theory through to forming the basis of quantum filed theory. Despite their impressive versatility in modelling real-world phenomena, adaptive information processing techniques specifically designed for quaternion-valued signals have only recently come to the attention of the machine learning, signal processing, and control communities. The most important development in this direction is introduction of the HR-calculus, which provides the required mathematical foundation for deriving adaptive information processing techniques directly in the quaternion domain. In this article, the foundations of the HR-calculus are revised and the required tools for deriving adaptive learning techniques suitable for dealing with quaternion-valued signals, such as the gradient operator, chain and product derivative rules, and Taylor series expansion are presented. This serves to establish the most important applications of adaptive information processing in the quaternion domain for both single-node and multi-node formulations. The article is supported by Supplementary Material, which will be referred to as SM.

replace-cross FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings

Authors: Jean Ogier du Terrail, Quentin Klopfenstein, Honghao Li, Imke Mayer, Nicolas Loiseau, Mohammad Hallal, Michael Debouver, Thibault Camalon, Thibault Fouqueray, Jorge Arellano Castro, Zahia Yanes, Laetitia Dahan, Julien Ta\"ieb, Pierre Laurent-Puig, Jean-Baptiste Bachet, Shulin Zhao, Remy Nicolle, J\'erome Cros, Daniel Gonzalez, Robert Carreras-Torres, Adelaida Garcia Velasco, Kawther Abdilleh, Sudheer Doss, F\'elix Balazard, Mathieu Andreux

Abstract: External control arms (ECA) can inform the early clinical development of experimental drugs and provide efficacy evidence for regulatory approval. However, the main challenge in implementing ECA lies in accessing real-world or historical clinical trials data. Indeed, regulations protecting patients' rights by strictly controlling data processing make pooling data from multiple sources in a central server often difficult. To address these limitations, we develop a new method, 'FedECA' that leverages federated learning (FL) to enable inverse probability of treatment weighting (IPTW) for time-to-event outcomes on separate cohorts without needing to pool data. To showcase the potential of FedECA, we apply it in different settings of increasing complexity culminating with a real-world use-case in which FedECA provides evidence for a differential effect between two drugs that would have otherwise gone unnoticed. By sharing our code, we hope FedECA will foster the creation of federated research networks and thus accelerate drug development.

replace-cross Volume-Preserving Transformers for Learning Time Series Data with Structure

Authors: Benedikt Brantner, Guillaume de Romemont, Michael Kraus, Zeyuan Li

Abstract: Two of the many trends in neural network research of the past few years have been (i) the learning of dynamical systems, especially with recurrent neural networks such as long short-term memory networks (LSTMs) and (ii) the introduction of transformer neural networks for natural language processing (NLP) tasks. While some work has been performed on the intersection of these two trends, those efforts were largely limited to using the vanilla transformer directly without adjusting its architecture for the setting of a physical system. In this work we develop a transformer-inspired neural network and use it to learn a dynamical system. We (for the first time) change the activation function of the attention layer to imbue the transformer with structure-preserving properties to improve long-term stability. This is shown to be of great advantage when applying the neural network to learning the trajectory of a rigid body.

replace-cross A Proximal Gradient Method With Probabilistic Multi-Gossip Communications for Decentralized Composite Optimization

Authors: Luyao Guo, Luqing Wang, Xinli Shi, Jinde Cao

Abstract: Decentralized optimization methods with local updates have recently gained attention for their provable ability to communication acceleration. In these methods, nodes perform several iterations of local computations between the communication rounds. Nevertheless, this capability is effective only when the loss function is smooth and the network is sufficiently well-connected. In this paper, we propose a communication-efficient method MG-Skip with probabilistic local updates and multi-gossip communications for decentralized composite (smooth + nonsmooth) optimization, whose stepsize is independent of the number of local updates and the network topology. Without any additional condition for network connectivity, MG-Skip allows for the multi-gossip communications to be skipped in most iterations in the strongly convex setting, while its iteration complexity is $\mathcal{O}\left(\kappa \log \frac{1}{\epsilon}\right)$ and communication complexity is only $\mathcal{O}\left(\sqrt{\frac{\kappa}{(1-\rho)}} \log \frac{1}{\epsilon}\right)$, where $\kappa$ is the condition number of the loss function, $\rho$ reflects the connectivity of the network topology, and $\epsilon$ is the target accuracy. The theoretical results demonstrate that MG-Skip achieves the optimal communication complexity and confirm the benefits of local updates in the nonsmooth setup.

replace-cross Self-supervised learning for skin cancer diagnosis with limited training data

Authors: Hamish Haggerty, Rohitash Chandra

Abstract: Early cancer detection is crucial for prognosis, but many cancer types lack large labelled datasets required for developing deep learning models. This paper investigates self-supervised learning (SSL) as an alternative to the standard supervised pre-training on ImageNet data for scenarios with limited training data using the ResNet-50 deep learning model. We first demonstrate that SSL pre-training on ImageNet (via the Barlow Twins SSL algorithm) outperforms supervised pre-training (SL) using a skin lesion dataset with limited training samples. We then consider further SSL pre-training (of the two ImageNet pre-trained models) on task-specific datasets, where our implementation is motivated by supervised transfer learning. The SSL significantly enhances initially SL pre-trained models, closing the performance gap with initially SSL pre-trained ones. Surprisingly, further pre-training on just the limited fine-tuning data achieves this performance equivalence. We implement a linear probe training strategy in the RestNet-50 model, and our experiments reveal that improvement stems from enhanced feature extraction. We find that minimal further SSL pre-training on task-specific data can be as effective as large-scale SSL pre-training on ImageNet for medical image classification tasks with limited labelled data. We validate these results on an oral cancer histopathology dataset, suggesting broader applicability across medical imaging domains facing labelled data scarcity.

replace-cross A Hierarchical Framework with Spatio-Temporal Consistency Learning for Emergence Detection in Complex Adaptive Systems

Authors: Siyuan Chen, Xin Du, Jiahai Wang

Abstract: Emergence, a global property of complex adaptive systems (CASs) constituted by interactive agents, is prevalent in real-world dynamic systems, e.g., network-level traffic congestions. Detecting its formation and evaporation helps to monitor the state of a system, allowing to issue a warning signal for harmful emergent phenomena. Since there is no centralized controller of CAS, detecting emergence based on each agent's local observation is desirable but challenging. Existing works are unable to capture emergence-related spatial patterns, and fail to model the nonlinear relationships among agents. This paper proposes a hierarchical framework with spatio-temporal consistency learning to solve these two problems by learning the system representation and agent representations, respectively. Spatio-temporal encoders composed of spatial and temporal transformers are designed to capture agents' nonlinear relationships and the system's complex evolution. Agents' and the system's representations are learned to preserve the spatio-temporal consistency by minimizing the spatial and temporal dissimilarities in a self-supervised manner in the latent space. Our method achieves more accurate detection than traditional methods and deep learning methods on three datasets with well-known yet hard-to-detect emergent behaviors. Notably, our hierarchical framework is generic in incorporating other deep learning methods for agent-level and system-level detection.

replace-cross On $f$-Divergence Principled Domain Adaptation: An Improved Framework

Authors: Ziqiao Wang, Yongyi Mao

Abstract: Unsupervised domain adaptation (UDA) plays a crucial role in addressing distribution shifts in machine learning. In this work, we improve the theoretical foundations of UDA proposed in Acuna et al. (2021) by refining their $f$-divergence-based discrepancy and additionally introducing a new measure, $f$-domain discrepancy ($f$-DD). By removing the absolute value function and incorporating a scaling parameter, $f$-DD obtains novel target error and sample complexity bounds, allowing us to recover previous KL-based results and bridging the gap between algorithms and theory presented in Acuna et al. (2021). Using a localization technique, we also develop a fast-rate generalization bound. Empirical results demonstrate the superior performance of $f$-DD-based learning algorithms over previous works in popular UDA benchmarks.

replace-cross Resource-Aware Hierarchical Federated Learning in Wireless Video Caching Networks

Authors: Md Ferdous Pervej, Andreas F. Molisch

Abstract: Backhaul traffic congestion caused by the video traffic of a few popular files can be alleviated by storing the to-be-requested content at various levels in wireless video caching networks. Typically, content service providers (CSPs) own the content, and the users request their preferred content from the CSPs using their (wireless) internet service providers (ISPs). As these parties do not reveal their private information and business secrets, traditional techniques may not be readily used to predict the dynamic changes in users' future demands. Motivated by this, we propose a novel resource-aware hierarchical federated learning (RawHFL) solution for predicting user's future content requests. A practical data acquisition technique is used that allows the user to update its local training dataset based on its requested content. Besides, since networking and other computational resources are limited, considering that only a subset of the users participate in the model training, we derive the convergence bound of the proposed algorithm. Based on this bound, we minimize a weighted utility function for jointly configuring the controllable parameters to train the RawHFL energy efficiently under practical resource constraints. Our extensive simulation results validate the proposed algorithm's superiority, in terms of test accuracy and energy cost, over existing baselines.

replace-cross Adversarial Robustness Through Artifact Design

Authors: Tsufit Shua, Liron David, Mahmood Sharif

Abstract: Adversarial examples arose as a challenge for machine learning. To hinder them, most defenses alter how models are trained (e.g., adversarial training) or inference is made (e.g., randomized smoothing). Still, while these approaches markedly improve models' adversarial robustness, models remain highly susceptible to adversarial examples. Identifying that, in certain domains such as traffic-sign recognition, objects are implemented per standards specifying how artifacts (e.g., signs) should be designed, we propose a novel approach for improving adversarial robustness. Specifically, we offer a method to redefine standards, making minor changes to existing ones, to defend against adversarial examples. We formulate the problem of artifact design as a robust optimization problem, and propose gradient-based and greedy search methods to solve it. We evaluated our approach in the domain of traffic-sign recognition, allowing it to alter traffic-sign pictograms (i.e., symbols within the signs) and their colors. We found that, combined with adversarial training, our approach led to up to 25.18\% higher robust accuracy compared to state-of-the-art methods against two adversary types, while further increasing accuracy on benign inputs. Notably, a user study we conducted showed that traffic signs produced by our approach are also easily recognizable by human subjects.

replace-cross Calibrating Long-form Generations from Large Language Models

Authors: Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan Dhingra

Abstract: To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation.

replace-cross PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining

Authors: Mishaal Kazmi, Hadrien Lautraite, Alireza Akbari, Qiaoyue Tang, Mauricio Soroco, Tao Wang, S\'ebastien Gambs, Mathias L\'ecuyer

Abstract: We present PANORAMIA, a privacy leakage measurement framework for machine learning models that relies on membership inference attacks using generated data as non-members. By relying on generated non-member data, PANORAMIA eliminates the common dependency of privacy measurement tools on in-distribution non-member data. As a result, PANORAMIA does not modify the model, training data, or training process, and only requires access to a subset of the training data. We evaluate PANORAMIA on ML models for image and tabular data classification, as well as on large-scale language models.

replace-cross LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

Authors: Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach

Abstract: Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation LexC-Gen, a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. Through ablation study, we show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.

replace-cross Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation

Authors: Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Noboru Koshizuka, Chuan Xiao

Abstract: This paper introduces a novel approach using Large Language Models (LLMs) integrated into an agent framework for flexible and effective personal mobility generation. LLMs overcome the limitations of previous models by effectively processing semantic data and offering versatility in modeling various tasks. Our approach addresses three research questions: aligning LLMs with real-world urban mobility data, developing reliable activity generation strategies, and exploring LLM applications in urban mobility. The key technical contribution is a novel LLM agent framework that accounts for individual activity patterns and motivations, including a self-consistency approach to align LLMs with real-world activity data and a retrieval-augmented strategy for interpretable activity generation. We evaluate our LLM agent framework and compare it with state-of-the-art personal mobility generation approaches, demonstrating the effectiveness of our approach and its potential applications in urban mobility. Overall, this study marks the pioneering work of designing an LLM agent framework for activity generation based on real-world human activity data, offering a promising tool for urban mobility analysis.

replace-cross Watermarking Makes Language Models Radioactive

Authors: Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, Teddy Furon

Abstract: We investigate the radioactivity of text generated by large language models (LLM), i.e. whether it is possible to detect that such synthetic input was used to train a subsequent LLM. Current methods like membership inference or active IP protection either work only in settings where the suspected text is known or do not provide reliable statistical guarantees. We discover that, on the contrary, it is possible to reliably determine if a language model was trained on synthetic data if that data is output by a watermarked LLM. Our new methods, specialized for radioactivity, detects with a provable confidence weak residuals of the watermark signal in the fine-tuned LLM. We link the radioactivity contamination level to the following properties: the watermark robustness, its proportion in the training set, and the fine-tuning process. For instance, if the suspect model is open-weight, we demonstrate that training on watermarked instructions can be detected with high confidence ($p$-value $< 10^{-5}$) even when as little as $5\%$ of training text is watermarked.

replace-cross PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures

Authors: Christina Giannoula, Peiming Yang, Ivan Fernandez Vega, Jiacheng Yang, Sankeerth Durvasula, Yu Xin Li, Mohammad Sadrosadati, Juan Gomez Luna, Onur Mutlu, Gennady Pekhimenko

Abstract: Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML library that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim is publicly available at https://github.com/CMU-SAFARI/PyGim.

URLs: https://github.com/CMU-SAFARI/PyGim.

replace-cross Leveraging Self-Supervised Learning for Scene Classification in Child Sexual Abuse Imagery

Authors: Pedro H. V. Valois, Jo\~ao Macedo, Leo S. F. Ribeiro, Jefersson A. dos Santos, Sandra Avila

Abstract: Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing \& Exploited Children every year, and over 80% originate from online sources. Therefore, investigation centers cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene classification task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to downstream tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.

replace-cross Effectiveness Assessment of Recent Large Vision-Language Models

Authors: Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shahbaz Khan

Abstract: The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their effectiveness in specialized tasks, we employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. These six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope that this study can provide useful insights for the future development of LVLMs, helping researchers improve LVLMs for both general and specialized applications.

replace-cross Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization

Authors: Xiangxin Zhou, Dongyu Xue, Ruizhe Chen, Zaixiang Zheng, Liang Wang, Quanquan Gu

Abstract: Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody sequence-structure co-design as an optimization problem towards specific preferences, considering both rationality and functionality. Leveraging a pre-trained conditional diffusion model that jointly models sequences and structures of antibodies with equivariant neural networks, we propose direct energy-based preference optimization to guide the generation of antibodies with both rational structures and considerable binding affinities to given antigens. Our method involves fine-tuning the pre-trained diffusion model using a residue-level decomposed energy preference. Additionally, we employ gradient surgery to address conflicts between various types of energy, such as attraction and repulsion. Experiments on RAbD benchmark show that our approach effectively optimizes the energy of generated antibodies and achieves state-of-the-art performance in designing high-quality antibodies with low total energy and high binding affinity simultaneously, demonstrating the superiority of our approach.

replace-cross On Large Language Models' Hallucination with Regard to Known Facts

Authors: Che Jiang, Biqing Qi, Xiangyu Hong, Dayuan Fu, Yang Cheng, Fandong Meng, Mo Yu, Bowen Zhou, Jie Zhou

Abstract: Large language models are successful in answering factoid questions but are also prone to hallucination. We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics, an area not previously covered in studies on hallucinations. We are able to conduct this analysis via two key ideas. First, we identify the factual questions that query the same triplet knowledge but result in different answers. The difference between the model behaviors on the correct and incorrect outputs hence suggests the patterns when hallucinations happen. Second, to measure the pattern, we utilize mappings from the residual streams to vocabulary space. We reveal the different dynamics of the output token probabilities along the depths of layers between the correct and hallucinated cases. In hallucinated cases, the output token's information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model. Leveraging the dynamic curve as a feature, we build a classifier capable of accurately detecting hallucinatory predictions with an 88\% success rate. Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.

replace-cross CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants

Authors: Amit Finkman Noah, Avishag Shapira, Eden Bar Kochva, Inbar Maimon, Dudu Mimran, Yuval Elovici, Asaf Shabtai

Abstract: LLM-based code assistants are becoming increasingly popular among developers. These tools help developers improve their coding efficiency and reduce errors by providing real-time suggestions based on the developer's codebase. While beneficial, the use of these tools can inadvertently expose the developer's proprietary code to the code assistant service provider during the development process. In this work, we propose a method to mitigate the risk of code leakage when using LLM-based code assistants. CodeCloak is a novel deep reinforcement learning agent that manipulates the prompts before sending them to the code assistant service. CodeCloak aims to achieve the following two contradictory goals: (i) minimizing code leakage, while (ii) preserving relevant and useful suggestions for the developer. Our evaluation, employing StarCoder and Code Llama, LLM-based code assistants models, demonstrates CodeCloak's effectiveness on a diverse set of code repositories of varying sizes, as well as its transferability across different models. We also designed a method for reconstructing the developer's original codebase from code segments sent to the code assistant service (i.e., prompts) during the development process, to thoroughly analyze code leakage risks and evaluate the effectiveness of CodeCloak under practical development scenarios.

replace-cross Generalization capabilities and robustness of hybrid machine learning models grounded in flow physics compared to purely deep learning models

Authors: Rodrigo Abad\'ia-Heredia, Adri\'an Corrochano, Manuel Lopez-Martin, Soledad Le Clainche

Abstract: This study investigates the generalization capabilities and robustness of purely deep learning (DL) models and hybrid models based on physical principles in fluid dynamics applications, specifically focusing on iteratively forecasting the temporal evolution of flow dynamics. Three autoregressive models were compared: a convolutional autoencoder combined with a convolutional LSTM (ConvLSTM), a variational autoencoder (VAE) combined with a ConvLSTM and a hybrid model that combines proper orthogonal decomposition (POD) with a LSTM (POD-DL). These models were tested on two high-dimensional, nonlinear datasets representing the velocity field of flow past a circular cylinder in both laminar and turbulent regimes. The study used latent dimension methods, enabling a bijective reduction of high-dimensional dynamics into a lower-order space to facilitate future predictions. While the VAE and ConvLSTM models accurately predicted laminar flow, the hybrid POD-DL model outperformed the others across both laminar and turbulent flow regimes. This success is attributed to the model's ability to incorporate modal decomposition, reducing the dimensionality of the data, by a non-parametric method, and simplifying the forecasting component. By leveraging POD, the model not only gained insight into the underlying physics, improving prediction accuracy with less training data, but also reduce the number of trainable parameters as POD is non-parametric. The findings emphasize the potential of hybrid models, particularly those integrating modal decomposition and deep learning, in predicting complex flow dynamics.

replace-cross Customizing Text-to-Image Models with a Single Image Pair

Authors: Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu

Abstract: Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.

replace-cross Artificial Intelligence for the Internal Democracy of Political Parties

Authors: Claudio Novelli, Giuliano Formisano, Prathm Juneja, Giulia Sandri, Luciano Floridi

Abstract: The article argues that AI can enhance the measurement and implementation of democratic processes within political parties, known as Intra-Party Democracy (IPD). It identifies the limitations of traditional methods for measuring IPD, which often rely on formal parameters, self-reported data, and tools like surveys. Such limitations lead to the collection of partial data, rare updates, and significant demands on resources. To address these issues, the article suggests that specific data management and Machine Learning (ML) techniques, such as natural language processing and sentiment analysis, can improve the measurement (ML about) and practice (ML for) of IPD. The article concludes by considering some of the principal risks of ML for IPD, including concerns over data privacy, the potential for manipulation, and the dangers of overreliance on technology.

replace-cross Parallel Backpropagation for Shared-Feature Visualization

Authors: Alexander Lappe, Anna Bogn\'ar, Ghazaleh Ghamkhari Nejad, Albert Mukovskiy, Lucas Martini, Martin A. Giese, Rufin Vogels

Abstract: High-level visual brain regions contain subareas in which neurons appear to respond more strongly to examples of a particular semantic category, like faces or bodies, rather than objects. However, recent work has shown that while this finding holds on average, some out-of-category stimuli also activate neurons in these regions. This may be due to visual features common among the preferred class also being present in other images. Here, we propose a deep-learning-based approach for visualizing these features. For each neuron, we identify relevant visual features driving its selectivity by modelling responses to images based on latent activations of a deep neural network. Given an out-of-category image which strongly activates the neuron, our method first identifies a reference image from the preferred category yielding a similar feature activation pattern. We then backpropagate latent activations of both images to the pixel level, while enhancing the identified shared dimensions and attenuating non-shared features. The procedure highlights image regions containing shared features driving responses of the model neuron. We apply the algorithm to novel recordings from body-selective regions in macaque IT cortex in order to understand why some images of objects excite these neurons. Visualizations reveal object parts which resemble parts of a macaque body, shedding light on neural preference of these objects.

replace-cross Images that Sound: Composing Images and Sounds on a Single Canvas

Authors: Ziyang Chen, Daniel Geng, Andrew Owens

Abstract: Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/

URLs: https://ificl.github.io/images-that-sound/

replace-cross Deep linear networks for regression are implicitly regularized towards flat minima

Authors: Pierre Marion, L\'ena\"ic Chizat

Abstract: The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.

replace-cross FACT or Fiction: Can Truthful Mechanisms Eliminate Federated Free Riding?

Authors: Marco Bornstein, Amrit Singh Bedi, Abdirisak Mohamed, Furong Huang

Abstract: Standard federated learning (FL) approaches are vulnerable to the free-rider dilemma: participating agents can contribute little to nothing yet receive a well-trained aggregated model. While prior mechanisms attempt to solve the free-rider dilemma, none have addressed the issue of truthfulness. In practice, adversarial agents can provide false information to the server in order to cheat its way out of contributing to federated training. In an effort to make free-riding-averse federated mechanisms truthful, and consequently less prone to breaking down in practice, we propose FACT. FACT is the first federated mechanism that: (1) eliminates federated free riding by using a penalty system, (2) ensures agents provide truthful information by creating a competitive environment, and (3) encourages agent participation by offering better performance than training alone. Empirically, FACT avoids free-riding when agents are untruthful, and reduces agent loss by over 4x.

replace-cross Matrix Denoising with Doubly Heteroscedastic Noise: Fundamental Limits and Optimal Spectral Methods

Authors: Yihan Zhang, Marco Mondelli

Abstract: We study the matrix denoising problem of estimating the singular vectors of a rank-$1$ signal corrupted by noise with both column and row correlations. Existing works are either unable to pinpoint the exact asymptotic estimation error or, when they do so, the resulting approaches (e.g., based on whitening or singular value shrinkage) remain vastly suboptimal. On top of this, most of the literature has focused on the special case of estimating the left singular vector of the signal when the noise only possesses row correlation (one-sided heteroscedasticity). In contrast, our work establishes the information-theoretic and algorithmic limits of matrix denoising with doubly heteroscedastic noise. We characterize the exact asymptotic minimum mean square error, and design a novel spectral estimator with rigorous optimality guarantees: under a technical condition, it attains positive correlation with the signals whenever information-theoretically possible and, for one-sided heteroscedasticity, it also achieves the Bayes-optimal error. Numerical experiments demonstrate the significant advantage of our theoretically principled method with the state of the art. The proofs draw connections with statistical physics and approximate message passing, departing drastically from standard random matrix theory techniques.

replace-cross RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar

Authors: Fangqiang Ding, Xiangyu Wen, Yunzhou Zhu, Yiming Li, Chris Xiaoxuan Lu

Abstract: 3D occupancy-based perception pipeline has significantly advanced autonomous driving by capturing detailed scene descriptions and demonstrating strong generalizability across various object categories and shapes. Current methods predominantly rely on LiDAR or camera inputs for 3D occupancy prediction. These methods are susceptible to adverse weather conditions, limiting the all-weather deployment of self-driving cars. To improve perception robustness, we leverage the recent advances in automotive radars and introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction. Our method, RadarOcc, circumvents the limitations of sparse radar point clouds by directly processing the 4D radar tensor, thus preserving essential scene details. RadarOcc innovatively addresses the challenges associated with the voluminous and noisy 4D radar data by employing Doppler bins descriptors, sidelobe-aware spatial sparsification, and range-wise self-attention mechanisms. To minimize the interpolation errors associated with direct coordinate transformations, we also devise a spherical-based feature encoding followed by spherical-to-Cartesian feature aggregation. We benchmark various baseline methods based on distinct modalities on the public K-Radar dataset. The results demonstrate RadarOcc's state-of-the-art performance in radar-based 3D occupancy prediction and promising results even when compared with LiDAR- or camera-based methods. Additionally, we present qualitative evidence of the superior performance of 4D radar in adverse weather conditions and explore the impact of key pipeline components through ablation studies.

replace-cross Markovian Flow Matching: Accelerating MCMC with Continuous Normalizing Flows

Authors: Alberto Cabezas, Louis Sharrock, Christopher Nemeth

Abstract: Continuous normalizing flows (CNFs) learn the probability path between a reference distribution and a target distribution by modeling the vector field generating said path using neural networks. Recently, Lipman et al. (2022) introduced a simple and inexpensive method for training CNFs in generative modeling, termed flow matching (FM). In this paper, we repurpose this method for probabilistic inference by incorporating Markovian sampling methods in evaluating the FM objective, and using the learned CNF to improve Monte Carlo sampling. Specifically, we propose an adaptive Markov chain Monte Carlo (MCMC) algorithm, which combines a local Markov transition kernel with a non-local, flow-informed transition kernel, defined using a CNF. This CNF is adapted on-the-fly using samples from the Markov chain, which are used to specify the probability path for the FM objective. Our method also includes an adaptive tempering mechanism that allows the discovery of multiple modes in the target distribution. Under mild assumptions, we establish convergence of our method to a local optimum of the FM objective. We then benchmark our approach on several synthetic and real-world examples, achieving similar performance to other state-of-the-art methods, but often at a significantly lower computational cost.

replace-cross Representation noising can prevent harmful fine-tuning on LLMs

Authors: Domenic Rosati, Jan Wehner, Kai Williams, {\L}ukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank Rudzicz

Abstract: Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. Even without the open release of weights, weight stealing and fine-tuning APIs make closed models vulnerable to harmful fine-tuning attacks (HFAs). While safety measures like preventing jailbreaks and improving safety guardrails are important, such measures can easily be reversed through fine-tuning. In this work, we propose Representation Noising (RepNoise), a defence mechanism that is effective even when attackers have access to the weights. RepNoise works by removing information about harmful representations such that it is difficult to recover them during fine-tuning. Importantly, our defence is also able to generalize across different subsets of harm that have not been seen during the defence process as long as they are drawn from the same distribution of the attack set. Our method does not degrade the general capability of LLMs and retains the ability to train the model on harmless tasks. We provide empirical evidence that the effectiveness of our defence lies in its "depth": the degree to which information about harmful representations is removed across all layers of the LLM.

replace-cross RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance

Authors: Zhicheng Sun, Zhenhao Yang, Yang Jin, Haozhe Chi, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Yang Song, Kun Gai, Yadong Mu

Abstract: Customizing diffusion models to generate identity-preserving images from user-provided reference images is an intriguing new problem. The prevalent approaches typically require training on extensive domain-specific images to achieve identity preservation, which lacks flexibility across different use cases. To address this issue, we exploit classifier guidance, a training-free technique that steers diffusion models using an existing classifier, for personalized image generation. Our study shows that based on a recent rectified flow framework, the major limitation of vanilla classifier guidance in requiring a special classifier can be resolved with a simple fixed-point solution, allowing flexible personalization with off-the-shelf image discriminators. Moreover, its solving procedure proves to be stable when anchored to a reference flow trajectory, with a convergence guarantee. The derived method is implemented on rectified flow with different off-the-shelf image discriminators, delivering advantageous personalization results for human faces, live subjects, and certain objects. Code is available at https://github.com/feifeiobama/RectifID.

URLs: https://github.com/feifeiobama/RectifID.

replace-cross Efficient Certificates of Anti-Concentration Beyond Gaussians

Authors: Ainesh Bakshi, Pravesh Kothari, Goutham Rajendran, Madhur Tulsiani, Aravindan Vijayaraghavan

Abstract: A set of high dimensional points $X=\{x_1, x_2,\ldots, x_n\} \subset R^d$ in isotropic position is said to be $\delta$-anti concentrated if for every direction $v$, the fraction of points in $X$ satisfying $|\langle x_i,v \rangle |\leq \delta$ is at most $O(\delta)$. Motivated by applications to list-decodable learning and clustering, recent works have considered the problem of constructing efficient certificates of anti-concentration in the average case, when the set of points $X$ corresponds to samples from a Gaussian distribution. Their certificates played a crucial role in several subsequent works in algorithmic robust statistics on list-decodable learning and settling the robust learnability of arbitrary Gaussian mixtures, yet remain limited to rotationally invariant distributions. This work presents a new (and arguably the most natural) formulation for anti-concentration. Using this formulation, we give quasi-polynomial time verifiable sum-of-squares certificates of anti-concentration that hold for a wide class of non-Gaussian distributions including anti-concentrated bounded product distributions and uniform distributions over $L_p$ balls (and their affine transformations). Consequently, our method upgrades and extends results in algorithmic robust statistics e.g., list-decodable learning and clustering, to such distributions. Our approach constructs a canonical integer program for anti-concentration and analysis a sum-of-squares relaxation of it, independent of the intended application. We rely on duality and analyze a pseudo-expectation on large subsets of the input points that take a small value in some direction. Our analysis uses the method of polynomial reweightings to reduce the problem to analyzing only analytically dense or sparse directions.

replace-cross Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

Authors: Mingyang Yi, Aoxue Li, Yi Xin, Zhenguo Li

Abstract: Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e.g., Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e.g., texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM. Inspired by the observations, we proceed to explore the influence of each token in the text prompt during the two stages. After a series of experiments of T2I generations conditioned on a set of text prompts. We conclude that in the earlier generation stage, the image is mostly decided by the special token [\texttt{EOS}] in the text prompt, and the information in the text prompt is already conveyed in this stage. After that, the diffusion model completes the details of generated images by information from themselves. Finally, we propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance, which finally accelerates the sampling up to 25\%+.

replace-cross KG-FIT: Knowledge Graph Fine-Tuning Upon Open-World Knowledge

Authors: Pengcheng Jiang, Lang Cao, Cao Xiao, Parminder Bhatia, Jimeng Sun, Jiawei Han

Abstract: Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-guided refinement to construct a semantically coherent hierarchical structure of entity clusters. By incorporating this hierarchical knowledge along with textual information during the fine-tuning process, KG-FIT effectively captures both global semantics from the LLM and local semantics from the KG. Extensive experiments on the benchmark datasets FB15K-237, YAGO3-10, and PrimeKG demonstrate the superiority of KG-FIT over state-of-the-art pre-trained language model-based methods, achieving improvements of 14.4%, 13.5%, and 11.9% in the Hits@10 metric for the link prediction task, respectively. Furthermore, KG-FIT yields substantial performance gains of 12.6%, 6.7%, and 17.7% compared to the structure-based base models upon which it is built. These results highlight the effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings.

replace-cross MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities

Authors: Hao Dong, Yue Zhao, Eleni Chatzi, Olga Fink

Abstract: Detecting out-of-distribution (OOD) samples is important for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. Existing research has mainly focused on unimodal scenarios on image data. However, real-world applications are inherently multimodal, which makes it essential to leverage information from multiple modalities to enhance the efficacy of OOD detection. To establish a foundation for more realistic Multimodal OOD Detection, we introduce the first-of-its-kind benchmark, MultiOOD, characterized by diverse dataset sizes and varying modality combinations. We first evaluate existing unimodal OOD detection algorithms on MultiOOD, observing that the mere inclusion of additional modalities yields substantial improvements. This underscores the importance of utilizing multiple modalities for OOD detection. Based on the observation of Modality Prediction Discrepancy between in-distribution (ID) and OOD data, and its strong correlation with OOD performance, we propose the Agree-to-Disagree (A2D) algorithm to encourage such discrepancy during training. Moreover, we introduce a novel outlier synthesis method, NP-Mix, which explores broader feature spaces by leveraging the information from nearest neighbor classes and complements A2D to strengthen OOD detection performance. Extensive experiments on MultiOOD demonstrate that training with A2D and NP-Mix improves existing OOD detection algorithms by a large margin. Our source code and MultiOOD benchmark are available at https://github.com/donghao51/MultiOOD.

URLs: https://github.com/donghao51/MultiOOD.

replace-cross Task-Agnostic Machine-Learning-Assisted Inference

Authors: Jiacheng Miao, Qiongshi Lu

Abstract: Machine learning (ML) is playing an increasingly important role in scientific research. In conjunction with classical statistical approaches, ML-assisted analytical strategies have shown great promise in accelerating research findings. This has also opened a whole field of methodological research focusing on integrative approaches that leverage both ML and statistics to tackle data science challenges. One type of study that has quickly gained popularity employs ML to predict unobserved outcomes in massive samples, and then uses predicted outcomes in downstream statistical inference. However, existing methods designed to ensure the validity of this type of post-prediction inference are limited to very basic tasks such as linear regression analysis. This is because any extension of these approaches to new, more sophisticated statistical tasks requires task-specific algebraic derivations and software implementations, which ignores the massive library of existing software tools already developed for the same scientific problem given observed data. This severely constrains the scope of application for post-prediction inference. To address this challenge, we introduce a novel statistical framework named PSPS for task-agnostic ML-assisted inference. It provides a post-prediction inference solution that can be easily plugged into almost any established data analysis routines. It delivers valid and efficient inference that is robust to arbitrary choice of ML model, allowing nearly all existing statistical frameworks to be incorporated into the analysis of ML-predicted data. Through extensive experiments, we showcase our method's validity, versatility, and superiority compared to existing approaches. Our software is available at https://github.com/qlu-lab/psps.

URLs: https://github.com/qlu-lab/psps.

replace-cross Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Authors: Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, Kaisheng Ma

Abstract: In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from long per-case optimization time with inconsistent issues. Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution. To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, we propose a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps, a multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results. Extensive experiments demonstrate that our Unique3D significantly outperforms other image-to-3D baselines in terms of geometric and textural details.

replace-cross What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

Authors: Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

Abstract: Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code and data are available at: https://github.com/CVMI-Lab/clip-beyond-tail.

URLs: https://github.com/CVMI-Lab/clip-beyond-tail.

replace-cross Embedding-Aligned Language Models

Authors: Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Lior Shani, Ethan Liang, Craig Boutilier

Abstract: We propose a novel approach for training large language models (LLMs) to adhere to objectives defined within a latent embedding space. Our method leverages reinforcement learning (RL), treating a pre-trained LLM as an environment. Our embedding-aligned guided language (EAGLE) agent is trained to iteratively steer the LLM's generation towards optimal regions of the latent embedding space, w.r.t. some predefined criterion. We demonstrate the effectiveness of the EAGLE agent using the MovieLens 25M and Amazon Review datasets to surface content gaps that satisfy latent user demand. We also demonstrate the benefit of using an optimal design of a state-dependent action set to improve EAGLE's efficiency. Our work paves the way for controlled and grounded text generation using LLMs, ensuring consistency with domain-specific knowledge and data representations.

replace-cross Non-geodesically-convex optimization in the Wasserstein space

Authors: Hoang Phuc Hau Luu, Hanlin Yu, Bernardo Williams, Petrus Mikkola, Marcelo Hartmann, Kai Puolam\"aki, Arto Klami

Abstract: We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward-Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is -- to our knowledge -- still unknown in our very general non-geodesically-convex setting.

replace-cross Schr\"{o}dinger Bridge with Quadratic State Cost is Exactly Solvable

Authors: Alexis M. H. Teter, Wenqing Wang, Abhishek Halder

Abstract: Schr\"{o}dinger bridge is a diffusion process that steers a given distribution to another in a prescribed time while minimizing the effort to do so. It can be seen as the stochastic dynamical version of the optimal mass transport, and has growing applications in generative diffusion models and stochastic optimal control. {\black{We say a Schr\"{o}dinger bridge is ``exactly solvable'' if the associated uncontrolled Markov kernel is available in closed form, since then the bridge can be numerically computed using dynamic Sinkhorn recursion for arbitrary endpoint distributions with finite second moments.}} In this work, we propose a regularized variant of the Schr\"{o}dinger bridge with a quadratic state cost-to-go that incentivizes the optimal sample paths to stay close to a nominal level. Unlike the conventional Schr\"{o}dinger bridge, the regularization induces a state-dependent rate of killing and creation of probability mass, and its solution requires determining the Markov kernel of a reaction-diffusion partial differential equation. We derive this Markov kernel in closed form, {\black{showing that the regularized Schr\"{o}dinger bridge is exactly solvable, even for non-Gaussian endpoints. This advances the state-of-the-art because closed form Markov kernel for the regularized Schr\"{o}dinger bridge is available in existing literature only for Gaussian endpoints}}. Our solution recovers the heat kernel in the vanishing regularization (i.e., diffusion without reaction) limit, thereby recovering the solution of the conventional Schr\"{o}dinger bridge {\black{as a special case}}. We deduce properties of the new kernel and explain its connections with certain exactly solvable models in quantum mechanics.

replace-cross HYDRA: Model Factorization Framework for Black-Box LLM Personalization

Authors: Yuchen Zhuang, Haotian Sun, Yue Yu, Rushi Qiang, Qifan Wang, Chao Zhang, Bo Dai

Abstract: Personalization has emerged as a critical research area in modern intelligent systems, focusing on mining users' behavioral history and adapting to their preferences for delivering tailored experiences. Despite the remarkable few-shot capabilities exhibited by black-box large language models (LLMs), the inherent opacity of their model parameters presents significant challenges in aligning the generated output with individual expectations. Existing solutions have primarily focused on prompt design to incorporate user-specific profiles and behaviors; however, such approaches often struggle to generalize effectively due to their inability to capture shared knowledge among all users. To address these challenges, we propose HYDRA, a model factorization framework that captures both user-specific behavior patterns from historical data and shared general knowledge among all users to deliver personalized generation. In order to capture user-specific behavior patterns, we first train a reranker to prioritize the most useful information from top-retrieved relevant historical records. By combining the prioritized history with the corresponding query, we train an adapter to align the output with individual user-specific preferences, eliminating the reliance on access to inherent model parameters of black-box LLMs. Both the reranker and the adapter can be decomposed into a base model with multiple user-specific heads, resembling a hydra. The base model maintains shared knowledge across users, while the multiple personal heads capture user-specific preferences. Experimental results demonstrate that HYDRA outperforms existing state-of-the-art prompt-based methods by an average relative improvement of 9.01% across five diverse personalization tasks in the LaMP benchmark. Our implementation is available at https://github.com/night-chen/HYDRA.

URLs: https://github.com/night-chen/HYDRA.

replace-cross Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Authors: Ahad Jawaid, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

Abstract: Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. However, encoding stylistic information (e.g., timbre, emotion, and prosody) from diverse and unseen reference speech remains a challenge. This paper introduces StyleMoE, an approach that addresses the issue of learning averaged style representations in the style encoder by creating style experts that learn from subsets of data. The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts (MoE) layer. The style experts specialize by learning from subsets of reference speech routed to them by the gating network, enabling them to handle different aspects of the style space. As a result, StyleMoE improves the style coverage of the style encoder for style transfer TTS. Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech. The proposed method enhances the performance of existing state-of-the-art style transfer TTS models and represents the first study of style MoE in TTS.

replace-cross DALD: Improving Logits-based Detector without Logits from Black-box LLMs

Authors: Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, zhiqiang xu, Yao Li, Haifeng Chen, Wei Cheng, Dongkuan Xu

Abstract: The advent of Large Language Models (LLMs) has revolutionized text generation, producing outputs that closely mimic human writing. This blurring of lines between machine- and human-written text presents new challenges in distinguishing one from the other a task further complicated by the frequent updates and closed nature of leading proprietary LLMs. Traditional logits-based detection methods leverage surrogate models for identifying LLM-generated content when the exact logits are unavailable from black-box LLMs. However, these methods grapple with the misalignment between the distributions of the surrogate and the often undisclosed target models, leading to performance degradation, particularly with the introduction of new, closed-source models. Furthermore, while current methodologies are generally effective when the source model is identified, they falter in scenarios where the model version remains unknown, or the test set comprises outputs from various source models. To address these limitations, we present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations with minimal training investment. By leveraging corpus samples from publicly accessible outputs of advanced models such as ChatGPT, GPT-4 and Claude-3, DALD fine-tunes surrogate models to synchronize with unknown source model distributions effectively.

replace-cross MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

Authors: Bo Chen, Zhilei Bei, Xingyi Cheng, Pan Li, Jie Tang, Le Song

Abstract: Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in comprehensively capturing the intricate coevolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model complex evolutionary patterns. Endowed by this, its flexible 1D MSA decoding framework facilitates zero or few shot learning. Moreover, we demonstrate that leveraging the feedback from AlphaFold2 can further enhance the model capacity via Rejective Fine tuning (RFT) and Reinforcement Learning from AF2 Feedback (RLAF). Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy. The transfer learning capabilities also highlight its great potential for facilitating other protein tasks.

replace-cross GemNet: Menu-Based, Strategy-Proof Multi-Bidder Auctions Through Deep Learning

Authors: Yanchen Jiang, David C. Parkes, Tonghan Wang

Abstract: Automated mechanism design (AMD) uses computational methods for mechanism design. Differentiable economics is a form of AMD that uses deep learning to learn mechanism designs and has enabled strong progress in AMD in recent years. Nevertheless, a major open problem has been to learn multi-bidder, general, and fully strategy-proof (SP) auctions. We introduce GEneral Menu-based NETwork (GemNet), which significantly extends the menu-based approach of the single-bidder RochetNet (D\"utting et al., 2024) to the multi-bidder setting. The challenge in achieving SP is to learn bidder-independent menus that are feasible, so that the optimal menu choices for each bidder do not over-allocate items when taken together (we call this menu compatibility). GemNet penalizes the failure of menu compatibility during training, and transforms learned menus after training through price changes, by considering a set of discretized bidder values and reasoning about Lipschitz smoothness to guarantee menu compatibility on the entire value space. This approach is general, leaving trained menus that already satisfy menu compatibility undisturbed and reducing to RochetNet for a single bidder. Mixed-integer linear programs are used for menu transforms, and through a number of optimizations enabled by deep learning, including adaptive grids and methods to skip menu elements, we scale to large auction design problems. GemNet learns auctions with better revenue than affine maximization methods, achieves exact SP whereas previous general multi-bidder methods are approximately SP, and offers greatly enhanced interpretability.

replace-cross ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

Authors: Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang

Abstract: LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts can pose significant memory challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests can incur substantial I/O costs. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights using output channel-wise step sizes to reduce the model size. However, these methods overlook the fact that certain input channels of delta weights can cause significant quantization errors at extremely low bitwidths. Additionally, existing methods assume that the appropriate model for a user request is known in advance, which is not the case in practice. To this end, we introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs. To condense the number of bits required for describing the delta weights, we propose a salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization, reducing non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance. Moreover, we develop a model-level routing method that efficiently directs user queries to the most suitable expert by performing domain classification. Extensive experiments show the promising memory efficiency and routing performance of ME-Switch. For example, when serving three models from the Mistral-7B family, ME-Switch reduces the model size by $1.74\times$ and maintains nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Notably, our method can efficiently serve 16 Mistral-7B models on a single NVIDIA A100 GPU.

replace-cross CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making

Authors: Zibin Dong, Yifu Yuan, Jianye Hao, Fei Ni, Yi Ma, Pengyi Li, Yan Zheng

Abstract: Leveraging the powerful generative capability of diffusion models (DMs) to build decision-making agents has achieved extensive success. However, there is still a demand for an easy-to-use and modularized open-source library that offers customized and efficient development for DM-based decision-making algorithms. In this work, we introduce CleanDiffuser, the first DM library specifically designed for decision-making algorithms. By revisiting the roles of DMs in the decision-making domain, we identify a set of essential sub-modules that constitute the core of CleanDiffuser, allowing for the implementation of various DM algorithms with simple and flexible building blocks. To demonstrate the reliability and flexibility of CleanDiffuser, we conduct comprehensive evaluations of various DM algorithms implemented with CleanDiffuser across an extensive range of tasks. The analytical experiments provide a wealth of valuable design choices and insights, reveal opportunities and challenges, and lay a solid groundwork for future research. CleanDiffuser will provide long-term support to the decision-making community, enhancing reproducibility and fostering the development of more robust solutions. The code and documentation of CleanDiffuser are open-sourced on the https://github.com/CleanDiffuserTeam/CleanDiffuser.

URLs: https://github.com/CleanDiffuserTeam/CleanDiffuser.

replace-cross Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear Bandits

Authors: Ziyi Huang, Henry Lam, Haofeng Zhang

Abstract: Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. Despite the superior practical performance, their theoretical justification is less investigated in the literature, especially for contextual bandit problems. To fill this gap, we propose a theoretical framework to analyze the impact of approximate inference in stochastic linear bandits and conduct regret analysis on two Bayesian bandit algorithms, Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound, namely Linear Bayesian Upper Confidence Bound (LinBUCB). We demonstrate that when applied in the presence of approximate inference, LinTS and LinBUCB can preserve their original rates of regret upper bound but with a sacrifice of larger constant terms. These results hold for general Bayesian inference approaches, assuming the inference error measured by two different $\alpha$-divergences is bounded. Additionally, by introducing a new definition of well-behaved distributions, we show that LinBUCB expedites the regret rate of LinTS from $\tilde{O}(d^{3/2}\sqrt{T})$ to $\tilde{O}(d\sqrt{T})$, matching the minimax optimal rate. To our knowledge, this work provides the first regret bounds in the setting of stochastic linear bandits with bounded approximate inference errors.

replace-cross Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking

Authors: Yuwei Zhang, Tong Xia, Jing Han, Yu Wu, Georgios Rizos, Yang Liu, Mohammed Mosuily, Jagmohan Chauhan, Cecilia Mascolo

Abstract: Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets (~136K samples, over 400 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health. The system is accessible from https://github.com/evelyn0414/OPERA.

URLs: https://github.com/evelyn0414/OPERA.

replace-cross GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Authors: Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, Zhihao Jia

Abstract: Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6X. GraphPipe also reduces the search time by 9-21X compared to PipeDream and Piper.

replace-cross Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels

Authors: Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu

Abstract: We present a simple meta quantization approach that quantizes different layers of a large language model (LLM) at different bit levels, and is independent of the underlying quantization technique. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (higher is better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (smaller is better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Adding layer importance to inherently dynamic quantization techniques can further improve their performance, showing that our approach is complementary to other dynamic quantization methods; (c) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (d) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. Our code is publicly available at https://github.com/RazvanDu/LayerwiseQuant/.

URLs: https://github.com/RazvanDu/LayerwiseQuant/.

replace-cross Aligning Target-Aware Molecule Diffusion Models with Exact Energy Optimization

Authors: Siyi Gu, Minkai Xu, Alexander Powers, Weili Nie, Tomas Geffner, Karsten Kreis, Jure Leskovec, Arash Vahdat, Stefano Ermon

Abstract: Generating ligand molecules for specific protein targets, known as structure-based drug design, is a fundamental problem in therapeutics development and biological discovery. Recently, target-aware generative models, especially diffusion models, have shown great promise in modeling protein-ligand interactions and generating candidate drugs. However, existing models primarily focus on learning the chemical distribution of all drug candidates, which lacks effective steerability on the chemical quality of model generations. In this paper, we propose a novel and general alignment framework to align pretrained target diffusion models with preferred functional properties, named AliDiff. AliDiff shifts the target-conditioned chemical distribution towards regions with higher binding affinity and structural rationality, specified by user-defined reward functions, via the preference optimization approach. To avoid the overfitting problem in common preference optimization objectives, we further develop an improved Exact Energy Preference Optimization method to yield an exact and efficient alignment of the diffusion models, and provide the closed-form expression for the converged distribution. Empirical studies on the CrossDocked2020 benchmark show that AliDiff can generate molecules with state-of-the-art binding energies with up to -7.07 Avg. Vina Score, while maintaining strong molecular properties. Code is available at https://github.com/MinkaiXu/AliDiff.

URLs: https://github.com/MinkaiXu/AliDiff.

replace-cross Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers

Authors: Terry Tong, Jiashu Xu, Qin Liu, Muhao Chen

Abstract: Large language models (LLMs) have acquired the ability to handle longer context lengths and understand nuances in text, expanding their dialogue capabilities beyond a single utterance. A popular user-facing application of LLMs is the multi-turn chat setting. Though longer chat memory and better understanding may seemingly benefit users, our paper exposes a vulnerability that leverages the multi-turn feature and strong learning ability of LLMs to harm the end-user: the backdoor. We demonstrate that LLMs can capture the combinational backdoor representation. Only upon presentation of triggers together does the backdoor activate. We also verify empirically that this representation is invariant to the position of the trigger utterance. Subsequently, inserting a single extra token into two utterances of 5%of the data can cause over 99% Attack Success Rate (ASR). Our results with 3 triggers demonstrate that this framework is generalizable, compatible with any trigger in an adversary's toolbox in a plug-and-play manner. Defending the backdoor can be challenging in the chat setting because of the large input and output space. Our analysis indicates that the distributed backdoor exacerbates the current challenges by polynomially increasing the dimension of the attacked input space. Canonical textual defenses like ONION and BKI leverage auxiliary model forward passes over individual tokens, scaling exponentially with the input sequence length and struggling to maintain computational feasibility. To this end, we propose a decoding time defense - decayed contrastive decoding - that scales linearly with assistant response sequence length and reduces the backdoor to as low as 0.35%.

replace-cross SE(3)-bi-equivariant Transformers for Point Cloud Assembly

Authors: Ziming Wang, Rebecka J\"ornsten

Abstract: Given a pair of point clouds, the goal of assembly is to recover a rigid transformation that aligns one point cloud to the other. This task is challenging because the point clouds may be non-overlapped, and they may have arbitrary initial positions. To address these difficulties, we propose a method, called SE(3)-bi-equivariant transformer (BITR), based on the SE(3)-bi-equivariance prior of the task: it guarantees that when the inputs are rigidly perturbed, the output will transform accordingly. Due to its equivariance property, BITR can not only handle non-overlapped PCs, but also guarantee robustness against initial positions. Specifically, BITR first extracts features of the inputs using a novel $SE(3) \times SE(3)$-transformer, and then projects the learned feature to group SE(3) as the output. Moreover, we theoretically show that swap and scale equivariances can be incorporated into BITR, thus it further guarantees stable performance under scaling and swapping the inputs. We experimentally show the effectiveness of BITR in practical tasks.

replace-cross XEdgeAI: A Human-centered Industrial Inspection Framework with Data-centric Explainable Edge AI Approach

Authors: Truong Thanh Hung Nguyen, Phuc Truong Loc Nguyen, Hung Cao

Abstract: Recent advancements in deep learning have significantly improved visual quality inspection and predictive maintenance within industrial settings. However, deploying these technologies on low-resource edge devices poses substantial challenges due to their high computational demands and the inherent complexity of Explainable AI (XAI) methods. This paper addresses these challenges by introducing a novel XAI-integrated Visual Quality Inspection framework that optimizes the deployment of semantic segmentation models on low-resource edge devices. Our framework incorporates XAI and the Large Vision Language Model to deliver human-centered interpretability through visual and textual explanations to end-users. This is crucial for end-user trust and model interpretability. We outline a comprehensive methodology consisting of six fundamental modules: base model fine-tuning, XAI-based explanation generation, evaluation of XAI approaches, XAI-guided data augmentation, development of an edge-compatible model, and the generation of understandable visual and textual explanations. Through XAI-guided data augmentation, the enhanced model incorporating domain expert knowledge with visual and textual explanations is successfully deployed on mobile devices to support end-users in real-world scenarios. Experimental results showcase the effectiveness of the proposed framework, with the mobile model achieving competitive accuracy while significantly reducing model size. This approach paves the way for the broader adoption of reliable and interpretable AI tools in critical industrial applications, where decisions must be both rapid and justifiable. Our code for this work can be found at https://github.com/Analytics-Everywhere-Lab/vqixai.

URLs: https://github.com/Analytics-Everywhere-Lab/vqixai.

replace-cross Decomposed Direct Preference Optimization for Structure-Based Drug Design

Authors: Xiwei Cheng, Xiangxin Zhou, Yuwei Yang, Yu Bao, Quanquan Gu

Abstract: Diffusion models have achieved promising results for Structure-Based Drug Design (SBDD). Nevertheless, high-quality protein subpocket and ligand data are relatively scarce, which hinders the models' generation capabilities. Recently, Direct Preference Optimization (DPO) has emerged as a pivotal tool for aligning generative models with human preferences. In this paper, we propose DecompDPO, a structure-based optimization method aligns diffusion models with pharmaceutical needs using multi-granularity preference pairs. DecompDPO introduces decomposition into the optimization objectives and obtains preference pairs at the molecule or decomposed substructure level based on each objective's decomposability. Additionally, DecompDPO introduces a physics-informed energy term to ensure reasonable molecular conformations in the optimization results. Notably, DecompDPO can be effectively used for two main purposes: (1) fine-tuning pretrained diffusion models for molecule generation across various protein families, and (2) molecular optimization given a specific protein subpocket after generation. Extensive experiments on the CrossDocked2020 benchmark show that DecompDPO significantly improves model performance, achieving up to 95.2% Med. High Affinity and a 36.2% success rate for molecule generation, and 100% Med. High Affinity and a 52.1% success rate for molecular optimization.

replace-cross CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

Authors: Emanuele Frascaroli, Aniello Panariello, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

Abstract: With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a prevalent strategy in Continual Learning. This has led to the development of numerous prompting strategies to adapt transformer-based models without incurring catastrophic forgetting. However, these strategies often compromise the original zero-shot capabilities of the pre-trained CLIP model and struggle to adapt to domains that significantly deviate from the pre-training data. In this work, we propose Continual Generative training for Incremental prompt-Learning, a simple and novel approach to mitigate forgetting while adapting CLIP. Briefly, we employ Variational Autoencoders (VAEs) to learn class-conditioned distributions within the embedding space of the visual encoder. We then exploit these distributions to sample new synthetic visual embeddings and train the corresponding class-specific textual prompts during subsequent tasks. Through extensive experiments on different domains, we show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities, evaluated using a novel metric tailored for CL scenarios. Notably, further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.

URLs: https://github.com/aimagelab/mammoth.

replace-cross Course-Correction: Safety Alignment Using Synthetic Preferences

Authors: Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu

Abstract: The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C$^2$-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C$^2$-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B}, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

replace-cross On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures

Authors: Benedikt Hilmes, Nick Rossenbach, and Ralf Schl\"uter

Abstract: In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce the original training data, training ASR systems solely on synthetic data. For ASR, we use three different architectures, attention-based encoder-decoder, hybrid deep neural network hidden Markov model and a Gaussian mixture hidden Markov model, showing the different sensitivity of the models to synthetic data generation. In order to extend previous work, we present a number of ablation studies on the effectiveness of synthetic vs. real training data for ASR. In particular we focus on how the gap between training on synthetic and real data changes by varying the speaker embedding or by scaling the model size. For the latter we show that the TTS models generalize well, even when training scores indicate overfitting.

replace-cross SAM 2: Segment Anything in Images and Videos

Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R\"adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll\'ar, Christoph Feichtenhofer

Abstract: We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.

replace-cross NeuralBeta: Estimating Beta Using Deep Learning

Authors: Yuxin Liu, Jimin Lin, Achintya Gopal

Abstract: Traditional approaches to estimating beta in finance often involve rigid assumptions and fail to adequately capture beta dynamics, limiting their effectiveness in use cases like hedging. To address these limitations, we have developed a novel method using neural networks called NeuralBeta, which is capable of handling both univariate and multivariate scenarios and tracking the dynamic behavior of beta. To address the issue of interpretability, we introduce a new output layer inspired by regularized weighted linear regression, which provides transparency into the model's decision-making process. We conducted extensive experiments on both synthetic and market data, demonstrating NeuralBeta's superior performance compared to benchmark methods across various scenarios, especially instances where beta is highly time-varying, e.g., during regime shifts in the market. This model not only represents an advancement in the field of beta estimation, but also shows potential for applications in other financial contexts that assume linear relationships.

replace-cross A Comparative Analysis of Wealth Index Predictions in Africa between three Multi-Source Inference Models

Authors: M\'arton Karsai, J\'anos Kert\'esz, Lisette Esp\'in-Noboa

Abstract: Poverty map inference has become a critical focus of research, utilizing both traditional and modern techniques, ranging from regression models to convolutional neural networks applied to tabular data, satellite imagery, and networks. While much attention has been given to validating models during the training phase, the final predictions have received less scrutiny. In this study, we analyze the International Wealth Index (IWI) predicted by Lee and Braithwaite (2022) and Esp\'in-Noboa et al. (2023), alongside the Relative Wealth Index (RWI) inferred by Chi et al. (2022), across six Sub-Saharan African countries. Our analysis reveals trends and discrepancies in wealth predictions between these models. In particular, significant and unexpected discrepancies between the predictions of Lee and Braithwaite and Esp\'in-Noboa et al., even after accounting for differences in training data. In contrast, the shape of the wealth distributions predicted by Esp\'in-Noboa et al. and Chi et al. are more closely aligned, suggesting similar levels of skewness. These findings raise concerns about the validity of certain models and emphasize the importance of rigorous audits for wealth prediction algorithms used in policy-making. Continuous validation and refinement are essential to ensure the reliability of these models, particularly when they inform poverty alleviation strategies.

replace-cross Few-Shot Transfer Learning for Individualized Braking Intent Detection on Neuromorphic Hardware

Authors: Nathan Lutes, Venkata Sriram Siddhardh Nadendla, K. Krishnamurthy

Abstract: Objective: This work explores use of a few-shot transfer learning method to train and implement a convolutional spiking neural network (CSNN) on a BrainChip Akida AKD1000 neuromorphic system-on-chip for developing individual-level, instead of traditionally used group-level, models using electroencephalographic data. Main Results: Efficacy of the above methodology to develop individual-specific braking intention predictive models by rapidly adapting the group-level model in as few as three training epochs while achieving at least 90% accuracy, true positive rate and true negative rate is presented. Further, results show the energy-efficiency of the neuromorphic hardware through a power reduction of over 97% with only a $1.3* increase in latency when using the Akida AKD1000 processor for network inference compared to an Intel Xeon central processing unit. Similar results were obtained in a subsequent ablation study using a subset of five out of 19 channels.

replace-cross Efficacy of Large Language Models in Systematic Reviews

Authors: Aaditya Shah, Shridhar Mehendale, Siddha Kanthi

Abstract: This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature through a systematic review of the relationship between Environmental, Social, and Governance (ESG) factors and financial performance. The primary objective is to assess how LLMs can replicate a systematic review on a corpus of ESG-focused papers. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. Additionally, we used a set of 238 papers from a previous systematic review of ESG literature from January 2015 to February 2020. We evaluated two current state-of-the-art LLMs, Meta AI's Llama 3 8B and OpenAI's GPT-4o, on the accuracy of their interpretations relative to human-made classifications on both sets of papers. We then compared these results to a "Custom GPT" and a fine-tuned GPT-4o Mini model using the corpus of 238 papers as training data. The fine-tuned GPT-4o Mini model outperformed the base LLMs by 28.3% on average in overall accuracy on prompt 1. At the same time, the "Custom GPT" showed a 3.0% and 15.7% improvement on average in overall accuracy on prompts 2 and 3, respectively. Our findings reveal promising results for investors and agencies to leverage LLMs to summarize complex evidence related to ESG investing, thereby enabling quicker decision-making and a more efficient market.

replace-cross Memorization in In-Context Learning

Authors: Shahriar Golchin, Mihai Surdeanu, Steven Bethard, Eduardo Blanco, Ellen Riloff

Abstract: In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind this performance improvement remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance on downstream tasks across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers memorization as a new factor impacting ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?

replace-cross Graph Attention Inference of Network Topology in Multi-Agent Systems

Authors: Akshay Kolli, Reza Azadeh, Kshitj Jerath

Abstract: Accurately identifying the underlying graph structures of multi-agent systems remains a difficult challenge. Our work introduces a novel machine learning-based solution that leverages the attention mechanism to predict future states of multi-agent systems by learning node representations. The graph structure is then inferred from the strength of the attention values. This approach is applied to both linear consensus dynamics and the non-linear dynamics of Kuramoto oscillators, resulting in implicit learning of the graph by learning good agent representations. Our results demonstrate that the presented data-driven graph attention machine learning model can identify the network topology in multi-agent systems, even when the underlying dynamic model is not known, as evidenced by the F1 scores achieved in the link prediction.

replace-cross AI Olympics challenge with Evolutionary Soft Actor Critic

Authors: Marco Cal\`i, Alberto Sinigaglia, Niccol\`o Turcato, Ruggero Carli, Gian Antonio Susto

Abstract: In the following report, we describe the solution we propose for the AI Olympics competition held at IROS 2024. Our solution is based on a Model-free Deep Reinforcement Learning approach combined with an evolutionary strategy. We will briefly describe the algorithms that have been used and then provide details of the approach

replace-cross PatternPaint: Generating Layout Patterns Using Generative AI and Inpainting Techniques

Authors: Guanglei Zhou, Bhargav Korrapati, Gaurav Rajavendra Reddy, Jiang Hu, Yiran Chen, Dipto G. Thakurta

Abstract: Generation of diverse VLSI layout patterns is crucial for various downstream tasks in design for manufacturing (DFM) studies. However, the lengthy design cycles often hinder the creation of a comprehensive layout pattern library, and new detrimental patterns may be discovered late in the product development process. Existing training-based ML pattern generation approaches struggle to produce legal layout patterns in the early stages of technology node development due to the limited availability of training samples.To address this challenge, we propose PatternPaint, a training-free framework capable of generating legal patterns with limited DRC Clean training samples. PatternPaint simplifies complex layout pattern generation into a series of inpainting processes with a template-based denoising scheme. Our framework enables even a general pre-trained image foundation model (stable-diffusion), to generate valuable pattern variations, thereby enhancing the library. Notably, PatternPaint can operate with any input size. Furthermore, we explore fine-tuning a pre-trained model with VLSI layout images, resulting in a 2x generation efficiency compared to the base model. Our results show that the proposed model can generate legal patterns in complex 2D metal interconnect design rule settings and achieves a high diversity score. The designed system, with its flexible settings, supports pattern generation with localized changes and design rule violation correction. Validated on a sub-3nm technology node (Intel 18A), PatternPaint is the first framework to generate a complex 2D layout pattern library using only 20 design rule clean layout patterns as input.

replace-cross Conformal Prediction in Dynamic Biological Systems

Authors: Alberto Portela, Julio R. Banga, Marcos Matabuena

Abstract: Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex biological systems. Here, we focus on dynamic models represented by deterministic nonlinear ordinary differential equations. Many current UQ approaches in this field rely on Bayesian statistical methods. While powerful, these methods often require strong prior specifications and make parametric assumptions that may not always hold in biological systems. Additionally, these methods face challenges in domains where sample sizes are limited, and statistical inference becomes constrained, with computational speed being a bottleneck in large models of biological systems. As an alternative, we propose the use of conformal inference methods, introducing two novel algorithms that, in some instances, offer non-asymptotic guarantees, enhancing robustness and scalability across various applications. We demonstrate the efficacy of our proposed algorithms through several scenarios, highlighting their advantages over traditional Bayesian approaches. The proposed methods show promising results for diverse biological data structures and scenarios, offering a general framework to quantify uncertainty for dynamic models of biological systems.The software for the methodology and the reproduction of the results is available at https://zenodo.org/doi/10.5281/zenodo.13644870.

URLs: https://zenodo.org/doi/10.5281/zenodo.13644870.

replace-cross Persistent pseudopod splitting is an effective chemotaxis strategy in shallow gradients

Authors: Albert Alonso, Julius B. Kirkegaard, Robert G. Endres

Abstract: Single-cell organisms and various cell types use a range of motility modes when following a chemical gradient, but it is unclear which mode is best suited for different gradients. Here, we model directional decision-making in chemotactic amoeboid cells as a stimulus-dependent actin recruitment contest. Pseudopods extending from the cell body compete for a finite actin pool to push the cell in their direction until one pseudopod wins and determines the direction of movement. Our minimal model provides a quantitative understanding of the strategies cells use to reach the physical limit of accurate chemotaxis, aligning with data without explicit gradient sensing or cellular memory for persistence. To generalize our model, we employ reinforcement learning optimization to study the effect of pseudopod suppression, a simple but effective cellular algorithm by which cells can suppress possible directions of movement. Different pseudopod-based chemotaxis strategies emerge naturally depending on the environment and its dynamics. For instance, in static gradients, cells can react faster at the cost of pseudopod accuracy, which is particularly useful in noisy, shallow gradients where it paradoxically increases chemotactic accuracy. In contrast, in dynamics gradients, cells form de novo pseudopods. Overall, our work demonstrates mechanical intelligence for high chemotaxis performance with minimal cellular regulation.

replace-cross Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities

Authors: Md Tauseef Alam, Raju Halder, Abyayananda Maiti

Abstract: The large-scale deployment of Solidity smart contracts on the Ethereum mainnet has increasingly attracted financially-motivated attackers in recent years. A few now-infamous attacks in Ethereum's history includes DAO attack in 2016 (50 million dollars lost), Parity Wallet hack in 2017 (146 million dollars locked), Beautychain's token BEC in 2018 (900 million dollars market value fell to 0), and NFT gaming blockchain breach in 2022 ($600 million in Ether stolen). This paper presents a comprehensive investigation of the use of large language models (LLMs) and their capabilities in detecting OWASP Top Ten vulnerabilities in Solidity. We introduce a novel, class-balanced, structured, and labeled dataset named VulSmart, which we use to benchmark and compare the performance of open-source LLMs such as CodeLlama, Llama2, CodeT5 and Falcon, alongside closed-source models like GPT-3.5 Turbo and GPT-4o Mini. Our proposed SmartVD framework is rigorously tested against these models through extensive automated and manual evaluations, utilizing BLEU and ROUGE metrics to assess the effectiveness of vulnerability detection in smart contracts. We also explore three distinct prompting strategies-zero-shot, few-shot, and chain-of-thought-to evaluate the multi-class classification and generative capabilities of the SmartVD framework. Our findings reveal that SmartVD outperforms its open-source counterparts and even exceeds the performance of closed-source base models like GPT-3.5 and GPT-4 Mini. After fine-tuning, the closed-source models, GPT-3.5 Turbo and GPT-4o Mini, achieved remarkable performance with 99% accuracy in detecting vulnerabilities, 94% in identifying their types, and 98% in determining severity. Notably, SmartVD performs best with the `chain-of-thought' prompting technique, whereas the fine-tuned closed-source models excel with the `zero-shot' prompting approach.

replace-cross LatentQGAN: A Hybrid QGAN with Classical Convolutional Autoencoder

Authors: Alexis Vieloszynski, Soumaya Cherkaoui, Jean-Fr\'ed\'eric Laprade, Oliver Nahman-L\'evesque, Abdallah Aaraba, Shengrui Wang

Abstract: Quantum machine learning consists in taking advantage of quantum computations to generate classical data. A potential application of quantum machine learning is to harness the power of quantum computers for generating classical data, a process essential to a multitude of applications such as enriching training datasets, anomaly detection, and risk management in finance. Given the success of Generative Adversarial Networks in classical image generation, the development of its quantum versions has been actively conducted. However, existing implementations on quantum computers often face significant challenges, such as scalability and training convergence issues. To address these issues, we propose LatentQGAN, a novel quantum model that uses a hybrid quantum-classical GAN coupled with an autoencoder. Although it was initially designed for image generation, the LatentQGAN approach holds potential for broader application across various practical data generation tasks. Experimental outcomes on both classical simulators and noisy intermediate scale quantum computers have demonstrated significant performance enhancements over existing quantum methods, alongside a significant reduction in quantum resources overhead.

replace-cross NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes

Authors: Ziquan Wei, Tingting Dan, Jiaqi Ding, Guorong Wu

Abstract: Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the clich\'e of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.

replace-cross FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark

Authors: Heegyu Kim, Taeyang Jeon, Seunghwan Choi, Seungtaek Choi, Hyunsouk Cho

Abstract: Text-to-SQL systems have become crucial for translating natural language into SQL queries in various industries, enabling non-technical users to perform complex data operations. The need for accurate evaluation methods has increased as these systems have grown more sophisticated. However, the Execution Accuracy (EX), the most prevalent evaluation metric, still shows many false positives and negatives. Thus, this paper introduces FLEX (False-Less EXecution), a novel approach to evaluating text-to-SQL systems using large language models (LLMs) to emulate human expert-level evaluation of SQL queries. Our metric improves agreement with human experts (from 62 to 87.04 in Cohen's kappa) with comprehensive context and sophisticated criteria. Our extensive experiments yield several key insights: (1) Models' performance increases by over 2.6 points on average, substantially affecting rankings on Spider and BIRD benchmarks; (2) The underestimation of models in EX primarily stems from annotation quality issues; and (3) Model performance on particularly challenging questions tends to be overestimated. This work contributes to a more accurate and nuanced evaluation of text-to-SQL systems, potentially reshaping our understanding of state-of-the-art performance in this field.

replace-cross Mixture of Multicenter Experts in Multimodal Generative AI for Advanced Radiotherapy Target Delineation

Authors: Yujin Oh, Sangjoon Park, Xiang Li, Wang Yi, Jonathan Paly, Jason Efstathiou, Annie Chan, Jun Won Kim, Hwa Kyung Byun, Ik Jae Lee, Jaeho Cho, Chan Woo Wee, Peng Shu, Peilong Wang, Nathan Yu, Jason Holmes, Jong Chul Ye, Quanzheng Li, Wei Liu, Woong Sub Koom, Jin Sung Kim, Kyungsang Kim

Abstract: Clinical experts employ diverse philosophies and strategies in patient care, influenced by regional patient populations. However, existing medical artificial intelligence (AI) models are often trained on data distributions that disproportionately reflect highly prevalent patterns, reinforcing biases and overlooking the diverse expertise of clinicians. To overcome this limitation, we introduce the Mixture of Multicenter Experts (MoME) approach. This method strategically integrates specialized expertise from diverse clinical strategies, enhancing the AI model's ability to generalize and adapt across multiple medical centers. The MoME-based multimodal target volume delineation model, trained with few-shot samples including images and clinical notes from each medical center, outperformed baseline methods in prostate cancer radiotherapy target delineation. The advantages of MoME were most pronounced when data characteristics varied across centers or when data availability was limited, demonstrating its potential for broader clinical applications. Therefore, the MoME framework enables the deployment of AI-based target volume delineation models in resource-constrained medical facilities by adapting to specific preferences of each medical center only using a few sample data, without the need for data sharing between institutions. Expanding the number of multicenter experts within the MoME framework will significantly enhance the generalizability, while also improving the usability and adaptability of clinical AI applications in the field of precision radiation oncology.

replace-cross Geometric Collaborative Filtering with Convergence

Authors: Hisham Husain, Julien Monteil

Abstract: Latent variable collaborative filtering methods have been a standard approach to modelling user-click interactions due to their simplicity and effectiveness. However, there is limited work on analyzing the mathematical properties of these methods in particular on preventing the overfitting towards the identity, and such methods typically utilize loss functions that overlook the geometry between items. In this work, we introduce a notion of generalization gap in collaborative filtering and analyze this with respect to latent collaborative filtering models. We present a geometric upper bound that gives rise to loss functions, and a way to meaningfully utilize the geometry of item-metadata to improve recommendations. We show how these losses can be minimized and gives the recipe to a new latent collaborative filtering algorithm, which we refer to as GeoCF, due to the geometric nature of our results. We then show experimentally that our proposed GeoCF algorithm can outperform other all existing methods on the Movielens20M and Netflix datasets, as well as two large-scale internal datasets. In summary, our work proposes a theoretically sound method which paves a way to better understand generalization of collaborative filtering at large.

replace-cross AIME: AI System Optimization via Multiple LLM Evaluators

Authors: Bhrij Patel, Souradip Chakraborty, Wesley A. Suttle, Mengdi Wang, Amrit Singh Bedi, Dinesh Manocha

Abstract: Text-based AI system optimization typically involves a feedback loop scheme where a single LLM generates an evaluation in natural language of the current output to improve the next iteration's output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to $62\%$ higher error detection rate and up to $16\%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to $12\%$.

replace-cross CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models

Authors: Zi Gong, Hang Yu, Cong Liao, Bingchang Liu, Chaoyu Chen, Jianguo Li

Abstract: Multi-task learning (MTL) benefits the fine-tuning of large language models (LLMs) by providing a single model with improved performance and generalization ability across tasks, presenting a resource-efficient alternative to developing separate models for each task. Yet, existing MTL strategies for LLMs often fall short by either being computationally intensive or failing to ensure simultaneous task convergence. This paper presents CoBa, a new MTL approach designed to effectively manage task convergence balance with minimal computational overhead. Utilizing Relative Convergence Scores (RCS), Absolute Convergence Scores (ACS), and a Divergence Factor (DF), CoBa dynamically adjusts task weights during the training process, ensuring that the validation loss of all tasks progress towards convergence at an even pace while mitigating the issue of individual task divergence. The results of our experiments involving three disparate datasets underscore that this approach not only fosters equilibrium in task convergence but enhances the LLMs' performance by up to 13% relative to the second-best baselines. Code is open-sourced at https://github.com/codefuse-ai/MFTCoder.

URLs: https://github.com/codefuse-ai/MFTCoder.

replace-cross MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou

Abstract: Scientific discovery contributes largely to human society's prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.

replace-cross Learning to Walk from Three Minutes of Real-World Data with Semi-structured Dynamics Models

Authors: Jacob Levy, Tyler Westenbroek, David Fridovich-Keil

Abstract: Traditionally, model-based reinforcement learning (MBRL) methods exploit neural networks as flexible function approximators to represent $\textit{a priori}$ unknown environment dynamics. However, training data are typically scarce in practice, and these black-box models often fail to generalize. Modeling architectures that leverage known physics can substantially reduce the complexity of system-identification, but break down in the face of complex phenomena such as contact. We introduce a novel framework for learning semi-structured dynamics models for contact-rich systems which seamlessly integrates structured first principles modeling techniques with black-box auto-regressive models. Specifically, we develop an ensemble of probabilistic models to estimate external forces, conditioned on historical observations and actions, and integrate these predictions using known Lagrangian dynamics. With this semi-structured approach, we can make accurate long-horizon predictions with substantially less data than prior methods. We leverage this capability and propose Semi-Structured Reinforcement Learning ($\texttt{SSRL}$) a simple model-based learning framework which pushes the sample complexity boundary for real-world learning. We validate our approach on a real-world Unitree Go1 quadruped robot, learning dynamic gaits -- from scratch -- on both hard and soft surfaces with just a few minutes of real-world data. Video and code are available at: https://sites.google.com/utexas.edu/ssrl

URLs: https://sites.google.com/utexas.edu/ssrl

replace-cross Continual Learning with Neuromorphic Computing: Theories, Methods, and Applications

Authors: Mishal Fatima Minhas, Rachmad Vidya Wicaksana Putra, Falah Awwad, Osman Hasan, Muhammad Shafique

Abstract: To adapt to real-world dynamics, intelligent systems need to assimilate new knowledge without catastrophic forgetting, where learning new tasks leads to a degradation in performance on old tasks. To address this, continual learning concept is proposed for enabling autonomous systems to acquire new knowledge and dynamically adapt to changing environments. Specifically, energy-efficient continual learning is needed to ensure the functionality of autonomous systems under tight compute and memory resource budgets (i.e., so-called autonomous embedded systems). Neuromorphic computing, with brain-inspired Spiking Neural Networks (SNNs), offers inherent advantages for enabling low-power/energy continual learning in autonomous embedded systems. In this paper, we comprehensively discuss the foundations and methods for enabling continual learning in neural networks, then analyze the state-of-the-art works considering SNNs. Afterward, comparative analyses of existing methods are conducted while considering crucial design factors, such as network complexity, memory, latency, and power/energy efficiency. We also explore the practical applications that can benefit from SNN-based continual learning and open challenges in real-world scenarios. In this manner, our survey provides valuable insights into the recent advancements of SNN-based continual learning for real-world application use-cases.

replace-cross Multi-Task Dynamic Pricing in Credit Market with Contextual Information

Authors: Adel Javanmard, Jingwei Ji, Renyuan Xu

Abstract: We study the dynamic pricing problem faced by a broker that buys and sells a large number of financial securities in the credit market, such as corporate bonds, government bonds, loans, and other credit-related securities. One challenge in pricing these securities is their infrequent trading, which leads to insufficient data for individual pricing. However, many of these securities share structural features that can be utilized. Building on this, we propose a multi-task dynamic pricing framework that leverages these shared structures across securities, enhancing pricing accuracy through learning. In our framework, a security is fully characterized by a $d$ dimensional contextual/feature vector. The customer will buy (sell) the security from the broker if the broker quotes a price lower (higher) than that of the competitors. We assume a linear contextual model for the competitor's pricing, with unknown parameters a priori. The parameters for pricing different securities may or may not be similar to each other. The firm's objective is to minimize the expected regret, namely, the expected revenue loss against a clairvoyant policy which has the knowledge of the parameters of the competitor's pricing model. We show that the regret of our policy is better than both the policy that treats each security individually and the policy that treats all securities as the same. Moreover, the regret is bounded by $\tilde{O} ( \delta_{\max} \sqrt{T M d} + M d ) $, where $M$ is the number of securities and $\delta_{\max}$ characterizes the overall dissimilarity across securities in the basket.

replace-cross TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

Authors: Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang

Abstract: Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, HH-RLHF, UltraFeedback, GSM8K, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves the highest win rate of 65% on TutorEval and around 60% win rates across other different datasets, outperforming standard BoN with the same computational cost and showcasing its scalability and alignment efficacy.

replace-cross MagicPIG: LSH Sampling for Efficient LLM Generation

Authors: Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen

Abstract: Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have been proposed to leverage the common insight that attention is sparse. In this paper, we first show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. Rather than selecting the keys and values with the highest attention scores, sampling with theoretical guarantees can provide a better estimation for attention output. To make the sampling-based approximation practical in LLM generation, we propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH). MagicPIG significantly reduces the workload of attention computation while preserving high accuracy for diverse tasks. MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy. MagicPIG can improve decoding throughput by $1.9\sim3.9\times$ across various GPU hardware and achieve 110ms decoding latency on a single RTX 4090 for Llama-3.1-8B-Instruct model with a context of 96k tokens. The code is available at \url{https://github.com/Infini-AI-Lab/MagicPIG}.

URLs: https://github.com/Infini-AI-Lab/MagicPIG

replace-cross Advancing Gasoline Consumption Forecasting: A Novel Hybrid Model Integrating Transformers, LSTM, and CNN

Authors: Mahmoud Ranjbar, Mohammad Rahimzadeh

Abstract: Iran, endowed with abundant hydrocarbon resources, plays a crucial role in the global energy landscape. Gasoline, as a critical fuel, significantly supports the nation's transportation sector. Accurate forecasting of gasoline consumption is essential for strategic resource management and environmental planning. This research introduces a novel approach to predicting monthly gasoline consumption using a hybrid Transformer-LSTM-CNN model, which integrates the strengths of Transformer networks, Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNN). This advanced architecture offers a superior alternative to conventional methods such as artificial neural networks and regression models by capturing both short- and long-term dependencies in time series data. By leveraging the self-attention mechanism of Transformers, the temporal memory of LSTMs, and the local pattern detection of CNNs, our hybrid model delivers improved prediction accuracy. Implemented using Python, the model provides precise future gasoline consumption forecasts and evaluates the environmental impact through the analysis of greenhouse gas emissions. This study examines gasoline consumption trends from 2007 to 2021, which rose from 64.5 million liters per day in 2007 to 99.80 million liters per day in 2021. Our proposed model forecasts consumption levels up to 2031, offering a valuable tool for policymakers and energy analysts. The results highlight the superiority of this hybrid model in improving the accuracy of gasoline consumption forecasts, reinforcing the need for advanced machine learning techniques to optimize resource management and mitigate environmental risks in the energy sector.

replace-cross DNAHLM -- DNA sequence and Human Language mixed large language Model

Authors: Wang Liang

Abstract: There are already many DNA large language models, but most of them still follow traditional uses, such as extracting sequence features for classification tasks. More innovative applications of large language models, such as prompt engineering, RAG, and zero-shot or few-shot prediction, remain challenging for DNA-based models. The key issue lies in the fact that DNA models and human natural language models are entirely separate; however, techniques like prompt engineering require the use of natural language, thereby significantly limiting the application of DNA large language models. This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text, and uses a unified BPE tokenization method. We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning on this pre-trained model to create a fine-tuned model capable of handling multiple tasks. The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application. This research provides a highly promising direction for building a unified DNA sequence task framework.

replace-cross Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Authors: Nils Blank, Moritz Reuss, Marcel R\"uhle, \"Omer Erdin\c{c} Ya\u{g}murlu, Fabian Wenzel, Oier Mees, Rudolf Lioutikov

Abstract: A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pretrained vision-language foundation models in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a kitchen play dataset show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity. We use NILS to label over 115k trajectories obtained from over 430 hours of robot data. We open-source our auto-labeling code and generated annotations on our website: http://robottasklabeling.github.io.

URLs: http://robottasklabeling.github.io.

replace-cross Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Authors: Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao

Abstract: Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.

replace-cross Free-Rider and Conflict Aware Collaboration Formation for Cross-Silo Federated Learning

Authors: Mengmeng Chen, Xiaohu Wu, Xiaoli Tang, Tiantian He, Yew-Soon Ong, Qiqi Liu, Qicheng Lao, Han Yu

Abstract: Federated learning (FL) is a machine learning paradigm that allows multiple FL participants (FL-PTs) to collaborate on training models without sharing private data. Due to data heterogeneity, negative transfer may occur in the FL training process. This necessitates FL-PT selection based on their data complementarity. In cross-silo FL, organizations that engage in business activities are key sources of FL-PTs. The resulting FL ecosystem has two features: (i) self-interest, and (ii) competition among FL-PTs. This requires the desirable FL-PT selection strategy to simultaneously mitigate the problems of free riders and conflicts of interest among competitors. To this end, we propose an optimal FL collaboration formation strategy -- FedEgoists -- which ensures that: (1) a FL-PT can benefit from FL if and only if it benefits the FL ecosystem, and (2) a FL-PT will not contribute to its competitors or their supporters. It provides an efficient clustering solution to group FL-PTs into coalitions, ensuring that within each coalition, FL-PTs share the same interest. We theoretically prove that the FL-PT coalitions formed are optimal since no coalitions can collaborate together to improve the utility of any of their members. Extensive experiments on widely adopted benchmark datasets demonstrate the effectiveness of FedEgoists compared to nine state-of-the-art baseline methods, and its ability to establish efficient collaborative networks in cross-silos FL with FL-PTs that engage in business activities.