Authors: Christian Bender, Nguyen Tran Thuan
Abstract: We present a random measure approach for modeling exploration, i.e., the execution of measure-valued controls, in continuous-time reinforcement learning (RL) with controlled diffusion and jumps. First, we consider the case when sampling the randomized control in continuous time takes place on a discrete-time grid and reformulate the resulting stochastic differential equation (SDE) as an equation driven by suitable random measures. The construction of these random measures makes use of the Brownian motion and the Poisson random measure (which are the sources of noise in the original model dynamics) as well as the additional random variables, which are sampled on the grid for the control execution. Then, we prove a limit theorem for these random measures as the mesh-size of the sampling grid goes to zero, which leads to the grid-sampling limit SDE that is jointly driven by white noise random measures and a Poisson random measure. We also argue that the grid-sampling limit SDE can substitute the exploratory SDE and the sample SDE of the recent continuous-time RL literature, i.e., it can be applied for the theoretical analysis of exploratory control problems and for the derivation of learning algorithms.
Authors: Amey Agrawal, Junda Chen, \'I\~nigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse
Abstract: As large language models (LLMs) evolve to handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints - like Time to First Token (TTFT) and Time Between Tokens (TBT). Furthermore, there are no long context inference solutions that allow batching requests to increase the hardware utilization today. In this paper, we propose three key innovations for efficient interactive long context LLM inference, without resorting to any approximation: adaptive chunking to reduce prefill overheads in mixed batching, Sequence Pipeline Parallelism (SPP) to lower TTFT, and KV Cache Parallelism (KVP) to minimize TBT. These contributions are combined into a 3D parallelism strategy, enabling Mnemosyne to scale interactive inference to context lengths at least up to 10 million tokens with high throughput enabled with batching. To our knowledge, Mnemosyne is the first to be able to achieve support for 10 million long context inference efficiently, while satisfying production-grade SLOs on TBT (30ms) on contexts up to and including 10 million.
Authors: Hannes Stark, Umesh Padia, Julia Balla, Cameron Diao, George Church
Abstract: Generating protein sequences conditioned on protein structures is an impactful technique for protein engineering. When synthesizing engineered proteins, they are commonly translated into DNA and expressed in an organism such as yeast. One difficulty in this process is that the expression rates can be low due to suboptimal codon sequences for expressing a protein in a host organism. We propose CodonMPNN, which generates a codon sequence conditioned on a protein backbone structure and an organism label. If naturally occurring DNA sequences are close to codon optimality, CodonMPNN could learn to generate codon sequences with higher expression yields than heuristic codon choices for generated amino acid sequences. Experiments show that CodonMPNN retains the performance of previous inverse folding approaches and recovers wild-type codons more frequently than baselines. Furthermore, CodonMPNN has a higher likelihood of generating high-fitness codon sequences than low-fitness codon sequences for the same protein sequence. Code is available at https://github.com/HannesStark/CodonMPNN.
Authors: Th\'eo Bourdais, Houman Owhadi
Abstract: Whether deterministic or stochastic, models can be viewed as functions designed to approximate a specific quantity of interest. We propose a data-driven framework that aggregates predictions from diverse models into a single, more accurate output. This aggregation approach exploits each model's strengths to enhance overall accuracy. It is non-intrusive - treating models as black-box functions - model-agnostic, requires minimal assumptions, and can combine outputs from a wide range of models, including those from machine learning and numerical solvers. We argue that the aggregation process should be point-wise linear and propose two methods to find an optimal aggregate: Minimal Error Aggregation (MEA), which minimizes the aggregate's prediction error, and Minimal Variance Aggregation (MVA), which minimizes its variance. While MEA is inherently more accurate when correlations between models and the target quantity are perfectly known, Minimal Empirical Variance Aggregation (MEVA), an empirical version of MVA - consistently outperforms Minimal Empirical Error Aggregation (MEEA), the empirical counterpart of MEA, when these correlations must be estimated from data. The key difference is that MEVA constructs an aggregate by estimating model errors, while MEEA treats the models as features for direct interpolation of the quantity of interest. This makes MEEA more susceptible to overfitting and poor generalization, where the aggregate may underperform individual models during testing. We demonstrate the versatility and effectiveness of our framework in various applications, such as data science and partial differential equations, showing how it successfully integrates traditional solvers with machine learning models to improve both robustness and accuracy.
Authors: Jacobo Ruiz, Manas Gupta
Abstract: The field of artificial intelligence faces significant challenges in achieving both biological plausibility and computational efficiency, particularly in visual learning tasks. Current artificial neural networks, such as convolutional neural networks, rely on techniques like backpropagation and weight sharing, which do not align with the brain's natural information processing methods. To address these issues, we propose the Memory Network, a model inspired by biological principles that avoids backpropagation and convolutions, and operates in a single pass. This approach enables rapid and efficient learning, mimicking the brain's ability to adapt quickly with minimal exposure to data. Our experiments demonstrate that the Memory Network achieves efficient and biologically plausible learning, showing strong performance on simpler datasets like MNIST. However, further refinement is needed for the model to handle more complex datasets such as CIFAR10, highlighting the need to develop new algorithms and techniques that closely align with biological processes while maintaining computational efficiency.
Authors: Max Koster, Jude Kukla
Abstract: In recent years, continual learning, a prediction setting in which the problem environment may evolve over time, has become an increasingly popular research field due to the framework's gearing towards complex, non-stationary objectives. Learning such objectives requires plasticity, or the ability of a neural network to adapt its predictions to a different task. Recent findings indicate that plasticity loss on new tasks is highly related to loss landscape sharpness in non-stationary RL frameworks. We explore the usage of sharpness regularization techniques, which seek out smooth minima and have been touted for their generalization capabilities in vanilla prediction settings, in efforts to combat plasticity loss. Our findings indicate that such techniques have no significant effect on reducing plasticity loss.
Authors: Aranyak Acharyya, Michael W. Trosset, Carey E. Priebe, Hayden S. Helm
Abstract: Generative models, such as large language models and text-to-image diffusion models, produce relevant information when presented a query. Different models may produce different information when presented the same query. As the landscape of generative models evolves, it is important to develop techniques to study and analyze differences in model behaviour. In this paper we present novel theoretical results for embedding-based representations of generative models in the context of a set of queries. We establish sufficient conditions for the consistent estimation of the model embeddings in situations where the query set and the number of models grow.
Authors: Anantaa Kotal, Anupam Joshi
Abstract: The integration of privacy measures, including differential privacy techniques, ensures a provable privacy guarantee for the synthetic data. However, challenges arise for Generative Deep Learning models when tasked with generating realistic data, especially in critical domains such as Cybersecurity and Healthcare. Generative Models optimized for continuous data struggle to model discrete and non-Gaussian features that have domain constraints. Challenges increase when the training datasets are limited and not diverse. In such cases, generative models create synthetic data that repeats sensitive features, which is a privacy risk. Moreover, generative models face difficulties comprehending attribute constraints in specialized domains. This leads to the generation of unrealistic data that impacts downstream accuracy. To address these issues, this paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation. The novel framework augments the training of generative models with supplementary context about attribute values and enforces domain constraints during training. This added guidance enhances the model's capacity to generate realistic and domain-compliant synthetic data. The proposed model is evaluated on real-world datasets, specifically in the domains of Cybersecurity and Healthcare, where domain constraints and rules add to the complexity of the data. Our experiments evaluate the privacy resilience and downstream accuracy of the model against benchmark methods, demonstrating its effectiveness in addressing the balance between privacy preservation and data accuracy in complex domains.
Authors: L\^e-Nguy\^en Hoang
Abstract: This paper advances the understanding of how the size of a machine learning model affects its vulnerability to poisoning, despite state-of-the-art defenses. Given isotropic random honest feature vectors and the geometric median (or clipped mean) as the robust gradient aggregator rule, we essentially prove that, perhaps surprisingly, linear and logistic regressions with $D \geq 169 H^2/P^2$ parameters are subject to arbitrary model manipulation by poisoners, where $H$ and $P$ are the numbers of honestly labeled and poisoned data points used for training. Our experiments go on exposing a fundamental tradeoff between augmenting model expressivity and increasing the poisoners' attack surface, on both synthetic data, and on MNIST & FashionMNIST data for linear classifiers with random features. We also discuss potential implications for source-based learning and neural nets.
Authors: Ruiquan Huang, Yingbin Liang, Jing Yang
Abstract: Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge sub-linearly to the direction of their corresponding max-margin solutions. We also show that the cross-entropy loss enjoys a linear convergence rate. Furthermore, we show that the trained transformer presents non-trivial prediction ability with dataset shift, which sheds light on the remarkable generalization performance of transformers. Our analysis technique involves the development of novel properties on the attention gradient and further in-depth analysis of how these properties contribute to the convergence of the training process. Our experiments further validate our theoretical findings.
Authors: Filippo Lazzati, Alberto Maria Metelli
Abstract: Our goal is to extract useful knowledge from demonstrations of behavior in sequential decision-making problems. Although it is well-known that humans commonly engage in risk-sensitive behaviors in the presence of stochasticity, most Inverse Reinforcement Learning (IRL) models assume a risk-neutral agent. Beyond introducing model misspecification, these models do not directly capture the risk attitude of the observed agent, which can be crucial in many applications. In this paper, we propose a novel model of behavior in Markov Decision Processes (MDPs) that explicitly represents the agent's risk attitude through a utility function. We then define the Utility Learning (UL) problem as the task of inferring the observed agent's risk attitude, encoded via a utility function, from demonstrations in MDPs, and we analyze the partial identifiability of the agent's utility. Furthermore, we devise two provably efficient algorithms for UL in a finite-data regime, and we analyze their sample complexity. We conclude with proof-of-concept experiments that empirically validate both our model and our algorithms.
Authors: Yegor Klochkov, Yang Liu
Abstract: Influence functions are a popular tool for attributing a model's output to training data. The traditional approach relies on the calculation of inverse Hessian-vector products (iHVP), but the classical solver "Linear time Stochastic Second-order Algorithm" (LiSSA, Agarwal et al. (2017)) is often deemed impractical for large models due to expensive computation and hyperparameter tuning. We show that the three hyperparameters -- the scaling factor, the batch size, and the number of steps -- can be chosen depending on the spectral properties of the Hessian, particularly its trace and largest eigenvalue. By evaluating with random sketching (Swartworth and Woodruff, 2023), we find that the batch size has to be sufficiently large for LiSSA to converge; however, for all of the models we consider, the requirement is mild. We confirm our findings empirically by comparing to Proximal Bregman Retraining Functions (PBRF, Bae et al. (2022)). Finally, we discuss what role the inverse Hessian plays in calculating the influence.
Authors: Alif Bin Abdul Qayyum, Xihaier Luo, Nathan M. Urban, Xiaoning Qian, Byung-Jun Yoon
Abstract: The world is moving towards clean and renewable energy sources, such as wind energy, in an attempt to reduce greenhouse gas emissions that contribute to global warming. To enhance the analysis and storage of wind data, we introduce a deep learning framework designed to simultaneously enable effective dimensionality reduction and continuous representation of multi-altitude wind data from discrete observations. The framework consists of three key components: dimensionality reduction, cross-modal prediction, and super-resolution. We aim to: (1) improve data resolution across diverse climatic conditions to recover high-resolution details; (2) reduce data dimensionality for more efficient storage of large climate datasets; and (3) enable cross-prediction between wind data measured at different heights. Comprehensive testing confirms that our approach surpasses existing methods in both super-resolution quality and compression efficiency.
Authors: Ruining Yang, Lili Su
Abstract: Modern vehicles are equipped with multiple information-collection devices such as sensors and cameras, continuously generating a large volume of raw data. Accurately predicting the trajectories of neighboring vehicles is a vital component in understanding the complex driving environment. Yet, training trajectory prediction models is challenging in two ways. Processing the large-scale data is computation-intensive. Moreover, easy-medium driving scenarios often overwhelmingly dominate the dataset, leaving challenging driving scenarios such as dense traffic under-represented. For example, in the Argoverse motion prediction dataset, there are very few instances with $\ge 50$ agents, while scenarios with $10 \thicksim 20$ agents are far more common. In this paper, to mitigate data redundancy in the over-represented driving scenarios and to reduce the bias rooted in the data scarcity of complex ones, we propose a novel data-efficient training method based on coreset selection. This method strategically selects a small but representative subset of data while balancing the proportions of different scenario difficulties. To the best of our knowledge, we are the first to introduce a method capable of effectively condensing large-scale trajectory dataset, while achieving a state-of-the-art compression ratio. Notably, even when using only 50% of the Argoverse dataset, the model can be trained with little to no decline in performance. Moreover, the selected coreset maintains excellent generalization ability.
Authors: Zhixiang Shen, Shuo Wang, Zhao Kang
Abstract: Unsupervised Multiplex Graph Learning (UMGL) aims to learn node representations on various edge types without manual labeling. However, existing research overlooks a key factor: the reliability of the graph structure. Real-world data often exhibit a complex nature and contain abundant task-irrelevant noise, severely compromising UMGL's performance. Moreover, existing methods primarily rely on contrastive learning to maximize mutual information across different graphs, limiting them to multiplex graph redundant scenarios and failing to capture view-unique task-relevant information. In this paper, we focus on a more realistic and challenging task: to unsupervisedly learn a fused graph from multiple graphs that preserve sufficient task-relevant information while removing task-irrelevant noise. Specifically, our proposed Information-aware Unsupervised Multiplex Graph Fusion framework (InfoMGF) uses graph structure refinement to eliminate irrelevant noise and simultaneously maximizes view-shared and view-unique task-relevant information, thereby tackling the frontier of non-redundant multiplex graph. Theoretical analyses further guarantee the effectiveness of InfoMGF. Comprehensive experiments against various baselines on different downstream tasks demonstrate its superior performance and robustness. Surprisingly, our unsupervised method even beats the sophisticated supervised approaches. The source code and datasets are available at https://github.com/zxlearningdeep/InfoMGF.
Authors: Zhengxin Joseph Ye, Bjoern Schuller
Abstract: Earnings release is a key economic event in the financial markets and crucial for predicting stock movements. Earnings data gives a glimpse into how a company is doing financially and can hint at where its stock might go next. However, the irregularity of its release cycle makes it a challenge to incorporate this data in a medium-frequency algorithmic trading model and the usefulness of this data fades fast after it is released, making it tough for models to stay accurate over time. Addressing this challenge, we introduce the Contrastive Earnings Transformer (CET) model, a self-supervised learning approach rooted in Contrastive Predictive Coding (CPC), aiming to optimise the utilisation of earnings data. To ascertain its effectiveness, we conduct a comparative study of CET against benchmark models across diverse sectors. Our research delves deep into the intricacies of stock data, evaluating how various models, and notably CET, handle the rapidly changing relevance of earnings data over time and over different sectors. The research outcomes shed light on CET's distinct advantage in extrapolating the inherent value of earnings data over time. Its foundation on CPC allows for a nuanced understanding, facilitating consistent stock predictions even as the earnings data ages. This finding about CET presents a fresh approach to better use earnings data in algorithmic trading for predicting stock price trends.
Authors: Qining Zhang, Lei Ying
Abstract: Reward inference (learning a reward model from human preferences) is a critical intermediate step in Reinforcement Learning from Human Feedback (RLHF) for fine-tuning Large Language Models (LLMs) such as ChatGPT. In practice, reward inference faces several fundamental challenges, including double problem misspecification, reward model evaluation without ground truth, distribution shift, and overfitting in joint reward model and policy training. An alternative approach that avoids these pitfalls is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLMs. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which only works under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference, which work for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradely-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish rates of convergence in terms of the number of policy gradient iterations, as well as the number of trajectory samples and human preference queries per iteration. Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
Authors: Athar Mahmoudi-Nejad, Matthew Guzdial, Pierre Boulanger
Abstract: The need to generate a spider to provoke a desired anxiety response arises in the context of personalized virtual reality exposure therapy (VRET), a treatment approach for arachnophobia. This treatment involves patients observing virtual spiders in order to become desensitized and decrease their phobia, which requires that the spiders elicit specific anxiety responses. However, VRET approaches tend to require therapists to hand-select the appropriate spider for each patient, which is a time-consuming process and takes significant technical knowledge and patient insight. While automated methods exist, they tend to employ rules-based approaches with minimal ability to adapt to specific users. To address these challenges, we present a framework for VRET utilizing procedural content generation (PCG) and reinforcement learning (RL), which automatically adapts a spider to elicit a desired anxiety response. We demonstrate the superior performance of this system compared to a more common rules-based VRET method.
Authors: Athar Mahmoudi-Nejad, Pierre Boulanger, Matthew Guzdial
Abstract: Personalized virtual reality exposure therapy is a therapeutic practice that can adapt to an individual patient, leading to better health outcomes. Measuring a patient's mental state to adjust the therapy is a critical but difficult task. Most published studies use subjective methods to estimate a patient's mental state, which can be inaccurate. This article proposes a virtual reality exposure therapy (VRET) platform capable of assessing a patient's mental state using non-intrusive and widely available physiological signals such as photoplethysmography (PPG). In a case study, we evaluate how PPG signals can be used to detect two binary classifications: peaceful and stressful states. Sixteen healthy subjects were exposed to the two VR environments (relaxed and stressful). Using LOSO cross-validation, our best classification model could predict the two states with a 70.6% accuracy which outperforms many more complex approaches.
Authors: Jiong Zhu, Gaotang Li, Yao-An Yang, Jing Zhu, Xuehao Cui, Danai Koutra
Abstract: Heterophily, or the tendency of connected nodes in networks to have different class labels or dissimilar features, has been identified as challenging for many Graph Neural Network (GNN) models. While the challenges of applying GNNs for node classification when class labels display strong heterophily are well understood, it is unclear how heterophily affects GNN performance in other important graph learning tasks where class labels are not available. In this work, we focus on the link prediction task and systematically analyze the impact of heterophily in node features on GNN performance. Theoretically, we first introduce formal definitions of homophilic and heterophilic link prediction tasks, and present a theoretical framework that highlights the different optimizations needed for the respective tasks. We then analyze how different link prediction encoders and decoders adapt to varying levels of feature homophily and introduce designs for improved performance. Our empirical analysis on a variety of synthetic and real-world datasets confirms our theoretical insights and highlights the importance of adopting learnable decoders and GNN encoders with ego- and neighbor-embedding separation in message passing for link prediction tasks beyond homophily.
Authors: Wen Ge, Guanyi Mou, Emmanuel O. Agu, Kyumin Lee
Abstract: Context-aware Human Activity Recognition (CHAR) is challenging due to the need to recognize the user's current activity from signals that vary significantly with contextual factors such as phone placements and the varied styles with which different users perform the same activity. In this paper, we argue that context-aware activity visit patterns in realistic in-the-wild data can equivocally be considered as a general graph representation learning task. We posit that exploiting underlying graphical patterns in CHAR data can improve CHAR task performance and representation learning. Building on the intuition that certain activities are frequently performed with the phone placed in certain positions, we focus on the context-aware human activity problem of recognizing the
Authors: Sagnik Anupam, Maddy Bowers, Omar Costilla-Reyes, Armando Solar-Lezama
Abstract: We present MathDSL, a Domain-Specific Language (DSL) for mathematical equation solving, which, when deployed in program synthesis models, outperforms state-of-the-art reinforcement-learning-based methods. We also introduce a quantitative metric for measuring the conciseness of a mathematical solution and demonstrate the improvement in the quality of generated solutions compared to other methods. Our system demonstrates that a program synthesis system (DreamCoder) using MathDSL can generate programs that solve linear equations with greater accuracy and conciseness than using reinforcement learning systems. Additionally, we demonstrate that if we use the action spaces of previous reinforcement learning systems as DSLs, MathDSL outperforms the action-space-DSLs. We use DreamCoder to store equation-solving strategies as learned abstractions in its program library and demonstrate that by using MathDSL, these can be converted into human-interpretable solution strategies that could have applications in mathematical education.
Authors: Jie Hu, Yi-Ting Ma, Do Young Eun
Abstract: Distributed learning is essential to train machine learning algorithms across heterogeneous agents while maintaining data privacy. We conduct an asymptotic analysis of Unified Distributed SGD (UD-SGD), exploring a variety of communication patterns, including decentralized SGD and local SGD within Federated Learning (FL), as well as the increasing communication interval in the FL setting. In this study, we assess how different sampling strategies, such as i.i.d. sampling, shuffling, and Markovian sampling, affect the convergence speed of UD-SGD by considering the impact of agent dynamics on the limiting covariance matrix as described in the Central Limit Theorem (CLT). Our findings not only support existing theories on linear speedup and asymptotic network independence, but also theoretically and empirically show how efficient sampling strategies employed by individual agents contribute to overall convergence in UD-SGD. Simulations reveal that a few agents using highly efficient sampling can achieve or surpass the performance of the majority employing moderately improved strategies, providing new insights beyond traditional analyses focusing on the worst-performing agent.
Authors: Yusuke Matsui, Tatsuya Yokota
Abstract: We propose a new operator defined between two tensors, the broadcast product. The broadcast product calculates the Hadamard product after duplicating elements to align the shapes of the two tensors. Complex tensor operations in libraries like \texttt{numpy} can be succinctly represented as mathematical expressions using the broadcast product. Finally, we propose a novel tensor decomposition using the broadcast product, highlighting its potential applications in dimensionality reduction.
Authors: Xuefeng Du, Chaowei Xiao, Yixuan Li
Abstract: The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at https://github.com/deeplearningwisc/haloscope.
Authors: Xiufang Shi, Wei Zhang, Mincheng Wu, Guangyi Liu, Zhenyu Wen, Shibo He, Tejal Shah, Rajiv Ranjan
Abstract: In federated learning, the heterogeneity of client data has a great impact on the performance of model training. Many heterogeneity issues in this process are raised by non-independently and identically distributed (Non-IID) data. This study focuses on the issue of label distribution skew. To address it, we propose a hybrid federated learning framework called HFLDD, which integrates dataset distillation to generate approximately independent and equally distributed (IID) data, thereby improving the performance of model training. Particularly, we partition the clients into heterogeneous clusters, where the data labels among different clients within a cluster are unbalanced while the data labels among different clusters are balanced. The cluster headers collect distilled data from the corresponding cluster members, and conduct model training in collaboration with the server. This training process is like traditional federated learning on IID data, and hence effectively alleviates the impact of Non-IID data on model training. Furthermore, we compare our proposed method with typical baseline methods on public datasets. Experimental results demonstrate that when the data labels are severely imbalanced, the proposed HFLDD outperforms the baseline methods in terms of both test accuracy and communication cost.
Authors: Saber Malekmohammadi, Golnoosh Farnadi
Abstract: A significant approach in natural language processing involves large-scale pre-training on general domain data followed by adaptation to specific tasks or domains. As models grow in size, full fine-tuning all parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g. LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters coming from their full fine-tuning, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between the noise distribution and a Gaussian distribution with the same variance, we show that the dynamics of LoRA and FLoRA are very close to differentially private full fine-tuning the adapters, which suggests that low-rank adaptation implicitly provides privacy w.r.t the fine-tuning data. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient clipping, low-rank adaptation is almost equivalent to differentially private full fine-tuning adapters with a fixed noise scale.
Authors: Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
Abstract: In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular cross-attention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods.
Authors: Kunyu Peng, Di Wen, Kailun Yang, Ao Luo, Yufan Chen, Jia Fu, M. Saquib Sarfraz, Alina Roitberg, Rainer Stiefelhagen
Abstract: In Open-Set Domain Generalization (OSDG), the model is exposed to both new variations of data appearance (domains) and open-set conditions, where both known and novel categories are present at test time. The challenges of this task arise from the dual need to generalize across diverse domains and accurately quantify category novelty, which is critical for applications in dynamic environments. Recently, meta-learning techniques have demonstrated superior results in OSDG, effectively orchestrating the meta-train and -test tasks by employing varied random categories and predefined domain partition strategies. These approaches prioritize a well-designed training schedule over traditional methods that focus primarily on data augmentation and the enhancement of discriminative feature learning. The prevailing meta-learning models in OSDG typically utilize a predefined sequential domain scheduler to structure data partitions. However, a crucial aspect that remains inadequately explored is the influence brought by strategies of domain schedulers during training. In this paper, we observe that an adaptive domain scheduler benefits more in OSDG compared with prefixed sequential and random domain schedulers. We propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve an adaptive domain scheduler. This method strategically sequences domains by assessing their reliabilities in utilizing a follower network, trained with confidence scores learned in an evidential manner, regularized by max rebiasing discrepancy, and optimized in a bi-level manner. The results show that our method substantially improves OSDG performance and achieves more discriminative embeddings for both the seen and unseen categories. The source code will be available at https://github.com/KPeng9510/EBiL-HaDS.
Authors: Kasper Green Larsen, Omar Montasser, Nikita Zhivotovskiy
Abstract: Multi-distribution or collaborative learning involves learning a single predictor that works well across multiple data distributions, using samples from each during training. Recent research on multi-distribution learning, focusing on binary loss and finite VC dimension classes, has shown near-optimal sample complexity that is achieved with oracle efficient algorithms. That is, these algorithms are computationally efficient given an efficient ERM for the class. Unlike in classical PAC learning, where the optimal sample complexity is achieved with deterministic predictors, current multi-distribution learning algorithms output randomized predictors. This raises the question: can these algorithms be derandomized to produce a deterministic predictor for multiple distributions? Through a reduction to discrepancy minimization, we show that derandomizing multi-distribution learning is computationally hard, even when ERM is computationally efficient. On the positive side, we identify a structural condition enabling an efficient black-box reduction, converting existing randomized multi-distribution predictors into deterministic ones.
Authors: Naoya Hasegawa, Issei Sato
Abstract: Real-world data distributions are often highly skewed. This has spurred a growing body of research on long-tailed recognition to address this imbalance in training classification models. Among the methods studied, multiplicative logit adjustment (MLA) stands out as a simple and effective method. However, it lacks theoretical guarantees, which raises concerns about the optimality of its adjustment method. We provide a theoretical justification for the effectiveness of MLA with the following two-step theory. First, we develop a theory that adjusts optimal decision boundaries by estimating feature spread on the basis of neural collapse. Then, we demonstrate that MLA approximates this optimal method. Additionally, through experiments on long-tailed datasets, we illustrate the practical usefulness of MLA under more realistic conditions. We also offer experimental insights to guide the tuning of MLA's hyperparameters.
Authors: Mollaev Dzhambulat, Alexander Kostin, Postnova Maria, Ivan Karpukhin, Ivan A Kireev, Gleb Gusev, Andrey Savchenko
Abstract: Financial organizations collect a huge amount of data about clients that typically has a temporal (sequential) structure and is collected from various sources (modalities). Due to privacy issues, there are no large-scale open-source multimodal datasets of event sequences, which significantly limits the research in this area. In this paper, we present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients with several modalities: 950M bank transactions, 1B geo position events, 5M embeddings of dialogues with technical support and monthly aggregated purchases of four bank's products. All entries are properly anonymized from real proprietary bank data. Using this dataset, we introduce a novel benchmark with two business tasks: campaigning (purchase prediction in the next month) and matching of clients. We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task. As a result, the proposed dataset can open new perspectives and facilitate the future development of practically important large-scale multimodal algorithms for event sequences. HuggingFace Link: https://huggingface.co/datasets/ai-lab/MBD Github Link: https://github.com/Dzhambo/MBD
URLs: https://huggingface.co/datasets/ai-lab/MBD, https://github.com/Dzhambo/MBD
Authors: Max Y. Ma, Gen-Hua Shi
Abstract: Based on the numerical manifold method principle, we developed a mathematical framework of a neural network manifold: Deep Manifold and discovered that neural networks: 1) is numerical computation combining forward and inverse; 2) have near infinite degrees of freedom; 3) exponential learning capacity with depth; 4) have self-progressing boundary conditions; 5) has training hidden bottleneck. We also define two concepts: neural network learning space and deep manifold space and introduce two concepts: neural network intrinsic pathway and fixed point. We raise three fundamental questions: 1). What is the training completion definition; 2). where is the deep learning convergence point (neural network fixed point); 3). How important is token timestamp in training data given negative time is critical in inverse problem.
Authors: Yilin Wang, Yifei Yu, Kong Sun, Peixuan Lei, Yuxuan Zhang, Enrico Zio, Aiguo Xia, Yuanxiang Li
Abstract: In industry, the reliability of rotating machinery is critical for production efficiency and safety. Current methods of Prognostics and Health Management (PHM) often rely on task-specific models, which face significant challenges in handling diverse datasets with varying signal characteristics, fault modes and operating conditions. Inspired by advancements in generative pretrained models, we propose RmGPT, a unified model for diagnosis and prognosis tasks. RmGPT introduces a novel token-based framework, incorporating Signal Tokens, Prompt Tokens, Time-Frequency Task Tokens and Fault Tokens to handle heterogeneous data within a unified model architecture. We leverage self-supervised learning for robust feature extraction and introduce a next signal token prediction pretraining strategy, alongside efficient prompt learning for task-specific adaptation. Extensive experiments demonstrate that RmGPT significantly outperforms state-of-the-art algorithms, achieving near-perfect accuracy in diagnosis tasks and exceptionally low errors in prognosis tasks. Notably, RmGPT excels in few-shot learning scenarios, achieving 92% accuracy in 16-class one-shot experiments, highlighting its adaptability and robustness. This work establishes RmGPT as a powerful PHM foundation model for rotating machinery, advancing the scalability and generalizability of PHM solutions.
Authors: Jiawei Du, Xin Zhang, Juncheng Hu, Wenxin Huang, Joey Tianyi Zhou
Abstract: The sharp increase in data-related expenses has motivated research into condensing datasets while retaining the most informative features. Dataset distillation has thus recently come to the fore. This paradigm generates synthetic dataset that are representative enough to replace the original dataset in training a neural network. To avoid redundancy in these synthetic datasets, it is crucial that each element contains unique features and remains diverse from others during the synthesis stage. In this paper, we provide a thorough theoretical and empirical analysis of diversity within synthesized datasets. We argue that enhancing diversity can improve the parallelizable yet isolated synthesizing approach. Specifically, we introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process, thereby maximizing the representativeness and diversity of each synthetic instance. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset. Extensive experiments across multiple datasets, including CIFAR, Tiny-ImageNet, and ImageNet-1K, demonstrate the superior performance of our method, highlighting its effectiveness in producing diverse and representative synthetic datasets with minimal computational expense.
Authors: Yusong Wang, Chaoran Cheng, Shaoning Li, Yuxuan Ren, Bin Shao, Ge Liu, Pheng-Ann Heng, Nanning Zheng
Abstract: Geometric graph neural networks (GNNs) have emerged as powerful tools for modeling molecular geometry. However, they encounter limitations in effectively capturing long-range interactions in large molecular systems. To address this challenge, we introduce Neural P$^3$M, a versatile enhancer of geometric GNNs to expand the scope of their capabilities by incorporating mesh points alongside atoms and reimaging traditional mathematical operations in a trainable manner. Neural P$^3$M exhibits flexibility across a wide range of molecular systems and demonstrates remarkable accuracy in predicting energies and forces, outperforming on benchmarks such as the MD22 dataset. It also achieves an average improvement of 22% on the OE62 dataset while integrating with various architectures.
Authors: Keitaro Sakamoto, Issei Sato
Abstract: Modern over-parameterized neural networks can be trained to fit the training data perfectly while still maintaining a high generalization performance. This "benign overfitting" phenomenon has been studied in a surge of recent theoretical work; however, most of these studies have been limited to linear models or two-layer neural networks. In this work, we analyze benign overfitting in the token selection mechanism of the attention architecture, which characterizes the success of transformer models. We first show the existence of a benign overfitting solution and explain its mechanism in the attention architecture. Next, we discuss whether the model converges to such a solution, raising the difficulties specific to the attention architecture. We then present benign overfitting cases and not-benign overfitting cases by conditioning different scenarios based on the behavior of attention probabilities during training. To the best of our knowledge, this is the first study to characterize benign overfitting for the attention mechanism.
Authors: Pavel Proch\'azka, Marek D\v{e}di\v{c}, Luk\'a\v{s} Bajer
Abstract: Last decade has seen the emergence of numerous methods for learning on graphs, particularly Graph Neural Networks (GNNs). These methods, however, are often not directly applicable to more complex structures like bipartite graphs (equivalent to hypergraphs), which represent interactions among two entity types (e.g. a user liking a movie). This paper proposes Convolutional Signal Propagation (CSP), a non-parametric simple and scalable method that natively operates on bipartite graphs (hypergraphs) and can be implemented with just a few lines of code. After defining CSP, we demonstrate its relationship with well-established methods like label propagation, Naive Bayes, and Hypergraph Convolutional Networks. We evaluate CSP against several reference methods on real-world datasets from multiple domains, focusing on retrieval and classification tasks. Our results show that CSP offers competitive performance while maintaining low computational complexity, making it an ideal first choice as a baseline for hypergraph node classification and retrieval. Moreover, despite operating on hypergraphs, CSP achieves good results in tasks typically not associated with hypergraphs, such as natural language processing.
Authors: Eike Cramer
Abstract: Real-world chemical processes often exhibit stochastic dynamics with non-trivial correlations and state-dependent fluctuations. However, most process models simply add stationary noise terms to a deterministic prediction, which can lead to inaccurate predictions. This work proposes using conditional normalizing flows as discrete-time models (DTMs) to learn the stochastic dynamics of chemical processes. Normalizing flows learn an explicit expression of the system states' probability density function (PDF) given prior states and control inputs. The resulting model naturally allows for formulating stochastic and probabilistic setpoint-tracking objectives and chance constraints. In applications to a continuous reactor and a reactor cascade, the normalizing flow yields stable simulations over long time horizons and high-quality results in stochastic and probabilistic MPC formulation for open-loop control. Furthermore, a chance-constrained optimization finds reliable startup controls for the reactor cascade with stochastic reactions. In conclusion, the conditional normalizing flow presents an excellent choice for modeling nonlinear stochastic dynamics.
Authors: Tokio Kajitsuka, Issei Sato
Abstract: Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $\tilde{O}(\sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $\tilde{O}(\sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.
Authors: Chaithra Umesh, Kristian Schultz, Manjunath Mahendra, Saparshi Bej, Olaf Wolkenhauer
Abstract: Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.
Authors: Eeshaan Jain, Indradyumna Roy, Saswat Meher, Soumen Chakrabarti, Abir De
Abstract: Graph Edit Distance (GED) measures the (dis-)similarity between two given graphs, in terms of the minimum-cost edit sequence that transforms one graph to the other. However, the exact computation of GED is NP-Hard, which has recently motivated the design of neural methods for GED estimation. However, they do not explicitly account for edit operations with different costs. In response, we propose GRAPHEDX, a neural GED estimator that can work with general costs specified for the four edit operations, viz., edge deletion, edge addition, node deletion and node addition. We first present GED as a quadratic assignment problem (QAP) that incorporates these four costs. Then, we represent each graph as a set of node and edge embeddings and use them to design a family of neural set divergence surrogates. We replace the QAP terms corresponding to each operation with their surrogates. Computing such neural set divergence require aligning nodes and edges of the two graphs. We learn these alignments using a Gumbel-Sinkhorn permutation generator, additionally ensuring that the node and edge alignments are consistent with each other. Moreover, these alignments are cognizant of both the presence and absence of edges between node-pairs. Experiments on several datasets, under a variety of edit cost settings, show that GRAPHEDX consistently outperforms state-of-the-art methods and heuristics in terms of prediction error.
Authors: Mateo Espinosa Zarlenga, Swami Sankaranarayanan, Jerone T. A. Andrews, Zohreh Shams, Mateja Jamnik, Alice Xiang
Abstract: Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups, particularly when group and task labels are spuriously correlated (e.g., "grassy background" and "cows"). Existing bias mitigation methods that aim to address this issue often either rely on group labels for training or validation, or require an extensive hyperparameter search. Such data and computational requirements hinder the practical deployment of these methods, especially when datasets are too large to be group-annotated, computational resources are limited, and models are trained through already complex pipelines. In this paper, we propose Targeted Augmentations for Bias Mitigation (TAB), a simple hyperparameter-free framework that leverages the entire training history of a helper model to identify spurious samples, and generate a group-balanced training set from which a robust model can be trained. We show that TAB improves worst-group performance without any group information or model selection, outperforming existing methods while maintaining overall accuracy.
Authors: Yuxin Jia, Youfang Lin, Jing Yu, Shuo Wang, Tianhao Liu, Huaiyu Wan
Abstract: Due to the recurrent structure of RNN, the long information propagation path poses limitations in capturing long-term dependencies, gradient explosion/vanishing issues, and inefficient sequential execution. Based on this, we propose a novel paradigm called Parallel Gated Network (PGN) as the new successor to RNN. PGN directly captures information from previous time steps through the designed Historical Information Extraction (HIE) layer and leverages gated mechanisms to select and fuse it with the current time step information. This reduces the information propagation path to $\mathcal{O}(1)$, effectively addressing the limitations of RNN. To enhance PGN's performance in long-range time series forecasting tasks, we propose a novel temporal modeling framework called Temporal PGN (TPGN). TPGN incorporates two branches to comprehensively capture the semantic information of time series. One branch utilizes PGN to capture long-term periodic patterns while preserving their local characteristics. The other branch employs patches to capture short-term information and aggregate the global representation of the series. TPGN achieves a theoretical complexity of $\mathcal{O}(\sqrt{L})$, ensuring efficiency in its operations. Experimental results on five benchmark datasets demonstrate the state-of-the-art (SOTA) performance and high efficiency of TPGN, further confirming the effectiveness of PGN as the new successor to RNN in long-range time series forecasting. The code is available in this repository: \url{https://github.com/Water2sea/TPGN}.
Authors: Luiz Felipe Vecchietti, Minji Lee, Begench Hangeldiyev, Hyunkyu Jung, Hahnbeom Park, Tae-Kyun Kim, Meeyoung Cha, Ho Min Kim
Abstract: Recent advancements in machine learning (ML) are transforming the field of structural biology. For example, AlphaFold, a groundbreaking neural network for protein structure prediction, has been widely adopted by researchers. The availability of easy-to-use interfaces and interpretable outcomes from the neural network architecture, such as the confidence scores used to color the predicted structures, have made AlphaFold accessible even to non-ML experts. In this paper, we present various methods for representing protein 3D structures from low- to high-resolution, and show how interpretable ML methods can support tasks such as predicting protein structures, protein function, and protein-protein interactions. This survey also emphasizes the significance of interpreting and visualizing ML-based inference for structure-based protein representations that enhance interpretability and knowledge discovery. Developing such interpretable approaches promises to further accelerate fields including drug development and protein design.
Authors: Diego Cajaraville-Aboy, Ana Fern\'andez-Vilas, Rebeca P. D\'iaz-Redondo, Manuel Fern\'andez-Veiga
Abstract: Federated Learning (FL) emerges as a distributed machine learning approach that addresses privacy concerns by training AI models locally on devices. Decentralized Federated Learning (DFL) extends the FL paradigm by eliminating the central server, thereby enhancing scalability and robustness through the avoidance of a single point of failure. However, DFL faces significant challenges in optimizing security, as most Byzantine-robust algorithms proposed in the literature are designed for centralized scenarios. In this paper, we present a novel Byzantine-robust aggregation algorithm to enhance the security of Decentralized Federated Learning environments, coined WFAgg. This proposal handles the adverse conditions and strength robustness of dynamic decentralized topologies at the same time by employing multiple filters to identify and mitigate Byzantine attacks. Experimental results demonstrate the effectiveness of the proposed algorithm in maintaining model accuracy and convergence in the presence of various Byzantine attack scenarios, outperforming state-of-the-art centralized Byzantine-robust aggregation schemes (such as Multi-Krum or Clustering). These algorithms are evaluated on an IID image classification problem in both centralized and decentralized scenarios.
Authors: Harsh Yadav, Maximilian Schaefer, Kun Zhao, Tobias Meisen
Abstract: Motion prediction is an important aspect for Autonomous Driving (AD) and Advance Driver Assistance Systems (ADAS). Current state-of-the-art motion prediction methods rely on High Definition (HD) maps for capturing the surrounding context of the ego vehicle. Such systems lack scalability in real-world deployment as HD maps are expensive to produce and update in real-time. To overcome this issue, we propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from rasterized Bird-Eye-View (BEV) images. Our system can be integrated with any upstream perception module that is capable of generating BEV images. Moreover, CASPFormer directly decodes vectorized trajectories without any postprocessing. Trajectories are decoded recurrently using deformable attention, as it is computationally efficient and provides the network with the ability to focus its attention on the important spatial locations of the BEV images. In addition, we also address the issue of mode collapse for generating multiple scene-consistent trajectories by incorporating learnable mode queries. We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics
Authors: Indu Solomon, Aye Phyu Phyu Aung, Uttam Kumar, Senthilnath Jayavelu
Abstract: Continual learning (CL) adapt the deep learning scenarios with timely updated datasets. However, existing CL models suffer from the catastrophic forgetting issue, where new knowledge replaces past learning. In this paper, we propose Continual Learning with Task Specialists (CLTS) to address the issues of catastrophic forgetting and limited labelled data in real-world datasets by performing class incremental learning of the incoming stream of data. The model consists of Task Specialists (T S) and Task Predictor (T P ) with pre-trained Stable Diffusion (SD) module. Here, we introduce a new specialist to handle a new task sequence and each T S has three blocks; i) a variational autoencoder (V AE) to learn the task distribution in a low dimensional latent space, ii) a K-Means block to perform data clustering and iii) Bootstrapping Language-Image Pre-training (BLIP ) model to generate a small batch of captions from the input data. These captions are fed as input to the pre-trained stable diffusion model (SD) for the generation of task samples. The proposed model does not store any task samples for replay, instead uses generated samples from SD to train the T P module. A comparison study with four SOTA models conducted on three real-world datasets shows that the proposed model outperforms all the selected baselines
Authors: Yakir Yehuda, Kira Radinsky
Abstract: In the realm of artificial intelligence, the generation of realistic training data for supervised learning tasks presents a significant challenge. This is particularly true in the synthesis of electrocardiograms (ECGs), where the objective is to develop a synthetic 12-lead ECG model. The primary complexity of this task stems from accurately modeling the intricate biological and physiological interactions among different ECG leads. Although mathematical process simulators have shed light on these dynamics, effectively incorporating this understanding into generative models is not straightforward. In this work, we introduce an innovative method that employs ordinary differential equations (ODEs) to enhance the fidelity of generating 12-lead ECG data. This approach integrates a system of ODEs that represent cardiac dynamics directly into the generative model's optimization process, allowing for the production of biologically plausible ECG training data that authentically reflects real-world variability and inter-lead dependencies. We conducted an empirical analysis of thousands of ECGs and found that incorporating cardiac simulation insights into the data generation process significantly improves the accuracy of heart abnormality classifiers trained on this synthetic 12-lead ECG data.
Authors: Hui-Po Wang, Mario Fritz
Abstract: Despite the widespread use of statistical prior models in various fields, such models for neural network gradients have long been overlooked. The inherent challenge stems from their high-dimensional structures and complex interdependencies, which complicate effective modeling. In this work, we demonstrate the potential of large language models (LLMs) to act as gradient priors in a zero-shot setting. We examine the property by considering lossless gradient compression -- a critical application in distributed learning -- that depends heavily on precise probability modeling. To achieve this, we introduce LM-GC, a novel method that integrates LLMs with arithmetic coding. Our technique converts plain gradients into text-like formats, enhancing token efficiency by up to 38 times compared to their plain representations. We ensure that this data conversion maintains a close alignment with the structure of plain gradients and the symbols commonly recognized by LLMs. Our experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods, improving compression rates by 10\% up to 17.2\% across various datasets and architectures. Additionally, our approach shows promising compatibility with lossy compression techniques such as quantization and sparsification. These findings highlight the significant potential of LLMs as a model for effectively handling gradients. We will release the source code upon publication.
Authors: R. Gallon, F. Schiemenz, A. Krstova, A. Menicucci, E. Gill
Abstract: In the framework of Failure Detection, Isolation and Recovery (FDIR) on spacecraft, new AI-based approaches are emerging in the state of the art to overcome the limitations commonly imposed by traditional threshold checking. The present research aims at characterizing two different approaches to the problem of stuck values detection in multivariate time series coming from spacecraft attitude sensors. The analysis reveals the performance differences in the two approaches, while commenting on their interpretability and generalization to different scenarios.
Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang
Abstract: Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach's effectiveness, with up to 13\times speedup in matrix multiplication compared to NVIDIA's CUTLASS. When integrated into LLMs, we achieve up to 6.7\times inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.
Authors: Joseph Massingham, Ole Nielsen, Tore Butlin
Abstract: Predicting the response of nonlinear dynamical systems subject to random, broadband excitation is important across a range of scientific disciplines, such as structural dynamics and neuroscience. Building data-driven models requires experimental measurements of the system input and output, but it can be difficult to determine whether inaccuracies in the model stem from modelling errors or noise. This paper presents a novel method to identify the causal component of the input-output data from measurements of a system in the presence of output noise, as a function of frequency, without needing a high fidelity model. An output prediction, calculated using an available model, is optimally combined with noisy measurements of the output to predict the input to the system. The parameters of the algorithm balance the two output signals and are utilised to calculate a nonlinear coherence metric as a measure of causality. This method is applicable to a broad class of nonlinear dynamical systems. There are currently no solutions to this problem in the absence of a complete benchmark model.
Authors: Chao Min, Yijia Wang, Bo Zhang, Xin Ma, Junyi Cui
Abstract: Accurate power load forecasting is crucial for improving energy efficiency and ensuring power supply quality. Considering the power load forecasting problem involves not only dynamic factors like historical load variations but also static factors such as climate conditions that remain constant over specific periods. From the model-agnostic perspective, this paper proposes a parallel structure network to extract important information from both dynamic and static data. Firstly, based on complexity learning theory, it is demonstrated that models integrated through parallel structures exhibit superior generalization abilities compared to individual base learners. Additionally, the higher the independence between base learners, the stronger the generalization ability of the parallel structure model. This suggests that the structure of machine learning models inherently contains significant information. Building on this theoretical foundation, a parallel convolutional neural network (CNN)-gate recurrent unit (GRU) attention model (PCGA) is employed to address the power load forecasting issue, aiming to effectively integrate the influences of dynamic and static features. The CNN module is responsible for capturing spatial characteristics from static data, while the GRU module captures long-term dependencies in dynamic time series data. The attention layer is designed to focus on key information from the spatial-temporal features extracted by the parallel CNN-GRU. To substantiate the advantages of the parallel structure model in extracting and integrating multi-source information, a series of experiments are conducted.
Authors: Konstantinos Skianis, Giannis Nikolentzos, Michalis Vazirgiannis
Abstract: Large language models (LLMs) have recently achieved remarkable success in various reasoning tasks in the field of natural language processing. This success of LLMs has also motivated their use in graph-related tasks. Among others, recent work has explored whether LLMs can solve graph problems such as counting the number of connected components of a graph or computing the shortest path distance between two nodes. Although LLMs possess preliminary graph reasoning abilities, they might still struggle to solve some seemingly simple problems. In this paper, we investigate whether prompting via pseudo-code instructions can improve the performance of LLMs in solving graph problems. Our experiments demonstrate that using pseudo-code instructions generally improves the performance of all considered LLMs. The graphs, pseudo-code prompts, and evaluation code are publicly available.
Authors: Biplov Paneru, Bishwash Paneru, DP Sharma Mainali
Abstract: Remaining Useful Life (RUL) of battery is an important parameter to know the battery's remaining life and need for recharge. The goal of this research project is to develop machine learning-based models for the battery RUL dataset. Different ML models are developed to classify the RUL of the vehicle, and the IoT (Internet of Things) concept is simulated for automating the charging system and managing any faults aligning. The graphs plotted depict the relationship between various vehicle parameters using the Blynk IoT platform. Results show that the catboost, Multi-Layer Perceptron (MLP), Gated Recurrent Unit (GRU), and hybrid model developed could classify RUL into three classes with 99% more accuracy. The data is fed using the tkinter GUI for simulating artificial intelligence (AI)-based charging, and with a pyserial backend, data can be entered into the Esp-32 microcontroller for making charge discharge possible with the model's predictions. Also, with an IoT system, the charging can be disconnected, monitored, and analyzed for automation. The results show that an accuracy of 99% can be obtained on models MLP, catboost model and similar accuracy on GRU model can be obtained, and finally relay-based triggering can be made by prediction through the model used for automating the charging and energy-saving mechanism. By showcasing an exemplary Blynk platform-based monitoring and automation phenomenon, we further present innovative ways of monitoring parameters and automating the system.
Authors: Mathieu Bazinet, Valentina Zantedeschi, Pascal Germain
Abstract: The sample compression theory provides generalization guarantees for predictors that can be fully defined using a subset of the training dataset and a (short) message string, generally defined as a binary sequence. Previous works provided generalization bounds for the zero-one loss, which is restrictive, notably when applied to deep learning approaches. In this paper, we present a general framework for deriving new sample compression bounds that hold for real-valued losses. We empirically demonstrate the tightness of the bounds and their versatility by evaluating them on different types of models, e.g., neural networks and decision forests, trained with the Pick-To-Learn (P2L) meta-algorithm, which transforms the training method of any machine-learning predictor to yield sample-compressed predictors. In contrast to existing P2L bounds, ours are valid in the non-consistent case.
Authors: Boris Sedlak, Victor Casamayor Pujol, Andrea Morichetta, Praveen Kumar Donta, Schahram Dustdar
Abstract: The current scenario of IoT is witnessing a constant increase on the volume of data, which is generated in constant stream, calling for novel architectural and logical solutions for processing it. Moving the data handling towards the edge of the computing spectrum guarantees better distribution of load and, in principle, lower latency and better privacy. However, managing such a structure is complex, especially when requirements, also referred to Service Level Objectives (SLOs), specified by applications' owners and infrastructure managers need to be ensured. Despite the rich number of proposals of Machine Learning (ML) based management solutions, researchers and practitioners yet struggle to guarantee long-term prediction and control, and accurate troubleshooting. Therefore, we present a novel ML paradigm based on Active Inference (AIF) -- a concept from neuroscience that describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise. We implement it and evaluate it in a heterogeneous real stream processing use case, where an AIF-based agent continuously optimizes the fulfillment of three SLOs for three autonomous driving services running on multiple devices. The agent used causal knowledge to gradually develop an understanding of how its actions are related to requirements fulfillment, and which configurations to favor. Through this approach, our agent requires up to thirty iterations to converge to the optimal solution, showing the capability of offering accurate results in a short amount of time. Furthermore, thanks to AIF and its causal structures, our method guarantees full transparency on the decision making, making the interpretation of the results and the troubleshooting effortless.
Authors: Yannis Karmim, Marc Lafon, Rapha\"el Fournier S'niehotta, Nicolas Thome
Abstract: Fully connected Graph Transformers (GT) have rapidly become prominent in the static graph community as an alternative to Message-Passing models, which suffer from a lack of expressivity, oversquashing, and under-reaching. However, in a dynamic context, by interconnecting all nodes at multiple snapshots with self-attention, GT loose both structural and temporal information. In this work, we introduce Supra-LAplacian encoding for spatio-temporal TransformErs (SLATE), a new spatio-temporal encoding to leverage the GT architecture while keeping spatio-temporal information. Specifically, we transform Discrete Time Dynamic Graphs into multi-layer graphs and take advantage of the spectral properties of their associated supra-Laplacian matrix. Our second contribution explicitly model nodes' pairwise relationships with a cross-attention mechanism, providing an accurate edge representation for dynamic link prediction. SLATE outperforms numerous state-of-the-art methods based on Message-Passing Graph Neural Networks combined with recurrent models (e.g LSTM), and Dynamic Graph Transformers, on 9 datasets. Code and instructions to reproduce our results will be open-sourced.
Authors: Andres Felipe Lerma-Pineda, Philipp Petersen, Simon Frieder, Thomas Lukasiewicz
Abstract: We study the problem of approximating and estimating classification functions that have their decision boundary in the $RBV^2$ space. Functions of $RBV^2$ type arise naturally as solutions of regularized neural network learning problems and neural networks can approximate these functions without the curse of dimensionality. We modify existing results to show that every $RBV^2$ function can be approximated by a neural network with bounded weights. Thereafter, we prove the existence of a neural network with bounded weights approximating a classification function. And we leverage these bounds to quantify the estimation rates. Finally, we present a numerical study that analyzes the effect of different regularity conditions on the decision boundaries.
Authors: Jialin Li, Marta Zagorowska, Giulia De Pasquale, Alisa Rupenyan, John Lygeros
Abstract: Ensuring safety is a key aspect in sequential decision making problems, such as robotics or process control. The complexity of the underlying systems often makes finding the optimal decision challenging, especially when the safety-critical system is time-varying. Overcoming the problem of optimizing an unknown time-varying reward subject to unknown time-varying safety constraints, we propose TVSafeOpt, a new algorithm built on Bayesian optimization with a spatio-temporal kernel. The algorithm is capable of safely tracking a time-varying safe region without the need for explicit change detection. Optimality guarantees are also provided for the algorithm when the optimization problem becomes stationary. We show that TVSafeOpt compares favorably against SafeOpt on synthetic data, both regarding safety and optimality. Evaluation on a realistic case study with gas compressors confirms that TVSafeOpt ensures safety when solving time-varying optimization problems with unknown reward and safety functions.
Authors: Yuan Mi, Hao Sun
Abstract: Data-driven simulation of physical systems has recently kindled significant attention, where many neural models have been developed. In particular, mesh-based graph neural networks (GNNs) have demonstrated significant potential in predicting spatiotemporal dynamics across arbitrary geometric domains. However, the existing node-edge message passing mechanism in GNNs limits the model's representation learning ability. In this paper, we proposed a cell-embedded GNN model (aka CeGNN) to learn spatiotemporal dynamics with lifted performance. Specifically, we introduce a learnable cell attribution to the node-edge message passing process, which better captures the spatial dependency of regional features. Such a strategy essentially upgrades the local aggregation scheme from the first order (e.g., from edge to node) to a higher order (e.g., from volume to edge and then to node), which takes advantage of volumetric information in message passing. Meanwhile, a novel feature-enhanced block is designed to further improve the performance of CeGNN and relieve the over-smoothness problem, via treating the latent features as basis functions. The extensive experiments on various PDE systems and one real-world dataset demonstrate that CeGNN achieves superior performance compared with other baseline models, particularly reducing the prediction error with up to 1 orders of magnitude on several PDE systems.
Authors: Jakub {\L}ucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tram\`er, Javier Rando
Abstract: Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
Authors: Jiayu Yao, Weiwei Pan, Finale Doshi-Velez, Barbara E Engelhardt
Abstract: In this work, we study an inverse reinforcement learning (IRL) problem where the experts are planning under a shared reward function but with different, unknown planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, which makes it harder for existing IRL approaches to identify a reward function. To overcome this challenge, we develop algorithms that can learn a global multi-agent reward function with agent-specific discount factors that reconstruct the expert policies. We characterize the feasible solution space of the reward function and discount factors for both algorithms and demonstrate the generalizability of the learned reward function across multiple domains.
Authors: Francesco Mori, Stefano Sarao Mannelli, Francesca Mignacco
Abstract: Artificial neural networks often struggle with catastrophic forgetting when learning multiple tasks sequentially, as training on new tasks degrades the performance on previously learned ones. Recent theoretical work has addressed this issue by analysing learning curves in synthetic frameworks under predefined training protocols. However, these protocols relied on heuristics and lacked a solid theoretical foundation assessing their optimality. In this paper, we fill this gap combining exact equations for training dynamics, derived using statistical physics techniques, with optimal control methods. We apply this approach to teacher-student models for continual learning and multi-task problems, obtaining a theory for task-selection protocols maximising performance while minimising forgetting. Our theoretical analysis offers non-trivial yet interpretable strategies for mitigating catastrophic forgetting, shedding light on how optimal learning protocols can modulate established effects, such as the influence of task similarity on forgetting. Finally, we validate our theoretical findings on real-world data.
Authors: Theo Larcher, Lukas Picek, Benjamin Deneu, Titouan Lorieul, Maximilien Servajean, Alexis Joly
Abstract: This paper describes a deep-SDM framework, MALPOLON. Written in Python and built upon the PyTorch library, this framework aims to facilitate training and inferences of deep species distribution models (deep-SDM) and sharing for users with only general Python language skills (e.g., modeling ecologists) who are interested in testing deep learning approaches to build new SDMs. More advanced users can also benefit from the framework's modularity to run more specific experiments by overriding existing classes while taking advantage of press-button examples to train neural networks on multiple classification tasks using custom or provided raw and pre-processed datasets. The framework is open-sourced on GitHub and PyPi along with extensive documentation and examples of use in various scenarios. MALPOLON offers straightforward installation, YAML-based configuration, parallel computing, multi-GPU utilization, baseline and foundational models for benchmarking, and extensive tutorials/documentation, aiming to enhance accessibility and performance scalability for ecologists and researchers.
Authors: Parth Ganeriwala, Siddhartha Bhattacharyya, Raja Muthalagu
Abstract: Transfer Learning has become one of the standard methods to solve problems to overcome the isolated learning paradigm by utilizing knowledge acquired for one task to solve another related one. However, research needs to be done, to identify the initial steps before inducing transfer learning to applications for further verification and explainablity. In this research, we have performed cross dataset analysis and network architecture repair for the lane detection application in autonomous vehicles. Lane detection is an important aspect of autonomous vehicles driving assistance system. In most circumstances, modern deep-learning-based lane recognition systems are successful, but they struggle with lanes with complex topologies. The proposed architecture, ERFCondLaneNet is an enhancement to the CondlaneNet used for lane identification framework to solve the difficulty of detecting lane lines with complex topologies like dense, curved and fork lines. The newly proposed technique was tested on two common lane detecting benchmarks, CULane and CurveLanes respectively, and two different backbones, ResNet and ERFNet. The researched technique with ERFCondLaneNet, exhibited similar performance in comparison to ResnetCondLaneNet, while using 33% less features, resulting in a reduction of model size by 46%.
Authors: Qing Li, Jinxing Hua, Qiuxia Sun
Abstract: In this paper, we propose a Q-learning based decision-making framework to improve the safety and efficiency of Autonomous Vehicles when they encounter other maliciously behaving vehicles while passing through unsignalized intersections. In Autonomous Vehicles, conventional reward signals are set as regular rewards regarding feedback factors such as safety and efficiency. In this paper, safety gains are modulated by variable weighting parameters to ensure that safety can be emphasized more in emergency situations. The framework proposed in this paper introduces first-order theory of mind inferences on top of conventional rewards, using first-order beliefs as additional reward signals. The decision framework enables Autonomous Vehicles to make informed decisions when encountering vehicles with potentially malicious behaviors at unsignalized intersections, thereby improving the overall safety and efficiency of Autonomous Vehicle transportation systems. In order to verify the performance of the decision framework, this paper uses Prescan/Simulink co-simulations for simulation, and the results show that the performance of the decision framework can meet the set requirements.
Authors: Andrew Starnes, Clayton Webster
Abstract: In this effort, we propose using the Mamba for handling tabular data in personalized recommendation systems. We present the \textit{FT-Mamba} (Feature Tokenizer\,$+$\,Mamba), a novel hybrid model that replaces Transformer layers with Mamba layers within the FT-Transformer architecture, for handling tabular data in personalized recommendation systems. The \textit{Mamba model} offers an efficient alternative to Transformers, reducing computational complexity from quadratic to linear by enhancing the capabilities of State Space Models (SSMs). FT-Mamba is designed to improve the scalability and efficiency of recommendation systems while maintaining performance. We evaluate FT-Mamba in comparison to a traditional Transformer-based model within a Two-Tower architecture on three datasets: Spotify music recommendation, H\&M fashion recommendation, and vaccine messaging recommendation. Each model is trained on 160,000 user-action pairs, and performance is measured using precision (P), recall (R), Mean Reciprocal Rank (MRR), and Hit Ratio (HR) at several truncation values. Our results demonstrate that FT-Mamba outperforms the Transformer-based model in terms of computational efficiency while maintaining or exceeding performance across key recommendation metrics. By leveraging Mamba layers, FT-Mamba provides a scalable and effective solution for large-scale personalized recommendation systems, showcasing the potential of the Mamba architecture to enhance both efficiency and accuracy.
Authors: Shashidhar Reddy Javaji, Zining Zhu
Abstract: Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen's kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model's knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems
Authors: Diego Marcos, Robert van de Vlasakker, Ioannis N. Athanasiadis, Pierre Bonnet, Herv\'e Goeau, Alexis Joly, W. Daniel Kissling, C\'esar Leblanc, Andr\'e S. J. van Proosdij, Konstantinos P. Panousis
Abstract: Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the traits of interest.
Authors: V. Lanzetta
Abstract: Literature highlighted that financial time series data pose significant challenges for accurate stock price prediction, because these data are characterized by noise and susceptibility to news; traditional statistical methodologies made assumptions, such as linearity and normality, which are not suitable for the non-linear nature of financial time series; on the other hand, machine learning methodologies are able to capture non linear relationship in the data. To date, neural network is considered the main machine learning tool for the financial prices prediction. Transfer Learning, as a method aimed at transferring knowledge from source tasks to target tasks, can represent a very useful methodological tool for getting better financial prediction capability. Current reviews on the above body of knowledge are mainly focused on neural network architectures, for financial prediction, with very little emphasis on the transfer learning methodology; thus, this paper is aimed at going deeper on this topic by developing a systematic review with respect to application of Transfer Learning for financial market predictions and to challenges/potential future directions of the transfer learning methodologies for stock market predictions.
Authors: Duong Thuy Anh Nguyen, Su Wang, Duong Tung Nguyen, Angelia Nedich, H. Vincent Poor
Abstract: We investigate the problem of agent-to-agent interaction in decentralized (federated) learning over time-varying directed graphs, and, in doing so, propose a consensus-based algorithm called DSGTm-TV. The proposed algorithm incorporates gradient tracking and heavy-ball momentum to distributively optimize a global objective function, while preserving local data privacy. Under DSGTm-TV, agents will update local model parameters and gradient estimates using information exchange with neighboring agents enabled through row- and column-stochastic mixing matrices, which we show guarantee both consensus and optimality. Our analysis establishes that DSGTm-TV exhibits linear convergence to the exact global optimum when exact gradient information is available, and converges in expectation to a neighborhood of the global optimum when employing stochastic gradients. Moreover, in contrast to existing methods, DSGTm-TV preserves convergence for networks with uncoordinated stepsizes and momentum parameters, for which we provide explicit bounds. These results enable agents to operate in a fully decentralized manner, independently optimizing their local hyper-parameters. We demonstrate the efficacy of our approach via comparisons with state-of-the-art baselines on real-world image classification and natural language processing tasks.
Authors: Guanyi Mou, Kyumin Lee
Abstract: With the widespread online social networks, hate speeches are spreading faster and causing more damage than ever before. Existing hate speech detection methods have limitations in several aspects, such as handling data insufficiency, estimating model uncertainty, improving robustness against malicious attacks, and handling unintended bias (i.e., fairness). There is an urgent need for accurate, robust, and fair hate speech classification in online social networks. To bridge the gap, we design a data-augmented, fairness addressed, and uncertainty estimated novel framework. As parts of the framework, we propose Bidirectional Quaternion-Quasi-LSTM layers to balance effectiveness and efficiency. To build a generalized model, we combine five datasets collected from three platforms. Experiment results show that our model outperforms eight state-of-the-art methods under both no attack scenario and various attack scenarios, indicating the effectiveness and robustness of our model. We share our code along with combined dataset for better future research
Authors: Haleh Hayati, Carlos Murguia, Nathan van de Wouw
Abstract: Federated learning (FL) has emerged as a method to preserve privacy in collaborative distributed learning. In FL, clients train AI models directly on their devices rather than sharing data with a centralized server, which can pose privacy risks. However, it has been shown that despite FL's partial protection of local data privacy, information about clients' data can still be inferred from shared model updates during training. In recent years, several privacy-preserving approaches have been developed to mitigate this privacy leakage in FL, though they often provide privacy at the cost of model performance or system efficiency. Balancing these trade-offs presents a significant challenge in implementing FL schemes. In this manuscript, we introduce a privacy-preserving FL framework that combines differential privacy and system immersion tools from control theory. The core idea is to treat the optimization algorithms used in standard FL schemes (e.g., gradient-based algorithms) as a dynamical system that we seek to immerse into a higher-dimensional system (referred to as the target optimization algorithm). The target algorithm's dynamics are designed such that, first, the model parameters of the original algorithm are immersed in its parameters; second, it operates on distorted parameters; and third, it converges to an encoded version of the true model parameters from the original algorithm. These encoded parameters can then be decoded at the server to retrieve the original model parameters. We demonstrate that the proposed privacy-preserving scheme can be tailored to offer any desired level of differential privacy for both local and global model parameters, while maintaining the same accuracy and convergence rate as standard FL algorithms.
Authors: Tommie Kerssies, Daan de Geus, Gijs Dubbelman
Abstract: In this report, we present our solution for Track 1 of the 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves 1st place in the challenge. Our code is publicly available at https://github.com/tue-mps/benchmark-vfm-ss.
Authors: Shunyuan Mao, Ruobing Dong, Kwang Moo Yi, Lu Lu, Sifan Wang, Paris Perdikaris
Abstract: We introduce Disk2Planet, a machine learning-based tool to infer key parameters in disk-planet systems from observed protoplanetary disk structures. Disk2Planet takes as input the disk structures in the form of two-dimensional density and velocity maps, and outputs disk and planet properties, that is, the Shakura--Sunyaev viscosity, the disk aspect ratio, the planet--star mass ratio, and the planet's radius and azimuth. We integrate the Covariance Matrix Adaptation Evolution Strategy (CMA--ES), an evolutionary algorithm tailored for complex optimization problems, and the Protoplanetary Disk Operator Network (PPDONet), a neural network designed to predict solutions of disk--planet interactions. Our tool is fully automated and can retrieve parameters in one system in three minutes on an Nvidia A100 graphics processing unit. We empirically demonstrate that our tool achieves percent-level or higher accuracy, and is able to handle missing data and unknown levels of noise.
Authors: Debargha Ganguly, Srinivasan Iyengar, Vipin Chaudhary, Shivkumar Kalyanaraman
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning, particularly in novel domains and complex logical sequences. This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs. Our approach bridges LLM-generated ideas with formal logic verification, employing a custom interpreter to convert LLM outputs into First Order Logic constructs for theorem prover scrutiny. Central to our method is an intermediary JSON-based Domain-Specific Language, which by design balances precise logical structures with intuitive human concepts. This hybrid representation enables both rigorous validation and accessible human comprehension of LLM reasoning processes. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge, and a flexible architecture that allows for easy extension to various domain-specific applications. We demonstrate Proof of Thought's effectiveness through benchmarking on StrategyQA and a novel multimodal reasoning task, showing improved performance in open-ended scenarios. By providing verifiable and interpretable results, our technique addresses critical needs for AI system accountability and sets a foundation for human-in-the-loop oversight in high-stakes domains.
Authors: Pandiyaraju V, Shravan Venkatraman, Abeshek A, Aravintakshan S A, Pavan Kumar S, Madhan S
Abstract: Tumors in the brain result from abnormal cell growth within the brain tissue, arising from various types of brain cells. When left undiagnosed, they lead to severe neurological deficits such as cognitive impairment, motor dysfunction, and sensory loss. As the tumor grows, it causes an increase in intracranial pressure, potentially leading to life-threatening complications such as brain herniation. Therefore, early detection and treatment are necessary to manage the complications caused by such tumors to slow down their growth. Numerous works involving deep learning (DL) and artificial intelligence (AI) are being carried out to assist physicians in early diagnosis by utilizing the scans obtained through Magnetic Resonance Imaging (MRI). Our research proposes DL frameworks for localizing, segmenting, and classifying the grade of these gliomas from MRI images to solve this critical issue. In our localization framework, we enhance the LinkNet framework with a VGG19- inspired encoder architecture for improved multimodal tumor feature extraction, along with spatial and graph attention mechanisms to refine feature focus and inter-feature relationships. Following this, we integrated the SeResNet101 CNN model as the encoder backbone into the LinkNet framework for tumor segmentation, which achieved an IoU Score of 96%. To classify the segmented tumors, we combined the SeResNet152 feature extractor with an Adaptive Boosting classifier, which yielded an accuracy of 98.53%. Our proposed models demonstrated promising results, with the potential to advance medical AI by enabling early diagnosis and providing more accurate treatment options for patients.
Authors: Xun Xian, Ganghua Wang, Xuan Bi, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding
Abstract: Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs' generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q\&A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query's embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q\&A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.
Authors: Yacouba Kaloga, Shakeel A. Sheikh, Ina Kodrasi
Abstract: Recently proposed automatic pathological speech detection approaches rely on spectrogram input representations or wav2vec2 embeddings. These representations may contain pathology irrelevant uncorrelated information, such as changing phonetic content or variations in speaking style across time, which can adversely affect classification performance. To address this issue, we propose to use Multiview Canonical Correlation Analysis (MCCA) on these input representations prior to automatic pathological speech detection. Our results demonstrate that unlike other dimensionality reduction techniques, the use of MCCA leads to a considerable improvement in pathological speech detection performance by eliminating uncorrelated information present in the input representations. Employing MCCA with traditional classifiers yields a comparable or higher performance than using sophisticated architectures, while preserving the representation structure and providing interpretability.
Authors: Tongfei (Felicia), Guo, Taposh Banerjee, Rui Liu, Lili Su
Abstract: Trajectory prediction describes the motions of surrounding moving obstacles for an autonomous vehicle; it plays a crucial role in enabling timely decision-making, such as collision avoidance and trajectory replanning. Accurate trajectory planning is the key to reliable vehicle deployments in open-world environment, where unstructured obstacles bring in uncertainties that are impossible to fully capture by training data. For traditional machine learning tasks, such uncertainties are often addressed reasonably well via methods such as continual learning. On the one hand, naively applying those methods to trajectory prediction can result in continuous data collection and frequent model updates, which can be resource-intensive. On the other hand, the predicted trajectories can be far away from the true trajectories, leading to unsafe decision-making. In this paper, we aim to establish real-time awareness of out-of-distribution in trajectory prediction for autonomous vehicles. We focus on the challenging and practically relevant setting where the out-of-distribution is deceptive, that is, the one not easily detectable by human intuition. Drawing on the well-established techniques of sequential analysis, we build real-time awareness of out-of-distribution by monitoring prediction errors using the quickest change point detection (QCD). Our solutions are lightweight and can handle the occurrence of out-of-distribution at any time during trajectory prediction inference. Experimental results on multiple real-world datasets using a benchmark trajectory prediction model demonstrate the effectiveness of our methods.
Authors: Cui Zhang, Wenjun Zhang, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen, Khaled B. Letaief
Abstract: The Internet of Vehicles (IoV) network can address the issue of limited computing resources and data processing capabilities of individual vehicles, but it also brings the risk of privacy leakage to vehicle users. Applying blockchain technology can establish secure data links within the IoV, solving the problems of insufficient computing resources for each vehicle and the security of data transmission over the network. However, with the development of the IoV, the amount of data interaction between multiple vehicles and between vehicles and base stations, roadside units, etc., is continuously increasing. There is a need to further reduce the interaction volume, and intelligent data compression is key to solving this problem. The VIB technique facilitates the training of encoding and decoding models, substantially diminishing the volume of data that needs to be transmitted. This paper introduces an innovative approach that integrates blockchain with VIB, referred to as BVIB, designed to lighten computational workloads and reinforce the security of the network. We first construct a new network framework by separating the encoding and decoding networks to address the computational burden issue, and then propose a new algorithm to enhance the security of IoV networks. We also discuss the impact of the data extraction rate on system latency to determine the most suitable data extraction rate. An experimental framework combining Python and C++ has been established to substantiate the efficacy of our BVIB approach. Comprehensive simulation studies indicate that the BVIB consistently excels in comparison to alternative foundational methodologies.
Authors: Hanwen Huang
Abstract: Conditional generative models represent a significant advancement in the field of machine learning, allowing for the controlled synthesis of data by incorporating additional information into the generation process. In this work we introduce a novel Schr\"odinger bridge based deep generative method for learning conditional distributions. We start from a unit-time diffusion process governed by a stochastic differential equation (SDE) that transforms a fixed point at time $0$ into a desired target conditional distribution at time $1$. For effective implementation, we discretize the SDE with Euler-Maruyama method where we estimate the drift term nonparametrically using a deep neural network. We apply our method to both low-dimensional and high-dimensional conditional generation problems. The numerical studies demonstrate that though our method does not directly provide the conditional density estimation, the samples generated by this method exhibit higher quality compared to those obtained by several existing methods. Moreover, the generated samples can be effectively utilized to estimate the conditional density and related statistical quantities, such as conditional mean and conditional standard deviation.
Authors: Rita Rocio Guzman-Lopez, Luis Huamanchumo, Kevin Fernandez, Oscar Cutipa-Luque, Yhon Tiahuallpa, Helder Rojas
Abstract: This study introduces a novel approach that integrates agricultural census data with remotely sensed time series to develop precise predictive models for paddy rice yield across various regions of Peru. By utilizing sparse regression and Elastic-Net regularization techniques, the study identifies causal relationships between key remotely sensed variables-such as NDVI, precipitation, and temperature-and agricultural yield. To further enhance prediction accuracy, the first- and second-order dynamic transformations (velocity and acceleration) of these variables are applied, capturing non-linear patterns and delayed effects on yield. The findings highlight the improved predictive performance when combining regularization techniques with climatic and geospatial variables, enabling more precise forecasts of yield variability. The results confirm the existence of causal relationships in the Granger sense, emphasizing the value of this methodology for strategic agricultural management. This contributes to more efficient and sustainable production in paddy rice cultivation.
Authors: Jean-Loup Tastet, Inar Timiryasov
Abstract: We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantages of distillation cannot be attributed to suboptimal hyperparameter selection of the teachers. Our findings underscore the need for further investigation into distillation techniques, particularly in data-limited settings.
Authors: Ling Liang, Cameron Austin, Haizhao Yang
Abstract: Learning to Optimize (L2O) approaches, including algorithm unrolling, plug-and-play methods, and hyperparameter learning, have garnered significant attention and have been successfully applied to the Alternating Direction Method of Multipliers (ADMM) and its variants. However, the natural extension of L2O to multi-block ADMM-type methods remains largely unexplored. Such an extension is critical, as multi-block methods leverage the separable structure of optimization problems, offering substantial reductions in per-iteration complexity. Given that classical multi-block ADMM does not guarantee convergence, the Majorized Proximal Augmented Lagrangian Method (MPALM), which shares a similar form with multi-block ADMM and ensures convergence, is more suitable in this setting. Despite its theoretical advantages, MPALM's performance is highly sensitive to the choice of penalty parameters. To address this limitation, we propose a novel L2O approach that adaptively selects this hyperparameter using supervised learning. We demonstrate the versatility and effectiveness of our method by applying it to the Lasso problem and the optimal transport problem. Our numerical results show that the proposed framework outperforms popular alternatives. Given its applicability to generic linearly constrained composite optimization problems, this work opens the door to a wide range of potential real-world applications.
Authors: Jun Xiang, Jun Chen
Abstract: Predicting flight trajectories is a research area that holds significant merit. In this paper, we propose a data-driven learning framework, that leverages the predictive and feature extraction capabilities of the mixture models and seq2seq-based neural networks while addressing prevalent challenges caused by error propagation and dimensionality reduction. After training with this framework, the learned model can improve long-step prediction accuracy significantly given the past trajectories and the context information. The accuracy and effectiveness of the approach are evaluated by comparing the predicted trajectories with the ground truth. The results indicate that the proposed method has outperformed the state-of-the-art predicting methods on a terminal airspace flight trajectory dataset. The trajectories generated by the proposed method have a higher temporal resolution(1 timestep per second vs 0.1 timestep per second) and are closer to the ground truth.
Authors: Solmaz Seyed Monir, Irene Lau, Shubing Yang, Dongfang Zhao
Abstract: Traditional retrieval methods have been essential for assessing document similarity but struggle with capturing semantic nuances. Despite advancements in latent semantic analysis (LSA) and deep learning, achieving comprehensive semantic understanding and accurate retrieval remains challenging due to high dimensionality and semantic gaps. The above challenges call for new techniques to effectively reduce the dimensions and close the semantic gaps. To this end, we propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval. By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy. Experiments on real-world datasets show that VectorSearch outperforms baseline metrics, demonstrating its efficacy for large-scale retrieval tasks.
Authors: Konstantinos Skianis, John Pavlopoulos, A. Seza Do\u{g}ru\"oz
Abstract: Large Language Models (LLMs) are increasingly integrated into various medical fields, including mental health support systems. However, there is a gap in research regarding the effectiveness of LLMs in non-English mental health support applications. To address this problem, we present a novel multilingual adaptation of widely-used mental health datasets, translated from English into six languages (Greek, Turkish, French, Portuguese, German, and Finnish). This dataset enables a comprehensive evaluation of LLM performance in detecting mental health conditions and assessing their severity across multiple languages. By experimenting with GPT and Llama, we observe considerable variability in performance across languages, despite being evaluated on the same translated dataset. This inconsistency underscores the complexities inherent in multilingual mental health support, where language-specific nuances and mental health data coverage can affect the accuracy of the models. Through comprehensive error analysis, we emphasize the risks of relying exclusively on large language models (LLMs) in medical settings (e.g., their potential to contribute to misdiagnoses). Moreover, our proposed approach offers significant cost savings for multilingual tasks, presenting a major advantage for broad-scale implementation.
Authors: Pengsheng Liu, Linan Zheng, Jiale Chen, Guangfa Zhang, Yang Xu, Jinyun Fang
Abstract: The historical interaction sequences of users plays a crucial role in training recommender systems that can accurately predict user preferences. However, due to the arbitrariness of user behavior, the presence of noise in these sequences poses a challenge to predicting their next actions in recommender systems. To address this issue, our motivation is based on the observation that training noisy sequences and clean sequences (sequences without noise) with equal weights can impact the performance of the model. We propose a novel self-supervised Auxiliary Task Joint Training (ATJT) method aimed at more accurately reweighting noisy sequences in recommender systems. Specifically, we strategically select subsets from users' original sequences and perform random replacements to generate artificially replaced noisy sequences. Subsequently, we perform joint training on these artificially replaced noisy sequences and the original sequences. Through effective reweighting, we incorporate the training results of the noise recognition model into the recommender model. We evaluate our method on three datasets using a consistent base model. Experimental results demonstrate the effectiveness of introducing self-supervised auxiliary task to enhance the base model's performance.
Authors: Anirudh Tunga, Jordan Heim, Michael Mueterthies, Thomas Gruenwald, Jonathan Nistor
Abstract: Accurately capturing the three dimensional power distribution within a reactor core is vital for ensuring the safe and economical operation of the reactor, compliance with Technical Specifications, and fuel cycle planning (safety, control, and performance evaluation). Offline (that is, during cycle planning and core design), a three dimensional neutronics simulator is used to estimate the reactor's power, moderator, void, and flow distributions, from which margin to thermal limits and fuel exposures can be approximated. Online, this is accomplished with a system of local power range monitors (LPRMs) designed to capture enough neutron flux information to infer the full nodal power distribution. Certain problems with this process, ranging from measurement and calibration to the power adaption process, pose challenges to operators and limit the ability to design reload cores economically (e.g., engineering in insufficient margin or more margin than required). Artificial intelligence (AI) and machine learning (ML) are being used to solve the problems to reduce maintenance costs, improve the accuracy of online local power measurements, and decrease the bias between offline and online power distributions, thereby leading to a greater ability to design safe and economical reload cores. We present ML models trained from two deep neural network (DNN) architectures, SurrogateNet and LPRMNet, that demonstrate a testing error of 1 percent and 3 percent, respectively. Applications of these models can include virtual sensing capability for bypassed or malfunctioning LPRMs, on demand virtual calibration of detectors between successive calibrations, highly accurate nuclear end of life determinations for LPRMs, and reduced bias between measured and predicted power distributions within the core.
Authors: Spiridon Kasapis, Irina N. Kitiashvili, Alexander G. Kosovichev, John T. Stefan
Abstract: We developed Long Short-Term Memory (LSTM) models to predict the formation of active regions (ARs) on the solar surface. Using the Doppler shift velocity, the continuum intensity, and the magnetic field observations from the Solar Dynamics Observatory (SDO) Helioseismic and Magnetic Imager (HMI), we have created time-series datasets of acoustic power and magnetic flux, which are used to train LSTM models on predicting continuum intensity, 12 hours in advance. These novel machine learning (ML) models are able to capture variations of the acoustic power density associated with upcoming magnetic flux emergence and continuum intensity decrease. Testing of the models' performance was done on data for 5 ARs, unseen from the models during training. Model 8, the best performing model trained, was able to make a successful prediction of emergence for all testing active regions in an experimental setting and three of them in an operational. The model predicted the emergence of AR11726, AR13165, and AR13179 respectively 10, 29, and 5 hours in advance, and variations of this model achieved average RMSE values of 0.11 for both active and quiet areas on the solar disc. This work sets the foundations for ML-aided prediction of solar ARs.
Authors: Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4$\times$ speedup and 30\% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at \url{https://github.com/SalesforceAIResearch/GemFilter}.
Authors: Harsha Vardhan Simhadri, Martin Aum\"uller, Amir Ingber, Matthijs Douze, George Williams, Magdalen Dobson Manohar, Dmitry Baranchuk, Edo Liberty, Frank Liu, Ben Landrum, Mazin Karjikar, Laxman Dhulipala, Meng Chen, Yue Chen, Rui Ma, Kai Zhang, Yuzheng Cai, Jiayang Shi, Yizhuo Chen, Weiguo Zheng, Zihao Wan, Jie Yin, Ben Huang
Abstract: The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.
Authors: Julia S. Schmid (Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Alberta, Canada), Sean Simmons (Anglers Atlas, Goldstream Publishing, Prince George, British Columbia, Canada), Mark A. Lewis (Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Alberta, Canada, Department of Mathematics and Statistics, University of Victoria, Victoria, British Columbia, Canada, Department of Biology, University of Victoria, Victoria, British Columbia, Canada, Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada), Mark S. Poesch (Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada), Pouria Ramazi (Department of Mathematics and Statistics, Brock University, St. Catharines, Ontario, Canada)
Abstract: Understanding and predicting recreational fishing activity is important for sustainable fisheries management. However, traditional methods of measuring fishing pressure, such as surveys, can be costly and limited in both time and spatial extent. Predictive models that relate fishing activity to environmental or economic factors typically rely on historical data, which often restricts their spatial applicability due to data scarcity. In this study, high-resolution angler-generated data from an online platform and easily accessible auxiliary data were tested to predict daily boat presence and aerial counts of boats at almost 200 lakes over five years in Ontario, Canada. Lake-information website visits alone enabled predicting daily angler boat presence with 78% accuracy. While incorporating additional environmental, socio-ecological, weather and angler-generated features into machine learning models did not remarkably improve prediction performance of boat presence, they were substantial for the prediction of boat counts. Models achieved an R2 of up to 0.77 at known lakes included in the model training, but they performed poorly for unknown lakes (R2 = 0.21). The results demonstrate the value of integrating angler-generated data from online platforms into predictive models and highlight the potential of machine learning models to enhance fisheries management.
Authors: Chih-Wei Hsu, Martin Mladenov, Ofer Meshi, James Pine, Hubert Pham, Shane Li, Xujian Liang, Anton Polishko, Li Yang, Ben Scheetz, Craig Boutilier
Abstract: Evaluation of policies in recommender systems typically involves A/B testing using live experiments on real users to assess a new policy's impact on relevant metrics. This ``gold standard'' comes at a high cost, however, in terms of cycle time, user cost, and potential user retention. In developing policies for ``onboarding'' new users, these costs can be especially problematic, since on-boarding occurs only once. In this work, we describe a simulation methodology used to augment (and reduce) the use of live experiments. We illustrate its deployment for the evaluation of ``preference elicitation'' algorithms used to onboard new users of the YouTube Music platform. By developing counterfactually robust user behavior models, and a simulation service that couples such models with production infrastructure, we are able to test new algorithms in a way that reliably predicts their performance on key metrics when deployed live. We describe our domain, our simulation models and platform, results of experiments and deployment, and suggest future steps needed to further realistic simulation as a powerful complement to live experiments.
Authors: Chirag Vashist, Shichong Peng, Ke Li
Abstract: An emerging area of research aims to learn deep generative models with limited training data. Prior generative models like GANs and diffusion models require a lot of data to perform well, and their performance degrades when they are trained on only a small amount of data. A recent technique called Implicit Maximum Likelihood Estimation (IMLE) has been adapted to the few-shot setting, achieving state-of-the-art performance. However, current IMLE-based approaches encounter challenges due to inadequate correspondence between the latent codes selected for training and those drawn during inference. This results in suboptimal test-time performance. We theoretically show a way to address this issue and propose RS-IMLE, a novel approach that changes the prior distribution used for training. This leads to substantially higher quality image generation compared to existing GAN and IMLE-based methods, as validated by comprehensive experiments conducted on nine few-shot image datasets.
Authors: Ming Xiang, Stratis Ioannidis, Edmund Yeh, Carlee Joe-Wong, Lili Su
Abstract: Addressing intermittent client availability is critical for the real-world deployment of federated learning algorithms. Most prior work either overlooks the potential non-stationarity in the dynamics of client unavailability or requires substantial memory/computation overhead. We study federated learning in the presence of heterogeneous and non-stationary client availability, which may occur when the deployment environments are uncertain or the clients are mobile. The impacts of the heterogeneity and non-stationarity in client unavailability can be significant, as we illustrate using FedAvg, the most widely adopted federated learning algorithm. We propose FedAPM, which includes novel algorithmic structures that (i) compensate for missed computations due to unavailability with only $O(1)$ additional memory and computation with respect to standard FedAvg, and (ii) evenly diffuse local updates within the federated learning system through implicit gossiping, despite being agnostic to non-stationary dynamics. We show that FedAPM converges to a stationary point of even non-convex objectives while achieving the desired linear speedup property. We corroborate our analysis with numerical experiments over diversified client unavailability dynamics on real-world data sets.
Authors: Ryuichi Yamamoto, Yuma Shirahata, Masaya Kawamura, Kentaro Tachibana
Abstract: We propose a novel description-based controllable text-to-speech (TTS) method with cross-lingual control capability. To address the lack of audio-description paired data in the target language, we combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. These two models share disentangled timbre and style representations based on self-supervised learning (SSL), allowing for disentangled voice control, such as controlling speaking styles while retaining the original timbre. Furthermore, because the SSL-based timbre and style representations are language-agnostic, combining the TTS and description control models while sharing the same embedding space effectively enables cross-lingual control of voice characteristics. Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages, even though no Japanese audio-description pairs are used.
Authors: Yuqing Zhou, Ruixiang Tang, Ziyu Yao, Ziwei Zhu
Abstract: Language models (LMs), despite their advances, often depend on spurious correlations, undermining their accuracy and generalizability. This study addresses the overlooked impact of subtler, more complex shortcuts that compromise model reliability beyond oversimplified shortcuts. We introduce a comprehensive benchmark that categorizes shortcuts into occurrence, style, and concept, aiming to explore the nuanced ways in which these shortcuts influence the performance of LMs. Through extensive experiments across traditional LMs, large language models, and state-of-the-art robust models, our research systematically investigates models' resilience and susceptibilities to sophisticated shortcuts. Our benchmark and code can be found at: https://github.com/yuqing-zhou/shortcut-learning-in-text-classification.
URLs: https://github.com/yuqing-zhou/shortcut-learning-in-text-classification.
Authors: Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, Subhabrata Mukherjee
Abstract: The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model's performance across standard benchmarks. Full implementation and dataset are publicly accessible at https://github.com/kriti-hippo/red_queen.
Authors: Ruijiang Gao, Mingzhang Yin, James McInerney, Nathan Kallus
Abstract: Conformal Prediction methods have finite-sample distribution-free marginal coverage guarantees. However, they generally do not offer conditional coverage guarantees, which can be important for high-stakes decisions. In this paper, we propose a novel algorithm to train a regression function to improve the conditional coverage after applying the split conformal prediction procedure. We establish an upper bound for the miscoverage gap between the conditional coverage and the nominal coverage rate and propose an end-to-end algorithm to control this upper bound. We demonstrate the efficacy of our method empirically on synthetic and real-world datasets.
Authors: Guanyi Mou, Yichuan Li, Kyumin Lee
Abstract: Data augmentation has shown its effectiveness in resolving the data-hungry problem and improving model's generalization ability. However, the quality of augmented data can be varied, especially compared with the raw/original data. To boost deep learning models' performance given augmented data/samples in text classification tasks, we propose a novel framework, which leverages both meta learning and contrastive learning techniques as parts of our design for reweighting the augmented samples and refining their feature representations based on their quality. As part of the framework, we propose novel weight-dependent enqueue and dequeue algorithms to utilize augmented samples' weight/quality information effectively. Through experiments, we show that our framework can reasonably cooperate with existing deep learning models (e.g., RoBERTa-base and Text-CNN) and augmentation techniques (e.g., Wordnet and Easydata) for specific supervised learning tasks. Experiment results show that our framework achieves an average of 1.6%, up to 4.3% absolute improvement on Text-CNN encoders and an average of 1.4%, up to 4.4% absolute improvement on RoBERTa-base encoders on seven GLUE benchmark datasets compared with the best baseline. We present an indepth analysis of our framework design, revealing the non-trivial contributions of our network components. Our code is publicly available for better reproducibility.
Authors: Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang
Abstract: Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model's 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM's learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at \url{https://github.com/NVlabs/MaskLLM}.
Authors: Diego Martinez-Taboada, Aaditya Ramdas
Abstract: We present a sequential version of the kernelized Stein discrepancy, which allows for conducting goodness-of-fit tests for unnormalized densities that are continuously monitored and adaptively stopped. That is, the sample size need not be fixed prior to data collection; the practitioner can choose whether to stop the test or continue to gather evidence at any time while controlling the false discovery rate. In stark contrast to related literature, we do not impose uniform boundedness on the Stein kernel. Instead, we exploit the potential boundedness of the Stein kernel at arbitrary point evaluations to define test martingales, that give way to the subsequent novel sequential tests. We prove the validity of the test, as well as an asymptotic lower bound for the logarithmic growth of the wealth process under the alternative. We further illustrate the empirical performance of the test with a variety of distributions, including restricted Boltzmann machines.
Authors: Xun Zhu, Ying Hu, Fanbin Mo, Miao Li, Ji Wu
Abstract: Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector. Extensive ablation experiments validate the effectiveness of introducing CMoE under any configuration, with up to an average 8% performance gains. We further provide interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. Compared to previous state-of-the-art medical MLLMs, Uni-Med achieves competitive or superior evaluation metrics on diverse tasks. Code, data and model will be soon available at GitHub.
Authors: Ziquan Wei, Tingting Dan, Jiaqi Ding, Paul J Laurienti, Guorong Wu
Abstract: Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the clich\'e of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank under supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.
Authors: Gary A. McCully, John D. Hastings, Shengjie Xu, Adam Fortier
Abstract: Ransomware and other forms of malware cause significant financial and operational damage to organizations by exploiting long-standing and often difficult-to-detect software vulnerabilities. To detect vulnerabilities such as buffer overflows in compiled code, this research investigates the application of unidirectional transformer-based embeddings, specifically GPT-2. Using a dataset of LLVM functions, we trained a GPT-2 model to generate embeddings, which were subsequently used to build LSTM neural networks to differentiate between vulnerable and non-vulnerable code. Our study reveals that embeddings from the GPT-2 model significantly outperform those from bidirectional models of BERT and RoBERTa, achieving an accuracy of 92.5% and an F1-score of 89.7%. LSTM neural networks were developed with both frozen and unfrozen embedding model layers. The model with the highest performance was achieved when the embedding layers were unfrozen. Further, the research finds that, in exploring the impact of different optimizers within this domain, the SGD optimizer demonstrates superior performance over Adam. Overall, these findings reveal important insights into the potential of unidirectional transformer-based approaches in enhancing cybersecurity defenses.
Authors: Danial Sharifrazi, Nouman Javed, Javad Hassannataj Joloudari, Roohallah Alizadehsani, Prasad N. Paradkar, Ru-San Tan, U. Rajendra Acharya, Asim Bhatti
Abstract: Human brain neuron activities are incredibly significant nowadays. Neuronal behavior is assessed by analyzing signal data such as electroencephalography (EEG), which can offer scientists valuable information about diseases and human-computer interaction. One of the difficulties researchers confront while evaluating these signals is the existence of large volumes of spike data. Spikes are some considerable parts of signal data that can happen as a consequence of vital biomarkers or physical issues such as electrode movements. Hence, distinguishing types of spikes is important. From this spot, the spike classification concept commences. Previously, researchers classified spikes manually. The manual classification was not precise enough as it involves extensive analysis. Consequently, Artificial Intelligence (AI) was introduced into neuroscience to assist clinicians in classifying spikes correctly. This review discusses the importance and use of AI in spike classification, focusing on the recognition of neural activity noises. The task is divided into three main components: preprocessing, classification, and evaluation. Existing methods are introduced and their importance is determined. The review also highlights the need for more efficient algorithms. The primary goal is to provide a perspective on spike classification for future research and provide a comprehensive understanding of the methodologies and issues involved. The review organizes materials in the spike classification field for future studies. In this work, numerous studies were extracted from different databases. The PRISMA-related research guidelines were then used to choose papers. Then, research studies based on spike classification using machine learning and deep learning approaches with effective preprocessing were selected.
Authors: Konstantinos Pantazis, Michael Trosset, William N. Frost, Carey E. Priebe, Vince Lyzinski
Abstract: Theoretical and empirical evidence suggests that joint graph embedding algorithms induce correlation across the networks in the embedding space. In the Omnibus joint graph embedding framework, previous results explicitly delineated the dual effects of the algorithm-induced and model-inherent correlations on the correlation across the embedded networks. Accounting for and mitigating the algorithm-induced correlation is key to subsequent inference, as sub-optimal Omnibus matrix constructions have been demonstrated to lead to loss in inference fidelity. This work presents the first efforts to automate the Omnibus construction in order to address two key questions in this joint embedding framework: the correlation-to-OMNI problem and the flat correlation problem. In the flat correlation problem, we seek to understand the minimum algorithm-induced flat correlation (i.e., the same across all graph pairs) produced by a generalized Omnibus embedding. Working in a subspace of the fully general Omnibus matrices, we prove both a lower bound for this flat correlation and that the classical Omnibus construction induces the maximal flat correlation. In the correlation-to-OMNI problem, we present an algorithm -- named corr2Omni -- that, from a given matrix of estimated pairwise graph correlations, estimates the matrix of generalized Omnibus weights that induces optimal correlation in the embedding space. Moreover, in both simulated and real data settings, we demonstrate the increased effectiveness of our corr2Omni algorithm versus the classical Omnibus construction.
Authors: Cheolhun Jang
Abstract: Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model's distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose \textbf{Modulated Intervention Preference Optimization (MIPO)} to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it. If the data is well-aligned, the intervention is increased to prevent the policy model from diverging significantly from reference model. Conversely, if the alignment is poor, the interference is reduced to facilitate more extensive training. We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench. The experimental results demonstrate that MIPO consistently outperforms DPO across various evaluation scenarios.
Authors: Dimpal Janu, Sandeep Mandia, Kuldeep Singh, Sandeep Kumar
Abstract: In this paper, we develop a novel mobility-aware transformer-driven tiered structure (MASSFormer) based cooperative spectrum sensing method that effectively models the spatio-temporal dynamics of user movements. Unlike existing methods, our method considers a dynamic scenario involving mobile primary users (PUs) and secondary users (SUs)and addresses the complexities introduced by user mobility. The transformer architecture utilizes an attention mechanism, enabling the proposed method to adeptly model the temporal dynamics of user mobility by effectively capturing long-range dependencies within the input data. The proposed method first computes tokens from the sequence of covariance matrices (CMs) for each SU and processes them in parallel using the SUtransformer network to learn the spatio-temporal features at SUlevel. Subsequently, the collaborative transformer network learns the group-level PU state from all SU-level feature representations. The attention-based sequence pooling method followed by the transformer encoder adjusts the contributions of all tokens. The main goal of predicting the PU states at each SU-level and group-level is to improve detection performance even more. We conducted a sufficient amount of simulations and compared the detection performance of different SS methods. The proposed method is tested under imperfect reporting channel scenarios to show robustness. The efficacy of our method is validated with the simulation results demonstrating its higher performance compared with existing methods in terms of detection probability, sensing error, and classification accuracy.
Authors: Deniz G\"und\"uz, Mich\`ele A. Wigger, Tze-Yang Tung, Ping Zhang, Yong Xiao
Abstract: Semantic- and task-oriented communication has emerged as a promising approach to reducing the latency and bandwidth requirements of next-generation mobile networks by transmitting only the most relevant information needed to complete a specific task at the receiver. This is particularly advantageous for machine-oriented communication of high data rate content, such as images and videos, where the goal is rapid and accurate inference, rather than perfect signal reconstruction. While semantic- and task-oriented compression can be implemented in conventional communication systems, joint source-channel coding (JSCC) offers an alternative end-to-end approach by optimizing compression and channel coding together, or even directly mapping the source signal to the modulated waveform. Although all digital communication systems today rely on separation, thanks to its modularity, JSCC is known to achieve higher performance in finite blocklength scenarios, and to avoid cliff and the levelling-off effects in time-varying channel scenarios. This article provides an overview of the information theoretic foundations of JSCC, surveys practical JSCC designs over the decades, and discusses the reasons for their limited adoption in practical systems. We then examine the recent resurgence of JSCC, driven by the integration of deep learning techniques, particularly through DeepJSCC, highlighting its many surprising advantages in various scenarios. Finally, we discuss why it may be time to reconsider today's strictly separate architectures, and reintroduce JSCC to enable high-fidelity, low-latency communications in critical applications such as autonomous driving, drone surveillance, or wearable systems.
Authors: Christina Zhang, Simran Motwani, Matthew Yu, Ji Hou, Felix Juefei-Xu, Sam Tsai, Peter Vajda, Zijian He, Jialiang Wang
Abstract: Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically $8 \times 8$ lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.
Authors: Peiyong Wang, Casey. R. Myers, Lloyd C. L. Hollenberg, Udaya Parampalli
Abstract: Artificial Intelligence (AI), with its multiplier effect and wide applications in multiple areas, could potentially be an important application of quantum computing. Since modern AI systems are often built on neural networks, the design of quantum neural networks becomes a key challenge in integrating quantum computing into AI. To provide a more fine-grained characterisation of the impact of quantum components on the performance of neural networks, we propose a framework where classical neural network layers are gradually replaced by quantum layers that have the same type of input and output while keeping the flow of information between layers unchanged, different from most current research in quantum neural network, which favours an end-to-end quantum model. We start with a simple three-layer classical neural network without any normalisation layers or activation functions, and gradually change the classical layers to the corresponding quantum versions. We conduct numerical experiments on image classification datasets such as the MNIST, FashionMNIST and CIFAR-10 datasets to demonstrate the change of performance brought by the systematic introduction of quantum components. Through this framework, our research sheds new light on the design of future quantum neural network models where it could be more favourable to search for methods and frameworks that harness the advantages from both the classical and quantum worlds.
Authors: Zeyue Zhang, Xiaoling Lu, Feng Zhou
Abstract: The Bayesian two-step change point detection method is popular for the Hawkes process due to its simplicity and intuitiveness. However, the non-conjugacy between the point process likelihood and the prior requires most existing Bayesian two-step change point detection methods to rely on non-conjugate inference methods. These methods lack analytical expressions, leading to low computational efficiency and impeding timely change point detection. To address this issue, this work employs data augmentation to propose a conjugate Bayesian two-step change point detection method for the Hawkes process, which proves to be more accurate and efficient. Extensive experiments on both synthetic and real data demonstrate the superior effectiveness and efficiency of our method compared to baseline methods. Additionally, we conduct ablation studies to explore the robustness of our method concerning various hyperparameters. Our code is publicly available at https://github.com/Aurora2050/CoBay-CPD.
Authors: Amir Samadi, Konstantinos Koufos, Kurt Debattista, Mehrdad Dianati
Abstract: In this paper, we address the limitations of traditional teacher-student models, imitation learning, and behaviour cloning in the context of Autonomous/Automated Driving Systems (ADS), where these methods often struggle with incomplete coverage of real-world scenarios. To enhance the robustness of such models, we introduce the use of Counterfactual Explanations (CFEs) as a novel data augmentation technique for end-to-end ADS. CFEs, by generating training samples near decision boundaries through minimal input modifications, lead to a more comprehensive representation of expert driver strategies, particularly in safety-critical scenarios. This approach can therefore help improve the model's ability to handle rare and challenging driving events, such as anticipating darting out pedestrians, ultimately leading to safer and more trustworthy decision-making for ADS. Our experiments in the CARLA simulator demonstrate that CF-Driver outperforms the current state-of-the-art method, achieving a higher driving score and lower infraction rates. Specifically, CF-Driver attains a driving score of 84.2, surpassing the previous best model by 15.02 percentage points. These results highlight the effectiveness of incorporating CFEs in training end-to-end ADS. To foster further research, the CF-Driver code is made publicly available.
Authors: Nicola Pia, Martin Strauss, Markus Multrus, Bernd Edler
Abstract: This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.
Authors: Mark Kozdoba, Binyamin Perets, Shie Mannor
Abstract: There is a well known intrinsic trade-off between the fairness of a representation and the performance of classifiers derived from the representation. Due to the complexity of optimisation algorithms in most modern representation learning approaches, for a given method it may be non-trivial to decide whether the obtained fairness-performance curve of the method is optimal, i.e., whether it is close to the true Pareto front for these quantities for the underlying data distribution. In this paper we propose a new method to compute the optimal Pareto front, which does not require the training of complex representation models. We show that optimal fair representations possess several useful structural properties, and that these properties enable a reduction of the computation of the Pareto Front to a compact discrete problem. We then also show that these compact approximating problems can be efficiently solved via off-the shelf concave-convex programming methods. Since our approach is independent of the specific model of representations, it may be used as the benchmark to which representation learning algorithms may be compared. We experimentally evaluate the approach on a number of real world benchmark datasets.
Authors: Shin'ya Yamaguchi, Kosuke Nishida
Abstract: Recent concept-based interpretable models have succeeded in providing meaningful explanations by pre-defined concept sets. However, the dependency on the pre-defined concepts restricts the application because of the limited number of concepts for explanations. This paper proposes a novel interpretable deep neural network called explanation bottleneck models (XBMs). XBMs generate a text explanation from the input without pre-defined concepts and then predict a final task prediction based on the generated explanation by leveraging pre-trained vision-language encoder-decoder models. To achieve both the target task performance and the explanation quality, we train XBMs through the target task loss with the regularization penalizing the explanation decoder via the distillation from the frozen pre-trained decoder. Our experiments, including a comparison to state-of-the-art concept bottleneck models, confirm that XBMs provide accurate and fluent natural language explanations without pre-defined concept sets. Code will be available at https://github.com/yshinya6/xbm/.
Authors: Yasaman Haghbin, Hadi Moradi, Reshad Hosseini
Abstract: One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson's disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.
Authors: Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang
Abstract: In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
Authors: Giandomenico Cornacchia, Giulio Zizzo, Kieran Fraser, Muhammad Zaid Hamed, Ambrish Rawat, Mark Purcell
Abstract: The proliferation of Large Language Models (LLMs) in diverse applications underscores the pressing need for robust security measures to thwart potential jailbreak attacks. These attacks exploit vulnerabilities within LLMs, endanger data integrity and user privacy. Guardrails serve as crucial protective mechanisms against such threats, but existing models often fall short in terms of both detection accuracy, and computational efficiency. This paper advocates for the significance of jailbreak attack prevention on LLMs, and emphasises the role of input guardrails in safeguarding these models. We introduce MoJE (Mixture of Jailbreak Expert), a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails. By employing simple linguistic statistical techniques, MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference. Through rigorous experimentation, MoJE demonstrates superior performance capable of detecting 90% of the attacks without compromising benign prompts, enhancing LLMs security against jailbreak attacks.
Authors: Koki Okajima, Tomoyuki Obuchi
Abstract: Transfer learning techniques aim to leverage information from multiple related datasets to enhance prediction quality against a target dataset. Such methods have been adopted in the context of high-dimensional sparse regression, and some Lasso-based algorithms have been invented: Trans-Lasso and Pretraining Lasso are such examples. These algorithms require the statistician to select hyperparameters that control the extent and type of information transfer from related datasets. However, selection strategies for these hyperparameters, as well as the impact of these choices on the algorithm's performance, have been largely unexplored. To address this, we conduct a thorough, precise study of the algorithm in a high-dimensional setting via an asymptotic analysis using the replica method. Our approach reveals a surprisingly simple behavior of the algorithm: Ignoring one of the two types of information transferred to the fine-tuning stage has little effect on generalization performance, implying that efforts for hyperparameter selection can be significantly reduced. Our theoretical findings are also empirically supported by real-world applications on the IMDb dataset.
Authors: Nithish Kannen, Yao Ma, Gerrit J. J. van den Burg, Jean Baptiste Faddoul
Abstract: News recommendation is a challenging task that involves personalization based on the interaction history and preferences of each user. Recent works have leveraged the power of pretrained language models (PLMs) to directly rank news items by using inference approaches that predominately fall into three categories: pointwise, pairwise, and listwise learning-to-rank. While pointwise methods offer linear inference complexity, they fail to capture crucial comparative information between items that is more effective for ranking tasks. Conversely, pairwise and listwise approaches excel at incorporating these comparisons but suffer from practical limitations: pairwise approaches are either computationally expensive or lack theoretical guarantees, and listwise methods often perform poorly in practice. In this paper, we propose a novel framework for PLM-based news recommendation that integrates both pointwise relevance prediction and pairwise comparisons in a scalable manner. We present a rigorous theoretical analysis of our framework, establishing conditions under which our approach guarantees improved performance. Extensive experiments show that our approach outperforms the state-of-the-art methods on the MIND and Adressa news recommendation datasets.
Authors: Tiago de Souza Farias, Lucas Friedrich, Jonas Maziero
Abstract: Quantum computing with qudits, an extension of qubits to multiple levels, is a research field less mature than qubit-based quantum computing. However, qudits can offer some advantages over qubits, by representing information with fewer separated components. In this article, we present QuForge, a Python-based library designed to simulate quantum circuits with qudits. This library provides the necessary quantum gates for implementing quantum algorithms, tailored to any chosen qudit dimension. Built on top of differentiable frameworks, QuForge supports execution on accelerating devices such as GPUs and TPUs, significantly speeding up simulations. It also supports sparse operations, leading to a reduction in memory consumption compared to other libraries. Additionally, by constructing quantum circuits as differentiable graphs, QuForge facilitates the implementation of quantum machine learning algorithms, enhancing the capabilities and flexibility of quantum computing research.
Authors: Anna Volodkevich, Danil Gusak, Anton Klenitskiy, Alexey Vasilev
Abstract: The goal of modern sequential recommender systems is often formulated in terms of next-item prediction. In this paper, we explore the applicability of generative transformer-based models for the Top-K sequential recommendation task, where the goal is to predict items a user is likely to interact with in the "near future". We explore commonly used autoregressive generation strategies, including greedy decoding, beam search, and temperature sampling, to evaluate their performance for the Top-K sequential recommendation task. In addition, we propose novel Reciprocal Rank Aggregation (RRA) and Relevance Aggregation (RA) generation strategies based on multi-sequence generation with temperature sampling and subsequent aggregation. Experiments on diverse datasets give valuable insights regarding commonly used strategies' applicability and show that suggested approaches improve performance on longer time horizons compared to widely-used Top-K prediction approach and single-sequence autoregressive generation strategies.
Authors: Nilanjan Sinhababu, Andrew Parry, Debasis Ganguly, Debasis Samanta, Pabitra Mitra
Abstract: A supervised ranking model, despite its advantage of being effective, usually involves complex processing - typically multiple stages of task-specific pre-training and fine-tuning. This has motivated researchers to explore simpler pipelines leveraging large language models (LLMs) that are capable of working in a zero-shot manner. However, since zero-shot inference does not make use of a training set of pairs of queries and their relevant documents, its performance is mostly worse than that of supervised models, which are trained on such example pairs. Motivated by the existing findings that training examples generally improve zero-shot performance, in our work, we explore if this also applies to ranking models. More specifically, given a query and a pair of documents, the preference prediction task is improved by augmenting examples of preferences for similar queries from a training set. Our proposed pairwise few-shot ranker demonstrates consistent improvements over the zero-shot baseline on both in-domain (TREC DL) and out-domain (BEIR subset) retrieval benchmarks. Our method also achieves a close performance to that of a supervised model without requiring any complex training pipeline.
Authors: Evangelia Christodoulou, Annika Reinke, Rola Houhou, Piotr Kalinowski, Selen Erkan, Carole H. Sudre, Ninon Burgos, Sofi\`ene Boutaj, Sophie Loizillon, Ma\"elys Solal, Nicola Rieke, Veronika Cheplygina, Michela Antonelli, Leon D. Mayer, Minu D. Tizabi, M. Jorge Cardoso, Amber Simpson, Paul F. J\"ager, Annette Kopp-Schneider, Ga\"el Varoquaux, Olivier Colliot, Lena Maier-Hein
Abstract: Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50\% of papers do not assess performance variability at all. Moreover, only one (0.5\%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95\% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60\% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.
Authors: Mehdi Neshat, Michael Phipps, Chris A. Browne, Nicole T. Vargas, Seyedali Mirjalili
Abstract: Predicting hospital length of stay (LoS) stands as a critical factor in shaping public health strategies. This data serves as a cornerstone for governments to discern trends, patterns, and avenues for enhancing healthcare delivery. In this study, we introduce a robust hybrid deep learning model, a combination of Multi-layer Convolutional (CNNs) deep learning, Gated Recurrent Units (GRU), and Dense neural networks, that outperforms 11 conventional and state-of-the-art Machine Learning (ML) and Deep Learning (DL) methodologies in accurately forecasting inpatient hospital stay duration. Our investigation delves into the implementation of this hybrid model, scrutinising variables like geographic indicators tied to caregiving institutions, demographic markers encompassing patient ethnicity, race, and age, as well as medical attributes such as the CCS diagnosis code, APR DRG code, illness severity metrics, and hospital stay duration. Statistical evaluations reveal the pinnacle LoS accuracy achieved by our proposed model (CNN-GRU-DNN), which averages at 89% across a 10-fold cross-validation test, surpassing LSTM, BiLSTM, GRU, and Convolutional Neural Networks (CNNs) by 19%, 18.2%, 18.6%, and 7%, respectively. Accurate LoS predictions not only empower hospitals to optimise resource allocation and curb expenses associated with prolonged stays but also pave the way for novel strategies in hospital stay management. This avenue holds promise for catalysing advancements in healthcare research and innovation, inspiring a new era of precision-driven healthcare practices.
Authors: Fabrizio Maturo, Annamaria Porreca
Abstract: The positioning of this research falls within the scalar-on-function classification literature, a field of significant interest across various domains, particularly in statistics, mathematics, and computer science. This study introduces an advanced methodology for supervised classification by integrating Functional Data Analysis (FDA) with tree-based ensemble techniques for classifying high-dimensional time series. The proposed framework, Enriched Functional Tree-Based Classifiers (EFTCs), leverages derivative and geometric features, benefiting from the diversity inherent in ensemble methods to further enhance predictive performance and reduce variance. While our approach has been tested on the enrichment of Functional Classification Trees (FCTs), Functional K-NN (FKNN), Functional Random Forest (FRF), Functional XGBoost (FXGB), and Functional LightGBM (FLGBM), it could be extended to other tree-based and non-tree-based classifiers, with appropriate considerations emerging from this investigation. Through extensive experimental evaluations on seven real-world datasets and six simulated scenarios, this proposal demonstrates fascinating improvements over traditional approaches, providing new insights into the application of FDA in complex, high-dimensional learning problems.
Authors: Bowen Jing, Hannes St\"ark, Tommi Jaakkola, Bonnie Berger
Abstract: Molecular dynamics (MD) is a powerful technique for studying microscopic phenomena, but its computational cost has driven significant interest in the development of deep learning-based surrogate models. We introduce generative modeling of molecular trajectories as a paradigm for learning flexible multi-task surrogate models of MD from data. By conditioning on appropriately chosen frames of the trajectory, we show such generative models can be adapted to diverse tasks such as forward simulation, transition path sampling, and trajectory upsampling. By alternatively conditioning on part of the molecular system and inpainting the rest, we also demonstrate the first steps towards dynamics-conditioned molecular design. We validate the full set of these capabilities on tetrapeptide simulations and show that our model can produce reasonable ensembles of protein monomers. Altogether, our work illustrates how generative modeling can unlock value from MD data towards diverse downstream tasks that are not straightforward to address with existing methods or even MD itself. Code is available at https://github.com/bjing2016/mdgen.
Authors: Zeyu Li, Hongkun Dou, Shen Fang, Wang Han, Yue Deng, Lijun Yang
Abstract: The reconstruction of physical fields from sparse measurements is pivotal in both scientific research and engineering applications. Traditional methods are increasingly supplemented by deep learning models due to their efficacy in extracting features from data. However, except for the low accuracy on complex physical systems, these models often fail to comply with essential physical constraints, such as governing equations and boundary conditions. To overcome this limitation, we introduce a novel data-driven field reconstruction framework, termed the Physics-aligned Schr\"{o}dinger Bridge (PalSB). This framework leverages a diffusion Schr\"{o}dinger bridge mechanism that is specifically tailored to align with physical constraints. The PalSB approach incorporates a dual-stage training process designed to address both local reconstruction mapping and global physical principles. Additionally, a boundary-aware sampling technique is implemented to ensure adherence to physical boundary conditions. We demonstrate the effectiveness of PalSB through its application to three complex nonlinear systems: cylinder flow from Particle Image Velocimetry experiments, two-dimensional turbulence, and a reaction-diffusion system. The results reveal that PalSB not only achieves higher accuracy but also exhibits enhanced compliance with physical constraints compared to existing methods. This highlights PalSB's capability to generate high-quality representations of intricate physical interactions, showcasing its potential for advancing field reconstruction techniques.
Authors: Antonio Mirarchi, Raul P. Pelaez, Guillem Simeon, Gianni De Fabritiis
Abstract: All-atom molecular simulations offer detailed insights into macromolecular phenomena, but their substantial computational cost hinders the exploration of complex biological processes. We introduce Advanced Machine-learning Atomic Representation Omni-force-field (AMARO), a new neural network potential (NNP) that combines an O(3)-equivariant message-passing neural network architecture, TensorNet, with a coarse-graining map that excludes hydrogen atoms. AMARO demonstrates the feasibility of training coarser NNP, without prior energy terms, to run stable protein dynamics with scalability and generalization capabilities.
Authors: Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan
Abstract: We develop a solvable model of neural scaling laws beyond the kernel limit. Theoretical analysis of this model shows how performance scales with model size, training time, and the total amount of available data. We identify three scaling regimes corresponding to varying task difficulties: hard, easy, and super easy tasks. For easy and super-easy target functions, which lie in the reproducing kernel Hilbert space (RKHS) defined by the initial infinite-width Neural Tangent Kernel (NTK), the scaling exponents remain unchanged between feature learning and kernel regime models. For hard tasks, defined as those outside the RKHS of the initial NTK, we demonstrate both analytically and empirically that feature learning can improve scaling with training time and compute, nearly doubling the exponent for hard tasks. This leads to a different compute optimal strategy to scale parameters and training time in the feature learning regime. We support our finding that feature learning improves the scaling law for hard tasks but not for easy and super-easy tasks with experiments of nonlinear MLPs fitting functions with power-law Fourier spectra on the circle and CNNs learning vision tasks.
Authors: Christian Ganh\"or, Marta Moscati, Anna Hausberger, Shah Nawaz, Markus Schedl
Abstract: Most recommender systems adopt collaborative filtering (CF) and provide recommendations based on past collective interactions. Therefore, the performance of CF algorithms degrades when few or no interactions are available, a scenario referred to as cold-start. To address this issue, previous work relies on models leveraging both collaborative data and side information on the users or items. Similar to multimodal learning, these models aim at combining collaborative and content representations in a shared embedding space. In this work we propose a novel technique for multimodal recommendation, relying on a multimodal Single-Branch embedding network for Recommendation (SiBraR). Leveraging weight-sharing, SiBraR encodes interaction data as well as multimodal side information using the same single-branch embedding network on different modalities. This makes SiBraR effective in scenarios of missing modality, including cold start. Our extensive experiments on large-scale recommendation datasets from three different recommendation domains (music, movie, and e-commerce) and providing multimodal content information (audio, text, image, labels, and interactions) show that SiBraR significantly outperforms CF as well as state-of-the-art content-based RSs in cold-start scenarios, and is competitive in warm scenarios. We show that SiBraR's recommendations are accurate in missing modality scenarios, and that the model is able to map different modalities to the same region of the shared embedding space, hence reducing the modality gap.
Authors: Taridzo Chomutare, Aleksandar Babic, Laura-Maria Peltonen, Silja Elunurm, Peter Lundberg, Arne J\"onsson, Emma Eneling, Ciprian-Virgil Gerstenberger, Troels Siggaard, Raivo Kolde, Oskar Jerdhaf, Martin Hansson, Alexandra Makhlysheva, Miroslav Muzny, Erik Ylip\"a\"a, S{\o}ren Brunak, Hercules Dalianis
Abstract: Background: Centralized collection and processing of healthcare data across national borders pose significant challenges, including privacy concerns, data heterogeneity and legal barriers. To address some of these challenges, we formed an interdisciplinary consortium to develop a feder-ated health data network, comprised of six institutions across five countries, to facilitate Nordic-Baltic cooperation on secondary use of health data. The objective of this report is to offer early insights into our experiences developing this network. Methods: We used a mixed-method ap-proach, combining both experimental design and implementation science to evaluate the factors affecting the implementation of our network. Results: Technically, our experiments indicate that the network functions without significant performance degradation compared to centralized simu-lation. Conclusion: While use of interdisciplinary approaches holds a potential to solve challeng-es associated with establishing such collaborative networks, our findings turn the spotlight on the uncertain regulatory landscape playing catch up and the significant operational costs.
Authors: David Olivares, Pierre Fournier, Pavan Vasishta, Julien Marzat
Abstract: This paper evaluates and compares the performance of model-free and model-based reinforcement learning for the attitude control of fixed-wing unmanned aerial vehicles using PID as a reference point. The comparison focuses on their ability to handle varying flight dynamics and wind disturbances in a simulated environment. Our results show that the Temporal Difference Model Predictive Control agent outperforms both the PID controller and other model-free reinforcement learning methods in terms of tracking accuracy and robustness over different reference difficulties, particularly in nonlinear flight regimes. Furthermore, we introduce actuation fluctuation as a key metric to assess energy efficiency and actuator wear, and we test two different approaches from the literature: action variation penalty and conditioning for action policy smoothness. We also evaluate all control methods when subject to stochastic turbulence and gusts separately, so as to measure their effects on tracking performance, observe their limitations and outline their implications on the Markov decision process formalism.
Authors: Gaoxiang Li, Yu Zhuang
Abstract: The rapid expansion of Internet of Things (IoT) devices demands robust and resource-efficient security solutions. Physically Unclonable Functions (PUFs), which generate unique cryptographic keys from inherent hardware variations, offer a promising approach. However, traditional PUFs like Arbiter PUFs (APUFs) and XOR Arbiter PUFs (XOR-PUFs) are susceptible to machine learning (ML) and reliability-based attacks. In this study, we investigate Component-Differentially Challenged XOR-PUFs (CDC-XPUFs), a less explored variant, to address these vulnerabilities. We propose an optimized CDC-XPUF design that incorporates a pre-selection strategy to enhance reliability and introduces a novel lightweight architecture to reduce hardware overhead. Rigorous testing demonstrates that our design significantly lowers resource consumption, maintains strong resistance to ML attacks, and improves reliability, effectively mitigating reliability-based attacks. These results highlight the potential of CDC-XPUFs as a secure and efficient candidate for widespread deployment in resource-constrained IoT systems.
Authors: Bingyao Liu, Iris Li, Jianhua Yao, Yuan Chen, Guanming Huang, Jiajing Wang
Abstract: This paper takes the graph neural network as the technical framework, integrates the intrinsic connections between enterprise financial indicators, and proposes a model for enterprise credit risk assessment. The main research work includes: Firstly, based on the experience of predecessors, we selected 29 enterprise financial data indicators, abstracted each indicator as a vertex, deeply analyzed the relationships between the indicators, constructed a similarity matrix of indicators, and used the maximum spanning tree algorithm to achieve the graph structure mapping of enterprises; secondly, in the representation learning phase of the mapped graph, a graph neural network model was built to obtain its embedded representation. The feature vector of each node was expanded to 32 dimensions, and three GraphSAGE operations were performed on the graph, with the results pooled using the Pool operation, and the final output of three feature vectors was averaged to obtain the graph's embedded representation; finally, a classifier was constructed using a two-layer fully connected network to complete the prediction task. Experimental results on real enterprise data show that the model proposed in this paper can well complete the multi-level credit level estimation of enterprises. Furthermore, the tree-structured graph mapping deeply portrays the intrinsic connections of various indicator data of the company, and according to the ROC and other evaluation criteria, the model's classification effect is significant and has good "robustness".
Authors: Richard Yue, John E. Ortega
Abstract: Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s'). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s'. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s'. Most FMR techniques use machine translation as a way of "repairing" those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.
Authors: Richard Yue, John E. Ortega, Kenneth Ward Church
Abstract: The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near human parity using common metrics for measurement such as BLEU and COMET, we find that an important step is being missed: the translation of technical terms, specifically acronyms. Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms - as much as 50% in our findings. This article addresses acronym disambiguation for MT systems by proposing an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm that achieves nearly 10% increase when compared to Google Translate and OpusMT.
Authors: Linzhuang Sun, Hao Liang, Wentao Zhang
Abstract: Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains. However, they still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics. Previous studies have employed techniques such as supervised fine-tuning (SFT), prompt engineering, and search-based methods to improve the mathematical problem-solving abilities of LLMs. Despite these efforts, their performance remains suboptimal and demands substantial computational resources. To address this issue, we propose a novel approach, BEATS, to enhance mathematical problem-solving abilities. Our method leverages newly designed prompts that guide the model to iteratively rewrite, advance by one step, and generate answers based on previous steps. Additionally, we introduce a new back-verification technique that uses LLMs to validate the correctness of the generated answers. Furthermore, we employ a pruning tree search to optimize search time while achieving strong performance. Notably, our method improves Qwen2-7b-Instruct's score from 36.94 to 61.52, outperforming GPT4's 42.5 on the MATH benchmark.
Authors: Janek Haberer, Ali Hojjat, Olaf Landsiedel
Abstract: The architecture of Vision Transformers (ViTs), particularly the Multi-head Attention (MHA) mechanism, imposes substantial hardware demands. Deploying ViTs on devices with varying constraints, such as mobile phones, requires multiple models of different sizes. However, this approach has limitations, such as training and storing each required model separately. This paper introduces HydraViT, a novel approach that addresses these limitations by stacking attention heads to achieve a scalable ViT. By repeatedly changing the size of the embedded dimensions throughout each layer and their corresponding number of attention heads in MHA during training, HydraViT induces multiple subnetworks. Thereby, HydraViT achieves adaptability across a wide spectrum of hardware environments while maintaining performance. Our experimental results demonstrate the efficacy of HydraViT in achieving a scalable ViT with up to 10 subnetworks, covering a wide range of resource constraints. HydraViT achieves up to 5 p.p. more accuracy with the same GMACs and up to 7 p.p. more accuracy with the same throughput on ImageNet-1K compared to the baselines, making it an effective solution for scenarios where hardware availability is diverse or varies over time. Source code available at https://github.com/ds-kiel/HydraViT.
Authors: Christo Kurisummoottil Thomas, Walid Saad
Abstract: Semantic communications (SC) is an emerging communication paradigm in which wireless devices can send only relevant information from a source of data while relying on computing resources to regenerate missing data points. However, the design of a multi-user SC system becomes more challenging because of the computing and communication overhead required for coordination. Existing solutions for learning the semantic language and performing resource allocation often fail to capture the computing and communication tradeoffs involved in multiuser SC. To address this gap, a novel framework for decentralized computing and communication resource allocation in multiuser SC systems is proposed. The challenge of efficiently allocating communication and computing resources (for reasoning) in a decentralized manner to maximize the quality of task experience for the end users is addressed through the application of Stackelberg hyper game theory. Leveraging the concept of second-level hyper games, novel analytical formulations are developed to model misperceptions of the users about each other's communication and control strategies. Further, equilibrium analysis of the learned resource allocation protocols examines the convergence of the computing and communication strategies to a local Stackelberg equilibria, considering misperceptions. Simulation results show that the proposed Stackelberg hyper game results in efficient usage of communication and computing resources while maintaining a high quality of experience for the users compared to state-of-the-art that does not account for the misperceptions.
Authors: Peilin Wu, Weiji Xie, Jiahang Cao, Hang Lai, Weinan Zhang
Abstract: Reinforcement Learning (RL) has shown its remarkable and generalizable capability in legged locomotion through sim-to-real transfer. However, while adaptive methods like domain randomization are expected to make policy more robust to diverse environments, such comprehensiveness potentially detracts from the policy's performance in any specific environment according to the No Free Lunch theorem, leading to a suboptimal solution once deployed in the real world. To address this issue, we propose a lifelong policy adaptation framework named LoopSR, which utilizes a transformer-based encoder to project real-world trajectories into a latent space, and accordingly reconstruct the real-world environments back in simulation for further improvement. Autoencoder architecture and contrastive learning methods are adopted to better extract the characteristics of real-world dynamics. The simulation parameters for continual training are derived by combining predicted parameters from the decoder with retrieved parameters from the simulation trajectory dataset. By leveraging the continual training, LoopSR achieves superior data efficiency compared with strong baselines, with only a limited amount of data to yield eminent performance in both sim-to-sim and sim-to-real experiments.
Authors: L. Lao Beyer, S. Karaman
Abstract: Diffusion models have been successfully applied to robotics problems such as manipulation and vehicle path planning. In this work, we explore their application to end-to-end navigation -- including both perception and planning -- by considering the problem of jointly performing global localization and path planning in known but arbitrary 2D environments. In particular, we introduce a diffusion model which produces collision-free paths in a global reference frame given an egocentric LIDAR scan, an arbitrary map, and a desired goal position. To this end, we implement diffusion in the space of paths in SE(2), and describe how to condition the denoising process on both obstacles and sensor observations. In our evaluation, we show that the proposed conditioning techniques enable generalization to realistic maps of considerably different appearance than the training environment, demonstrate our model's ability to accurately describe ambiguous solutions, and run extensive simulation experiments showcasing our model's use as a real-time, end-to-end localization and planning stack.
Authors: Xin Cai, Zhiyuan You, Hailong Zhang, Wentao Liu, Jinwei Gu, Tianfan Xue
Abstract: Lensless cameras offer significant advantages in size, weight, and cost compared to traditional lens-based systems. Without a focusing lens, lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, current algorithms struggle with inaccurate forward imaging models and insufficient priors to reconstruct high-quality images. To overcome these limitations, we introduce a novel two-stage approach for consistent and photorealistic lensless image reconstruction. The first stage of our approach ensures data consistency by focusing on accurately reconstructing the low-frequency content with a spatially varying deconvolution method that adjusts to changes in the Point Spread Function (PSF) across the camera's field of view. The second stage enhances photorealism by incorporating a generative prior from pre-trained diffusion models. By conditioning on the low-frequency content retrieved in the first stage, the diffusion model effectively reconstructs the high-frequency details that are typically lost in the lensless imaging process, while also maintaining image fidelity. Our method achieves a superior balance between data fidelity and visual quality compared to existing methods, as demonstrated with two popular lensless systems, PhlatCam and DiffuserCam. Project website: https://phocolens.github.io/.
Authors: Ronak Tali, Ali Rabeh, Cheng-Hau Yang, Mehdi Shadkhah, Samundra Karki, Abhisek Upadhyaya, Suriya Dhakshinamoorthy, Marjan Saadati, Soumik Sarkar, Adarsh Krishnamurthy, Chinmay Hegde, Aditya Balu, Baskar Ganapathysubramanian
Abstract: Simulating fluid flow around arbitrary shapes is key to solving various engineering problems. However, simulating flow physics across complex geometries remains numerically challenging and computationally resource-intensive, particularly when using conventional PDE solvers. Machine learning methods offer attractive opportunities to create fast and adaptable PDE solvers. However, benchmark datasets to measure the performance of such methods are scarce, especially for flow physics across complex geometries. We introduce FlowBench, a dataset for neural simulators with over 10K samples, which is currently larger than any publicly available flow physics dataset. FlowBench contains flow simulation data across complex geometries (\textit{parametric vs. non-parametric}), spanning a range of flow conditions (\textit{Reynolds number and Grashoff number}), capturing a diverse array of flow phenomena (\textit{steady vs. transient; forced vs. free convection}), and for both 2D and 3D. FlowBench contains over 10K data samples, with each sample the outcome of a fully resolved, direct numerical simulation using a well-validated simulator framework designed for modeling transport phenomena in complex geometries. For each sample, we include velocity, pressure, and temperature field data at 3 different resolutions and several summary statistics features of engineering relevance (such as coefficients of lift and drag, and Nusselt numbers). %Additionally, we include masks and signed distance fields for each shape. We envision that FlowBench will enable evaluating the interplay between complex geometry, coupled flow phenomena, and data sufficiency on the performance of current, and future, neural PDE solvers. We enumerate several evaluation metrics to help rank order the performance of neural PDE solvers. We benchmark the performance of several baseline methods including FNO, CNO, WNO, and DeepONet.
Authors: Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim
Abstract: Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ($\textbf{I}$mage-like Retrieval and $\textbf{F}$requency-based Entity Filtering for Zero-shot $\textbf{Cap}$tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.
Authors: Kartik Garg, Sai Shubodh Puligilla, Shishir Kolathaya, Madhava Krishna, Sourav Garg
Abstract: Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the "whole" image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching for "image segments" instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: https://github.com/AnyLoc/Revisit-Anything.
Authors: Yanming Wan, Yue Wu, Yiping Wang, Jiayuan Mao, Natasha Jaques
Abstract: For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.
Authors: Rob A. J. de Mooij, Josien P. W. Pluim, Cian M. Scannell
Abstract: Self-supervised pretraining (SSP) has shown promising results in learning from large unlabeled datasets and, thus, could be useful for automated cardiovascular magnetic resonance (CMR) short-axis cine segmentation. However, inconsistent reports of the benefits of SSP for segmentation have made it difficult to apply SSP to CMR. Therefore, this study aimed to evaluate SSP methods for CMR cine segmentation. To this end, short-axis cine stacks of 296 subjects (90618 2D slices) were used for unlabeled pretraining with four SSP methods; SimCLR, positional contrastive learning, DINO, and masked image modeling (MIM). Subsets of varying numbers of subjects were used for supervised fine-tuning of 2D models for each SSP method, as well as to train a 2D baseline model from scratch. The fine-tuned models were compared to the baseline using the 3D Dice similarity coefficient (DSC) in a test dataset of 140 subjects. The SSP methods showed no performance gains with the largest supervised fine-tuning subset compared to the baseline (DSC = 0.89). When only 10 subjects (231 2D slices) are available for supervised training, SSP using MIM (DSC = 0.86) improves over training from scratch (DSC = 0.82). This study found that SSP is valuable for CMR cine segmentation when labeled training data is scarce, but does not aid state-of-the-art deep learning methods when ample labeled data is available. Moreover, the choice of SSP method is important. The code is publicly available at: https://github.com/q-cardIA/ssp-cmr-cine-segmentation
Authors: Lucia Gordon, Nikhil Behari, Samuel Collier, Elizabeth Bondi-Kelly, Jackson A. Killian, Catherine Ressijac, Peter Boucher, Andrew Davies, Milind Tambe
Abstract: Much of Earth's charismatic megafauna is endangered by human activities, particularly the rhino, which is at risk of extinction due to the poaching crisis in Africa. Monitoring rhinos' movement is crucial to their protection but has unfortunately proven difficult because rhinos are elusive. Therefore, instead of tracking rhinos, we propose the novel approach of mapping communal defecation sites, called middens, which give information about rhinos' spatial behavior valuable to anti-poaching, management, and reintroduction efforts. This paper provides the first-ever mapping of rhino midden locations by building classifiers to detect them using remotely sensed thermal, RGB, and LiDAR imagery in passive and active learning settings. As existing active learning methods perform poorly due to the extreme class imbalance in our dataset, we design MultimodAL, an active learning system employing a ranking technique and multimodality to achieve competitive performance with passive learning models with 94% fewer labels. Our methods could therefore save over 76 hours in labeling time when used on a similarly-sized dataset. Unexpectedly, our midden map reveals that rhino middens are not randomly distributed throughout the landscape; rather, they are clustered. Consequently, rangers should be targeted at areas with high midden densities to strengthen anti-poaching efforts, in line with UN Target 15.7.
Authors: Yuexi Du, John Onofrey, Nicha C. Dvornek
Abstract: Contrastive Language-Image Pre-training (CLIP) shows promise in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities under-explored. Here, we propose the first adaptation of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and data imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline.
Authors: Daniel N. Nissani (Nissensohn)
Abstract: After four decades of research there still exists a Classification accuracy gap of about 20% between our best Unsupervisedly Learned Representations methods and the accuracy rates achieved by intelligent animals. It thus may well be that we are looking in the wrong direction. A possible solution to this puzzle is presented. We demonstrate that Reinforcement Learning can learn representations which achieve the same accuracy as that of animals. Our main modest contribution lies in the observations that: a. when applied to a real world environment Reinforcement Learning does not require labels, and thus may be legitimately considered as Unsupervised Learning, and b. in contrast, when Reinforcement Learning is applied in a simulated environment it does inherently require labels and should thus be generally be considered as Supervised Learning. The corollary of these observations is that further search for Unsupervised Learning competitive paradigms which may be trained in simulated environments may be futile.
Authors: Kunchi Li, Jun Wan, Shan Yu
Abstract: Rehearsal approaches in class incremental learning (CIL) suffer from decision boundary overfitting to new classes, which is mainly caused by two factors: insufficiency of old classes data for knowledge distillation and imbalanced data learning between the learned and new classes because of the limited storage memory. In this work, we present a simple but effective approach to tackle these two factors. First, we employ a re-sampling strategy and Mixup K}nowledge D}istillation (Re-MKD) to improve the performances of KD, which would greatly alleviate the overfitting problem. Specifically, we combine mixup and re-sampling strategies to synthesize adequate data used in KD training that are more consistent with the latent distribution between the learned and new classes. Second, we propose a novel incremental influence balance (IIB) method for CIL to tackle the classification of imbalanced data by extending the influence balance method into the CIL setting, which re-weights samples by their influences to create a proper decision boundary. With these two improvements, we present the effective decision boundary learning algorithm (EDBL) which improves the performance of KD and deals with the imbalanced data learning simultaneously. Experiments show that the proposed EDBL achieves state-of-the-art performances on several CIL benchmarks.
Authors: Hangtao Zhang, Zeming Yao, Leo Yu Zhang, Shengshan Hu, Chao Chen, Alan Liew, Zhetao Li
Abstract: Federated learning (FL) is vulnerable to poisoning attacks, where adversaries corrupt the global aggregation results and cause denial-of-service (DoS). Unlike recent model poisoning attacks that optimize the amplitude of malicious perturbations along certain prescribed directions to cause DoS, we propose a Flexible Model Poisoning Attack (FMPA) that can achieve versatile attack goals. We consider a practical threat scenario where no extra knowledge about the FL system (e.g., aggregation rules or updates on benign devices) is available to adversaries. FMPA exploits the global historical information to construct an estimator that predicts the next round of the global model as a benign reference. It then fine-tunes the reference model to obtain the desired poisoned model with low accuracy and small perturbations. Besides the goal of causing DoS, FMPA can be naturally extended to launch a fine-grained controllable attack, making it possible to precisely reduce the global accuracy. Armed with precise control, malicious FL service providers can gain advantages over their competitors without getting noticed, hence opening a new attack surface in FL other than DoS. Even for the purpose of DoS, experiments show that FMPA significantly decreases the global accuracy, outperforming six state-of-the-art attacks.
Authors: Changlong Wu, Ananth Grama, Wojciech Szpankowski
Abstract: We study online classification of features into labels with general hypothesis classes. In our setting, true labels are determined by some function within the hypothesis class but are corrupted by unknown stochastic noise, and the features are generated adversarially. Predictions are made using observed noisy labels and noiseless features, while the performance is measured via minimax risk when comparing against true labels. The noise mechanism is modeled via a general noise kernel that specifies, for any individual data point, a set of distributions from which the actual noisy label distribution is chosen. We show that minimax risk is tightly characterized (up to a logarithmic factor of the hypothesis class size) by the Hellinger gap of the noisy label distributions induced by the kernel, independent of other properties such as the means and variances of the noise. Our main technique is based on a novel reduction to an online comparison scheme of two hypotheses, along with a new conditional version of Le Cam-Birg\'e testing suitable for online settings. Our work provides the first comprehensive characterization for noisy online classification with guarantees with respect to the ground truth while addressing general noisy observations.
Authors: Mateo Espinosa Zarlenga, Katherine M. Collins, Krishnamurthy Dvijotham, Adrian Weller, Zohreh Shams, Mateja Jamnik
Abstract: Concept Bottleneck Models (CBMs) tackle the opacity of neural architectures by constructing and explaining their predictions using a set of high-level concepts. A special property of these models is that they permit concept interventions, wherein users can correct mispredicted concepts and thus improve the model's performance. Recent work, however, has shown that intervention efficacy can be highly dependent on the order in which concepts are intervened on and on the model's architecture and training hyperparameters. We argue that this is rooted in a CBM's lack of train-time incentives for the model to be appropriately receptive to concept interventions. To address this, we propose Intervention-aware Concept Embedding models (IntCEMs), a novel CBM-based architecture and training paradigm that improves a model's receptiveness to test-time interventions. Our model learns a concept intervention policy in an end-to-end fashion from where it can sample meaningful intervention trajectories at train-time. This conditions IntCEMs to effectively select and receive concept interventions when deployed at test-time. Our experiments show that IntCEMs significantly outperform state-of-the-art concept-interpretable models when provided with test-time concept interventions, demonstrating the effectiveness of our approach.
Authors: Heinrich Peters, Alireza Hashemi, James Rae
Abstract: Machine learning (ML) and artificial intelligence (AI) systems rely heavily on human-annotated data for training and evaluation. A major challenge in this context is the occurrence of annotation errors, as their effects can degrade model performance. This paper presents a predictive error model trained to detect potential errors in search relevance annotation tasks for three industry-scale ML applications (music streaming, video streaming, and mobile apps). Drawing on real-world data from an extensive search relevance annotation program, we demonstrate that errors can be predicted with moderate model performance (AUC=0.65-0.75) and that model performance generalizes well across applications (i.e., a global, task-agnostic model performs on par with task-specific models). In contrast to past research, which has often focused on predicting annotation labels from task-specific features, our model is trained to predict errors directly from a combination of task features and behavioral features derived from the annotation process, in order to achieve a high degree of generalizability. We demonstrate the usefulness of the model in the context of auditing, where prioritizing tasks with high predicted error probabilities considerably increases the amount of corrected annotation errors (e.g., 40% efficiency gains for the music streaming application). These results highlight that behavioral error detection models can yield considerable improvements in the efficiency and quality of data annotation processes. Our findings reveal critical insights into effective error management in the data annotation process, thereby contributing to the broader field of human-in-the-loop ML.
Authors: Takeshi Koshizuka, Masahiro Fujisawa, Yusuke Tanaka, Issei Sato
Abstract: In this paper, we explores the expressivity and trainability of the Fourier Neural Operator (FNO). We establish a mean-field theory for the FNO, analyzing the behavior of the random FNO from an edge of chaos perspective. Our investigation into the expressivity of a random FNO involves examining the ordered-chaos phase transition of the network based on the weight distribution. This phase transition demonstrates characteristics unique to the FNO, induced by mode truncation, while also showcasing similarities to those of densely connected networks. Furthermore, we identify a connection between expressivity and trainability: the ordered and chaotic phases correspond to regions of vanishing and exploding gradients, respectively. This finding provides a practical prerequisite for the stable training of the FNO. Our experimental results corroborate our theoretical findings.
Authors: Nianli Peng, Muhang Tian, Brandon Fain
Abstract: We study multi-objective reinforcement learning with nonlinear preferences over trajectories. That is, we maximize the expected value of a nonlinear function over accumulated rewards (expected scalarized return or ESR) in a multi-objective Markov Decision Process (MOMDP). We derive an extended form of Bellman optimality for nonlinear optimization that explicitly considers time and current accumulated reward. Using this formulation, we describe an approximation algorithm for computing an approximately optimal non-stationary policy in pseudopolynomial time for smooth scalarization functions with a constant number of rewards. We prove the approximation analytically and demonstrate the algorithm experimentally, showing that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.
Authors: Zhou Lu
Abstract: Can a physicist make only a finite number of errors in the eternal quest to uncover the law of nature? This millennium-old philosophical problem, known as inductive inference, lies at the heart of epistemology. Despite its significance to understanding human reasoning, a rigorous justification of inductive inference has remained elusive. At a high level, inductive inference asks whether one can make at most finite errors amidst an infinite sequence of observations, when deducing the correct hypothesis from a given hypothesis class. Historically, the only theoretical guarantee has been that if the hypothesis class is countable, inductive inference is possible, as exemplified by Solomonoff induction for learning Turing machines. In this paper, we provide a tight characterization of inductive inference by establishing a novel link to online learning theory. As our main result, we prove that inductive inference is possible if and only if the hypothesis class is a countable union of online learnable classes, potentially with an uncountable size, no matter the observations are adaptively chosen or iid sampled. Moreover, the same condition is also sufficient and necessary in the agnostic setting, where any hypothesis class meeting this criterion enjoys an $\tilde{O}(\sqrt{T})$ regret bound for any time step $T$, while others require an arbitrarily slow rate of regret. Our main technical tool is a novel non-uniform online learning framework, which may be of independent interest.
Authors: Dishank Bansal, Ricky T. Q. Chen, Mustafa Mukadam, Brandon Amos
Abstract: Deep learning models are often deployed in downstream tasks that the training procedure may not be aware of. For example, models solely trained to achieve accurate predictions may struggle to perform well on downstream tasks because seemingly small prediction errors may incur drastic task errors. The standard end-to-end learning approach is to make the task loss differentiable or to introduce a differentiable surrogate that the model can be trained on. In these settings, the task loss needs to be carefully balanced with the prediction loss because they may have conflicting objectives. We propose take the task loss signal one level deeper than the parameters of the model and use it to learn the parameters of the loss function the model is trained on, which can be done by learning a metric in the prediction space. This approach does not alter the optimal prediction model itself, but rather changes the model learning to emphasize the information important for the downstream task. This enables us to achieve the best of both worlds: a prediction model trained in the original prediction space while also being valuable for the desired downstream task. We validate our approach through experiments conducted in two main settings: 1) decision-focused model learning scenarios involving portfolio optimization and budget allocation, and 2) reinforcement learning in noisy environments with distracting states. The source code to reproduce our experiments is available at https://github.com/facebookresearch/taskmet
Authors: Yikun Bai, Rocio Diaz Martin, Abihith Kothapalli, Hengrong Du, Xinran Liu, Soheil Kolouri
Abstract: The Gromov-Wasserstein (GW) distance has gained increasing interest in the machine learning community in recent years, as it allows for the comparison of measures in different metric spaces. To overcome the limitations imposed by the equal mass requirements of the classical GW problem, researchers have begun exploring its application in unbalanced settings. However, Unbalanced GW (UGW) can only be regarded as a discrepancy rather than a rigorous metric/distance between two metric measure spaces (mm-spaces). In this paper, we propose a particular case of the UGW problem, termed Partial Gromov-Wasserstein (PGW). We establish that PGW is a well-defined metric between mm-spaces and discuss its theoretical properties, including the existence of a minimizer for the PGW problem and the relationship between PGW and GW, among others. We then propose two variants of the Frank-Wolfe algorithm for solving the PGW problem and show that they are mathematically and computationally equivalent. Moreover, based on our PGW metric, we introduce the analogous concept of barycenters for mm-spaces. Finally, we validate the effectiveness of our PGW metric and related solvers in applications such as shape matching, shape retrieval, and shape interpolation, comparing them against existing baselines.
Authors: Michael S. Yao, Yimeng Zeng, Hamsa Bastani, Jacob Gardner, James C. Gee, Osbert Bastani
Abstract: Offline model-based optimization seeks to optimize against a learned surrogate model without querying the true oracle objective function during optimization. Such tasks are commonly encountered in protein design, robotics, and clinical medicine where evaluating the oracle function is prohibitively expensive. However, inaccurate surrogate model predictions are frequently encountered along offline optimization trajectories. To address this limitation, we propose generative adversarial model-based optimization using adaptive source critic regularization (aSCR) -- a task- and optimizer- agnostic framework for constraining the optimization trajectory to regions of the design space where the surrogate function is reliable. We propose a computationally tractable algorithm to dynamically adjust the strength of this constraint, and show how leveraging aSCR with standard Bayesian optimization outperforms existing methods on a suite of offline generative design tasks. Our code is available at https://github.com/michael-s-yao/gabo
Authors: Wang Jia, Hang Xu
Abstract: Deep Reinforcement Learning (DRL) has emerged as a promising approach for handling highly dynamic and nonlinear Active Flow Control (AFC) problems. However, the computational cost associated with training DRL models presents a significant performance bottleneck. To address this challenge and enable efficient scaling on high-performance computing architectures, this study focuses on optimizing DRL-based algorithms in parallel settings. We validate an existing state-of-the-art DRL framework used for AFC problems and discuss its efficiency bottlenecks. Subsequently, by deconstructing the overall framework and conducting extensive scalability benchmarks for individual components, we investigate various hybrid parallelization configurations and propose efficient parallelization strategies. Moreover, we refine input/output (I/O) operations in multi-environment DRL training to tackle critical overhead associated with data movement. Finally, we demonstrate the optimized framework for a typical AFC problem where near-linear scaling can be obtained for the overall framework. We achieve a significant boost in parallel efficiency from around 49% to approximately 78%, and the training process is accelerated by approximately 47 times using 60 central processing unit (CPU) cores. These findings are expected to provide valuable insights for further advancements in DRL-based AFC studies. Consequently, it continues to be a prominent and actively studied problem of significant interest.
Authors: Zhenyu Sun, Ermin Wei
Abstract: Classical convergence analyses for optimization algorithms rely on the widely-adopted uniform smoothness assumption. However, recent experimental studies have demonstrated that many machine learning problems exhibit non-uniform smoothness, meaning the smoothness factor is a function of the model parameter instead of a universal constant. In particular, it has been observed that the smoothness grows with respect to the gradient norm along the training trajectory. Motivated by this phenomenon, the recently introduced $(L_0, L_1)$-smoothness is a more general notion, compared to traditional $L$-smoothness, that captures such positive relationship between smoothness and gradient norm. Under this type of non-uniform smoothness, existing literature has designed stochastic first-order algorithms by utilizing gradient clipping techniques to obtain the optimal $\mathcal{O}(\epsilon^{-3})$ sample complexity for finding an $\epsilon$-approximate first-order stationary solution. Nevertheless, the studies of quasi-Newton methods are still lacking. Considering higher accuracy and more robustness for quasi-Newton methods, in this paper we propose a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness. Leveraging gradient clipping and variance reduction, our algorithm can achieve the best-known $\mathcal{O}(\epsilon^{-3})$ sample complexity and enjoys convergence speedup with simple hyperparameter tuning. Our numerical experiments show that our proposed algorithm outperforms the state-of-the-art approaches.
Authors: Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L. Caterini, Jesse C. Cresswell
Abstract: In recent years there has been increased interest in understanding the interplay between deep generative models (DGMs) and the manifold hypothesis. Research in this area focuses on understanding the reasons why commonly-used DGMs succeed or fail at learning distributions supported on unknown low-dimensional manifolds, as well as developing new models explicitly designed to account for manifold-supported data. This manifold lens provides both clarity as to why some DGMs (e.g. diffusion models and some generative adversarial networks) empirically surpass others (e.g. likelihood-based models such as variational autoencoders, normalizing flows, or energy-based models) at sample generation, and guidance for devising more performant DGMs. We carry out the first survey of DGMs viewed through this lens, making two novel contributions along the way. First, we formally establish that numerical instability of likelihoods in high ambient dimensions is unavoidable when modelling data with low intrinsic dimension. We then show that DGMs on learned representations of autoencoders can be interpreted as approximately minimizing Wasserstein distance: this result, which applies to latent diffusion models, helps justify their outstanding empirical results. The manifold lens provides a rich perspective from which to understand DGMs, and we aim to make this perspective more accessible and widespread.
Authors: Tian Xie, Xueru Zhang
Abstract: As machine learning (ML) models are increasingly used in social domains to make consequential decisions about humans, they often have the power to reshape data distributions. Humans, as strategic agents, continuously adapt their behaviors in response to the learning system. As populations change dynamically, ML systems may need frequent updates to ensure high performance. However, acquiring high-quality human-annotated samples can be highly challenging and even infeasible in social domains. A common practice to address this issue is using the model itself to annotate unlabeled data samples. This paper investigates the long-term impacts when ML models are retrained with model-annotated samples when they incorporate human strategic responses. We first formalize the interactions between strategic agents and the model and then analyze how they evolve under such dynamic interactions. We find that agents are increasingly likely to receive positive decisions as the model gets retrained, whereas the proportion of agents with positive labels may decrease over time. We thus propose a refined retraining process to stabilize the dynamics. Last, we examine how algorithmic fairness can be affected by these retraining processes and find that enforcing common fairness constraints at every round may not benefit the disadvantaged group in the long run. Experiments on (semi-)synthetic and real data validate the theoretical findings.
Authors: Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Abstract: In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining both sublinear regret and sublinear constraint violation, when competing against a best-in-hindsight policy that satisfies constraints on average. In this paper, we show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases. Specifically, we propose algorithms attaining $\tilde{\mathcal{O}} (\sqrt{T} + C)$ regret and positive constraint violation under bandit feedback, where $C$ is a corruption value measuring the environment non-stationarity. This can be $\Theta(T)$ in the worst case, coherently with the impossibility result for adversarial CMDPs. First, we design an algorithm with the desired guarantees when $C$ is known. Then, in the case $C$ is unknown, we show how to obtain the same results by embedding such an algorithm in a general meta-procedure. This is of independent interest, as it can be applied to any non-stationary constrained online learning setting.
Authors: Shuaipeng Li, Penghao Zhao, Hailin Zhang, Xingwu Sun, Hao Wu, Dian Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin Cui, Di Wang
Abstract: In current deep learning tasks, Adam style optimizers such as Adam, Adagrad, RMSProp, Adafactor, and Lion have been widely used as alternatives to SGD style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly or follows similar rules with batch size for SGD style optimizers. However, this conclusion is not applicable to Adam style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conducted experiments on various CV and NLP tasks and verified the correctness of the scaling law.
Authors: T. Konstantin Rusch, Nathan Kirk, Michael M. Bronstein, Christiane Lemieux, Daniela Rus
Abstract: Discrepancy is a well-known measure for the irregularity of the distribution of a point set. Point sets with small discrepancy are called low-discrepancy and are known to efficiently fill the space in a uniform manner. Low-discrepancy points play a central role in many problems in science and engineering, including numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. In this work, we present the first machine learning approach to generate a new class of low-discrepancy point sets named Message-Passing Monte Carlo (MPMC) points. Motivated by the geometric nature of generating low-discrepancy point sets, we leverage tools from Geometric Deep Learning and base our model on Graph Neural Networks. We further provide an extension of our framework to higher dimensions, which flexibly allows the generation of custom-made points that emphasize the uniformity in specific dimensions that are primarily important for the particular problem at hand. Finally, we demonstrate that our proposed model achieves state-of-the-art performance superior to previous methods by a significant margin. In fact, MPMC points are empirically shown to be either optimal or near-optimal with respect to the discrepancy for low dimension and small number of points, i.e., for which the optimal discrepancy can be determined. Code for generating MPMC points can be found at https://github.com/tk-rusch/MPMC.
Authors: William L. Tong, Cengiz Pehlevan
Abstract: In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique hallmark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context competitively with Transformers given the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging strong prior arguments about MLPs' limited ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs, and support the growing interest in all-MLP alternatives to task-specific architectures.
Authors: Minghao Xu, Yunteng Geng, Yihang Zhang, Ling Yang, Jian Tang, Wentao Zhang
Abstract: Glycans are basic biomolecules and perform essential functions within living organisms. The rapid increase of functional glycan data provides a good opportunity for machine learning solutions to glycan understanding. However, there still lacks a standard machine learning benchmark for glycan function prediction. In this work, we fill this blank by building a comprehensive benchmark for Glycan Machine Learning (GlycanML). The GlycanML benchmark consists of diverse types of tasks including glycan taxonomy prediction, glycan immunogenicity prediction, glycosylation type prediction, and protein-glycan interaction prediction. Glycans can be represented by both sequences and graphs in GlycanML, which enables us to extensively evaluate sequence-based models and graph neural networks (GNNs) on benchmark tasks. Furthermore, by concurrently performing eight glycan taxonomy prediction tasks, we introduce the GlycanML-MTL testbed for multi-task learning (MTL) algorithms. Experimental results show the superiority of modeling glycans with multi-relational GNNs, and suitable MTL methods can further boost model performance. We provide all datasets and source codes at https://github.com/GlycanML/GlycanML and maintain a leaderboard at https://GlycanML.github.io/project
URLs: https://github.com/GlycanML/GlycanML, https://GlycanML.github.io/project
Authors: Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, Haifeng Xu
Abstract: Motivated by the phenomenon of strategic agents gaming a recommender system to maximize the number of times they are recommended to users, we study a strategic variant of the linear contextual bandit problem, where the arms can strategically misreport privately observed contexts to the learner. We treat the algorithm design problem as one of mechanism design under uncertainty and propose the Optimistic Grim Trigger Mechanism (OptGTM) that incentivizes the agents (i.e., arms) to report their contexts truthfully while simultaneously minimizing regret. We also show that failing to account for the strategic nature of the agents results in linear regret. However, a trade-off between mechanism design and regret minimization appears to be unavoidable. More broadly, this work aims to provide insight into the intersection of online learning and mechanism design.
Authors: Chaolv Zeng, Zhanyu Liu, Guanjie Zheng, Linghe Kong
Abstract: Recent advancements in multivariate time series forecasting have been propelled by Linear-based, Transformer-based, and Convolution-based models, with Transformer-based architectures gaining prominence for their efficacy in temporal and cross-channel mixing. More recently, Mamba, a state space model, has emerged with robust sequence and feature mixing capabilities. However, the suitability of the vanilla Mamba design for time series forecasting remains an open question, particularly due to its inadequate handling of cross-channel dependencies. Capturing cross-channel dependencies is critical in enhancing the performance of multivariate time series prediction. Recent findings show that self-attention excels in capturing cross-channel dependencies, whereas other simpler mechanisms, such as MLP, may degrade model performance. This is counterintuitive, as MLP, being a learnable architecture, should theoretically capture both correlations and irrelevances, potentially leading to neutral or improved performance. Diving into the self-attention mechanism, we attribute the observed degradation in MLP performance to its lack of data dependence and global receptive field, which result in MLP's lack of generalization ability. Based on the above insights, we introduce a refined Mamba variant tailored for time series forecasting. Our proposed model, \textbf{CMamba}, incorporates a modified Mamba (M-Mamba) module for temporal dependencies modeling, a global data-dependent MLP (GDD-MLP) to effectively capture cross-channel dependencies, and a Channel Mixup mechanism to mitigate overfitting. Comprehensive experiments conducted on seven real-world datasets demonstrate the efficacy of our model in improving forecasting performance.
Authors: Hong Xiong, Feng Wu, Leon Deng, Megan Su, Li-wei H Lehman
Abstract: In the context of medical decision making, counterfactual prediction enables clinicians to predict treatment outcomes of interest under alternative courses of therapeutic actions given observed patient history. In this work, we present G-Transformer for counterfactual outcome prediction under dynamic and time-varying treatment strategies. Our approach leverages a Transformer architecture to capture complex, long-range dependencies in time-varying covariates while enabling g-computation, a causal inference method for estimating the effects of dynamic treatment regimes. Specifically, we use a Transformer-based encoder architecture to estimate the conditional distribution of relevant covariates given covariate and treatment history at each time point, then produces Monte Carlo estimates of counterfactual outcomes by simulating forward patient trajectories under treatment strategies of interest. We evaluate G-Transformer extensively using two simulated longitudinal datasets from mechanistic models, and a real-world sepsis ICU dataset from MIMIC-IV. G-Transformer outperforms both classical and state-of-the-art counterfactual prediction models in these settings. To the best of our knowledge, this is the first Transformer-based architecture that supports g-computation for counterfactual outcome prediction under dynamic and time-varying treatment strategies.
Authors: Ferdinand Rewicki, Jakob Gawlikowski, Julia Niebling, Joachim Denzler
Abstract: The detection of abnormal or critical system states is essential in condition monitoring. While much attention is given to promptly identifying anomalies, a retrospective analysis of these anomalies can significantly enhance our comprehension of the underlying causes of observed undesired behavior. This aspect becomes particularly critical when the monitored system is deployed in a vital environment. In this study, we delve into anomalies within the domain of Bio-Regenerative Life Support Systems (BLSS) for space exploration and analyze anomalies found in telemetry data stemming from the EDEN ISS space greenhouse in Antarctica. We employ time series clustering on anomaly detection results to categorize various types of anomalies in both uni- and multivariate settings. We then assess the effectiveness of these methods in identifying systematic anomalous behavior. Additionally, we illustrate that the anomaly detection methods MDI and DAMP produce complementary results, as previously indicated by research.
Authors: Dianhui Wang, Gang Dang
Abstract: Temporal data modelling techniques with neural networks are useful in many domain applications, including time-series forecasting and control engineering. This paper aims at developing a recurrent version of stochastic configuration networks (RSCNs) for problem solving, where we have no underlying assumption on the dynamic orders of the input variables. Given a collection of historical data, we first build an initial RSCN model in the light of a supervisory mechanism, followed by an online update of the output weights by using a projection algorithm. Some theoretical results are established, including the echo state property, the universal approximation property of RSCNs for both the offline and online learnings, and the convergence of the output weights. The proposed RSCN model is remarkably distinguished from the well-known echo state networks (ESNs) in terms of the way of assigning the input random weight matrix and a special structure of the random feedback matrix. A comprehensive comparison study among the long short-term memory (LSTM) network, the original ESN, and several state-of-the-art ESN methods such as the simple cycle reservoir (SCR), the polynomial ESN (PESN), the leaky-integrator ESN (LIESN) and RSCN is carried out. Numerical results clearly indicate that the proposed RSCN performs favourably over all of the datasets.
Authors: Ayush Kaushal, Tejas Vaidhya, Tejas Pandey, Aaryan Bhagat, Irina Rish
Abstract: Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but unfortunately, it suffers from significant performance degradation below 4-bit precision. An alternative approach involves training compressed models directly at a low bitwidth (e.g., binary or ternary models). However, the performance, training dynamics, and scaling trends of such models are not yet well understood. To address this issue, we train and openly release the Spectra LLM suite consisting of 54 language models ranging from 99M to 3.9B parameters, trained on 300B tokens. Spectra includes FloatLMs, post-training quantized QuantLMs (3, 4, 6, and 8 bits), and ternary LLMs (TriLMs) - our improved architecture for ternary language modeling, which significantly outperforms previously proposed ternary models of a given size (in bits), matching half-precision models at scale. For example, TriLM 3.9B is (bit-wise) smaller than the half-precision FloatLM 830M, but matches half-precision FloatLM 3.9B in commonsense reasoning and knowledge benchmarks. However, TriLM 3.9B is also as toxic and stereotyping as FloatLM 3.9B, a model six times larger in size. Additionally, TriLM 3.9B lags behind FloatLM in perplexity on validation splits and web-based corpora but performs better on less noisy datasets like Lambada and PennTreeBank. To enhance understanding of low-bitwidth models, we are releasing 500+ intermediate checkpoints of the Spectra suite at \href{https://github.com/NolanoOrg/SpectraSuite}{https://github.com/NolanoOrg/SpectraSuite}.
URLs: https://github.com/NolanoOrg/SpectraSuite, https://github.com/NolanoOrg/SpectraSuite
Authors: Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou
Abstract: We initiate a formal investigation into the design and analysis of LLM-based algorithms, i.e. algorithms that contain one or multiple calls of large language models (LLMs) as sub-routines and critically rely on the capabilities of LLMs. While LLM-based algorithms, ranging from basic LLM calls with prompt engineering to complicated LLM-powered agent systems and compound AI systems, have achieved remarkable empirical success, the design and optimization of them have mostly relied on heuristics and trial-and-errors, which is largely due to a lack of formal and analytical study for these algorithms. To fill this gap, we start by identifying the computational-graph representation of LLM-based algorithms, the design principle of task decomposition, and some key abstractions, which then facilitate our formal analysis for the accuracy and efficiency of LLM-based algorithms, despite the black-box nature of LLMs. Through extensive analytical and empirical investigation in a series of case studies, we demonstrate that the proposed framework is broadly applicable to a wide range of scenarios and diverse patterns of LLM-based algorithms, such as parallel, hierarchical and recursive task decomposition. Our proposed framework holds promise for advancing LLM-based algorithms, by revealing the reasons behind curious empirical phenomena, guiding the choices of hyperparameters, predicting the empirical performance of algorithms, and inspiring new algorithm design. To promote further study of LLM-based algorithms, we release our source code at https://github.com/modelscope/agentscope/tree/main/examples/paper_llm_based_algorithm.
URLs: https://github.com/modelscope/agentscope/tree/main/examples/paper_llm_based_algorithm.
Authors: Francisco B\'erchez-Moreno, V\'ictor M. Vargas, Rafael Ayll\'on-Gavil\'an, David Guijo-Rubio, C\'esar Herv\'as-Mart\'inez, Juan C. Fern\'andez, Pedro A. Guti\'errez
Abstract: dlordinal is a new Python library that unifies many recent deep ordinal classification methodologies available in the literature. Developed using PyTorch as underlying framework, it implements the top performing state-of-the-art deep learning techniques for ordinal classification problems. Ordinal approaches are designed to leverage the ordering information present in the target variable. Specifically, it includes loss functions, various output layers, dropout techniques, soft labelling methodologies, and other classification strategies, all of which are appropriately designed to incorporate the ordinal information. Furthermore, as the performance metrics to assess novel proposals in ordinal classification depend on the distance between target and predicted classes in the ordinal scale, suitable ordinal evaluation metrics are also included. dlordinal is distributed under the BSD-3-Clause license and is available at https://github.com/ayrna/dlordinal.
Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi
Abstract: In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with $m$ basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.
Authors: Zeyuan Ma, Jiacheng Chen, Hongshu Guo, Yue-Jiao Gong
Abstract: Recent research in Meta-Black-Box Optimization (MetaBBO) have shown that meta-trained neural networks can effectively guide the design of black-box optimizers, significantly reducing the need for expert tuning and delivering robust performance across complex problem distributions. Despite their success, a paradox remains: MetaBBO still rely on human-crafted Exploratory Landscape Analysis features to inform the meta-level agent about the low-level optimization progress. To address the gap, this paper proposes Neural Exploratory Landscape Analysis (NeurELA), a novel framework that dynamically profiles landscape features through a two-stage, attention-based neural network, executed in an entirely end-to-end fashion. NeurELA is pre-trained over a variety of MetaBBO algorithms using a multi-task neuroevolution strategy. Extensive experiments show that NeurELA achieves consistently superior performance when integrated into different and even unseen MetaBBO tasks and can be efficiently fine-tuned for further performance boost. This advancement marks a pivotal step in making MetaBBO algorithms more autonomous and broadly applicable.The source code of NeurELA can be accessed at https://anonymous.4open.science/r/Neur-ELA-303C.
Authors: Xingyou Song, Qiuyi Zhang, Chansoo Lee, Emily Fertig, Tzu-Kuo Huang, Lior Belenki, Greg Kochanski, Setareh Ariafar, Srinivas Vasudevan, Sagi Perel, Daniel Golovin
Abstract: Google Vizier has performed millions of optimizations and accelerated numerous research and production systems at Google, demonstrating the success of Bayesian optimization as a large-scale service. Over multiple years, its algorithm has been improved considerably, through the collective experiences of numerous research efforts and user feedback. In this technical report, we discuss the implementation details and design choices of the current default algorithm provided by Open Source Vizier. Our experiments on standardized benchmarks reveal its robustness and versatility against well-established industry baselines on multiple practical modes.
Authors: Tianyi Lin, Chi Jin, Michael. I. Jordan
Abstract: We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of $\min_\textbf{x} \max_{\textbf{y} \in Y} f(\textbf{x}, \textbf{y})$, where the objective function $f(\textbf{x}, \textbf{y})$ is nonconvex in $\textbf{x}$ and concave in $\textbf{y}$, and the constraint set $Y \subseteq \mathbb{R}^n$ is convex and bounded. In the convex-concave setting, the single-timescale gradient descent ascent (GDA) algorithm is widely used in applications and has been shown to have strong convergence guarantees. In more general settings, however, it can fail to converge. Our contribution is to design TTGDA algorithms that are effective beyond the convex-concave setting, efficiently finding a stationary point of the function $\Phi(\cdot) := \max_{\textbf{y} \in Y} f(\cdot, \textbf{y})$. We also establish theoretical bounds on the complexity of solving both smooth and nonsmooth nonconvex-concave minimax optimization problems. To the best of our knowledge, this is the first systematic analysis of TTGDA for nonconvex minimax optimization, shedding light on its superior performance in training generative adversarial networks (GANs) and in other real-world application problems.
Authors: Qing Li, Runze Gan, Simon Godsill
Abstract: This paper tackles the challenge of multi-sensor multi-object tracking by proposing various decentralised Variational Inference (VI) schemes that match the tracking performance of centralised sensor fusion with only local message exchanges among neighboring sensors. We first establish a centralised VI sensor fusion scheme as a benchmark and analyse the limitations of its decentralised counterpart, which requires sensors to await consensus at each VI iteration. Therefore, we propose a decentralised gradient-based VI framework that optimises the Locally Maximised Evidence Lower Bound (LM-ELBO) instead of the standard ELBO, which reduces the parameter search space and enables faster convergence, making it particularly beneficial for decentralised tracking.This proposed framework is inherently self-evolving, improving with advancements in decentralised optimisation techniques for convergence guarantees and efficiency. Further, we enhance the convergence speed of proposed decentralised schemes using natural gradients and gradient tracking strategies. Results verify that our decentralised VI schemes are empirically equivalent to centralised fusion in tracking performance. Notably, the decentralised natural gradient VI method is the most communication-efficient, with communication costs comparable to suboptimal decentralised strategies while delivering notably higher tracking accuracy.
Authors: Yuchang Sun, Yuexiang Xie, Bolin Ding, Yaliang Li, Jun Zhang
Abstract: Federated learning (FL) has emerged as a promising paradigm for fine-tuning foundation models using distributed data in a privacy-preserving manner. Under limited computational resources, clients often find it more practical to fine-tune a selected subset of layers, rather than the entire model, based on their task-specific data. In this study, we provide a thorough theoretical exploration of selective layer fine-tuning in FL, emphasizing a flexible approach that allows the clients to adjust their selected layers according to their local data and resources. We theoretically demonstrate that the layer selection strategy has a significant impact on model convergence in two critical aspects: the importance of selected layers and the heterogeneous choices across clients. Drawing from these insights, we further propose a strategic layer selection method that utilizes local gradients and regulates layer selections across clients. The extensive experiments on both image and text datasets demonstrate the effectiveness of the proposed strategy compared with several baselines, highlighting its advances in identifying critical layers that adapt to the client heterogeneity and training dynamics in FL.
Authors: Alexandru Vasilache, Jann Krausse, Klaus Knobloch, Juergen Becker
Abstract: Intra-cortical brain-machine interfaces (iBMIs) have the potential to dramatically improve the lives of people with paraplegia by restoring their ability to perform daily activities. However, current iBMIs suffer from scalability and mobility limitations due to bulky hardware and wiring. Wireless iBMIs offer a solution but are constrained by a limited data rate. To overcome this challenge, we are investigating hybrid spiking neural networks for embedded neural decoding in wireless iBMIs. The networks consist of a temporal convolution-based compression followed by recurrent processing and a final interpolation back to the original sequence length. As recurrent units, we explore gated recurrent units (GRUs), leaky integrate-and-fire (LIF) neurons, and a combination of both - spiking GRUs (sGRUs) and analyze their differences in terms of accuracy, footprint, and activation sparsity. To that end, we train decoders on the "Nonhuman Primate Reaching with Multichannel Sensorimotor Cortex Electrophysiology" dataset and evaluate it using the NeuroBench framework, targeting both tracks of the IEEE BioCAS Grand Challenge on Neural Decoding. Our approach achieves high accuracy in predicting velocities of primate reaching movements from multichannel primary motor cortex recordings while maintaining a low number of synaptic operations, surpassing the current baseline models in the NeuroBench framework. This work highlights the potential of hybrid neural networks to facilitate wireless iBMIs with high decoding precision and a substantial increase in the number of monitored neurons, paving the way toward more advanced neuroprosthetic technologies.
Authors: Mattias Cross, Anton Ragni
Abstract: Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the properties of DM likelihood are unknown. This impact sheds light on the previously unknown nature of DM likelihoods. Although conditional DMs maximise likelihood, the likelihood in question is not as sensitive to the conditioning input as one expects. This investigation provides a new point-of-view on diffusion likelihoods.
Authors: Petr Philonenko, Sergey Postovalov
Abstract: The focus of this study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations. To achieve this, we develop several ML-based methods with varying architectures and implement them as two-sample tests. Each method is an ensemble (stacking) that combines predictions from classical two-sample tests. This paper presents the results of training the proposed ML methods, examines their statistical power compared to classical two-sample tests, analyzes the null distribution of the proposed methods when the null hypothesis is true, and evaluates the significance of the features incorporated into the proposed methods. In total, this work covers 18 methods for two-sample testing under right-censored observations, including the proposed methods and classical well-studied two-sample tests. All results from numerical experiments were obtained from a synthetic dataset generated using the inverse transform sampling method and replicated multiple times through Monte Carlo simulation. To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods (scripts, dataset, and models are available on GitHub and Hugging Face).
Authors: Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, Xin Jin
Abstract: Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.
Authors: Gleb Bazhenov, Oleg Platonov, Liudmila Prokhorenkova
Abstract: Tabular machine learning is an important field for industry and science. In this field, table rows are usually treated as independent data samples, but additional information about relations between them is sometimes available and can be used to improve predictive performance. Such information can be naturally modeled with a graph, thus tabular machine learning may benefit from graph machine learning methods. However, graph machine learning models are typically evaluated on datasets with homogeneous node features, which have little in common with heterogeneous mixtures of numerical and categorical features present in tabular datasets. Thus, there is a critical difference between the data used in tabular and graph machine learning studies, which does not allow one to understand how successfully graph models can be transferred to tabular data. To bridge this gap, we propose a new benchmark of diverse graphs with heterogeneous tabular node features and realistic prediction tasks. We use this benchmark to evaluate a vast set of models, including simple methods previously overlooked in the literature. Our experiments show that graph neural networks (GNNs) can indeed often bring gains in predictive performance for tabular data, but standard tabular models also can be adapted to work with graph data by using simple feature preprocessing, which sometimes enables them to compete with and even outperform GNNs. Based on our empirical study, we provide insights for researchers and practitioners in both tabular and graph machine learning fields.
Authors: Stefan Haufe, Rick Wilming, Benedict Clark, Rustam Zhumagambetov, Danny Panknin, Ahc\`ene Boubekki
Abstract: The use of machine learning (ML) in critical domains such as medicine poses risks and requires regulation. One requirement is that decisions of ML systems in high-risk applications should be human-understandable. The field of "explainable artificial intelligence" (XAI) seemingly addresses this need. However, in its current form, XAI is unfit to provide quality control for ML; it itself needs scrutiny. Popular XAI methods cannot reliably answer important questions about ML models, their training data, or a given test input. We recapitulate results demonstrating that popular XAI methods systematically attribute importance to input features that are independent of the prediction target. This limits their utility for purposes such as model and data (in)validation, model improvement, and scientific discovery. We argue that the fundamental reason for this limitation is that current XAI methods do not address well-defined problems and are not evaluated against objective criteria of explanation correctness. Researchers should formally define the problems they intend to solve first and then design methods accordingly. This will lead to notions of explanation correctness that can be theoretically verified and objective metrics of explanation performance that can be assessed using ground-truth data.
Authors: Alessio Mascolini, Sebastiano Gaiardelli, Francesco Ponzio, Nicola Dall'Ora, Enrico Macii, Sara Vinco, Santa Di Cataldo, Franco Fummi
Abstract: Detecting complex anomalies on massive amounts of data is a crucial task in Industry 4.0, best addressed by deep learning. However, available solutions are computationally demanding, requiring cloud architectures prone to latency and bandwidth issues. This work presents VARADE, a novel solution implementing a light autoregressive framework based on variational inference, which is best suited for real-time execution on the edge. The proposed approach was validated on a robotic arm, part of a pilot production line, and compared with several state-of-the-art algorithms, obtaining the best trade-off between anomaly detection accuracy, power consumption and inference frequency on two different edge platforms.
Authors: Hong-fu Chou, Vu Nguyen Ha, Prabhu Thiruvasagam, Thanh-Dung Le, Geoffrey Eappen, Ti Ti Nguyen, Luis M. Garces-Socarras, Jorge L. Gonzalez-Rios, Juan Carlos Merlano-Duncan, Symeon Chatzinotas
Abstract: Earth Observation (EO) systems play a crucial role in achieving Sustainable Development Goals by collecting and analyzing vital global data through satellite networks. These systems are essential for tasks like mapping, disaster monitoring, and resource management, but they face challenges in processing and transmitting large volumes of EO data, especially in specialized fields such as agriculture and real-time disaster response. Domain-adapted Large Language Models (LLMs) provide a promising solution by facilitating data fusion between extensive EO data and semantic EO data. By improving integration and interpretation of diverse datasets, LLMs address the challenges of processing specialized information in agriculture and disaster response applications. This fusion enhances the accuracy and relevance of transmitted data. This paper presents a framework for semantic communication in EO satellite networks, aimed at improving data transmission efficiency and overall system performance through cognitive processing techniques. The proposed system employs Discrete-Task-Oriented Source-Channel Coding (DT-JSCC) and Semantic Data Augmentation (SA) to focus on relevant information while minimizing communication overhead. By integrating cognitive semantic processing and inter-satellite links, the framework enhances the analysis and transmission of multispectral satellite imagery, improving object detection, pattern recognition, and real-time decision-making. The introduction of Cognitive Semantic Augmentation (CSA) allows satellites to process and transmit semantic information, boosting adaptability to changing environments and application needs. This end-to-end architecture is tailored for next-generation satellite networks, such as those supporting 6G, and demonstrates significant improvements in efficiency and accuracy.
Authors: Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Guha, E. Kelly Buchanan, Mayee Chen, Neel Guha, Christopher R\'e, Azalia Mirhoseini
Abstract: Inference-time techniques are emerging as highly effective tools to increase large language model (LLM) capabilities. However, there is still limited understanding of the best practices for developing systems that combine inference-time techniques with one or more LLMs, with challenges including: (1) effectively allocating inference compute budget, (2) understanding the interactions between different combinations of inference-time techniques and their impact on downstream performance, and 3) efficiently searching over the large space of model choices, inference-time techniques, and their compositions. To address these challenges, we introduce Archon, an automated framework for designing inference-time architectures. Archon defines an extensible design space, encompassing methods such as generation ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing. It then transforms the problem of selecting and combining LLMs and inference-time techniques into a hyperparameter optimization objective. To optimize this objective, we introduce automated Inference-Time Architecture Search (ITAS) algorithms. Given target benchmark(s), an inference compute budget, and available LLMs, ITAS outputs optimized architectures. We evaluate Archon architectures across a wide range of instruction-following and reasoning benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. We show that automatically designed inference-time architectures by Archon outperform strong models such as GPT-4o and Claude 3.5 Sonnet on these benchmarks, achieving an average increase of 15.1 and 11.2 percentage points with all-source models and open-source models, respectively. We make our code and datasets available publicly on Github: https://github.com/ScalingIntelligence/Archon.
Authors: Rahul Khorana, Marcus Noack, Jin Qian
Abstract: Developing robust representations of chemical structures that enable models to learn topological inductive biases is challenging. In this manuscript, we present a representation of atomistic systems. We begin by proving that our representation satisfies all structural, geometric, efficiency, and generalizability constraints. Afterward, we provide a general algorithm to encode any atomistic system. Finally, we report performance comparable to state-of-the-art methods on numerous tasks. We open-source all code and datasets. The code and data are available at https://github.com/rahulkhorana/PolyatomicComplexes.
Authors: Shadi Iskander, Nachshon Cohen, Zohar Karnin, Ori Shapira, Sofia Tolmach
Abstract: Training large language models (LLMs) for external tool usage is a rapidly expanding field, with recent research focusing on generating synthetic data to address the shortage of available data. However, the absence of systematic data quality checks poses complications for properly training and testing models. To that end, we propose two approaches for assessing the reliability of data for training LLMs to use external tools. The first approach uses intuitive, human-defined correctness criteria. The second approach uses a model-driven assessment with in-context evaluation. We conduct a thorough evaluation of data quality on two popular benchmarks, followed by an extrinsic evaluation that showcases the impact of data quality on model performance. Our results demonstrate that models trained on high-quality data outperform those trained on unvalidated data, even when trained with a smaller quantity of data. These findings empirically support the significance of assessing and ensuring the reliability of training data for tool-using LLMs.
Authors: Yuanyong Luo, Zhongxing Zhang, Richard Wu, Hu Liu, Ying Jin, Kai Zheng, Minmin Wang, Zhanying He, Guipeng Hu, Luyao Chen, Tianchi Hu, Junsong Wang, Minqi Chen, Mikhaylov Dmitry, Korviakov Vladimir, Bobrin Maxim, Yuhao Hu, Guanfu Chen, Zeyi Huang
Abstract: This preliminary white paper proposes a novel 8-bit floating-point data format HiFloat8 (abbreviated as HiF8) for deep learning. HiF8 features tapered precision. For normal value encoding, it provides 7 exponent values with 3-bit mantissa, 8 exponent values with 2-bit mantissa, and 16 exponent values with 1-bit mantissa. For denormal value encoding, it extends the dynamic range by 7 extra powers of 2, from 31 to 38 binades (notice that FP16 covers 40 binades). Meanwhile, HiF8 encodes all the special values except that positive zero and negative zero are represented by only one bit-pattern. Thanks to the better balance between precision and dynamic range, HiF8 can be simultaneously used in both forward and backward passes of AI training. In this paper, we will describe the definition and rounding methods of HiF8, as well as the tentative training and inference solutions. To demonstrate the efficacy of HiF8, massive simulation results on various neural networks, including traditional neural networks and large language models (LLMs), will also be presented.
Authors: Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang
Abstract: As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.
Authors: Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, Stefan Heimersheim
Abstract: We identify "stable regions" in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.
Authors: Xin Ma, Guorong Wu, Seong Jae Hwang, Won Hwa Kim
Abstract: Tremendous recent literature show that associations between different brain regions, i.e., brain connectivity, provide early symptoms of neurological disorders. Despite significant efforts made for graph neural network (GNN) techniques, their focus on graph nodes makes the state-of-the-art GNN methods not suitable for classifying brain connectivity as graphs where the objective is to characterize disease-relevant network dysfunction patterns on graph links. To address this issue, we propose Multi-resolution Edge Network (MENET) to detect disease-specific connectomic benchmarks with high discrimination power across diagnostic categories. The core of MENET is a novel graph edge-wise transform that we propose, which allows us to capture multi-resolution ``connectomic'' features. Using a rich set of the connectomic features, we devise a graph learning framework to jointly select discriminative edges and assign diagnostic labels for graphs. Experiments on two real datasets show that MENET accurately predicts diagnostic labels and identify brain connectivities highly associated with neurological disorders such as Alzheimer's Disease and Attention-Deficit/Hyperactivity Disorder.
Authors: Boris Shustin, Haim Avron
Abstract: Recent literature has advocated the use of randomized methods for accelerating the solution of various matrix problems arising throughout data science and computational science. One popular strategy for leveraging randomization is to use it as a way to reduce problem size. However, methods based on this strategy lack sufficient accuracy for some applications. Randomized preconditioning is another approach for leveraging randomization, which provides higher accuracy. The main challenge in using randomized preconditioning is the need for an underlying iterative method, thus randomized preconditioning so far have been applied almost exclusively to solving regression problems and linear systems. In this article, we show how to expand the application of randomized preconditioning to another important set of problems prevalent across data science: optimization problems with (generalized) orthogonality constraints. We demonstrate our approach, which is based on the framework of Riemannian optimization and Riemannian preconditioning, on the problem of computing the dominant canonical correlations and on the Fisher linear discriminant analysis problem. For both problems, we evaluate the effect of preconditioning on the computational costs and asymptotic convergence, and demonstrate empirically the utility of our approach.
Authors: Chengshuo Shen, Wei Zheng, Yonghua Ding, Xinkun Ai, Fengming Xue, Yu Zhong, Nengchao Wang, Li Gao, Zhipeng Chen, Zhoujun Yang, Zhongyong Chen, Yuan Pan, J-TEXT team
Abstract: Disruption prediction has made rapid progress in recent years, especially in machine learning (ML)-based methods. Understanding why a predictor makes a certain prediction can be as crucial as the prediction's accuracy for future tokamak disruption predictors. The purpose of most disruption predictors is accuracy or cross-machine capability. However, if a disruption prediction model can be interpreted, it can tell why certain samples are classified as disruption precursors. This allows us to tell the types of incoming disruption and gives us insight into the mechanism of disruption. This paper designs a disruption predictor called Interpretable Disruption Predictor based On Physics-guided feature extraction (IDP-PGFE) on J-TEXT. The prediction performance of the model is effectively improved by extracting physics-guided features. A high-performance model is required to ensure the validity of the interpretation results. The interpretability study of IDP-PGFE provides an understanding of J-TEXT disruption and is generally consistent with existing comprehension of disruption. IDP-PGFE has been applied to the disruption due to continuously increasing density towards density limit experiments on J-TEXT. The time evolution of the PGFE features contribution demonstrates that the application of ECRH triggers radiation-caused disruption, which lowers the density at disruption. While the application of RMP indeed raises the density limit in J-TEXT. The interpretability study guides intuition on the physical mechanisms of density limit disruption that RMPs affect not only the MHD instabilities but also the radiation profile, which delays density limit disruption.
Authors: Jun Lu
Abstract: The sole aim of this book is to give a self-contained introduction to concepts and mathematical tools in Bayesian matrix decomposition in order to seamlessly introduce matrix decomposition techniques and their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning Bayesian matrix decomposition and given the paucity of scope to present this discussion, e.g., the separated analysis of variational inference for conducting the optimization. We refer the reader to literature in the field of Bayesian analysis for a more detailed introduction to the related fields. This book is primarily a summary of purpose, significance of important Bayesian matrix decomposition methods, e.g., real-valued decomposition, nonnegative matrix factorization, Bayesian interpolative decomposition, and the origin and complexity of the methods which shed light on their applications. The mathematical prerequisite is a first course in statistics and linear algebra. Other than this modest background, the development is self-contained, with rigorous proof provided throughout.
Authors: Thijs van de Laar, Magnus Koudahl, Bert de Vries
Abstract: The Free Energy Principle (FEP) describes (biological) agents as minimising a variational Free Energy (FE) with respect to a generative model of their environment. Active Inference (AIF) is a corollary of the FEP that describes how agents explore and exploit their environment by minimising an expected FE objective. In two related papers, we describe a scalable, epistemic approach to synthetic AIF, by message passing on free-form Forney-style Factor Graphs (FFGs). A companion paper (part I) introduces a Constrained FFG (CFFG) notation that visually represents (generalised) FE objectives for AIF. The current paper (part II) derives message passing algorithms that minimise (generalised) FE objectives on a CFFG by variational calculus. A comparison between simulated Bethe and generalised FE agents illustrates how the message passing approach to synthetic AIF induces epistemic behaviour on a T-maze navigation task. Extension of the T-maze simulation to 1) learning goal statistics, and 2) a multi-agent bargaining setting, illustrate how this approach encourages reuse of nodes and updates in alternative settings. With a full message passing account of synthetic AIF agents, it becomes possible to derive and reuse message updates across models and move closer to industrial applications of synthetic AIF.
Authors: Pratik Nag, Ying Sun, Brian J Reich
Abstract: High spatial resolution wind data are essential for a wide range of applications in climate, oceanographic and meteorological studies. Large-scale spatial interpolation or downscaling of bivariate wind fields having velocity in two dimensions is a challenging task because wind data tend to be non-Gaussian with high spatial variability and heterogeneity. In spatial statistics, cokriging is commonly used for predicting bivariate spatial fields. However, the cokriging predictor is not optimal except for Gaussian processes. Additionally, cokriging is computationally prohibitive for large datasets. In this paper, we propose a method, called bivariate DeepKriging, which is a spatially dependent deep neural network (DNN) with an embedding layer constructed by spatial radial basis functions for bivariate spatial data prediction. We then develop a distribution-free uncertainty quantification method based on bootstrap and ensemble DNN. Our proposed approach outperforms the traditional cokriging predictor with commonly used covariance functions, such as the linear model of co-regionalization and flexible bivariate Mat\'ern covariance. We demonstrate the computational efficiency and scalability of the proposed DNN model, with computations that are, on average, 20 times faster than those of conventional techniques. We apply the bivariate DeepKriging method to the wind data over the Middle East region at 506,771 locations. The prediction performance of the proposed method is superior over the cokriging predictors and dramatically reduces computation time.
Authors: Ofir Gordon, Elad Cohen, Hai Victor Habi, Arnon Netzer
Abstract: Quantization is a key method for deploying deep neural networks on edge devices with limited memory and computation resources. Recent improvements in Post-Training Quantization (PTQ) methods were achieved by an additional local optimization process for learning the weight quantization rounding policy. However, a gap exists when employing network-wise optimization with small representative datasets. In this paper, we propose a new method for enhanced PTQ (EPTQ) that employs a network-wise quantization optimization process, which benefits from considering cross-layer dependencies during optimization. EPTQ enables network-wise optimization with a small representative dataset using a novel sample-layer attention score based on a label-free Hessian matrix upper bound. The label-free approach makes our method suitable for the PTQ scheme. We give a theoretical analysis for the said bound and use it to construct a knowledge distillation loss that guides the optimization to focus on the more sensitive layers and samples. In addition, we leverage the Hessian upper bound to improve the weight quantization parameters selection by focusing on the more sensitive elements in the weight tensors. Empirically, by employing EPTQ we achieve state-of-the-art results on various models, tasks, and datasets, including ImageNet classification, COCO object detection, and Pascal-VOC for semantic segmentation.
Authors: Andrew Nam, Eric Elmoznino, Nikolay Malkin, James McClelland, Yoshua Bengio, Guillaume Lajoie
Abstract: Symbolic systems are powerful frameworks for modeling cognitive processes as they encapsulate the rules and relationships fundamental to many aspects of human reasoning and behavior. Central to these models are systematicity, compositionality, and productivity, making them invaluable in both cognitive science and artificial intelligence. However, certain limitations remain. For instance, the integration of structured symbolic processes and latent sub-symbolic processes has been implemented at the computational level through fiat methods such as quantization or softmax sampling, which assume, rather than derive, the operations underpinning discretization and symbolicization. In this work, we introduce a novel neural stochastic dynamical systems model that integrates attractor dynamics with symbolic representations to model cognitive processes akin to the probabilistic language of thought (PLoT). Our model segments the continuous representational space into discrete basins, with attractor states corresponding to symbolic sequences, that reflect the semanticity and compositionality characteristic of symbolic systems through unsupervised learning, rather than relying on pre-defined primitives. Moreover, like PLoT, our model learns to sample a diverse distribution of attractor states that reflect the mutual information between the input data and the symbolic encodings. This approach establishes a unified framework that integrates both symbolic and sub-symbolic processing through neural dynamics, a neuro-plausible substrate with proven expressivity in AI, offering a more comprehensive model that mirrors the complex duality of cognitive operations.
Authors: Xihuai Wang, Shao Zhang, Wenhao Zhang, Wentao Dong, Jingxiao Chen, Ying Wen, Weinan Zhang
Abstract: Zero-shot coordination (ZSC) is a new cooperative multi-agent reinforcement learning (MARL) challenge that aims to train an ego agent to work with diverse, unseen partners during deployment. The significant difference between the deployment-time partners' distribution and the training partners' distribution determined by the training algorithm makes ZSC a unique out-of-distribution (OOD) generalization challenge. The potential distribution gap between evaluation and deployment-time partners leads to inadequate evaluation, which is exacerbated by the lack of appropriate evaluation metrics. In this paper, we present ZSC-Eval, the first evaluation toolkit and benchmark for ZSC algorithms. ZSC-Eval consists of: 1) Generation of evaluation partner candidates through behavior-preferring rewards to approximate deployment-time partners' distribution; 2) Selection of evaluation partners by Best-Response Diversity (BR-Div); 3) Measurement of generalization performance with various evaluation partners via the Best-Response Proximity (BR-Prox) metric. We use ZSC-Eval to benchmark ZSC algorithms in Overcooked and Google Research Football environments and get novel empirical findings. We also conduct a human experiment of current ZSC algorithms to verify the ZSC-Eval's consistency with human evaluation. ZSC-Eval is now available at https://github.com/sjtu-marl/ZSC-Eval.
Authors: Francesco Montagna, Atalanti A. Mastakouri, Elias Eulig, Nicoletta Noceti, Lorenzo Rosasco, Dominik Janzing, Bryon Aragam, Francesco Locatello
Abstract: When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.
Authors: Marcel Wagenl\"ander, Guo Li, Bo Zhao, Luo Mai, Peter Pietzuch
Abstract: Deep learning (DL) jobs use multi-dimensional parallelism, i.e. combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource elasticity during training adds or removes GPUs; (ii) hardware maintenance may require redeployment on different GPUs; and (iii) GPU failures force jobs to run with fewer devices. Current DL frameworks tie jobs to a set of GPUs and thus lack support for these scenarios. In particular, they cannot change the multi-dimensional parallelism of an already-running job in an efficient and model-independent way. We describe Scalai, a state management library for DL systems that enables jobs to change their parallelism dynamically after the GPU allocation is updated at runtime. Scalai achieves this through a new abstraction, a parallelizable tensor collection (PTC), that externalizes the job state during training. After a GPU change, Scalai uses the PTC to transform the job state: the PTC repartitions the dataset state under data parallelism and exposes it to DL workers through a virtual file system; and the PTC obtains the model state as partitioned checkpoints and transforms them to reflect the new parallelization configuration. For efficiency, Scalai executes PTC transformations in parallel with minimum data movement between workers. Our experiments show that Scalai enables DL jobs to support dynamic parallelization with low overhead.
Authors: Kenta Tsukahara, Kanji Tanaka
Abstract: A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available for the target workspace. However, this is not necessarily true when a robot travels around the general open world. This work introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot (``student") can ask the other robots it meets at unfamiliar places (``teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and then used for continual learning of the student model under domain, class, and vocabulary incremental setup. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, so that it can handle various types of open-set teachers, including those uncooperative, untrainable (e.g., image retrieval engines), or black-box teachers (i.e., data privacy). In this paper, we investigate a ranking function as an instance of such generic models, using a challenging data-free recursive distillation scenario, where a student once trained can recursively join the next-generation open teacher set.
Authors: Xuewen Liu, Zhikai Li, Junrui Xiao, Qingyi Gu
Abstract: Diffusion models have achieved great success in image generation tasks through iterative noise estimation. However, the heavy denoising process and complex neural networks hinder their low-latency applications in real-world scenarios. Quantization can effectively reduce model complexity, and post-training quantization (PTQ), which does not require fine-tuning, is highly promising for compressing and accelerating diffusion models. Unfortunately, we find that due to the highly dynamic distribution of activations in different denoising steps, existing PTQ methods for diffusion models suffer from distribution mismatch issues at both calibration sample level and reconstruction output level, which makes the performance far from satisfactory, especially in low-bit cases. In this paper, we propose Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models (EDA-DM) to address the above issues. Specifically, at the calibration sample level, we select calibration samples based on the density and variety in the latent space, thus facilitating the alignment of their distribution with the overall samples; and at the reconstruction output level, we modify the loss of block reconstruction with the losses of layers, aligning the outputs of quantized model and full-precision model at different network granularity. Extensive experiments demonstrate that EDA-DM significantly outperforms the existing PTQ methods across various models (DDIM, LDM-4, LDM-8, Stable-Diffusion) and different datasets (CIFAR-10, LSUN-Bedroom, LSUN-Church, ImageNet, MS-COCO).
Authors: Marcos V. Conde, Gregor Geigle, Radu Timofte
Abstract: Image restoration is a fundamental problem that involves recovering a high-quality clean image from its degraded observation. All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model. In this work, we present the first approach that uses human-written instructions to guide the image restoration model. Given natural language prompts, our model can recover high-quality images from their degraded counterparts, considering multiple degradation types. Our method, InstructIR, achieves state-of-the-art results on several restoration tasks including image denoising, deraining, deblurring, dehazing, and (low-light) image enhancement. InstructIR improves +1dB over previous all-in-one restoration methods. Moreover, our dataset and results represent a novel benchmark for new research on text-guided image restoration and enhancement. Our code, datasets and models are available at: https://github.com/mv-lab/InstructIR
Authors: Hengyuan Ma, Wenlian Lu, Jianfeng Feng
Abstract: Combinatorial optimization problems are widespread but inherently challenging due to their discrete nature. The primary limitation of existing methods is that they can only access a small fraction of the solution space at each iteration, resulting in limited efficiency for searching the global optimal. To overcome this challenge, diverging from conventional efforts of expanding the solver's search scope, we focus on enabling information to actively propagate to the solver through heat diffusion. By transforming the target function while preserving its optima, heat diffusion facilitates information flow from distant regions to the solver, providing more efficient navigation. Utilizing heat diffusion, we propose a framework for solving general combinatorial optimization problems. The proposed methodology demonstrates superior performance across a range of the most challenging and widely encountered combinatorial optimizations. Echoing recent advancements in harnessing thermodynamics for generative artificial intelligence, our study further reveals its significant potential in advancing combinatorial optimization.
Authors: Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis
Abstract: Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate performance loss becomes increasingly costly as model sizes grow. ReCycle is a system designed for efficient DNN training in the presence of failures, without relying on spare servers. It exploits the inherent functional redundancy in distributed training systems -- where servers across data-parallel groups store the same model parameters -- and pipeline schedule bubbles within each data-parallel group. When servers fails, ReCycle dynamically re-routes micro-batches to data-parallel peers, allowing for uninterrupted training despite multiple failures. However, this re-routing can create imbalances across pipeline stages, leading to reduced training throughput. To address this, ReCycle introduces two key optimizations that ensure re-routed micro-batches are processed within the original pipeline schedule's bubbles. First, it decouples the backward pass into two phases: one for computing gradients for the input and another for calculating gradients for the parameters. Second, it avoids synchronization across pipeline stages by staggering the optimizer step. Together, these optimizations enable adaptive pipeline schedules that minimize or even eliminate training throughput degradation during failures. We describe a prototype for ReCycle and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to $1.46\times$ and $1.64\times$, respectively.
Authors: Michelle Espranita Liman, Daniel Rueckert, Florian J. Fintelmann, Philip M\"uller
Abstract: Field-of-view (FOV) recovery of truncated chest CT scans is crucial for accurate body composition analysis, which involves quantifying skeletal muscle and subcutaneous adipose tissue (SAT) on CT slices. This, in turn, enables disease prognostication. Here, we present a method for recovering truncated CT slices using generative image outpainting. We train a diffusion model and apply it to truncated CT slices generated by simulating a small FOV. Our model reliably recovers the truncated anatomy and outperforms the previous state-of-the-art despite being trained on 87% less data.
Authors: Tong Zhang, Yingdong Hu, Jiacheng You, Yang Gao
Abstract: Given the high cost of collecting robotic data in the real world, sample efficiency is a consistently compelling pursuit in robotics. In this paper, we introduce SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations. Central to the design of SGRv2 is the incorporation of a critical inductive bias-action locality, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment. Extensive experiments in both simulated and real-world settings demonstrate that action locality is essential for boosting sample efficiency. SGRv2 excels in RLBench tasks with keyframe control using merely 5 demonstrations and surpasses the RVT baseline in 23 of 26 tasks. Furthermore, when evaluated on ManiSkill2 and MimicGen using dense control, SGRv2's success rate is 2.54 times that of SGR. In real-world environments, with only eight demonstrations, SGRv2 can perform a variety of tasks at a markedly higher success rate compared to baseline models. Project website: http://sgrv2-robot.github.io
Authors: Tatsuya Kamijo, Cristian C. Beltran-Hernandez, Masashi Hamaya
Abstract: Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage. While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions. In this work, we introduce a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Our system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback. Secondly, we present Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations. Our methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of our system in teaching robots dexterous manipulations with enhanced adaptability and safety. Code available at: https://github.com/omron-sinicx/CompACT
Authors: Tanush Chopra, Michael Li, Jacob Haimes
Abstract: When large language models (LLMs) are asked to perform certain tasks, how can we be sure that their learned representations align with reality? We propose a domain-agnostic framework for systematically evaluating distribution shifts in LLMs decision-making processes, where they are given control of mechanisms governed by pre-defined rules. While individual LLM actions may appear consistent with expected behavior, across a large number of trials, statistically significant distribution shifts can emerge. To test this, we construct a well-defined environment with known outcome logic: blackjack. In more than 1,000 trials, we uncover statistically significant evidence suggesting behavioral misalignment in the learned representations of LLM.
Authors: Jie Gui, Chengze Jiang, Minjing Dong, Kun Tong, Xinli Shi, Yuan Yan Tang, Dacheng Tao
Abstract: While adversarial training is an effective defense method against adversarial attacks, it notably increases the training cost. To this end, fast adversarial training (FAT) is presented for efficient training and has become a hot research topic. However, FAT suffers from catastrophic overfitting, which leads to a performance drop compared with multi-step adversarial training. However, the cause of catastrophic overfitting remains unclear and lacks exploration. In this paper, we present an example taxonomy in FAT, which identifies that catastrophic overfitting is caused by the imbalance between the inner and outer optimization in FAT. Furthermore, we investigated the impact of varying degrees of training loss, revealing a correlation between training loss and catastrophic overfitting. Based on these observations, we redesign the loss function in FAT with the proposed dynamic label relaxation to concentrate the loss range and reduce the impact of misclassified examples. Meanwhile, we introduce batch momentum initialization to enhance the diversity to prevent catastrophic overfitting in an efficient manner. Furthermore, we also propose Catastrophic Overfitting aware Loss Adaptation (COLA), which employs a separate training strategy for examples based on their loss degree. Our proposed method, named example taxonomy aware FAT (ETA), establishes an improved paradigm for FAT. Experiment results demonstrate our ETA achieves state-of-the-art performance. Comprehensive experiments on four standard datasets demonstrate the competitiveness of our proposed method.
Authors: Caleb Lammers, Miles Cranmer, Sam Hadden, Shirley Ho, Norman Murray, Daniel Tamayo
Abstract: Constraining planet formation models based on the observed exoplanet population requires generating large samples of synthetic planetary systems, which can be computationally prohibitive. A significant bottleneck is simulating the giant impact phase, during which planetary embryos evolve gravitationally and combine to form planets, which may themselves experience later collisions. To accelerate giant impact simulations, we present a machine learning (ML) approach to predicting collisional outcomes in multiplanet systems. Trained on more than 500,000 $N$-body simulations of three-planet systems, we develop an ML model that can accurately predict which two planets will experience a collision, along with the state of the post-collision planets, from a short integration of the system's initial conditions. Our model greatly improves on non-ML baselines that rely on metrics from dynamics theory, which struggle to accurately predict which pair of planets will experience a collision. By combining with a model for predicting long-term stability, we create an ML-based giant impact emulator, which can predict the outcomes of giant impact simulations with reasonable accuracy and a speedup of up to four orders of magnitude. We expect our model to enable analyses that would not otherwise be computationally feasible. As such, we release our training code, along with an easy-to-use API for our collision outcome model and giant impact emulator.
Authors: Ziyang Yu, Wenbing Huang, Yang Liu
Abstract: Molecular Dynamics (MD) is crucial in various fields such as materials science, chemistry, and pharmacology to name a few. Conventional MD software struggles with the balance between time cost and prediction accuracy, which restricts its wider application. Recently, data-driven approaches based on deep generative models have been devised for time-coarsened dynamics, which aim at learning dynamics of diverse molecular systems over a long timestep, enjoying both universality and efficiency. Nevertheless, most current methods are designed solely to learn from the data distribution regardless of the underlying Boltzmann distribution, and the physics priors such as energies and forces are constantly overlooked. In this work, we propose a conditional generative model called Force-guided Bridge Matching (FBM), which learns full-atom time-coarsened dynamics and targets the Boltzmann-constrained distribution. With the guidance of our delicately-designed intermediate force field, FBM leverages favourable physics priors into the generation process, giving rise to enhanced simulations. Experiments on two datasets consisting of peptides verify our superiority in terms of comprehensive metrics and demonstrate transferability to unseen systems.
Authors: Elisabeth Fischer, Albin Zehe, Andreas Hotho, Daniel Schl\"or
Abstract: Analyzing sequences of interactions between users and items, sequential recommendation models can learn user intent and make predictions about the next item. Next to item interactions, most systems also have interactions with what we call non-item pages: these pages are not related to specific items but still can provide insights of the user's interests, as, for example, navigation pages. We therefore propose a general way to include these non-item pages in sequential recommendation models to enhance next-item prediction. First, we demonstrate the influence of non-item pages on following interactions with the hypotheses testing framework HypTrails and propose methods for representing non-item pages in sequential recommendation models. Subsequently, we adapt popular sequential recommender models to integrate non-item pages and investigate their performance with different item representation strategies as well as their ability to handle noisy data. To show the general capabilities of the models to integrate non-item pages, we create a synthetic dataset for a controlled setting and then evaluate the improvements from including non-item pages on two real-world datasets. Our results show that non-item pages are a valuable source of information, and incorporating them in sequential recommendation models increases the performance of next-item prediction across all analyzed model architectures.
Authors: Jan Schnabel, Marco Roth
Abstract: Since the entry of kernel theory in the field of quantum machine learning, quantum kernel methods (QKMs) have gained increasing attention with regard to both probing promising applications and delivering intriguing research insights. Two common approaches for computing the underlying Gram matrix have emerged: fidelity quantum kernels (FQKs) and projected quantum kernels (PQKs). Benchmarking these methods is crucial to gain robust insights and to understand their practical utility. In this work, we present a comprehensive large-scale study examining QKMs based on FQKs and PQKs across a manifold of design choices. Our investigation encompasses both classification and regression tasks for five dataset families and 64 datasets, systematically comparing the use of FQKs and PQKs quantum support vector machines and kernel ridge regression. This resulted in over 20,000 models that were trained and optimized using a state-of-the-art hyperparameter search to ensure robust and comprehensive insights. We delve into the importance of hyperparameters on model performance scores and support our findings through rigorous correlation analyses. In this, we also closely inspect two data encoding strategies. Moreover, we provide an in-depth analysis addressing the design freedom of PQKs and explore the underlying principles responsible for learning. Our goal is not to identify the best-performing model for a specific task but to uncover the mechanisms that lead to effective QKMs and reveal universal patterns.
Authors: Yasaman Etesam, Hyunmin Cheong, Mohammadmehdi Ataei, Pradeep Kumar Jayaraman
Abstract: Generative AI has made remarkable progress in addressing various design challenges. One prominent area where generative AI could bring significant value is in engineering design. In particular, selecting an optimal set of components and their interfaces to create a mechanical system that meets design requirements is one of the most challenging and time-consuming tasks for engineers. This configuration design task is inherently challenging due to its categorical nature, multiple design requirements a solution must satisfy, and the reliance on physics simulations for evaluating potential solutions. These characteristics entail solving a combinatorial optimization problem with multiple constraints involving black-box functions. To address this challenge, we propose a deep generative model to predict the optimal combination of components and interfaces for a given design problem. To demonstrate our approach, we solve a gear train synthesis problem by first creating a synthetic dataset using a grammar, a parts catalogue, and a physics simulator. We then train a Transformer using this dataset, named GearFormer, which can not only generate quality solutions on its own, but also augment search methods such as an evolutionary algorithm and Monte Carlo tree search. We show that GearFormer outperforms such search methods on their own in terms of satisfying the specified design requirements with orders of magnitude faster generation time. Additionally, we showcase the benefit of hybrid methods that leverage both GearFormer and search methods, which further improve the quality of the solutions.
Authors: Caleb Koch, Carmen Strassle, Li-Yang Tan
Abstract: We connect the problem of properly PAC learning decision trees to the parameterized Nearest Codeword Problem ($k$-NCP). Despite significant effort by the respective communities, algorithmic progress on both problems has been stuck: the fastest known algorithm for the former runs in quasipolynomial time (Ehrenfeucht and Haussler 1989) and the best known approximation ratio for the latter is $O(n/\log n)$ (Berman and Karpinsky 2002; Alon, Panigrahy, and Yekhanin 2009). Research on both problems has thus far proceeded independently with no known connections. We show that $\textit{any}$ improvement of Ehrenfeucht and Haussler's algorithm will yield $O(\log n)$-approximation algorithms for $k$-NCP, an exponential improvement of the current state of the art. This can be interpreted either as a new avenue for designing algorithms for $k$-NCP, or as one for establishing the optimality of Ehrenfeucht and Haussler's algorithm. Furthermore, our reduction along with existing inapproximability results for $k$-NCP already rule out polynomial-time algorithms for properly learning decision trees. A notable aspect of our hardness results is that they hold even in the setting of $\textit{weak}$ learning whereas prior ones were limited to the setting of strong learning.
Authors: Yuxin Zhang, Zheng Lin, Zhe Chen, Zihan Fang, Wenjun Zhu, Xianhao Chen, Jin Zhao, Yue Gao
Abstract: Traditional federated learning (FL) frameworks rely heavily on terrestrial networks, where coverage limitations and increasing bandwidth congestion significantly hinder model convergence. Fortunately, the advancement of low-Earth orbit (LEO) satellite networks offers promising new communication avenues to augment traditional terrestrial FL. Despite this potential, the limited satellite-ground communication bandwidth and the heterogeneous operating environments of ground devices-including variations in data, bandwidth, and computing power-pose substantial challenges for effective and robust satellite-assisted FL. To address these challenges, we propose SatFed, a resource-efficient satellite-assisted heterogeneous FL framework. SatFed implements freshness-based model prioritization queues to optimize the use of highly constrained satellite-ground bandwidth, ensuring the transmission of the most critical models. Additionally, a multigraph is constructed to capture real-time heterogeneous relationships between devices, including data distribution, terrestrial bandwidth, and computing capability. This multigraph enables SatFed to aggregate satellite-transmitted models into peer guidance, enhancing local training in heterogeneous environments. Extensive experiments with real-world LEO satellite networks demonstrate that SatFed achieves superior performance and robustness compared to state-of-the-art benchmarks.
Authors: Yixi Wu, Pengfei He, Zehao Wang, Shaowei Wang, Yuan Tian, Tse-Hsun Chen
Abstract: Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on general code generation without considering API-oriented code generation, i.e., generating code that invokes APIs from specific libraries. Given the growing demand for API-oriented code generation, there is a pressing need for a systematic and automated approach to evaluate LLM on API-oriented code generation. To address this gap, we propose AutoAPIEval, a lightweight and automated framework designed to evaluate the capabilities of LLMs in API-oriented code generation. Our framework works with any library that provides API documentation and focuses on two unit tasks: API recommendation and code example generation, along with four metrics to evaluate the generated APIs and code examples, such as the proportion of incorrect API recommendations for Task 1, and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples for Task 2. In addition, we conducted a case study on three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) and Java Runtime Environment 8 to demonstrate the framework's effectiveness. Our findings reveal substantial variability in LLM performance across tasks, with ChatGPT adhering better to instructions, while sharing similar effectiveness in code example generation with its counterparts (i.e., MagiCoder and DeekSeek Coder). We also identify key factors associated with code quality, such as API popularity and model confidence, and build classifiers that achieve high accuracy in detecting incorrect API recommendations and erroneous code examples. Retrieval-augmented generation enhances the quality of code generated by LLMs, though its effectiveness varies across different LLMs.
Authors: Johan Hallberg Szabadv\'ary
Abstract: Conformal prediction is a widely-used framework for distribution-free uncertainty quantification, which generates valid prediction sets at a user-defined significance level. However, this framework relies on the assumption that the data-generating distribution is exchangeable, a condition that is frequently violated in time-series and other structured data. In such cases, the validity guarantees of conformal prediction break down. Adaptive conformal inference (ACI) has been proposed as a solution for non-exchangeable data by dynamically adjusting the significance level to retain at least finite sample guarantees on the marginal coverage error rate. This paper demonstrates that, despite its name, ACI does not strictly require the use of conformal predictors. Instead, it can operate effectively with the more general concept of a confidence predictor, which is often computationally simpler. The key requirement is that larger significance levels correspond to smaller prediction sets, a property known as nested prediction sets. Through experiments on synthetic and real-world data, we investigate whether ACI with conformal predictors offers advantages over confidence predictors. Our results indicate that confidence predictors can perform just as well, and sometimes better than conformal predictors in some cases, although further empirical studies are needed to determine when one approach may be preferable.
Authors: R. Stuart Geiger, Flynn O'Sullivan, Elsie Wang, Jonathan Lo
Abstract: We conducted controlled experimental bias audits for four versions of ChatGPT, which we asked to recommend an opening offer in salary negotiations for a new hire. We submitted 98,800 prompts to each version, systematically varying the employee's gender, university, and major, and tested prompts in voice of each side of the negotiation: the employee versus employer. We find ChatGPT as a multi-model platform is not robust and consistent enough to be trusted for such a task. We observed statistically significant salary offers when varying gender for all four models, although with smaller gaps than for other attributes tested. The largest gaps were different model versions and between the employee- vs employer-voiced prompts. We also observed substantial gaps when varying university and major, but many of the biases were not consistent across model versions. We tested for fictional and fraudulent universities and found wildly inconsistent results across cases and model versions. We make broader contributions to the AI/ML fairness literature. Our scenario and our experimental design differ from mainstream AI/ML auditing efforts in key ways. Bias audits typically test discrimination for protected classes like gender, which we contrast with testing non-protected classes of university and major. Asking for negotiation advice includes how aggressive one ought to be in a negotiation relative to known empirical salary distributions and scales, which is a deeply contextual and personalized task that has no objective ground truth to validate. These results raise concerns for the specific model versions we tested and ChatGPT as a multi-model platform in continuous development. Our epistemology does not permit us to definitively certify these models as either generally biased or unbiased on the attributes we test, but our study raises matters of concern for stakeholders to further investigate.
Authors: Yuchen Fang, Sen Na, Michael W. Mahoney, Mladen Kolar
Abstract: In this work, we consider solving optimization problems with a stochastic objective and deterministic equality constraints. We propose a Trust-Region Sequential Quadratic Programming method to find both first- and second-order stationary points. Our method utilizes a random model to represent the objective function, which is constructed from stochastic observations of the objective and is designed to satisfy proper adaptive accuracy conditions with a high but fixed probability. To converge to first-order stationary points, our method computes a gradient step in each iteration defined by minimizing a quadratic approximation of the objective subject to a (relaxed) linear approximation of the problem constraints and a trust-region constraint. To converge to second-order stationary points, our method additionally computes an eigen step to explore the negative curvature of the reduced Hessian matrix, as well as a second-order correction step to address the potential Maratos effect, which arises due to the nonlinearity of the problem constraints. Such an effect may impede the method from moving away from saddle points. Both gradient and eigen step computations leverage a novel parameter-free decomposition of the step and the trust-region radius, accounting for the proportions among the feasibility residual, optimality residual, and negative curvature. We establish global almost sure first- and second-order convergence guarantees for our method, and present computational results on CUTEst problems, regression problems, and saddle-point problems to demonstrate its superiority over existing line-search-based stochastic methods.
Authors: Teerapong Panboonyuen
Abstract: Forecasting sea surface currents is essential for applications such as maritime navigation, environmental monitoring, and climate analysis, particularly in regions like the Gulf of Thailand and the Andaman Sea. This paper introduces SEA-ViT, an advanced deep learning model that integrates Vision Transformer (ViT) with bidirectional Gated Recurrent Units (GRUs) to capture spatio-temporal covariance for predicting sea surface currents (U, V) using high-frequency radar (HF) data. The name SEA-ViT is derived from ``Sea Surface Currents Forecasting using Vision Transformer,'' highlighting the model's emphasis on ocean dynamics and its use of the ViT architecture to enhance forecasting capabilities. SEA-ViT is designed to unravel complex dependencies by leveraging a rich dataset spanning over 30 years and incorporating ENSO indices (El Ni\~no, La Ni\~na, and neutral phases) to address the intricate relationship between geographic coordinates and climatic variations. This development enhances the predictive capabilities for sea surface currents, supporting the efforts of the Geo-Informatics and Space Technology Development Agency (GISTDA) in Thailand's maritime regions. The code and pretrained models are available at \url{https://github.com/kaopanboonyuen/gistda-ai-sea-surface-currents}.
URLs: https://github.com/kaopanboonyuen/gistda-ai-sea-surface-currents
Authors: Alexander Popov, Alperen Degirmenci, David Wehr, Shashank Hegde, Ryan Oldja, Alexey Kamenev, Bertrand Douillard, David Nist\'er, Urs Muller, Ruchi Bhargava, Stan Birchfield, Nikolai Smolyanskiy
Abstract: We propose the use of latent space generative world models to address the covariate shift problem in autonomous driving. A world model is a neural network capable of predicting an agent's next state given past states and actions. By leveraging a world model during training, the driving policy effectively mitigates covariate shift without requiring an excessive amount of training data. During end-to-end training, our policy learns how to recover from errors by aligning with states observed in human demonstrations, so that at runtime it can recover from perturbations outside the training distribution. Additionally, we introduce a novel transformer-based perception encoder that employs multi-view cross-attention and a learned scene query. We present qualitative and quantitative results, demonstrating significant improvements upon prior state of the art in closed-loop testing in the CARLA simulator, as well as showing the ability to handle perturbations in both CARLA and NVIDIA's DRIVE Sim.