Authors: Philippe Chlenski, Quentin Chu, Raiyan R. Khan, Antonio Khalil Moretti, Itsik Pe'er
Abstract: Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation of DTs respects the geometry of the product manifold, yielding splits that are geodesically convex, maximum-margin, and composable. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. We benchmark our method on various classification, regression, and link prediction tasks on synthetic data, graph embeddings, mixed-curvature variational autoencoder latent spaces, and empirical data. Compared to six other classifiers, product DTs and RFs ranked first on 21 of 22 single-manifold benchmarks and 18 of 35 product manifold benchmarks, and placed in the top 2 on 53 of 57 benchmarks overall. This highlights the value of product DTs and RFs as straightforward yet powerful new tools for data analysis in product manifolds. Code for our paper is available at https://github.com/pchlenski/embedders.
Authors: Yikang Chen, Dehui du, Lili Tian
Abstract: We propose an importance sampling method for tractable and efficient estimation of counterfactual expressions in general settings, named Exogenous Matching. By minimizing a common upper bound of counterfactual estimators, we transform the variance minimization problem into a conditional distribution learning problem, enabling its integration with existing conditional distribution modeling approaches. We validate the theoretical results through experiments under various types and settings of Structural Causal Models (SCMs) and demonstrate the outperformance on counterfactual estimation tasks compared to other existing importance sampling methods. We also explore the impact of injecting structural prior knowledge (counterfactual Markov boundaries) on the results. Finally, we apply this method to identifiable proxy SCMs and demonstrate the unbiasedness of the estimates, empirically illustrating the applicability of the method to practical scenarios.
Authors: Shuyin Xia, Bolun Shi, Yifan Wang, Jiang Xie, Guoyin Wang, Xinbo Gao
Abstract: Traditional clustering algorithms often focus on the most fine-grained information and achieve clustering by calculating the distance between each pair of data points or implementing other calculations based on points. This way is not inconsistent with the cognitive mechanism of "global precedence" in human brain, resulting in those methods' bad performance in efficiency, generalization ability and robustness. To address this problem, we propose a new clustering algorithm called granular-ball clustering (GBCT) via granular-ball computing. Firstly, GBCT generates a smaller number of granular-balls to represent the original data, and forms clusters according to the relationship between granular-balls, instead of the traditional point relationship. At the same time, its coarse-grained characteristics are not susceptible to noise, and the algorithm is efficient and robust; besides, as granular-balls can fit various complex data, GBCT performs much better in non-spherical data sets than other traditional clustering methods. The completely new coarse granularity representation method of GBCT and cluster formation mode can also used to improve other traditional methods.
Authors: ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, and Lei Bai
Abstract: \textit{Nature is infinitely resolution-free}. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the \textbf{Flexible Vision Transformer} (FiT), a transformer architecture specifically designed for generating images with \textit{unrestricted resolutions and aspect ratios}. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits $2\times$ convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at \url{https://github.com/whlzy/FiT} to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.
Authors: Gon\c{c}alo Paulo, Alex Mallen, Caden Juang, Nora Belrose
Abstract: While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. However, these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a feature, which we find explains features that are not recalled by existing methods. We propose guidelines for generating better explanations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. We use our explanations to measure the semantic similarity of independently trained SAEs, and find that SAEs trained on nearby layers of the residual stream are highly similar. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons, even when neurons are sparsified using top-$k$ postprocessing. Our code is available at https://github.com/EleutherAI/sae-auto-interp, and our explanations are available at https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.
URLs: https://github.com/EleutherAI/sae-auto-interp,, https://huggingface.co/datasets/EleutherAI/auto_interp_explanations.
Authors: Tonghan Wang, Heng Dong, Yanchen Jiang, David C. Parkes, Milind Tambe
Abstract: Multiagent systems grapple with partial observability (PO), and the decentralized POMDP (Dec-POMDP) model highlights the fundamental nature of this challenge. Whereas recent approaches to address PO have appealed to deep learning models, providing a rigorous understanding of how these models and their approximation errors affect agents' handling of PO and their interactions remain a challenge. In addressing this challenge, we investigate reconstructing global states from local action-observation histories in Dec-POMDPs using diffusion models. We first find that diffusion models conditioned on local history represent possible states as stable fixed points. In collectively observable (CO) Dec-POMDPs, individual diffusion models conditioned on agents' local histories share a unique fixed point corresponding to the global state, while in non-CO settings, the shared fixed points yield a distribution of possible states given joint history. We further find that, with deep learning approximation errors, fixed points can deviate from true states and the deviation is negatively correlated to the Jacobian rank. Inspired by this low-rank property, we bound the deviation by constructing a surrogate linear regression model that approximates the local behavior of diffusion models. With this bound, we propose a composite diffusion process iterating over agents with theoretical convergence guarantees to the true state.
Authors: Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, Soummya Kar
Abstract: We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-\zeta})$, where $\zeta \in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, demonstrating noise symmetry in real-life settings and showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.
Authors: Ihab Bendidi, Shawn Whitfield, Kian Kenyon-Dean, Hanene Ben Yedder, Yassir El Mesbahi, Emmanuel Noutahi, Alisandra K. Denton
Abstract: Understanding the relationships among genes, compounds, and their interactions in living organisms remains limited due to technological constraints and the complexity of biological data. Deep learning has shown promise in exploring these relationships using various data types. However, transcriptomics, which provides detailed insights into cellular states, is still underused due to its high noise levels and limited data availability. Recent advancements in transcriptomics sequencing provide new opportunities to uncover valuable insights, especially with the rise of many new foundation models for transcriptomics, yet no benchmark has been made to robustly evaluate the effectiveness of these rising models for perturbation analysis. This article presents a novel biologically motivated evaluation framework and a hierarchy of perturbation analysis tasks for comparing the performance of pretrained foundation models to each other and to more classical techniques of learning from transcriptomics data. We compile diverse public datasets from different sequencing techniques and cell lines to assess models performance. Our approach identifies scVI and PCA to be far better suited models for understanding biological perturbations in comparison to existing foundation models, especially in their application in real-world scenarios.
Authors: Jinze Zhao
Abstract: As Transformer models grow in complexity, their ability to generalize to novel, compositional tasks becomes crucial. This study challenges conventional wisdom about sparse activation in Sparse Mixture of Experts (SMoE) models when faced with increasingly complex compositional tasks. Through experiments on the SRAVEN symbolic reasoning task and SKILL-MIX benchmark, we demonstrate that activating more experts improves performance on difficult tasks, with the optimal number of activated experts scaling with task complexity. Our findings reveal that pretrained SMoE-based Large Language Models achieve better results by increasing experts-per-token on challenging compositional tasks.
Authors: Minhua Lin, Zhiwei Zhang, Enyan Dai, Zongyu Wu, Yilong Wang, Xiang Zhang, Suhang Wang
Abstract: Graph Prompt Learning (GPL) has been introduced as a promising approach that uses prompts to adapt pre-trained GNN models to specific downstream tasks without requiring fine-tuning of the entire model. Despite the advantages of GPL, little attention has been given to its vulnerability to backdoor attacks, where an adversary can manipulate the model's behavior by embedding hidden triggers. Existing graph backdoor attacks rely on modifying model parameters during training, but this approach is impractical in GPL as GNN encoder parameters are frozen after pre-training. Moreover, downstream users may fine-tune their own task models on clean datasets, further complicating the attack. In this paper, we propose TGPA, a backdoor attack framework designed specifically for GPL. TGPA injects backdoors into graph prompts without modifying pre-trained GNN encoders and ensures high attack success rates and clean accuracy. To address the challenge of model fine-tuning by users, we introduce a finetuning-resistant poisoning approach that maintains the effectiveness of the backdoor even after downstream model adjustments. Extensive experiments on multiple datasets under various settings demonstrate the effectiveness of TGPA in compromising GPL models with fixed GNN encoders.
Authors: Renpu Liu, Ruida Zhou, Cong Shen, Jing Yang
Abstract: An intriguing property of the Transformer is its ability to perform in-context learning (ICL), where the Transformer can solve different inference tasks without parameter updating based on the contextual information provided by the corresponding input-output demonstration pairs. It has been theoretically proved that ICL is enabled by the capability of Transformers to perform gradient-descent algorithms (Von Oswald et al., 2023a; Bai et al., 2024). This work takes a step further and shows that Transformers can perform learning-to-optimize (L2O) algorithms. Specifically, for the ICL sparse recovery (formulated as LASSO) tasks, we show that a K-layer Transformer can perform an L2O algorithm with a provable convergence rate linear in K. This provides a new perspective explaining the superior ICL capability of Transformers, even with only a few layers, which cannot be achieved by the standard gradient-descent algorithms. Moreover, unlike the conventional L2O algorithms that require the measurement matrix involved in training to match that in testing, the trained Transformer is able to solve sparse recovery problems generated with different measurement matrices. Besides, Transformers as an L2O algorithm can leverage structural information embedded in the training tasks to accelerate its convergence during ICL, and generalize across different lengths of demonstration pairs, where conventional L2O algorithms typically struggle or fail. Such theoretical findings are supported by our experimental results.
Authors: Ethan Rathbun, Christopher Amato, Alina Oprea
Abstract: Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. These attacks induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks rely on arbitrarily large perturbations to the agent's rewards to achieve both of these objectives - leaving them open to detection. Thus, in this work, we propose a new class of backdoor attacks against DRL which achieve state of the art performance while minimally altering the agent's rewards. These ``inception'' attacks train the agent to associate the targeted adversarial behavior with high returns by inducing a disjunction between the agent's chosen action and the true action executed in the environment during training. We formally define these attacks and prove they can achieve both adversarial objectives. We then devise an online inception attack which significantly out-performs prior attacks under bounded reward constraints.
Authors: Allison Lau, Younwoo Choi, Vahid Balazadeh, Keertana Chidambaram, Vasilis Syrgkanis, Rahul G. Krishnan
Abstract: Reinforcement Learning from Human Feedback (RLHF) is widely used to align Language Models (LMs) with human preferences. However, existing approaches often neglect individual user preferences, leading to suboptimal personalization. We present the Preference Pretrained Transformer (PPT), a novel approach for adaptive personalization using online user feedback. PPT leverages the in-context learning capabilities of transformers to dynamically adapt to individual preferences. Our approach consists of two phases: (1) an offline phase where we train a single policy model using a history-dependent loss function, and (2) an online phase where the model adapts to user preferences through in-context learning. We demonstrate PPT's effectiveness in a contextual bandit setting, showing that it achieves personalized adaptation superior to existing methods while significantly reducing the computational costs. Our results suggest the potential of in-context learning for scalable and efficient personalization in large language models.
Authors: \"Omer Faruk Akg\"ul, Rajgopal Kannan, Viktor Prasanna
Abstract: Graphs play a crucial role in data mining and machine learning, representing real-world objects and interactions. As graph datasets grow, managing large, decentralized subgraphs becomes essential, particularly within federated learning frameworks. These frameworks face significant challenges, including missing neighbor information, which can compromise model reliability in safety-critical settings. Deployment of federated learning models trained in such settings necessitates quantifying the uncertainty of the models. This study extends the applicability of Conformal Prediction (CP), a well-established method for uncertainty quantification, to federated graph learning. We specifically tackle the missing links issue in distributed subgraphs to minimize its adverse effects on CP set sizes. We discuss data dependencies across the distributed subgraphs and establish conditions for CP validity and precise test-time coverage. We introduce a Variational Autoencoder-based approach for reconstructing missing neighbors to mitigate the negative impact of missing data. Empirical evaluations on real-world datasets demonstrate the efficacy of our approach, yielding smaller prediction sets while ensuring coverage guarantees.
Authors: Olena Hrynenko, Andrea Cavallaro
Abstract: Privacy personas capture the differences in user segments with respect to one's knowledge, behavioural patterns, level of self-efficacy, and perception of the importance of privacy protection. Modelling these differences is essential for appropriately choosing personalised communication about privacy (e.g. to increase literacy) and for defining suitable choices for privacy enhancing technologies (PETs). While various privacy personas have been derived in the literature, they group together people who differ from each other in terms of important attributes such as perceived or desired level of control, and motivation to use PET. To address this lack of granularity and comprehensiveness in describing personas, we propose eight personas that we derive by combining qualitative and quantitative analysis of the responses to an interactive educational questionnaire. We design an analysis pipeline that uses divisive hierarchical clustering and Boschloo's statistical test of homogeneity of proportions to ensure that the elicited clusters differ from each other based on a statistical measure. Additionally, we propose a new measure for calculating distances between questionnaire responses, that accounts for the type of the question (closed- vs open-ended) used to derive traits. We show that the proposed privacy personas statistically differ from each other. We statistically validate the proposed personas and also compare them with personas in the literature, showing that they provide a more granular and comprehensive understanding of user segments, which will allow to better assist users with their privacy needs.
Authors: Mohsen Ghassemi, Alan Mishler, Niccolo Dalmasso, Luhao Zhang, Vamsi K. Potluru, Tucker Balch, Manuela Veloso
Abstract: Conditional demographic parity (CDP) is a measure of the demographic parity of a predictive model or decision process when conditioning on an additional feature or set of features. Many algorithmic fairness techniques exist to target demographic parity, but CDP is much harder to achieve, particularly when the conditioning variable has many levels and/or when the model outputs are continuous. The problem of auditing and enforcing CDP is understudied in the literature. In light of this, we propose novel measures of {conditional demographic disparity (CDD)} which rely on statistical distances borrowed from the optimal transport literature. We further design and evaluate regularization-based approaches based on these CDD measures. Our methods, \fairbit{} and \fairlp{}, allow us to target CDP even when the conditioning variable has many levels. When model outputs are continuous, our methods target full equality of the conditional distributions, unlike other methods that only consider first moments or related proxy quantities. We validate the efficacy of our approaches on real-world datasets.
Authors: Giangiacomo Mercatali, Andre Freitas, Jie Chen
Abstract: Interacting systems are prevalent in nature. It is challenging to accurately predict the dynamics of the system if its constituent components are analyzed independently. We develop a graph-based model that unveils the systemic interactions of time series observed at irregular time points, by using a directed acyclic graph to model the conditional dependencies (a form of causal notation) of the system components and learning this graph in tandem with a continuous-time model that parameterizes the solution curves of ordinary differential equations (ODEs). Our technique, a graph neural flow, leads to substantial enhancements over non-graph-based methods, as well as graph-based methods without the modeling of conditional dependencies. We validate our approach on several tasks, including time series classification and forecasting, to demonstrate its efficacy.
Authors: Bryan L. M. de Oliveira, Murilo L. da Luz, Bruno Brand\~ao, Luana G. B. Martins, Telma W. de L. Soares, Luckeciano C. Melo
Abstract: Learning effective visual representations is crucial in open-world environments where agents encounter diverse and unstructured observations. This ability enables agents to extract meaningful information from raw sensory inputs, like pixels, which is essential for generalization across different tasks. However, evaluating representation learning separately from policy learning remains a challenge in most reinforcement learning (RL) benchmarks. To address this, we introduce the Sliding Puzzles Gym (SPGym), a benchmark that extends the classic 15-tile puzzle with variable grid sizes and observation spaces, including large real-world image datasets. SPGym allows scaling the representation learning challenge while keeping the latent environment dynamics and algorithmic problem fixed, providing a targeted assessment of agents' ability to form compositional and generalizable state representations. Experiments with both model-free and model-based RL algorithms, with and without explicit representation learning components, show that as the representation challenge scales, SPGym effectively distinguishes agents based on their capabilities. Moreover, SPGym reaches difficulty levels where no tested algorithm consistently excels, highlighting key challenges and opportunities for advancing representation learning for decision-making research.
Authors: Shashank Hegde, Gautam Salhotra, Gaurav S. Sukhatme
Abstract: With the increasing availability of open-source robotic data, imitation learning has emerged as a viable approach for both robot manipulation and locomotion. Currently, large generalized policies are trained to predict controls or trajectories using diffusion models, which have the desirable property of learning multimodal action distributions. However, generalizability comes with a cost - namely, larger model size and slower inference. Further, there is a known trade-off between performance and action horizon for Diffusion Policy (i.e., diffusing trajectories): fewer diffusion queries accumulate greater trajectory tracking errors. Thus, it is common practice to run these models at high inference frequency, subject to robot computational constraints. To address these limitations, we propose Latent Weight Diffusion (LWD), a method that uses diffusion to learn a distribution over policies for robotic tasks, rather than over trajectories. Our approach encodes demonstration trajectories into a latent space and then decodes them into policies using a hypernetwork. We employ a diffusion denoising model within this latent space to learn its distribution. We demonstrate that LWD can reconstruct the behaviors of the original policies that generated the trajectory dataset. LWD offers the benefits of considerably smaller policy networks during inference and requires fewer diffusion model queries. When tested on the Metaworld MT10 benchmark, LWD achieves a higher success rate compared to a vanilla multi-task policy, while using models up to ~18x smaller during inference. Additionally, since LWD generates closed-loop policies, we show that it outperforms Diffusion Policy in long action horizon settings, with reduced diffusion queries during rollout.
Authors: Eric Yang, Tomas Garcia, Hannah Williams, Bhawesh Kumar, Martin Ram\'e, Eileen Rivera, Yiran Ma, Jonathan Amar, Caricia Catalani, Yugang Jia
Abstract: Effective management of cardiometabolic conditions requires sustained positive nutrition habits, often hindered by complex and individualized barriers. Direct human management is simply not scalable, while previous attempts aimed at automating nutrition coaching lack the personalization needed to address these diverse challenges. This paper introduces a novel LLM-powered agentic workflow designed to provide personalized nutrition coaching by directly targeting and mitigating patient-specific barriers. Grounded in behavioral science principles, the workflow leverages a comprehensive mapping of nutrition-related barriers to corresponding evidence-based strategies. A specialized LLM agent intentionally probes for and identifies the root cause of a patient's dietary struggles. Subsequently, a separate LLM agent delivers tailored tactics designed to overcome those specific barriers with patient context. We designed and validated our approach through a user study with individuals with cardiometabolic conditions, demonstrating the system's ability to accurately identify barriers and provide personalized guidance. Furthermore, we conducted a large-scale simulation study, grounding on real patient vignettes and expert-validated metrics, to evaluate the system's performance across a wide range of scenarios. Our findings demonstrate the potential of this LLM-powered agentic workflow to improve nutrition coaching by providing personalized, scalable, and behaviorally-informed interventions.
Authors: Hariprasath Govindarajan, Per Sid\'en, Jacob Roll, Fredrik Lindsten
Abstract: A prominent self-supervised learning paradigm is to model the representations as clusters, or more generally as a mixture model. Learning to map the data samples to compact representations and fitting the mixture model simultaneously leads to the representation collapse problem. Regularizing the distribution of data points over the clusters is the prevalent strategy to avoid this issue. While this is sufficient to prevent full representation collapse, we show that a partial prototype collapse problem still exists in the DINO family of methods, that leads to significant redundancies in the prototypes. Such prototype redundancies serve as shortcuts for the method to achieve a marginal latent class distribution that matches the prescribed prior. We show that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated. Effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations. We demonstrate that this is especially beneficial when pre-training on a long-tailed fine-grained dataset.
Authors: Indrajit Kalita, Lucia Vilallonga, Yves Atchade
Abstract: With a warming planet, tropical regions are expected to experience the brunt of climate change, with more intense and more volatile rainfall events. Currently, state-of-the-art numerical weather prediction (NWP) models are known to struggle to produce skillful rainfall forecasts in tropical regions of Africa. There is thus a pressing need for improved rainfall forecasting in these regions. Over the last decade or so, the increased availability of large-scale meteorological datasets and the development of powerful machine learning models have opened up new opportunities for data-driven weather forecasting. Focusing on Ghana in this study, we use these tools to develop two U-Net convolutional neural network (CNN) models, to predict 24h rainfall at 12h and 30h lead-time. The models were trained using data from the ERA5 reanalysis dataset, and the GPM-IMERG dataset. A special attention was paid to interpretability. We developed a novel statistical methodology that allowed us to probe the relative importance of the meteorological variables input in our model, offering useful insights into the factors that drive precipitation in the Ghana region. Empirically, we found that our 12h lead-time model has performances that match, and in some accounts are better than the 18h lead-time forecasts produced by the ECMWF (as available in the TIGGE dataset). We also found that combining our data-driven model with classical NWP further improves forecast accuracy.
Authors: Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, Nadav Cohen
Abstract: Structured state space models (SSMs), the core engine behind prominent neural networks such as S4 and Mamba, are linear dynamical systems adhering to a specified structure, most notably diagonal. In contrast to typical neural network modules, whose parameterizations are real, SSMs often use complex parameterizations. Theoretically explaining the benefits of complex parameterizations for SSMs is an open problem. The current paper takes a step towards its resolution, by establishing formal gaps between real and complex diagonal SSMs. Firstly, we prove that while a moderate dimension suffices in order for a complex SSM to express all mappings of a real SSM, a much higher dimension is needed for a real SSM to express mappings of a complex SSM. Secondly, we prove that even if the dimension of a real SSM is high enough to express a given mapping, typically, doing so requires the parameters of the real SSM to hold exponentially large values, which cannot be learned in practice. In contrast, a complex SSM can express any given mapping with moderate parameter values. Experiments corroborate our theory, and suggest a potential extension of the theory that accounts for selectivity, a new architectural feature yielding state of the art performance.
Authors: Arip Asadulaev, Rostislav Korst, Alexander Korotin, Vage Egiazarian, Andrey Filchenkov, Evgeny Burnaev
Abstract: We propose a novel algorithm for offline reinforcement learning using optimal transport. Typically, in offline reinforcement learning, the data is provided by various experts and some of them can be sub-optimal. To extract an efficient policy, it is necessary to \emph{stitch} the best behaviors from the dataset. To address this problem, we rethink offline reinforcement learning as an optimal transportation problem. And based on this, we present an algorithm that aims to find a policy that maps states to a \emph{partial} distribution of the best expert actions for each given state. We evaluate the performance of our algorithm on continuous control problems from the D4RL suite and demonstrate improvements over existing methods.
Authors: Brianna Mueller, W. Nick Street, Stephen Baek, Qihang Lin, Jingyi Yang, Yankun Huang
Abstract: Federated learning (FL) enables multiple clients with distributed data sources to collaboratively train a shared model without compromising data privacy. However, existing FL paradigms face challenges due to heterogeneity in client data distributions and system capabilities. Personalized federated learning (pFL) has been proposed to mitigate these problems, but often requires a shared model architecture and a central entity for parameter aggregation, resulting in scalability and communication issues. More recently, model-heterogeneous FL has gained attention due to its ability to support diverse client models, but existing methods are limited by their dependence on a centralized framework, synchronized training, and publicly available datasets. To address these limitations, we introduce Federated Peer-Adaptive Ensemble Learning (FedPAE), a fully decentralized pFL algorithm that supports model heterogeneity and asynchronous learning. Our approach utilizes a peer-to-peer model sharing mechanism and ensemble selection to achieve a more refined balance between local and global information. Experimental results show that FedPAE outperforms existing state-of-the-art pFL algorithms, effectively managing diverse client capabilities and demonstrating robustness against statistical heterogeneity.
Authors: Shangzhe Li, Zhiao Huang, Hao Su
Abstract: Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.
Authors: Fanyu Meng, Jules Larke, Xin Liu, Zhaodan Kong, Xin Chen, Danielle Lemay, Ilias Tagkopoulos
Abstract: Machine learning is revolutionizing nutrition science by enabling systems to learn from data and make intelligent decisions. However, the complexity of these models often leads to challenges in understanding their decision-making processes, necessitating the development of explainability techniques to foster trust and increase model transparency. An under-explored type of explanation is cohort explanation, which provides explanations to groups of instances with similar characteristics. Unlike traditional methods that focus on individual explanations or global model behavior, cohort explainability bridges the gap by providing unique insights at an intermediate granularity. We propose a novel framework for identifying cohorts within a dataset based on local feature importance scores, aiming to generate concise descriptions of the clusters via tags. We evaluate our framework on a food-based inflammation prediction model and demonstrated that the framework can generate reliable explanations that match domain knowledge.
Authors: Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie
Abstract: The goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.
Authors: Bharath Muppasani, Protik Nag, Vignesh Narayanan, Biplav Srivastava, Michael N. Huhns
Abstract: In this study, we investigate the under-explored intervention planning aimed at disseminating accurate information within dynamic opinion networks by leveraging learning strategies. Intervention planning involves identifying key nodes (search) and exerting control (e.g., disseminating accurate/official information through the nodes) to mitigate the influence of misinformation. However, as network size increases, the problem becomes computationally intractable. To address this, we first introduce a novel ranking algorithm (search) to identify key nodes for disseminating accurate information, which facilitates the training of neural network (NN) classifiers for scalable and generalized solutions. Second, we address the complexity of label generation (through search) by developing a Reinforcement Learning (RL)-based dynamic planning framework. We investigate NN-based RL planners tailored for dynamic opinion networks governed by two propagation models for the framework. Each model incorporates both binary and continuous opinion and trust representations. Our experimental results demonstrate that our ranking algorithm-based classifiers provide plans that enhance infection rate control, especially with increased action budgets. Moreover, reward strategies focusing on key metrics, such as the number of susceptible nodes and infection rates, outperform those prioritizing faster blocking strategies. Additionally, our findings reveal that Graph Convolutional Networks (GCNs)-based planners facilitate scalable centralized plans that achieve lower infection rates (higher control) across various network scenarios (e.g., Watts-Strogatz topology, varying action budgets, varying initial infected nodes, and varying degree of infected nodes).
Authors: Alberto Del Pia, Dekun Zhou, Yinglun Zhu
Abstract: Sparse Principal Component Analysis (Sparse PCA) is a pivotal tool in data analysis and dimensionality reduction. However, Sparse PCA is a challenging problem in both theory and practice: it is known to be NP-hard and current exact methods generally require exponential runtime. In this paper, we propose a novel framework to efficiently approximate Sparse PCA by (i) approximating the general input covariance matrix with a re-sorted block-diagonal matrix, (ii) solving the Sparse PCA sub-problem in each block, and (iii) reconstructing the solution to the original problem. Our framework is simple and powerful: it can leverage any off-the-shelf Sparse PCA algorithm and achieve significant computational speedups, with a minor additive error that is linear in the approximation error of the block-diagonal matrix. Suppose $g(k, d)$ is the runtime of an algorithm (approximately) solving Sparse PCA in dimension $d$ and with sparsity value $k$. Our framework, when integrated with this algorithm, reduces the runtime to $\mathcal{O}\left(\frac{d}{d^\star} \cdot g(k, d^\star) + d^2\right)$, where $d^\star \leq d$ is the largest block size of the block-diagonal matrix. For instance, integrating our framework with the Branch-and-Bound algorithm reduces the complexity from $g(k, d) = \mathcal{O}(k^3\cdot d^k)$ to $\mathcal{O}(k^3\cdot d \cdot (d^\star)^{k-1})$, demonstrating exponential speedups if $d^\star$ is small. We perform large-scale evaluations on many real-world datasets: for exact Sparse PCA algorithm, our method achieves an average speedup factor of 93.77, while maintaining an average approximation error of 2.15%; for approximate Sparse PCA algorithm, our method achieves an average speedup factor of 6.77 and an average approximation error of merely 0.37%.
Authors: Haoyu He, Haozheng Luo, Qi R. Wang
Abstract: Predicting human mobility across multiple cities presents significant challenges due to the complex and diverse spatial-temporal dynamics inherent in different urban environments. In this study, we propose a robust approach to predict human mobility patterns called ST-MoE-BERT. Compared to existing methods, our approach frames the prediction task as a spatial-temporal classification problem. Our methodology integrates the Mixture-of-Experts architecture with BERT model to capture complex mobility dynamics and perform the downstream human mobility prediction task. Additionally, transfer learning is integrated to solve the challenge of data scarcity in cross-city prediction. We demonstrate the effectiveness of the proposed model on GEO-BLEU and DTW, comparing it to several state-of-the-art methods. Notably, ST-MoE-BERT achieves an average improvement of 8.29%.
Authors: Robert Spencer (Hennie), Surangika Ranathunga (Hennie), Mikael Boulic (Hennie), Andries (Hennie), van Heerden, Teo Susnjak
Abstract: This study investigates the application of Transfer Learning (TL) on Transformer architectures to enhance building energy consumption forecasting. Transformers are a relatively new deep learning architecture, which has served as the foundation for groundbreaking technologies such as ChatGPT. While TL has been studied in the past, these studies considered either one TL strategy or used older deep learning models such as Recurrent Neural Networks or Convolutional Neural Networks. Here, we carry out an extensive empirical study on six different TL strategies and analyse their performance under varying feature spaces. In addition to the vanilla Transformer architecture, we also experiment with Informer and PatchTST, specifically designed for time series forecasting. We use 16 datasets from the Building Data Genome Project 2 to create building energy consumption forecasting models. Experiment results reveal that while TL is generally beneficial, especially when the target domain has no data, careful selection of the exact TL strategy should be made to gain the maximum benefit. This decision largely depends on the feature space properties such as the recorded weather features. We also note that PatchTST outperforms the other two Transformer variants (vanilla Transformer and Informer). We believe our findings would assist researchers in making informed decision in using TL and transformer architectures for building energy consumption forecasting.
Authors: Seong Ho Pahng, Sahand Hormoz
Abstract: Graph Neural Networks (GNNs) traditionally employ a message-passing mechanism that resembles diffusion over undirected graphs, which often leads to homogenization of node features and reduced discriminative power in tasks such as node classification. Our key insight for addressing this limitation is to assign fuzzy edge directions -- that can vary continuously from node $i$ pointing to node $j$ to vice versa -- to the edges of a graph so that features can preferentially flow in one direction between nodes to enable long-range information transmission across the graph. We also introduce a novel complex-valued Laplacian for directed graphs with fuzzy edges where the real and imaginary parts represent information flow in opposite directions. Using this Laplacian, we propose a general framework, called Continuous Edge Direction (CoED) GNN, for learning on graphs with fuzzy edges and prove its expressivity limits using a generalization of the Weisfeiler-Leman (WL) graph isomorphism test for directed graphs with fuzzy edges. Our architecture aggregates neighbor features scaled by the learned edge directions and processes the aggregated messages from in-neighbors and out-neighbors separately alongside the self-features of the nodes. Since continuous edge directions are differentiable, they can be learned jointly with the GNN weights via gradient-based optimization. CoED GNN is particularly well-suited for graph ensemble data where the graph structure remains fixed but multiple realizations of node features are available, such as in gene regulatory networks, web connectivity graphs, and power grids. We demonstrate through extensive experiments on both synthetic and real datasets that learning continuous edge directions significantly improves performance both for undirected and directed graphs compared with existing methods.
Authors: Min Wen, Chengchang Liu, Ahmed Abdelmoniem, Yipeng Zhou, Yuedong Xu
Abstract: Bilevel optimization, crucial for hyperparameter tuning, meta-learning and reinforcement learning, remains less explored in the decentralized learning paradigm, such as decentralized federated learning (DFL). Typically, decentralized bilevel methods rely on both gradients and Hessian matrices to approximate hypergradients of upper-level models. However, acquiring and sharing the second-order oracle is compute and communication intensive. % and sharing this information incurs heavy communication overhead. To overcome these challenges, this paper introduces a fully first-order decentralized method for decentralized Bilevel optimization, $\text{C}^2$DFB which is both compute- and communicate-efficient. In $\text{C}^2$DFB, each learning node optimizes a min-min-max problem to approximate hypergradient by exclusively using gradients information. To reduce the traffic load at the inner-loop of solving the lower-level problem, $\text{C}^2$DFB incorporates a lightweight communication protocol for efficiently transmitting compressed residuals of local parameters. % during the inner loops. Rigorous theoretical analysis ensures its convergence % of the algorithm, indicating a first-order oracle calls of $\tilde{\mathcal{O}}(\epsilon^{-4})$. Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of $\text{C}^2$DFB across various typologies and heterogeneous data distributions.
Authors: Van Tuan Nguyen, Razvan Beuran
Abstract: This paper proposes a novel federated learning approach for improving IoT network intrusion detection. The rise of IoT has expanded the cyber attack surface, making traditional centralized machine learning methods insufficient due to concerns about data availability, computational resources, transfer costs, and especially privacy preservation. A semi-supervised federated learning model was developed to overcome these issues, combining the Shrink Autoencoder and Centroid one-class classifier (SAE-CEN). This approach enhances the performance of intrusion detection by effectively representing normal network data and accurately identifying anomalies in the decentralized strategy. Additionally, a mean square error-based aggregation algorithm (MSEAvg) was introduced to improve global model performance by prioritizing more accurate local models. The results obtained in our experimental setup, which uses various settings relying on the N-BaIoT dataset and Dirichlet distribution, demonstrate significant improvements in real-world heterogeneous IoT networks in detection accuracy from 93.98$\pm$2.90 to 97.30$\pm$0.49, reduced learning costs when requiring only 50\% of gateways participating in the training process, and robustness in large-scale networks.
Authors: Kavinayan P. Sivakumar, Yi Shen, Zachary Bell, Scott Nivison, Boyuan Chen, Michael M. Zavlanos
Abstract: In this paper, we study an inverse reinforcement learning problem that involves learning the reward function of a learning agent using trajectory data collected while this agent is learning its optimal policy. To address this problem, we propose an inverse reinforcement learning method that allows us to estimate the policy parameters of the learning agent which can then be used to estimate its reward function. Our method relies on a new variant of the behavior cloning algorithm, which we call bundle behavior cloning, and uses a small number of trajectories generated by the learning agent's policy at different points in time to learn a set of policies that match the distribution of actions observed in the sampled trajectories. We then use the cloned policies to train a neural network model that estimates the reward function of the learning agent. We provide a theoretical analysis to show a complexity result on bound guarantees for our method that beats standard behavior cloning as well as numerical experiments for a reinforcement learning problem that validate the proposed method.
Authors: Shaoming Xu, Arvind Renganathan, Ankush Khandelwal, Rahul Ghosh, Xiang Li, Licheng Liu, Kshitij Tayal, Peter Harrington, Xiaowei Jia, Zhenong Jin, Jonh Nieber, Vipin Kumar
Abstract: Streamflow, vital for water resource management, is governed by complex hydrological systems involving intermediate processes driven by meteorological forces. While deep learning models have achieved state-of-the-art results of streamflow prediction, their end-to-end single-task learning approach often fails to capture the causal relationships within these systems. To address this, we propose Hierarchical Conditional Multi-Task Learning (HCMTL), a hierarchical approach that jointly models soil water and snowpack processes based on their causal connections to streamflow. HCMTL utilizes task embeddings to connect network modules, enhancing flexibility and expressiveness while capturing unobserved processes beyond soil water and snowpack. It also incorporates the Conditional Mini-Batch strategy to improve long time series modeling. We compare HCMTL with five baselines on a global dataset. HCMTL's superior performance across hundreds of drainage basins over extended periods shows that integrating domain-specific causal knowledge into deep learning enhances both prediction accuracy and interpretability. This is essential for advancing our understanding of complex hydrological systems and supporting efficient water resource management to mitigate natural disasters like droughts and floods.
Authors: Shuyang Wang, Diego Klabjan
Abstract: Recent work by Woodworth et al. (2020) shows that the optimization dynamics of gradient descent for overparameterized problems can be viewed as low-dimensional dual dynamics induced by a mirror map, explaining the implicit regularization phenomenon from the mirror descent perspective. However, the methodology does not apply to algorithms where update directions deviate from true gradients, such as ADAM. We use the mirror descent framework to study the dynamics of smoothed sign descent with a stability constant $\varepsilon$ for regression problems. We propose a mirror map that establishes equivalence to dual dynamics under some assumptions. By studying dual dynamics, we characterize the convergent solution as an approximate KKT point of minimizing a Bregman divergence style function, and show the benefit of tuning the stability constant $\varepsilon$ to reduce the KKT error.
Authors: Kushagra Pandey, Jaideep Pathak, Yilun Xu, Stephan Mandt, Michael Pritchard, Arash Vahdat, Morteza Mardani
Abstract: Diffusion models achieve state-of-the-art generation quality across many applications, but their ability to capture rare or extreme events in heavy-tailed distributions remains unclear. In this work, we show that traditional diffusion and flow-matching models with standard Gaussian priors fail to capture heavy-tailed behavior. We address this by repurposing the diffusion framework for heavy-tail estimation using multivariate Student-t distributions. We develop a tailored perturbation kernel and derive the denoising posterior based on the conditional Student-t distribution for the backward process. Inspired by $\gamma$-divergence for heavy-tailed distributions, we derive a training objective for heavy-tailed denoisers. The resulting framework introduces controllable tail generation using only a single scalar hyperparameter, making it easily tunable for diverse real-world distributions. As specific instantiations of our framework, we introduce t-EDM and t-Flow, extensions of existing diffusion and flow models that employ a Student-t prior. Remarkably, our approach is readily compatible with standard Gaussian diffusion models and requires only minimal code changes. Empirically, we show that our t-EDM and t-Flow outperform standard diffusion models in heavy-tail estimation on high-resolution weather datasets in which generating rare and extreme events is crucial.
Authors: Quang Dang, Murat Kucukosmanoglu, Michael Anoruo, Golshan Kargosha, Sarah Conklin, Justin Brooks
Abstract: Assessing cognitive workload is crucial for human performance as it affects information processing, decision making, and task execution. Pupil size is a valuable indicator of cognitive workload, reflecting changes in attention and arousal governed by the autonomic nervous system. Cognitive events are closely linked to cognitive workload as they activate mental processes and trigger cognitive responses. This study explores the potential of using machine learning to automatically detect cognitive events experienced using individuals. We framed the problem as a binary classification task, focusing on detecting stimulus onset across four cognitive tasks using CNN models and 1-second pupillary data. The results, measured by Matthew's correlation coefficient, ranged from 0.47 to 0.80, depending on the cognitive task. This paper discusses the trade-offs between generalization and specialization, model behavior when encountering unseen stimulus onset times, structural variances among cognitive tasks, factors influencing model predictions, and real-time simulation. These findings highlight the potential of machine learning techniques in detecting cognitive events based on pupil and eye movement responses, contributing to advancements in personalized learning and optimizing neurocognitive workload management.
Authors: Felix Krones, Ben Walker, Terry Lyons, Adam Mahdi
Abstract: This work presents our team's (SignalSavants) winning contribution to the 2024 George B. Moody PhysioNet Challenge. The Challenge had two goals: reconstruct ECG signals from printouts and classify them for cardiac diseases. Our focus was the first task. Despite many ECGs being digitally recorded today, paper ECGs remain common throughout the world. Digitising them could help build more diverse datasets and enable automated analyses. However, the presence of varying recording standards and poor image quality requires a data-centric approach for developing robust models that can generalise effectively. Our approach combines the creation of a diverse training set, Hough transform to rotate images, a U-Net based segmentation model to identify individual signals, and mask vectorisation to reconstruct the signals. We assessed the performance of our models using the 10-fold stratified cross-validation (CV) split of 21,799 recordings proposed by the PTB-XL dataset. On the digitisation task, our model achieved an average CV signal-to-noise ratio of 17.02 and an official Challenge score of 12.15 on the hidden set, securing first place in the competition. Our study shows the challenges of building robust, generalisable, digitisation approaches. Such models require large amounts of resources (data, time, and computational power) but have great potential in diversifying the data available.
Authors: Sehun Kim
Abstract: A persistence diagram provides a compact summary of persistent homology, which captures the topological features of a space at different scales. However, due to its nature as a set, incorporating it as a feature into a machine learning framework is challenging. Several methods have been proposed to use persistence diagrams as input for machine learning models, but they often require complex preprocessing steps and extensive hyperparameter tuning. In this paper, we propose a novel transformer architecture called the \textit{Extended Persistence Transformer (xPerT)}, which is highly scalable than the compared to Persformer, an existing transformer for persistence diagrams. xPerT reduces GPU memory usage by over 90\% and improves accuracy on multiple datasets. Additionally, xPerT does not require complex preprocessing steps or extensive hyperparameter tuning, making it easy to use in practice. Our code is available at https://github.com/sehunfromdaegu/ECG_JEPA.
Authors: Mushir Akhtar (for the Alzheimer's Disease Neuroimaging), A. Quadir (for the Alzheimer's Disease Neuroimaging), M. Tanveer (for the Alzheimer's Disease Neuroimaging), Mohd. Arshad (for the Alzheimer's Disease Neuroimaging)
Abstract: Alzheimer's disease (AD) is a leading neurodegenerative condition and the primary cause of dementia, characterized by progressive cognitive decline and memory loss. Its progression, marked by shrinkage in the cerebral cortex, is irreversible. Numerous machine learning algorithms have been proposed for the early diagnosis of AD. However, they often struggle with the issues of noise, outliers, and class imbalance. To tackle the aforementioned limitations, in this article, we introduce a novel, robust, and flexible membership scheme called Flexi-Fuzz. This scheme integrates a novel flexible weighting mechanism, class probability, and imbalance ratio. The proposed flexible weighting mechanism assigns the maximum weight to samples within a specific proximity to the center, with a gradual decrease in weight beyond a certain threshold. This approach ensures that samples near the class boundary still receive significant weight, maintaining their influence in the classification process. Class probability is used to mitigate the impact of noisy samples, while the imbalance ratio addresses class imbalance. Leveraging this, we incorporate the proposed Flexi-Fuzz membership scheme into the least squares support vector machines (LSSVM) framework, resulting in a robust and flexible model termed Flexi-Fuzz-LSSVM. We determine the class-center using two methods: the conventional mean approach and an innovative median approach, leading to two model variants, Flexi-Fuzz-LSSVM-I and Flexi-Fuzz-LSSVM-II. To validate the effectiveness of the proposed Flexi-Fuzz-LSSVM models, we evaluated them on benchmark UCI and KEEL datasets, both with and without label noise. Additionally, we tested the models on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset for AD diagnosis. Experimental results demonstrate the superiority of the Flexi-Fuzz-LSSVM models over baseline models.
Authors: Chayan Maitra, Rajat K. De
Abstract: Visualizing high-dimensional datasets through a generalized embedding has been a challenge for a long time. Several methods have shown up for the same, but still, they have not been able to generate a generalized embedding, which not only can reveal the hidden patterns present in the data but also generate realistic high-dimensional samples from it. Motivated by this aspect, in this study, a novel generative model, called G-NeuroDAVIS, has been developed, which is capable of visualizing high-dimensional data through a generalized embedding, and thereby generating new samples. The model leverages advanced generative techniques to produce high-quality embedding that captures the underlying structure of the data more effectively than existing methods. G-NeuroDAVIS can be trained in both supervised and unsupervised settings. We rigorously evaluated our model through a series of experiments, demonstrating superior performance in classification tasks, which highlights the robustness of the learned representations. Furthermore, the conditional sample generation capability of the model has been described through qualitative assessments, revealing a marked improvement in generating realistic and diverse samples. G-NeuroDAVIS has outperformed the Variational Autoencoder (VAE) significantly in multiple key aspects, including embedding quality, classification performance, and sample generation capability. These results underscore the potential of our generative model to serve as a powerful tool in various applications requiring high-quality data generation and representation learning.
Authors: Runjia Li, Qiwei Di, Quanquan Gu
Abstract: Score-based diffusion models have emerged as powerful techniques for generating samples from high-dimensional data distributions. These models involve a two-phase process: first, injecting noise to transform the data distribution into a known prior distribution, and second, sampling to recover the original data distribution from noise. Among the various sampling methods, deterministic samplers stand out for their enhanced efficiency. However, analyzing these deterministic samplers presents unique challenges, as they preclude the use of established techniques such as Girsanov's theorem, which are only applicable to stochastic samplers. Furthermore, existing analysis for deterministic samplers usually focuses on specific examples, lacking a generalized approach for general forward processes and various deterministic samplers. Our paper addresses these limitations by introducing a unified convergence analysis framework. To demonstrate the power of our framework, we analyze the variance-preserving (VP) forward process with the exponential integrator (EI) scheme, achieving iteration complexity of $\tilde O(d^2/\epsilon)$. Additionally, we provide a detailed analysis of Denoising Diffusion Implicit Models (DDIM)-type samplers, which have been underexplored in previous research, achieving polynomial iteration complexity.
Authors: Manuel Brenner, Christoph J\"urgen Hemmer, Zahra Monfared, Daniel Durstewitz
Abstract: Dynamical systems (DS) theory is fundamental for many areas of science and engineering. It can provide deep insights into the behavior of systems evolving in time, as typically described by differential or recursive equations. A common approach to facilitate mathematical tractability and interpretability of DS models involves decomposing nonlinear DS into multiple linear DS separated by switching manifolds, i.e. piecewise linear (PWL) systems. PWL models are popular in engineering and a frequent choice in mathematics for analyzing the topological properties of DS. However, hand-crafting such models is tedious and only possible for very low-dimensional scenarios, while inferring them from data usually gives rise to unnecessarily complex representations with very many linear subregions. Here we introduce Almost-Linear Recurrent Neural Networks (AL-RNNs) which automatically and robustly produce most parsimonious PWL representations of DS from time series data, using as few PWL nonlinearities as possible. AL-RNNs can be efficiently trained with any SOTA algorithm for dynamical systems reconstruction (DSR), and naturally give rise to a symbolic encoding of the underlying DS that provably preserves important topological properties. We show that for the Lorenz and R\"ossler systems, AL-RNNs discover, in a purely data-driven way, the known topologically minimal PWL representations of the corresponding chaotic attractors. We further illustrate on two challenging empirical datasets that interpretable symbolic encodings of the dynamics can be achieved, tremendously facilitating mathematical and computational analysis of the underlying systems.
Authors: Daniel Riccio, Genoveffa Tortora, Mara Sangiovanni
Abstract: In many application domains, the proliferation of sensors and devices is generating vast volumes of data, imposing significant pressure on existing data analysis and data mining techniques. Nevertheless, an increase in data volume does not inherently imply an increase in informational content, as a substantial portion may be redundant or represent noise. This challenge is particularly evident in the deep learning domain, where the utility of additional data is contingent on its informativeness. In the absence of such, larger datasets merely exacerbate the computational cost and complexity of the learning process. To address these challenges, we propose RAZOR, a novel instance selection technique designed to extract a significantly smaller yet sufficiently informative subset from a larger set of instances without compromising the learning process. RAZOR has been specifically engineered to be robust, efficient, and scalable, making it suitable for large-scale datasets. Unlike many techniques in the literature, RAZOR is capable of operating in both supervised and unsupervised settings. Experimental results demonstrate that RAZOR outperforms recent state-of-the-art techniques in terms of both effectiveness and efficiency.
Authors: Zhibin Wang, Shipeng Li, Yuhang Zhou, Xue Li, Rong Gu, Nguyen Cam-Tu, Chen Tian, Sheng Zhong
Abstract: Large language models (LLMs) have achieved remarkable performance and are widely deployed in various applications, while the serving of LLM inference has raised concerns about user experience and serving throughput. Accordingly, service level objectives (SLOs) and goodput-the number of requests that meet SLOs per second-are introduced to evaluate the performance of LLM serving. However, existing metrics fail to capture the nature of user experience. We observe two ridiculous phenomena in existing metrics: 1) delaying token delivery can smooth the tail time between tokens (tail TBT) of a request and 2) dropping the request that fails to meet the SLOs midway can improve goodput. In this paper, we revisit SLO and goodput metrics in LLM serving and propose a unified metric framework smooth goodput including SLOs and goodput to reflect the nature of user experience in LLM serving. The framework can adapt to specific goals of different tasks by setting parameters. We re-evaluate the performance of different LLM serving systems under multiple workloads based on this unified framework and provide possible directions for future optimization of existing strategies. We hope that this framework can provide a unified standard for evaluating LLM serving and foster researches in the field of LLM serving optimization to move in a cohesive direction.
Authors: Christopher Holder, Anthony Bagnall, Jason Lines
Abstract: There is a long history of research into time series clustering using distance-based partitional clustering. Many of the most popular algorithms adapt k-means (also known as Lloyd's algorithm) to exploit time dependencies in the data by specifying a time series distance function. However, these algorithms are often presented with k-means configured in various ways, altering key parameters such as the initialisation strategy. This variability makes it difficult to compare studies because k-means is known to be highly sensitive to its configuration. To address this, we propose a standard Lloyd's-based model for TSCL that adopts an end-to-end approach, incorporating a specialised distance function not only in the assignment step but also in the initialisation and stopping criteria. By doing so, we create a unified structure for comparing seven popular Lloyd's-based TSCL algorithms. This common framework enables us to more easily attribute differences in clustering performance to the distance function itself, rather than variations in the k-means configuration.
Authors: Uttam Suman, Mariya Mamajiwala, Mukul Saxena, Ankit Tyagi, Debasish Roy
Abstract: Our proposal is on a new stochastic optimizer for non-convex and possibly non-smooth objective functions typically defined over large dimensional design spaces. Towards this, we have tried to bridge noise-assisted global search and faster local convergence, the latter being the characteristic feature of a Newton-like search. Our specific scheme -- acronymed FINDER (Filtering Informed Newton-like and Derivative-free Evolutionary Recursion), exploits the nonlinear stochastic filtering equations to arrive at a derivative-free update that has resemblance with the Newton search employing the inverse Hessian of the objective function. Following certain simplifications of the update to enable a linear scaling with dimension and a few other enhancements, we apply FINDER to a range of problems, starting with some IEEE benchmark objective functions to a couple of archetypal data-driven problems in deep networks to certain cases of physics-informed deep networks. The performance of the new method vis-\'a-vis the well-known Adam and a few others bears evidence to its promise and potentialities for large dimensional optimization problems of practical interest.
Authors: Tonglong Wei, Yan Lin, Youfang Lin, Shengnan Guo, Jilin Hu, Gao Cong, Huaiyu Wan
Abstract: Spatiotemporal trajectory data is vital for web-of-things services and is extensively collected and analyzed by web-based hardware and platforms. However, issues such as service interruptions and network instability often lead to sparsely recorded trajectories, resulting in a loss of detailed movement data. As a result, recovering these trajectories to restore missing information becomes essential. Despite progress, several challenges remain unresolved. First, the lack of large-scale dense trajectory data hampers the performance of existing deep learning methods, which rely heavily on abundant data for supervised training. Second, current methods struggle to generalize across sparse trajectories with varying sampling intervals, necessitating separate re-training for each interval and increasing computational costs. Third, external factors crucial for the recovery of missing points are not fully incorporated. To address these challenges, we propose a framework called PTR. This framework mitigates the issue of limited dense trajectory data by leveraging the capabilities of pre-trained language models (PLMs). PTR incorporates an explicit trajectory prompt and is trained on datasets with multiple sampling intervals, enabling it to generalize effectively across different intervals in sparse trajectories. To capture external factors, we introduce an implicit trajectory prompt that models road conditions, providing richer information for recovering missing points. Additionally, we present a trajectory embedder that encodes trajectory points and transforms the embeddings of both observed and missing points into a format comprehensible to PLMs. Experimental results on two public trajectory datasets with three sampling intervals demonstrate the efficacy and scalability of PTR.
Authors: Lukas Tatzel, B\'alint Mucs\'anyi, Osane Hackel, Philipp Hennig
Abstract: Quadratic approximations form a fundamental building block of machine learning methods. E.g., second-order optimizers try to find the Newton step into the minimum of a local quadratic proxy to the objective function; and the second-order approximation of a network's loss function can be used to quantify the uncertainty of its outputs via the Laplace approximation. When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches. This, however, distorts and biases the shape of the associated stochastic quadratic approximations in an intricate way with detrimental effects on applications. In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.
Authors: Sharv Murgai, Hrishikesh Bhagwat, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Abstract: Carbon emissions are rising at an alarming rate, posing a significant threat to global efforts to mitigate climate change. Electric vehicles have emerged as a promising solution, but their reliance on lithium-ion batteries introduces the critical challenge of battery degradation. Accurate prediction and forecasting of battery degradation over both short and long time spans are essential for optimizing performance, extending battery life, and ensuring effective long-term energy management. This directly influences the reliability, safety, and sustainability of EVs, supporting their widespread adoption and aligning with key UN SDGs. In this paper, we present a novel approach to the prediction and long-term forecasting of battery degradation using Scientific Machine Learning framework which integrates domain knowledge with neural networks, offering more interpretable and scientifically grounded solutions for both predicting short-term battery health and forecasting degradation over extended periods. This hybrid approach captures both known and unknown degradation dynamics, improving predictive accuracy while reducing data requirements. We incorporate ground-truth data to inform our models, ensuring that both the predictions and forecasts reflect practical conditions. The model achieved MSE of 9.90 with the UDE and 11.55 with the NeuralODE, in experimental data, a loss of 1.6986 with the UDE, and a MSE of 2.49 in the NeuralODE, demonstrating the enhanced precision of our approach. This integration of data-driven insights with SciML's strengths in interpretability and scalability allows for robust battery management. By enhancing battery longevity and minimizing waste, our approach contributes to the sustainability of energy systems and accelerates the global transition toward cleaner, more responsible energy solutions, aligning with the UN's SDG agenda.
Authors: Jialin Yu, Yuxiang Zhou, Yulan He, Nevin L. Zhang, Ricardo Silva
Abstract: The fine-tuning of pre-trained language models (PLMs) has been shown to be effective across various domains. By using domain-specific supervised data, the general-purpose representation derived from PLMs can be transformed into a domain-specific representation. However, these methods often fail to generalize to out-of-domain (OOD) data due to their reliance on non-causal representations, often described as spurious features. Existing methods either make use of adjustments with strong assumptions about lack of hidden common causes, or mitigate the effect of spurious features using multi-domain data. In this work, we investigate how fine-tuned pre-trained language models aid generalizability from single-domain scenarios under mild assumptions, targeting more general and practical real-world scenarios. We show that a robust representation can be derived through a so-called causal front-door adjustment, based on a decomposition assumption, using fine-tuned representations as a source of data augmentation. Comprehensive experiments in both synthetic and real-world settings demonstrate the superior generalizability of the proposed method compared to existing approaches. Our work thus sheds light on the domain generalization problem by introducing links between fine-tuning and causal mechanisms into representation learning.
Authors: Mingqian Li, Qiao Han, Yiteng Zhai, Ruifeng Li, Yao Yang, Hongyang Chen
Abstract: In multi-task learning, we often encounter the case when the presence of labels across samples exhibits irregular patterns: samples can be fully labeled, partially labeled or unlabeled. Taking drug analysis as an example, multiple toxicity properties of a drug molecule may not be concurrently available due to experimental limitations. It triggers a demand for a new training and inference mechanism that could accommodate irregularly present labels and maximize the utility of any available label information. In this work, we focus on the two-label learning task, and propose a novel training and inference framework, Dual-Label Learning (DLL). The DLL framework formulates the problem into a dual-function system, in which the two functions should simultaneously satisfy standard supervision, structural duality and probabilistic duality. DLL features a dual-tower model architecture that explicitly captures the information exchange between labels, aimed at maximizing the utility of partially available labels in understanding label correlation. During training, label imputation for missing labels is conducted as part of the forward propagation process, while during inference, labels are regarded as unknowns of a bivariate system of equations and are solved jointly. Theoretical analysis guarantees the feasibility of DLL, and extensive experiments are conducted to verify that by explicitly modeling label correlation and maximizing the utility of available labels, our method makes consistently better predictions than baseline approaches by up to a 10% gain in F1-score or MAPE. Remarkably, our method provided with data at a label missing rate as high as 60% can achieve similar or even better results than baseline approaches at a label missing rate of only 10%.
Authors: Peter A. Wijeratne, Daniel C. Alexander
Abstract: Disease progression models infer group-level temporal trajectories of change in patients' features as a chronic degenerative condition plays out. They provide unique insight into disease biology and staging systems with individual-level clinical utility. Discrete models consider disease progression as a latent permutation of events, where each event corresponds to a feature becoming measurably abnormal. However, permutation inference using traditional maximum likelihood approaches becomes prohibitive due to combinatoric explosion, severely limiting model dimensionality and utility. Here we leverage ideas from optimal transport to model disease progression as a latent permutation matrix of events belonging to the Birkhoff polytope, facilitating fast inference via optimisation of the variational lower bound. This enables a factor of 1000 times faster inference than the current state of the art and, correspondingly, supports models with several orders of magnitude more features than the current state of the art can consider. Experiments demonstrate the increase in speed, accuracy and robustness to noise in simulation. Further experiments with real-world imaging data from two separate datasets, one from Alzheimer's disease patients, the other age-related macular degeneration, showcase, for the first time, pixel-level disease progression events in the brain and eye, respectively. Our method is low compute, interpretable and applicable to any progressive condition and data modality, giving it broad potential clinical utility.
Authors: Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xingwei Wang, Xiaocun Cao, Jie Zhang, Dacheng Tao
Abstract: Model merging-based multitask learning (MTL) offers a promising approach for performing MTL by merging multiple expert models without requiring access to raw training data. However, in this paper, we examine the merged model's representation distribution and uncover a critical issue of "representation bias". This bias arises from a significant distribution gap between the representations of the merged and expert models, leading to the suboptimal performance of the merged MTL model. To address this challenge, we first propose a representation surgery solution called Surgery. Surgery is a lightweight, task-specific module that aligns the final layer representations of the merged model with those of the expert models, effectively alleviating bias and improving the merged model's performance. Despite these improvements, a performance gap remains compared to the traditional MTL method. Further analysis reveals that representation bias phenomena exist at each layer of the merged model, and aligning representations only in the last layer is insufficient for fully reducing systemic bias because biases introduced at each layer can accumulate and interact in complex ways. To tackle this, we then propose a more comprehensive solution, deep representation surgery (also called SurgeryV2), which mitigates representation bias across all layers, and thus bridges the performance gap between model merging-based MTL and traditional MTL. Finally, we design an unsupervised optimization objective to optimize both the Surgery and SurgeryV2 modules. Our experimental results show that incorporating these modules into state-of-the-art (SOTA) model merging schemes leads to significant performance gains. Notably, our SurgeryV2 scheme reaches almost the same level as individual expert models or the traditional MTL model. The code is available at \url{https://github.com/EnnengYang/SurgeryV2}.
Authors: Boning Zhang, Dongzhu Liu, Osvaldo Simeone, Guanchu Wang, Dimitrios Pezaros, Guangxu Zhu
Abstract: To support real-world decision-making, it is crucial for models to be well-calibrated, i.e., to assign reliable confidence estimates to their predictions. Uncertainty quantification is particularly important in personalized federated learning (PFL), as participating clients typically have small local datasets, making it difficult to unambiguously determine optimal model parameters. Bayesian PFL (BPFL) methods can potentially enhance calibration, but they often come with considerable computational and memory requirements due to the need to track the variances of all the individual model parameters. Furthermore, different clients may exhibit heterogeneous uncertainty levels owing to varying local dataset sizes and distributions. To address these challenges, we propose LR-BPFL, a novel BPFL method that learns a global deterministic model along with personalized low-rank Bayesian corrections. To tailor the local model to each client's inherent uncertainty level, LR-BPFL incorporates an adaptive rank selection mechanism. We evaluate LR-BPFL across a variety of datasets, demonstrating its advantages in terms of calibration, accuracy, as well as computational and memory requirements.
Authors: Konstantin Grotov, Artem Borzilov, Maksim Krivobok, Timofey Bryksin, Yaroslav Zharov
Abstract: Computational notebooks became indispensable tools for research-related development, offering unprecedented interactivity and flexibility in the development process. However, these benefits come at the cost of reproducibility and an increased potential for bugs. With the rise of code-fluent Large Language Models empowered with agentic techniques, smart bug-fixing tools with a high level of autonomy have emerged. However, those tools are tuned for classical script programming and still struggle with non-linear computational notebooks. In this paper, we present an AI agent designed specifically for error resolution in a computational notebook. We have developed an agentic system capable of exploring a notebook environment by interacting with it -- similar to how a user would -- and integrated the system into the JetBrains service for collaborative data science called Datalore. We evaluate our approach against the pre-existing single-action solution by comparing costs and conducting a user study. Users rate the error resolution capabilities of the agentic system higher but experience difficulties with UI. We share the results of the study and consider them valuable for further improving user-agent collaboration.
Authors: Pauline B\'eraud, Margaux Rioux, Michel Babany, Philippe de La Chevasnerie, Damien Theis, Giacomo Teodori, Chlo\'e Pinguet, Romane Rigaud, Fran\c{c}ois Leclerc
Abstract: Electricity forecasting has been a recurring research topic, as it is key to finding the right balance between production and consumption. While most papers are focused on the national or regional scale, few are interested in the household level. Desegregated forecast is a common topic in Machine Learning (ML) literature but lacks explainability that household energy forecasts require. This paper specifically targets the challenges of forecasting electricity use at the household level. This paper confronts common Machine Learning algorithms to electricity household forecasts, weighing the pros and cons, including accuracy and explainability with well-known key metrics. Furthermore, we also confront them in this paper with the business challenges specific to this sector such as explainability or outliers resistance. We introduce a custom decision tree, aiming at providing a fair estimate of the energy consumption, while being explainable and consistent with human intuition. We show that this novel method allows greater explainability without sacrificing much accuracy. The custom tree methodology can be used in various business use cases but is subject to limitations, such as a lack of resilience with outliers.
Authors: Santanu Rathod, Pietro Lio, Xiao Zhang
Abstract: We develop a novel data-driven framework as an alternative to dynamic flux balance analysis, bypassing the demand for deep domain knowledge and manual efforts to formulate the optimization problem. The proposed framework is end-to-end, which trains a structured neural ODE process (SNODEP) model to estimate flux and balance samples using gene-expression time-series data. SNODEP is designed to circumvent the limitations of the standard neural ODE process model, including restricting the latent and decoder sampling distributions to be normal and lacking structure between context points for calculating the latent, thus more suitable for modeling the underlying dynamics of a metabolic system. Through comprehensive experiments ($156$ in total), we demonstrate that SNODEP not only predicts the unseen time points of real-world gene-expression data and the flux and balance estimates well but can even generalize to more challenging unseen knockout configurations and irregular data sampling scenarios, all essential for metabolic pathway analysis. We hope our work can serve as a catalyst for building more scalable and powerful models for genome-scale metabolic analysis. Our code is available at: \url{https://github.com/TrustMLRG/SNODEP}.
Authors: Nandi Schoots, Alex Jackson, Ali Kholmovaia, Peter McBurney, Murray Shanahan
Abstract: Does the process of training a neural network to solve a task tend to use all of the available weights even when the task could be solved with fewer weights? To address this question we study the effects of pruning fully connected, convolutional and residual models while varying their widths. We find that the proportion of weights that can be pruned without degrading performance is largely invariant to model size. Increasing the width of a model has little effect on the density of the pruned model relative to the increase in absolute size of the pruned network. In particular, we find substantial prunability across a large range of model sizes, where our biggest model is 50 times as wide as our smallest model. We explore three hypotheses that could explain these findings.
Authors: Jialu Tang, Tong Xia, Yuan Lu, Cecilia Mascolo, Aaqib Saeed
Abstract: Electrocardiogram (ECG) interpretation requires specialized expertise, often involving synthesizing insights from ECG signals with complex clinical queries posed in natural language. The scarcity of labeled ECG data coupled with the diverse nature of clinical inquiries presents a significant challenge for developing robust and adaptable ECG diagnostic systems. This work introduces a novel multimodal meta-learning method for few-shot ECG question answering, addressing the challenge of limited labeled data while leveraging the rich knowledge encoded within large language models (LLMs). Our LLM-agnostic approach integrates a pre-trained ECG encoder with a frozen LLM (e.g., LLaMA and Gemma) via a trainable fusion module, enabling the language model to reason about ECG data and generate clinically meaningful answers. Extensive experiments demonstrate superior generalization to unseen diagnostic tasks compared to supervised baselines, achieving notable performance even with limited ECG leads. For instance, in a 5-way 5-shot setting, our method using LLaMA-3.1-8B achieves accuracy of 84.6%, 77.3%, and 69.6% on single verify, choose and query question types, respectively. These results highlight the potential of our method to enhance clinical ECG interpretation by combining signal processing with the nuanced language understanding capabilities of LLMs, particularly in data-constrained scenarios.
Authors: Jannatun Nayeem Pinky, Ramya Akula
Abstract: Cryptocurrencies, as decentralized digital assets, have experienced rapid growth and adoption, with over 23,000 cryptocurrencies and a market capitalization nearing \$1.1 trillion (about \$3,400 per person in the US) as of 2023. This dynamic market presents significant opportunities and risks, highlighting the need for accurate price prediction models to manage volatility. This chapter comprehensively reviews machine learning (ML) techniques applied to cryptocurrency price prediction from 2014 to 2024. We explore various ML algorithms, including linear models, tree-based approaches, and advanced deep learning architectures such as transformers and large language models. Additionally, we examine the role of sentiment analysis in capturing market sentiment from textual data like social media posts and news articles to anticipate price fluctuations. With expertise in optimizing complex systems and processes, industrial engineers are pivotal in enhancing these models. They contribute by applying principles of process optimization, efficiency, and risk mitigation to improve computational performance and data management. This chapter highlights the evolving landscape of cryptocurrency price prediction, the integration of emerging technologies, and the significant role of industrial engineers in refining predictive models. By addressing current limitations and exploring future research directions, this chapter aims to advance the development of more accurate and robust prediction systems, supporting better-informed investment decisions and more stable market behavior.
Authors: Vladimir R. Kostic, Karim Lounici, H\'el\`ene Halconruy, Timoth\'ee Devergne, Pietro Novelli, Massimiliano Pontil
Abstract: Markov processes serve as a universal model for many real-world random processes. This paper presents a data-driven approach for learning these models through the spectral decomposition of the infinitesimal generator (IG) of the Markov semigroup. The unbounded nature of IGs complicates traditional methods such as vector-valued regression and Hilbert-Schmidt operator analysis. Existing techniques, including physics-informed kernel regression, are computationally expensive and limited in scope, with no recovery guarantees for transfer operator methods when the time-lag is small. We propose a novel method that leverages the IG's resolvent, characterized by the Laplace transform of transfer operators. This approach is robust to time-lag variations, ensuring accurate eigenvalue learning even for small time-lags. Our statistical analysis applies to a broader class of Markov processes than current methods while reducing computational complexity from quadratic to linear in the state dimension. Finally, we illustrate the behaviour of our method in two experiments.
Authors: Kavinayan P. Sivakumar, Yan Zhang, Zachary Bell, Scott Nivison, Michael M. Zavlanos
Abstract: In this paper, we consider a transfer reinforcement learning problem involving agents with different action spaces. Specifically, for any new unseen task, the goal is to use a successful demonstration of this task by an expert agent in its action space to enable a learner agent learn an optimal policy in its own different action space with fewer samples than those required if the learner was learning on its own. Existing transfer learning methods across different action spaces either require handcrafted mappings between those action spaces provided by human experts, which can induce bias in the learning procedure, or require the expert agent to share its policy parameters with the learner agent, which does not generalize well to unseen tasks. In this work, we propose a method that learns a subgoal mapping between the expert agent policy and the learner agent policy. Since the expert agent and the learner agent have different action spaces, their optimal policies can have different subgoal trajectories. We learn this subgoal mapping by training a Long Short Term Memory (LSTM) network for a distribution of tasks and then use this mapping to predict the learner subgoal sequence for unseen tasks, thereby improving the speed of learning by biasing the agent's policy towards the predicted learner subgoal sequence. Through numerical experiments, we demonstrate that the proposed learning scheme can effectively find the subgoal mapping underlying the given distribution of tasks. Moreover, letting the learner agent imitate the expert agent's policy with the learnt subgoal mapping can significantly improve the sample efficiency and training time of the learner agent in unseen new tasks.
Authors: Matthew J. Vowels, Mathieu Rochat, Sina Akbari
Abstract: Artificial Neural Networks (ANNs), including fully-connected networks and transformers, are highly flexible and powerful function approximators, widely applied in fields like computer vision and natural language processing. However, their inability to inherently respect causal structures can limit their robustness, making them vulnerable to covariate shift and difficult to interpret/explain. This poses significant challenges for their reliability in real-world applications. In this paper, we introduce Causal Fully-Connected Neural Networks (CFCNs) and Causal Transformers (CaTs), two general model families designed to operate under predefined causal constraints, as specified by a Directed Acyclic Graph (DAG). These models retain the powerful function approximation abilities of traditional neural networks while adhering to the underlying structural constraints, improving robustness, reliability, and interpretability at inference time. This approach opens new avenues for deploying neural networks in more demanding, real-world scenarios where robustness and explainability is critical.
Authors: Seunghan Lee, Kibok Lee, Taeyoung Park
Abstract: Advances in diffusion models for generative artificial intelligence have recently propagated to the time series (TS) domain, demonstrating state-of-the-art performance on various tasks. However, prior works on TS diffusion models often borrow the framework of existing works proposed in other domains without considering the characteristics of TS data, leading to suboptimal performance. In this work, we propose Adaptive Noise schedule for Time series diffusion models (ANT), which automatically predetermines proper noise schedules for given TS datasets based on their statistics representing non-stationarity. Our intuition is that an optimal noise schedule should satisfy the following desiderata: 1) It linearly reduces the non-stationarity of TS data so that all diffusion steps are equally meaningful, 2) the data is corrupted to the random noise at the final step, and 3) the number of steps is sufficiently large. The proposed method is practical for use in that it eliminates the necessity of finding the optimal noise schedule with a small additional cost to compute the statistics for given datasets, which can be done offline before training. We validate the effectiveness of our method across various tasks, including TS forecasting, refinement, and generation, on datasets from diverse domains. Code is available at this repository: https://github.com/seunghan96/ANT.
Authors: Owen Cook, Charlie Grimshaw, Ben Wu, Sophie Dillon, Jack Hicks, Luke Jones, Thomas Smith, Matyas Szert, Xingyi Song
Abstract: Misinformation spreads rapidly on social media, confusing the truth and targetting potentially vulnerable people. To effectively mitigate the negative impact of misinformation, it must first be accurately detected before applying a mitigation strategy, such as X's community notes, which is currently a manual process. This study takes a knowledge-based approach to misinformation detection, modelling the problem similarly to one of natural language inference. The EffiARA annotation framework is introduced, aiming to utilise inter- and intra-annotator agreement to understand the reliability of each annotator and influence the training of large language models for classification based on annotator reliability. In assessing the EffiARA annotation framework, the Russo-Ukrainian Conflict Knowledge-Based Misinformation Classification Dataset (RUC-MCD) was developed and made publicly available. This study finds that sample weighting using annotator reliability performs the best, utilising both inter- and intra-annotator agreement and soft-label training. The highest classification performance achieved using Llama-3.2-1B was a macro-F1 of 0.757 and 0.740 using TwHIN-BERT-large.
Authors: Joshua Nathaniel Williams, Anurag Katakkar, Hoda Heidari, J. Zico Kolter
Abstract: Counterfactual explanations have been a popular method of post-hoc explainability for a variety of settings in Machine Learning. Such methods focus on explaining classifiers by generating new data points that are similar to a given reference, while receiving a more desirable prediction. In this work, we investigate a framing for counterfactual generation methods that considers counterfactuals not as independent draws from a region around the reference, but as jointly sampled with the reference from the underlying data distribution. Through this framing, we derive a distance metric, tailored for counterfactual similarity that can be applied to a broad range of settings. Through both quantitative and qualitative analyses of counterfactual generation methods, we show that this framing allows us to express more nuanced dependencies among the covariates.
Authors: Arthur Emanuel de Oliveira Carosia
Abstract: Cryptocurrencies have gained significant attention in recent years due to their decentralized nature and potential for financial innovation. Thus, the ability to accurately predict its price has become a subject of great interest for investors, traders, and researchers. Some works in the literature show how Bitcoin's market sentiment correlates with its price fluctuations in the market. However, papers that consider the sentiment of the market associated with financial Technical Analysis indicators in order to predict Bitcoin's price are still scarce. In this paper, we present a novel approach for predicting Bitcoin price movements by combining the Fear & Greedy Index, a measure of market sentiment, Technical Analysis indicators, and the potential of Machine Learning algorithms. This work represents a preliminary study on the importance of sentiment metrics in cryptocurrency forecasting. Our initial experiments demonstrate promising results considering investment returns, surpassing the Buy & Hold baseline, and offering valuable insights about the combination of indicators of sentiment and market in a cryptocurrency prediction model.
Authors: Ravil Mussabayev, Rustam Mussabayev
Abstract: K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data. Our approach is based on the sequential optimization of the partial objective function landscapes obtained by restricting the Minimum Sum-of-Squares Clustering (MSSC) formulation to random samples from the original big dataset. Within each landscape, systematically expanding neighborhoods of the currently best (incumbent) solution are explored by reinitializing all degenerate and a varying number of additional centroids. Extensive and rigorous experimentation on a large number of real-world datasets reveals that by transforming the traditional local search into a global one, our algorithm significantly enhances the accuracy and efficiency of K-means clustering in big data environments, becoming the new state of the art in the field.
Authors: Mikhail Mironov, Liudmila Prokhorenkova
Abstract: The concept of diversity is widely used in various applications: from image or molecule generation to recommender systems. Thus, being able to properly measure diversity is important. This paper addresses the problem of quantifying diversity for a set of objects. First, we make a systematic review of existing diversity measures and explore their undesirable behavior in some cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradicting. Unfortunately, the constructed examples are too computationally complex for practical use, thus we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice.
Authors: Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang
Abstract: Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings explains limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.
Authors: Nazanin Nezami, Hadis Anahideh
Abstract: Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world applications. Surrogate Optimization (SO) is a common resolution, yet its proprietary nature introduced by the complexity of surrogate models and the sampling core (e.g., acquisition functions) often leads to a lack of explainability and transparency. While existing literature has primarily concentrated on enhancing convergence to global optima, the practical interpretation of newly proposed strategies remains underexplored, especially in batch evaluation settings. In this paper, we propose \emph{Inclusive} Explainability Metrics for Surrogate Optimization (IEMSO), a comprehensive set of model-agnostic metrics designed to enhance the transparency, trustworthiness, and explainability of the SO approaches. Through these metrics, we provide both intermediate and post-hoc explanations to practitioners before and after performing expensive evaluations to gain trust. We consider four primary categories of metrics, each targeting a specific aspect of the SO process: Sampling Core Metrics, Batch Properties Metrics, Optimization Process Metrics, and Feature Importance. Our experimental evaluations demonstrate the significant potential of the proposed metrics across different benchmarks.
Authors: Rachel S. Y. Teo, Tan M. Nguyen
Abstract: Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model's lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In particular, we verify the advantages of MomentumSMoE over SMoE on a variety of practical tasks including ImageNet-1K object recognition and WikiText-103 language modeling. We demonstrate the applicability of MomentumSMoE to many types of SMoE models, including those in the Sparse MoE model for vision (V-MoE) and the Generalist Language Model (GLaM). We also show that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.
Authors: Lihi Idan
Abstract: Unsupervised validation of anomaly-detection models is a highly challenging task. While the common practices for model validation involve a labeled validation set, such validation sets cannot be constructed when the underlying datasets are unlabeled. The lack of robust and efficient unsupervised model-validation techniques presents an acute challenge in the implementation of automated anomaly-detection pipelines, especially when there exists no prior knowledge of the model's performance on similar datasets. This work presents a new paradigm to automated validation of anomaly-detection models, inspired by real-world, collaborative decision-making mechanisms. We focus on two commonly-used, unsupervised model-validation tasks -- model selection and model evaluation -- and provide extensive experimental results that demonstrate the accuracy and robustness of our approach on both tasks.
Authors: Aaron Alvarado Kristanto Julistiono, Davoud Ataee Tarzanagh, Navid Azizan
Abstract: Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.
Authors: Baran Atalar, Carlee Joe-Wong
Abstract: We consider the contextual combinatorial bandit setting where in each round, the learning agent, e.g., a recommender system, selects a subset of "arms," e.g., products, and observes rewards for both the individual base arms, which are a function of known features (called "context"), and the super arm (the subset of arms), which is a function of the base arm rewards. The agent's goal is to simultaneously learn the unknown reward functions and choose the highest-reward arms. For example, the "reward" may represent a user's probability of clicking on one of the recommended products. Conventional bandit models, however, employ restrictive reward function models in order to obtain performance guarantees. We make use of deep neural networks to estimate and learn the unknown reward functions and propose Neural UCB Clustering (NeUClust), which adopts a clustering approach to select the super arm in every round by exploiting underlying structure in the context space. Unlike prior neural bandit works, NeUClust uses a neural network to estimate the super arm reward and select the super arm, thus eliminating the need for a known optimization oracle. We non-trivially extend prior neural combinatorial bandit works to prove that NeUClust achieves $\widetilde{O}\left(\widetilde{d}\sqrt{T}\right)$ regret, where $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, $T$ the number of rounds. Experiments on real world recommendation datasets show that NeUClust achieves better regret and reward than other contextual combinatorial and neural bandit algorithms.
Authors: Namid R. Stillman, Rory Baggott
Abstract: Deep generative models are becoming increasingly used as tools for financial analysis. However, it is unclear how these models will influence financial markets, especially when they infer financial value in a semi-autonomous way. In this work, we explore the interplay between deep generative models and market dynamics. We develop a form of virtual traders that use deep generative models to make buy/sell decisions, which we term neuro-symbolic traders, and expose them to a virtual market. Under our framework, neuro-symbolic traders are agents that use vision-language models to discover a model of the fundamental value of an asset. Agents develop this model as a stochastic differential equation, calibrated to market data using gradient descent. We test our neuro-symbolic traders on both synthetic data and real financial time series, including an equity stock, commodity, and a foreign exchange pair. We then expose several groups of neuro-symbolic traders to a virtual market environment. This market environment allows for feedback between the traders belief of the underlying value to the observed price dynamics. We find that this leads to price suppression compared to the historical data, highlighting a future risk to market stability. Our work is a first step towards quantifying the effect of deep generative agents on markets dynamics and sets out some of the potential risks and benefits of this approach in the future.
Authors: Jessica Dai, Nika Haghtalab, Eric Zhao
Abstract: A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this view, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing per-subgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all components in proportion to their relative likelihood of having been drawn from each component. Using online calibration as a case study, we study a \variational algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve an $O(T^{1/2})$ rate even when the subpopulations are not well-separated. In comparison, the more natural cluster-then-predict approach that first recovers the structure of the subpopulations and then makes predictions suffers from a $O(T^{2/3})$ rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.
Authors: Yang Ba, Michelle V. Mancenido, Rong Pan
Abstract: To enhance the generalization of machine learning models to unseen data, techniques such as dropout, weight decay ($L_2$ regularization), and noise augmentation are commonly employed. While regularization methods (i.e., dropout and weight decay) are geared toward adjusting model parameters to prevent overfitting, data augmentation increases the diversity of the input training set, a method purported to improve accuracy and calibration error. In this paper, we investigate the impact of each of these techniques on the parameter space of neural networks, with the goal of understanding how they alter the weight landscape in transfer learning scenarios. To accomplish this, we employ Random Matrix Theory to analyze the eigenvalue distributions of pre-trained models, fine-tuned using these techniques but using different levels of data diversity, for the same downstream tasks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. Additionally, we compare commonly used data augmentation methods with synthetic data created by generative models. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.
Authors: Shih-Hsin Wang, Justin Baker, Cory Hauck, Bao Wang
Abstract: The pioneering work of Oono and Suzuki [ICLR, 2020] and Cai and Wang [arXiv:2006.13318] initializes the analysis of the smoothness of graph convolutional network (GCN) features. Their results reveal an intricate empirical correlation between node classification accuracy and the ratio of smooth to non-smooth feature components. However, the optimal ratio that favors node classification is unknown, and the non-smooth features of deep GCN with ReLU or leaky ReLU activation function diminish. In this paper, we propose a new strategy to let GCN learn node features with a desired smoothness -- adapting to data and tasks -- to enhance node classification. Our approach has three key steps: (1) We establish a geometric relationship between the input and output of ReLU or leaky ReLU. (2) Building on our geometric insights, we augment the message-passing process of graph convolutional layers (GCLs) with a learnable term to modulate the smoothness of node features with computational efficiency. (3) We investigate the achievable ratio between smooth and non-smooth feature components for GCNs with the augmented message-passing scheme. Our extensive numerical results show that the augmented message-passing schemes significantly improve node classification for GCN and some related models.
Authors: Mohamed Elsayed, Gautham Vasan, A. Rupam Mahmood
Abstract: Natural intelligence processes experience as a continuous stream, sensing, acting, and learning moment-by-moment in real time. Streaming learning, the modus operandi of classic reinforcement learning (RL) algorithms like Q-learning and TD, mimics natural learning by using the most recent sample without storing it. This approach is also ideal for resource-constrained, communication-limited, and privacy-sensitive applications. However, in deep RL, learners almost always use batch updates and replay buffers, making them computationally expensive and incompatible with streaming learning. Although the prevalence of batch deep RL is often attributed to its sample efficiency, a more critical reason for the absence of streaming deep RL is its frequent instability and failure to learn, which we refer to as stream barrier. This paper introduces the stream-x algorithms, the first class of deep RL algorithms to overcome stream barrier for both prediction and control and match sample efficiency of batch RL. Through experiments in Mujoco Gym, DM Control Suite, and Atari Games, we demonstrate stream barrier in existing algorithms and successful stable learning with our stream-x algorithms: stream Q, stream AC, and stream TD, achieving the best model-free performance in DM Control Dog environments. A set of common techniques underlies the stream-x algorithms, enabling their success with a single set of hyperparameters and allowing for easy extension to other algorithms, thereby reviving streaming RL.
Authors: Chuang Yang, Renhe Jiang, Xiaohang Xu, Chuan Xiao, Kaoru Sezaki
Abstract: Free-space trajectory similarity calculation, e.g., DTW, Hausdorff, and Frechet, often incur quadratic time complexity, thus learning-based methods have been proposed to accelerate the computation. The core idea is to train an encoder to transform trajectories into representation vectors and then compute vector similarity to approximate the ground truth. However, existing methods face dual challenges of effectiveness and efficiency: 1) they all utilize Euclidean distance to compute representation similarity, which leads to the severe curse of dimensionality issue -- reducing the distinguishability among representations and significantly affecting the accuracy of subsequent similarity search tasks; 2) most of them are trained in triplets manner and often necessitate additional information which downgrades the efficiency; 3) previous studies, while emphasizing the scalability in terms of efficiency, overlooked the deterioration of effectiveness when the dataset size grows. To cope with these issues, we propose a simple, yet accurate, fast, scalable model that only uses a single-layer vanilla transformer encoder as the feature extractor and employs tailored representation similarity functions to approximate various ground truth similarity measures. Extensive experiments demonstrate our model significantly mitigates the curse of dimensionality issue and outperforms the state-of-the-arts in effectiveness, efficiency, and scalability.
Authors: Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi
Abstract: In processing multiple time series, accounting for the individual features of each sequence can be challenging. To address this, modern deep learning methods for time series analysis combine a shared (global) model with local layers, specific to each time series, often implemented as learnable embeddings. Ideally, these local embeddings should encode meaningful representations of the unique dynamics of each sequence. However, when these are learned end-to-end as parameters of a forecasting model, they may end up acting as mere sequence identifiers. Shared processing blocks may then become reliant on such identifiers, limiting their transferability to new contexts. In this paper, we address this issue by investigating methods to regularize the learning of local learnable embeddings for time series processing. Specifically, we perform the first extensive empirical study on the subject and show how such regularizations consistently improve performance in widely adopted architectures. Furthermore, we show that methods preventing the co-adaptation of local and global parameters are particularly effective in this context. This hypothesis is validated by comparing several methods preventing the downstream models from relying on sequence identifiers, going as far as completely resetting the embeddings during training. The obtained results provide an important contribution to understanding the interplay between learnable local parameters and shared processing layers: a key challenge in modern time series processing models and a step toward developing effective foundation models for time series.
Authors: David R. Johnson, Joyce Chew, Siddharth Viswanath, Edward De Brouwer, Deanna Needell, Smita Krishnaswamy, Michael Perlmutter
Abstract: In order to better understand manifold neural networks (MNNs), we introduce Manifold Filter-Combine Networks (MFCNs). The filter-combine framework parallels the popular aggregate-combine paradigm for graph neural networks (GNNs) and naturally suggests many interesting families of MNNs which can be interpreted as the manifold analog of various popular GNNs. We then propose a method for implementing MFCNs on high-dimensional point clouds that relies on approximating the manifold by a sparse graph. We prove that our method is consistent in the sense that it converges to a continuum limit as the number of data points tends to infinity.
Authors: Junyu Cao, Ruijiang Gao, Esmaeil Keyvanshokooh
Abstract: Human doctors frequently recommend actionable recourses that allow patients to modify their conditions to access more effective treatments. Inspired by such healthcare scenarios, we propose the Recourse Linear UCB ($\textsf{RLinUCB}$) algorithm, which optimizes both action selection and feature modifications by balancing exploration and exploitation. We further extend this to the Human-AI Linear Recourse Bandit ($\textsf{HR-Bandit}$), which integrates human expertise to enhance performance. $\textsf{HR-Bandit}$ offers three key guarantees: (i) a warm-start guarantee for improved initial performance, (ii) a human-effort guarantee to minimize required human interactions, and (iii) a robustness guarantee that ensures sublinear regret even when human decisions are suboptimal. Empirical results, including a healthcare case study, validate its superior performance against existing benchmarks.
Authors: Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh
Abstract: The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by \emph{dynamic, non-uniform} compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as \emph{error monotonicity}, i.e. that the end-to-end model compression error is proportional to the sum of layer-wise errors. In this paper, we revisit this area, and propose a new and general approach for dynamic compression that is provably optimal in a given input range. We begin from the motivating observation that, in general, \emph{error monotonicity does not hold for LLMs}: compressed models with lower sum of per-layer errors can perform \emph{worse} than models with higher error sums. To address this, we propose a new general evolutionary framework for dynamic LLM compression called EvoPress, which has provable convergence, and low sample and evaluation complexity. We show that these theoretical guarantees lead to highly competitive practical performance for dynamic compression of Llama, Mistral and Phi models. Via EvoPress, we set new state-of-the-art results across all compression approaches: structural pruning (block/layer dropping), unstructured sparsity, as well as quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress.
Authors: Zhepeng Cen, Yao Liu, Siliang Zeng, Pratik Chaudhar, Huzefa Rangwala, George Karypis, Rasool Fakoor
Abstract: Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset. However, during inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one. Marginal differences in predictions at each step can cascade over successive steps, resulting in different distributions from what the models were trained for and potentially leading to unpredictable behavior. This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time. Our first approach is Batch-Scheduled Sampling, where, during training, we stochastically choose between the ground-truth token from the dataset and the model's own generated token as input to predict the next token. This is done in an offline manner, modifying the context window by interleaving ground-truth tokens with those generated by the model. Our second approach is Reference-Answer-based Correction, where we explicitly incorporate a self-correction capability into the model during training. This enables the model to effectively self-correct the gaps between the generated sequences and the ground truth data without relying on an external oracle model. By incorporating our proposed strategies during training, we have observed an overall improvement in performance compared to baseline methods, as demonstrated by our extensive experiments using summarization, general question-answering, and math question-answering tasks.
Authors: Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, Susan A. Murphy
Abstract: We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. Further, all actions within a bag jointly impact a single reward, observed at the end of the bag. Our goal is to construct an online RL algorithm to maximize the discounted sum of the bag-specific rewards. To handle non-Markovian transitions within a bag, we utilize an expert-provided causal directed acyclic graph (DAG). Based on the DAG, we construct the states as a dynamical Bayesian sufficient statistic of the observed history, which results in Markovian state transitions within and across bags. We then frame this problem as a periodic Markov decision process (MDP) that allows non-stationarity within a period. An online RL algorithm based on Bellman-equations for stationary MDPs is generalized to handle periodic MDPs. To justify the proposed RL algorithm, we show that our constructed state achieves the maximal optimal value function among all state constructions for a periodic MDP. Further we prove the Bellman optimality equations for periodic MDPs. We evaluate the proposed method on testbed variants, constructed with real data from a mobile health clinical trial.
Authors: Shengjie Sun, Runze Liu, Jiafei Lyu, Jing-Wen Yang, Liangpeng Zhang, Xiu Li
Abstract: Large Language Models (LLMs) have shown significant potential in designing reward functions for Reinforcement Learning (RL) tasks. However, obtaining high-quality reward code often involves human intervention, numerous LLM queries, or repetitive RL training. To address these issues, we propose CARD, a LLM-driven Reward Design framework that iteratively generates and improves reward function code. Specifically, CARD includes a Coder that generates and verifies the code, while a Evaluator provides dynamic feedback to guide the Coder in improving the code, eliminating the need for human feedback. In addition to process feedback and trajectory feedback, we introduce Trajectory Preference Evaluation (TPE), which evaluates the current reward function based on trajectory preferences. If the code fails the TPE, the Evaluator provides preference feedback, avoiding RL training at every iteration and making the reward function better aligned with the task objective. Empirical results on Meta-World and ManiSkill2 demonstrate that our method achieves an effective balance between task performance and token efficiency, outperforming or matching the baselines across all tasks. On 10 out of 12 tasks, CARD shows better or comparable performance to policies trained with expert-designed rewards, and our method even surpasses the oracle on 3 tasks.
Authors: Anay Pattanaik, Lav R. Varshney
Abstract: This paper considers an online reinforcement learning algorithm that leverages pre-collected data (passive memory) from the environment for online interaction. We show that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal. Results show that the quality of passive memory determines sub-optimality of the incurred regret. The proposed approach and results hold in both continuous and discrete state-action spaces.
Authors: Peimeng Guan, Mark A. Davenport
Abstract: Inverse problems aim to reconstruct unseen data from corrupted or perturbed measurements. While most work focuses on improving reconstruction quality, generalization accuracy and robustness are equally important, especially for safety-critical applications. Model-based architectures (MBAs), such as loop unrolling methods, are considered more interpretable and achieve better reconstructions. Empirical evidence suggests that MBAs are more robust to perturbations than black-box solvers, but the accuracy-robustness tradeoff in MBAs remains underexplored. In this work, we propose a simple yet effective training scheme for MBAs, called SGD jittering, which injects noise iteration-wise during reconstruction. We theoretically demonstrate that SGD jittering not only generalizes better than the standard mean squared error training but is also more robust to average-case attacks. We validate SGD jittering using denoising toy examples, seismic deconvolution, and single-coil MRI reconstruction. The proposed method achieves cleaner reconstructions for out-of-distribution data and demonstrates enhanced robustness to adversarial attacks.
Authors: Joshua Engels, Logan Riggs, Max Tegmark
Abstract: Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in "dark matter": unexplained variance in activations. This work investigates dark matter as an object of study in its own right. Surprisingly, we find that much of SAE dark matter--about half of the error vector itself and >90% of its norm--can be linearly predicted from the initial activation vector. Additionally, we find that the scaling behavior of SAE error norms at a per token level is remarkably predictable: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. We build on the linear representation hypothesis to propose models of activations that might lead to these observations, including postulating a new type of "introduced error"; these insights imply that the part of the SAE error vector that cannot be linearly predicted ("nonlinear" error) might be fundamentally different from the linearly predictable component. To validate this hypothesis, we empirically analyze nonlinear SAE error and show that 1) it contains fewer not yet learned features, 2) SAEs trained on it are quantitatively worse, 3) it helps predict SAE per-token scaling behavior, and 4) it is responsible for a proportional amount of the downstream increase in cross entropy loss when SAE activations are inserted into the model. Finally, we examine two methods to reduce nonlinear SAE error at a fixed sparsity: inference time gradient pursuit, which leads to a very slight decrease in nonlinear error, and linear transformations from earlier layer SAE outputs, which leads to a larger reduction.
Authors: Satoru Tadokoro, Akihiro Yamaguchi, Takao Namiki, Ichiro Tsuda
Abstract: By extending the extreme learning machine by additional control inputs, we achieved almost complete reproduction of bifurcation structures of dynamical systems. The learning ability of the proposed neural network system is striking in that the entire structure of the bifurcations of a target one-parameter family of dynamical systems can be nearly reproduced by training on transient dynamics using only a few parameter values. Moreover, we propose a mechanism to explain this remarkable learning ability and discuss the relationship between the present results and similar results obtained by Kim et al.
Authors: Kuba Weimann, Tim O. F. Conrad
Abstract: Accurate diagnosis of heart arrhythmias requires the interpretation of electrocardiograms (ECG), which capture the electrical activity of the heart. Automating this process through machine learning is challenging due to the need for large annotated datasets, which are difficult and costly to collect. To address this issue, transfer learning is often employed, where models are pre-trained on large datasets and fine-tuned for specific ECG classification tasks with limited labeled data. Self-supervised learning has become a widely adopted pre-training method, enabling models to learn meaningful representations from unlabeled datasets. In this work, we explore the joint-embedding predictive architecture (JEPA) for self-supervised learning from ECG data. Unlike invariance-based methods, JEPA does not rely on hand-crafted data augmentations, and unlike generative methods, it predicts latent features rather than reconstructing input data. We create a large unsupervised pre-training dataset by combining ten public ECG databases, amounting to over one million records. We pre-train Vision Transformers using JEPA on this dataset and fine-tune them on various PTB-XL benchmarks. Our results show that JEPA outperforms existing invariance-based and generative approaches, achieving an AUC of 0.945 on the PTB-XL all statements task. JEPA consistently learns the highest quality representations, as demonstrated in linear evaluations, and proves advantageous for pre-training even in the absence of additional data.
Authors: Diogo Reis Santos, Albert Sund Aillet, Antonio Boiano, Usevalad Milasheuski, Lorenzo Giusti, Marco Di Gennaro, Sanaz Kianoush, Luca Barbieri, Monica Nicoli, Michele Carminati, Alessandro E. C. Redondi, Stefano Savazzi, Luigi Serio
Abstract: The rapid evolution of artificial intelligence (AI) technologies holds transformative potential for the healthcare sector. In critical situations requiring immediate decision-making, healthcare professionals can leverage machine learning (ML) algorithms to prioritize and optimize treatment options, thereby reducing costs and improving patient outcomes. However, the sensitive nature of healthcare data presents significant challenges in terms of privacy and data ownership, hindering data availability and the development of robust algorithms. Federated Learning (FL) addresses these challenges by enabling collaborative training of ML models without the exchange of local data. This paper introduces a novel FL platform designed to support the configuration, monitoring, and management of FL processes. This platform operates on Platform-as-a-Service (PaaS) principles and utilizes the Message Queuing Telemetry Transport (MQTT) publish-subscribe protocol. Considering the production readiness and data sensitivity inherent in clinical environments, we emphasize the security of the proposed FL architecture, addressing potential threats and proposing mitigation strategies to enhance the platform's trustworthiness. The platform has been successfully tested in various operational environments using a publicly available dataset, highlighting its benefits and confirming its efficacy.
Authors: Zhengrui Guo, Fangxu Zhou, Wei Wu, Qichen Sun, Lishuang Feng, Jinzhuo Wang, Hao Chen
Abstract: Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfectly paired neural-behavioral datasets in real-world scenarios when deploying these models, a critical yet understudied research question emerges: how to develop a model that performs well using only neural activity as input at inference, while benefiting from the insights gained from behavioral signals during training? To this end, we propose BLEND, the behavior-guided neural population dynamics modeling framework via privileged knowledge distillation. By considering behavior as privileged information, we train a teacher model that takes both behavior observations (privileged features) and neural activities (regular features) as inputs. A student model is then distilled using only neural activity. Unlike existing methods, our framework is model-agnostic and avoids making strong assumptions about the relationship between behavior and neural activity. This allows BLEND to enhance existing neural dynamics modeling architectures without developing specialized models from scratch. Extensive experiments across neural population activity modeling and transcriptomic neuron identity prediction tasks demonstrate strong capabilities of BLEND, reporting over 50% improvement in behavioral decoding and over 15% improvement in transcriptomic neuron identity prediction after behavior-guided distillation. Furthermore, we empirically explore various behavior-guided distillation strategies within the BLEND framework and present a comprehensive analysis of effectiveness and implications for model performance.
Authors: Fedor S. Avilov, Roman A. Eremin, Semen A. Budennyy, Innokentiy S. Humonen
Abstract: Despite an artificial intelligence-assisted modeling of disordered crystals is a widely used and well-tried method of new materials design, the issues of its robustness, reliability, and stability are still not resolved and even not discussed enough. To highlight it, in this work we composed a series of nested intermetallic approximants of quasicrystals datasets and trained various machine learning models on them correspondingly. Our qualitative and, what is more important, quantitative assessment of the difference in the predictions clearly shows that different reasonable changes in the training sample can lead to the completely different set of the predicted potentially new materials. We also showed the advantage of pre-training and proposed a simple yet effective trick of sequential training to increase stability.
Authors: Jipeng Han
Abstract: Program synthesis methods, whether formal or neural-based, lack fine-grained control and flexible modularity, which limits their adaptation to complex software development. These limitations stem from rigid Domain-Specific Language (DSL) frameworks and neural network incorrect predictions. To this end, we propose the Chain of Logic (CoL), which organizes synthesis stages into a chain and provides precise heuristic control to guide the synthesis process. Furthermore, by integrating neural networks with libraries and introducing a Neural Network Feedback Control (NNFC) mechanism, our approach modularizes synthesis and mitigates the impact of neural network mispredictions. Experiments on relational and symbolic synthesis tasks show that CoL significantly enhances the efficiency and reliability of DSL program synthesis across multiple metrics. Specifically, CoL improves accuracy by 70% while reducing tree operations by 91% and time by 95%. Additionally, NNFC further boosts accuracy by 6%, with a 64% reduction in tree operations under challenging conditions such as insufficient training data, increased difficulty, and multidomain synthesis. These improvements confirm COOL as a highly efficient and reliable program synthesis framework.
Authors: Mirna Al-Shetairy, Hanan Hindy, Dina Khattab, Mostafa M. Aref
Abstract: In recent years, interest in vision-language tasks has grown, especially those involving chart interactions. These tasks are inherently multimodal, requiring models to process chart images, accompanying text, underlying data tables, and often user queries. Traditionally, Chart Understanding (CU) relied on heuristics and rule-based systems. However, recent advancements that have integrated transformer architectures significantly improved performance. This paper reviews prominent research in CU, focusing on State-of-The-Art (SoTA) frameworks that employ transformers within End-to-End (E2E) solutions. Relevant benchmarking datasets and evaluation techniques are analyzed. Additionally, this article identifies key challenges and outlines promising future directions for advancing CU solutions. Following the PRISMA guidelines, a comprehensive literature search is conducted across Google Scholar, focusing on publications from Jan'20 to Jun'24. After rigorous screening and quality assessment, 32 studies are selected for in-depth analysis. The CU tasks are categorized into a three-layered paradigm based on the cognitive task required. Recent advancements in the frameworks addressing various CU tasks are also reviewed. Frameworks are categorized into single-task or multi-task based on the number of tasks solvable by the E2E solution. Within multi-task frameworks, pre-trained and prompt-engineering-based techniques are explored. This review overviews leading architectures, datasets, and pre-training tasks. Despite significant progress, challenges remain in OCR dependency, handling low-resolution images, and enhancing visual reasoning. Future directions include addressing these challenges, developing robust benchmarks, and optimizing model efficiency. Additionally, integrating explainable AI techniques and exploring the balance between real and synthetic data are crucial for advancing CU research.
Authors: Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, Zifan Wang
Abstract: For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents, LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART is consist of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview-based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety
Authors: Aviral Srivastava, Sourav Panda
Abstract: As generative AI systems, including large language models (LLMs) and diffusion models, advance rapidly, their growing adoption has led to new and complex security risks often overlooked in traditional AI risk assessment frameworks. This paper introduces a novel formal framework for categorizing and mitigating these emergent security risks by integrating adaptive, real-time monitoring, and dynamic risk mitigation strategies tailored to generative models' unique vulnerabilities. We identify previously under-explored risks, including latent space exploitation, multi-modal cross-attack vectors, and feedback-loop-induced model degradation. Our framework employs a layered approach, incorporating anomaly detection, continuous red-teaming, and real-time adversarial simulation to mitigate these risks. We focus on formal verification methods to ensure model robustness and scalability in the face of evolving threats. Though theoretical, this work sets the stage for future empirical validation by establishing a detailed methodology and metrics for evaluating the performance of risk mitigation strategies in generative AI systems. This framework addresses existing gaps in AI safety, offering a comprehensive road map for future research and implementation.
Authors: Zheng Wang, Wanwan Wang, Yimin Huang, Zhaopeng Peng, Ziqi Yang, Cheng Wang, Xiaoliang Fan
Abstract: In recent years, graph neural networks (GNNs) have been commonly utilized for social recommendation systems. However, real-world scenarios often present challenges related to user privacy and business constraints, inhibiting direct access to valuable social information from other platforms. While many existing methods have tackled matrix factorization-based social recommendations without direct social data access, developing GNN-based federated social recommendation models under similar conditions remains largely unexplored. To address this issue, we propose a novel vertical federated social recommendation method leveraging privacy-preserving two-party graph convolution networks (P4GCN) to enhance recommendation accuracy without requiring direct access to sensitive social information. First, we introduce a Sandwich-Encryption module to ensure comprehensive data privacy during the collaborative computing process. Second, we provide a thorough theoretical analysis of the privacy guarantees, considering the participation of both curious and honest parties. Extensive experiments on four real-world datasets demonstrate that P4GCN outperforms state-of-the-art methods in terms of recommendation accuracy. The code is available at https://github.com/WwZzz/P4GCN.
Authors: Jinluan Yang, Anke Tang, Didi Zhu, Zhengyu Chen, Li Shen, Fei Wu
Abstract: Model merging has gained significant attention as a cost-effective approach to integrate multiple single-task fine-tuned models into a unified one that can perform well on multiple tasks. However, existing model merging techniques primarily focus on resolving conflicts between task-specific models, they often overlook potential security threats, particularly the risk of backdoor attacks in the open-source model ecosystem. In this paper, we first investigate the vulnerabilities of existing model merging methods to backdoor attacks, identifying two critical challenges: backdoor succession and backdoor transfer. To address these issues, we propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Specifically, DAM employs a meta-learning-based optimization method with dual masks to identify a shared and safety-aware subspace for model merging. These masks are alternately optimized: the Task-Shared mask identifies common beneficial parameters across tasks, aiming to preserve task-specific knowledge while reducing interference, while the Backdoor-Detection mask isolates potentially harmful parameters to neutralize security threats. This dual-mask design allows us to carefully balance the preservation of useful knowledge and the removal of potential vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points while sacrificing only about 1% in accuracy. Furthermore, DAM exhibits robust performance and broad applicability across various types of backdoor attacks and the number of compromised models involved in the merging process. We will release the codes and models soon.
Authors: Liam Rankin Sheldon, Dylan Sheldon Rankin, Philip Harris
Abstract: The use of machine learning methods in high energy physics typically relies on large volumes of precise simulation for training. As machine learning models become more complex they can become increasingly sensitive to differences between this simulation and the real data collected by experiments. We present a generic methodology based on contrastive learning which is able to greatly mitigate this negative effect. Crucially, the method does not require prior knowledge of the specifics of the mismodeling. While we demonstrate the efficacy of this technique using the task of jet-tagging at the Large Hadron Collider, it is applicable to a wide array of different tasks both in and out of the field of high energy physics.
Authors: Rachel Ma, Jingyi Qu, Andreea Bobu, Dylan Hadfield-Menell
Abstract: We present an online method for embodied agents to learn and accomplish diverse user goals. While offline methods like RLHF can represent various goals but require large datasets, our approach achieves similar flexibility with online efficiency. We extract natural language goal representations from conversations with Large Language Models (LLMs). We prompt an LLM to role play as a human with different goals and use the corresponding likelihoods to run Bayesian inference over potential goals. As a result, our method can represent uncertainty over complex goals based on unrestricted dialog. We evaluate our method in grocery shopping and home robot assistance domains using a text-based interface and AI2Thor simulation respectively. Results show our method outperforms ablation baselines that lack either explicit goal representation or probabilistic inference.
Authors: Kuldeep Singh, Simerjot Kaur, Charese Smiley
Abstract: Financial decision-making hinges on the analysis of relevant information embedded in the enormous volume of documents in the financial domain. To address this challenge, we developed FinQAPT, an end-to-end pipeline that streamlines the identification of relevant financial reports based on a query, extracts pertinent context, and leverages Large Language Models (LLMs) to perform downstream tasks. To evaluate the pipeline, we experimented with various techniques to optimize the performance of each module using the FinQA dataset. We introduced a novel clustering-based negative sampling technique to enhance context extraction and a novel prompting method called Dynamic N-shot Prompting to boost the numerical question-answering capabilities of LLMs. At the module level, we achieved state-of-the-art accuracy on FinQA, attaining an accuracy of 80.6\%. However, at the pipeline level, we observed decreased performance due to challenges in extracting relevant context from financial reports. We conducted a detailed error analysis of each module and the end-to-end pipeline, pinpointing specific challenges that must be addressed to develop a robust solution for handling complex financial tasks.
Authors: Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Shao-Yen Tseng, Vasudev Lal, Phillip Howard
Abstract: Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LVLMs by directly ablating biased attributes during text generation to avoid generating text related to protected attributes, or even representing them internally. Our method requires no training and a relatively small amount of representative biased outputs (~1000 samples). Our experiments show that not only can we can minimize the propensity of LVLMs to generate text related to protected attributes, but we can even use synthetic data to inform the ablation while retaining captioning performance on real data such as COCO. Furthermore, we find the resulting generations from a debiased LVLM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.
Authors: Aoran Zhang, Wenbin Zhou, Liyan Xie, Shixiang Zhu
Abstract: Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.
Authors: Jiping Li, Rishi Sonthalia
Abstract: Random matrix theory has proven to be a valuable tool in analyzing the generalization of linear models. However, the generalization properties of even two-layer neural networks trained by gradient descent remain poorly understood. To understand the generalization performance of such networks, it is crucial to characterize the spectrum of the feature matrix at the hidden layer. Recent work has made progress in this direction by describing the spectrum after a single gradient step, revealing a spiked covariance structure. Yet, the generalization error for linear models with spiked covariances has not been previously determined. This paper addresses this gap by examining two simple models exhibiting spiked covariances. We derive their generalization error in the asymptotic proportional regime. Our analysis demonstrates that the eigenvector and eigenvalue corresponding to the spike significantly influence the generalization error.
Authors: Gabriel Chan, Bart Van Parys, Amine Bennouna
Abstract: We establish a connection between distributionally robust optimization (DRO) and classical robust statistics. We demonstrate that this connection arises naturally in the context of estimation under data corruption, where the goal is to construct ``minimal'' confidence sets for the unknown data-generating distribution. Specifically, we show that a DRO ambiguity set, based on the Kullback-Leibler divergence and total variation distance, is uniformly minimal, meaning it represents the smallest confidence set that contains the unknown distribution with at a given confidence power. Moreover, we prove that when parametric assumptions are imposed on the unknown distribution, the ambiguity set is never larger than a confidence set based on the optimal estimator proposed by Huber. This insight reveals that the commonly observed conservatism of DRO formulations is not intrinsic to these formulations themselves but rather stems from the non-parametric framework in which these formulations are employed.
Authors: Shreya Saha, Ishaan Chadha, Meenakshi khosla
Abstract: Over the past decade, predictive modeling of neural responses in the primate visual system has advanced significantly, largely driven by various DNN approaches. These include models optimized directly for visual recognition, cross-modal alignment through contrastive objectives, neural response prediction from scratch, and large language model embeddings.Likewise, different readout mechanisms, ranging from fully linear to spatial-feature factorized methods have been explored for mapping network activations to neural responses. Despite the diversity of these approaches, it remains unclear which method performs best across different visual regions. In this study, we systematically compare these approaches for modeling the human visual system and investigate alternative strategies to improve response predictions. Our findings reveal that for early to mid-level visual areas, response-optimized models with visual inputs offer superior prediction accuracy, while for higher visual regions, embeddings from LLMs based on detailed contextual descriptions of images and task-optimized models pretrained on large vision datasets provide the best fit. Through comparative analysis of these modeling approaches, we identified three distinct regions in the visual cortex: one sensitive primarily to perceptual features of the input that are not captured by linguistic descriptions, another attuned to fine-grained visual details representing semantic information, and a third responsive to abstract, global meanings aligned with linguistic content. We also highlight the critical role of readout mechanisms, proposing a novel scheme that modulates receptive fields and feature maps based on semantic content, resulting in an accuracy boost of 3-23% over existing SOTAs for all models and brain regions. Together, these findings offer key insights into building more precise models of the visual system.
Authors: Bolin Lai, Sam Toyer, Tushar Nagarajan, Rohit Girdhar, Shengxin Zha, James M. Rehg, Kris Kitani, Kristen Grauman, Ruta Desai, Miao Liu
Abstract: Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans various tasks, including action anticipation, activity forecasting, intent prediction, goal prediction, and so on. Our survey aims to tie together this fragmented literature, covering recent technical innovations as well as the development of new large-scale datasets for model training and evaluation. We also summarize the widely-used metrics for different tasks and provide a comprehensive performance comparison of existing approaches on eleven action anticipation datasets. This survey serves as not only a reference for contemporary methodologies in action anticipation, but also a guideline for future research direction of this evolving landscape.
Authors: Runshi Tang, Tamara Kolda, Anru R. Zhang
Abstract: This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an optimization algorithm for computing tensor decompositions with unaligned observations, along with a stochastic gradient method to enhance computational efficiency. A sketching algorithm is also introduced to further improve efficiency when using the $\ell_2$ loss function. To demonstrate the efficacy of our methods, we provide illustrative examples using both synthetic data and an early childhood human microbiome dataset.
Authors: Alireza Rezazadeh, Zichao Li, Wei Wei, Yujia Bao
Abstract: Recent advancements in large language models have significantly improved their context windows, yet challenges in effective long-term memory management remain. We introduce MemTree, an algorithm that leverages a dynamic, tree-structured memory representation to optimize the organization, retrieval, and integration of information, akin to human cognitive schemas. MemTree organizes memory hierarchically, with each node encapsulating aggregated textual content, corresponding semantic embeddings, and varying abstraction levels across the tree's depths. Our algorithm dynamically adapts this memory structure by computing and comparing semantic embeddings of new and existing information to enrich the model's context-awareness. This approach allows MemTree to handle complex reasoning and extended interactions more effectively than traditional memory augmentation methods, which often rely on flat lookup tables. Evaluations on benchmarks for multi-turn dialogue understanding and document question answering show that MemTree significantly enhances performance in scenarios that demand structured memory management.
Authors: Panagiotis Theodoropoulos, Nikolaos Komianos, Vincent Pacelli, Guan-Horng Liu, Evangelos A. Theodorou
Abstract: Recent advancements in diffusion bridges for distribution transport problems have heavily relied on matching frameworks, yet existing methods often face a trade-off between scalability and access to optimal pairings during training. Fully unsupervised methods make minimal assumptions but incur high computational costs, limiting their practicality. On the other hand, imposing full supervision of the matching process with optimal pairings improves scalability, however, it can be infeasible in many applications. To strike a balance between scalability and minimal supervision, we introduce Feedback Schr\"{o}dinger Bridge Matching (FSBM), a novel semi-supervised matching framework that incorporates a small portion (less than 8% of the entire dataset) of pre-aligned pairs as state feedback to guide the transport map of non coupled samples, thereby significantly improving efficiency. This is achieved by formulating a static Entropic Optimal Transport (EOT) problem with an additional term capturing the semi-supervised guidance. The generalized EOT objective is then recast into a dynamic formulation to leverage the scalability of matching frameworks. Extensive experiments demonstrate that FSBM accelerates training and enhances generalization by leveraging coupled pairs guidance, opening new avenues for training matching frameworks with partially aligned datasets.
Authors: Amir Hossein Saberi, Amir Najafi, Ala Emrani, Amin Behjati, Yasaman Zolfimoselo, Mahdi Shadrooy, Abolfazl Motahari, Babak H. Khalaj
Abstract: The aim of this paper is to address the challenge of gradual domain adaptation within a class of manifold-constrained data distributions. In particular, we consider a sequence of $T\ge2$ data distributions $P_1,\ldots,P_T$ undergoing a gradual shift, where each pair of consecutive measures $P_i,P_{i+1}$ are close to each other in Wasserstein distance. We have a supervised dataset of size $n$ sampled from $P_0$, while for the subsequent distributions in the sequence, only unlabeled i.i.d. samples are available. Moreover, we assume that all distributions exhibit a known favorable attribute, such as (but not limited to) having intra-class soft/hard margins. In this context, we propose a methodology rooted in Distributionally Robust Optimization (DRO) with an adaptive Wasserstein radius. We theoretically show that this method guarantees the classification error across all $P_i$s can be suitably bounded. Our bounds rely on a newly introduced {\it {compatibility}} measure, which fully characterizes the error propagation dynamics along the sequence. Specifically, for inadequately constrained distributions, the error can exponentially escalate as we progress through the gradual shifts. Conversely, for appropriately constrained distributions, the error can be demonstrated to be linear or even entirely eradicated. We have substantiated our theoretical findings through several experimental results.
Authors: Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif Grabocka, Andreas Kipf
Abstract: The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present $\texttt{Virtual}$, a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on $\texttt{data.gov}$ datasets show that $\texttt{Virtual}$ reduces file sizes by up to 40% compared to Apache Parquet.
Authors: Xiao Liu, Xinchao Liu
Abstract: The Proper Orthogonal Decomposition (POD) computes the optimal basis modes that span a low-dimensional subspace where the Reduced-Order Models (ROMs) reside. Because a governing equation is often parameterized by a set of parameters, challenges immediately arise when one would like to investigate how systems behave differently over the parameter space (in design, control, uncertainty quantification and real-time operations). In this case, the POD basis needs to be updated so as to adapt ROM that accurately captures the variation of a system's behavior over its parameter space. This paper proposes a Projected Gaussian Process (pGP) and formulate the problem of adapting POD basis as a supervised statistical learning problem, for which the goal is to learn a mapping from the parameter space to the Grassmann Manifold that contains the optimal vector subspaces. A mapping is firstly found between the Euclidean space and the horizontal space of an orthogonal matrix that spans a reference subspace in the Grassmann Manifold. Then, a second mapping from the horizontal space to the Grassmann Manifold is established through the Exponential/Logarithm maps between the manifold and its tangent space. Finally, given a new parameter, the conditional distribution of a vector can be found in the Euclidean space using the Gaussian Process (GP) regression, and such a distribution is projected to the Grassmann Manifold that yields the optimal subspace for the new parameter. The proposed statistical learning approach allows us to optimally estimate model parameters given data (i.e., the prediction/interpolation becomes problem-specific), and quantify the uncertainty associated with the prediction. Numerical examples are presented to demonstrate the advantages of the proposed pGP for adapting POD basis against parameter changes.
Authors: Hao Sui, Bing Chen, Jiale Zhang, Chengcheng Zhu, Di Wu, Qinghua Lu, Guodong Long
Abstract: Recent studies have revealed that GNNs are highly susceptible to multiple adversarial attacks. Among these, graph backdoor attacks pose one of the most prominent threats, where attackers cause models to misclassify by learning the backdoored features with injected triggers and modified target labels during the training phase. Based on the features of the triggers, these attacks can be categorized into out-of-distribution (OOD) and in-distribution (ID) graph backdoor attacks, triggers with notable differences from the clean sample feature distributions constitute OOD backdoor attacks, whereas the triggers in ID backdoor attacks are nearly identical to the clean sample feature distributions. Existing methods can successfully defend against OOD backdoor attacks by comparing the feature distribution of triggers and clean samples but fail to mitigate stealthy ID backdoor attacks. Due to the lack of proper supervision signals, the main task accuracy is negatively affected in defending against ID backdoor attacks. To bridge this gap, we propose DMGNN against OOD and ID graph backdoor attacks that can powerfully eliminate stealthiness to guarantee defense effectiveness and improve the model performance. Specifically, DMGNN can easily identify the hidden ID and OOD triggers via predicting label transitions based on counterfactual explanation. To further filter the diversity of generated explainable graphs and erase the influence of the trigger features, we present a reverse sampling pruning method to screen and discard the triggers directly on the data level. Extensive experimental evaluations on open graph datasets demonstrate that DMGNN far outperforms the state-of-the-art (SOTA) defense methods, reducing the attack success rate to 5% with almost negligible degradation in model performance (within 3.5%).
Authors: Rachel Ma, Lyndon Lam, Benjamin A. Spiegel, Aditya Ganeshan, Roma Patel, Ben Abbatematteo, David Paulius, Stefanie Tellex, George Konidaris
Abstract: It is imperative that robots can understand natural language commands issued by humans. Such commands typically contain verbs that signify what action should be performed on a given object and that are applicable to many objects. We propose a method for generalizing manipulation skills to novel objects using verbs. Our method learns a probabilistic classifier that determines whether a given object trajectory can be described by a specific verb. We show that this classifier accurately generalizes to novel object categories with an average accuracy of 76.69% across 13 object categories and 14 verbs. We then perform policy search over the object kinematics to find an object trajectory that maximizes classifier prediction for a given verb. Our method allows a robot to generate a trajectory for a novel object based on a verb, which can then be used as input to a motion planner. We show that our model can generate trajectories that are usable for executing five verb commands applied to novel instances of two different object categories on a real robot.
Authors: Yonghyun Kim, Alexander Lerch
Abstract: Recent advancements in Automatic Piano Transcription (APT) have significantly improved system performance, but the impact of noisy environments on the system performance remains largely unexplored. This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models and evaluates the performance of the Onsets and Frames model when trained on noise-augmented data. We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
Authors: Eli N. Weinstein, Elizabeth B. Wood, David M. Blei
Abstract: A central question in human immunology is how a patient's repertoire of T cells impacts disease. Here, we introduce a method to infer the causal effects of T cell receptor (TCR) sequences on patient outcomes using observational TCR repertoire sequencing data and clinical outcomes data. Our approach corrects for unobserved confounders, such as a patient's environment and life history, by using the patient's immature, pre-selection TCR repertoire. The pre-selection repertoire can be estimated from nonproductive TCR data, which is widely available. It is generated by a randomized mutational process, V(D)J recombination, which provides a natural experiment. We show formally how to use the pre-selection repertoire to draw causal inferences, and develop a scalable neural-network estimator for our identification formula. Our method produces an estimate of the effect of interventions that add a specific TCR sequence to patient repertoires. As a demonstration, we use it to analyze the effects of TCRs on COVID-19 severity, uncovering potentially therapeutic TCRs that are (1) observed in patients, (2) bind SARS-CoV-2 antigens in vitro and (3) have strong positive effects on clinical outcomes.
Authors: Muhe Ding, Jianlong Wu, Xue Dong, Xiaojie Li, Pengda Qin, Tian Gan, Liqiang Nie
Abstract: Knowledge distillation is a mainstream algorithm in model compression by transferring knowledge from the larger model (teacher) to the smaller model (student) to improve the performance of student. Despite many efforts, existing methods mainly investigate the consistency between instance-level feature representation or prediction, which neglects the category-level information and the difficulty of each sample, leading to undesirable performance. To address these issues, we propose a novel preview-based category contrastive learning method for knowledge distillation (PCKD). It first distills the structural knowledge of both instance-level feature correspondence and the relation between instance features and category centers in a contrastive learning fashion, which can explicitly optimize the category representation and explore the distinct correlation between representations of instances and categories, contributing to discriminative category centers and better classification results. Besides, we introduce a novel preview strategy to dynamically determine how much the student should learn from each sample according to their difficulty. Different from existing methods that treat all samples equally and curriculum learning that simply filters out hard samples, our method assigns a small weight for hard instances as a preview to better guide the student training. Extensive experiments on several challenging datasets, including CIFAR-100 and ImageNet, demonstrate the superiority over state-of-the-art methods.
Authors: Yanming Zhang, Akshith Kota, Eric Papenhausen, Klaus Mueller
Abstract: Causal networks are widely used in many fields to model the complex relationships between variables. A recent approach has sought to construct causal networks by leveraging the wisdom of crowds through the collective participation of humans. While this can yield detailed causal networks that model the underlying phenomena quite well, it requires a large number of individuals with domain understanding. We adopt a different approach: leveraging the causal knowledge that large language models, such as OpenAI's GPT-4, have learned by ingesting massive amounts of literature. Within a dedicated visual analytics interface, called CausalChat, users explore single variables or variable pairs recursively to identify causal relations, latent variables, confounders, and mediators, constructing detailed causal networks through conversation. Each probing interaction is translated into a tailored GPT-4 prompt and the response is conveyed through visual representations which are linked to the generated text for explanations. We demonstrate the functionality of CausalChat across diverse data contexts and conduct user studies involving both domain experts and laypersons.
Authors: Gaoyang Pang, Wanchun Liu, Dusit Niyato, Daniel Quevedo, Branka Vucetic, Yonghui Li
Abstract: Wireless Human-Machine Collaboration (WHMC) represents a critical advancement for Industry 5.0, enabling seamless interaction between humans and machines across geographically distributed systems. As the WHMC systems become increasingly important for achieving complex collaborative control tasks, ensuring their stability is essential for practical deployment and long-term operation. Stability analysis certifies how the closed-loop system will behave under model randomness, which is essential for systems operating with wireless communications. However, the fundamental stability analysis of the WHMC systems remains an unexplored challenge due to the intricate interplay between the stochastic nature of wireless communications, dynamic human operations, and the inherent complexities of control system dynamics. This paper establishes a fundamental WHMC model incorporating dual wireless loops for machine and human control. Our framework accounts for practical factors such as short-packet transmissions, fading channels, and advanced HARQ schemes. We model human control lag as a Markov process, which is crucial for capturing the stochastic nature of human interactions. Building on this model, we propose a stochastic cycle-cost-based approach to derive a stability condition for the WHMC system, expressed in terms of wireless channel statistics, human dynamics, and control parameters. Our findings are validated through extensive numerical simulations and a proof-of-concept experiment, where we developed and tested a novel wireless collaborative cart-pole control system. The results confirm the effectiveness of our approach and provide a robust framework for future research on WHMC systems in more complex environments.
Authors: Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, Lingpeng Kong
Abstract: Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques. For instance, MDM achieves 91.5\% and 100\% accuracy on Countdown and Sudoku, respectively, compared to 45.8\% and 20.7\% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks.
Authors: H\'ector Laria, Alex Gomez-Villa, Imad Eddine Marouf, Kai Wang, Bogdan Raducanu, Joost van de Weijer
Abstract: Recent advances in diffusion models have significantly enhanced image generation capabilities. However, customizing these models with new classes often leads to unintended consequences that compromise their reliability. We introduce the concept of open-world forgetting to emphasize the vast scope of these unintended alterations, contrasting it with the well-studied closed-world forgetting, which is measurable by evaluating performance on a limited set of classes or skills. Our research presents the first comprehensive investigation into open-world forgetting in diffusion models, focusing on semantic and appearance drift of representations. We utilize zero-shot classification to analyze semantic drift, revealing that even minor model adaptations lead to unpredictable shifts affecting areas far beyond newly introduced concepts, with dramatic drops in zero-shot classification of up to 60%. Additionally, we observe significant changes in texture and color of generated content when analyzing appearance drift. To address these issues, we propose a mitigation strategy based on functional regularization, designed to preserve original capabilities while accommodating new concepts. Our study aims to raise awareness of unintended changes due to model customization and advocates for the analysis of open-world forgetting in future research on model customization and finetuning methods. Furthermore, we provide insights for developing more robust adaptation methodologies.
Authors: Yujun Zhou, Jingdong Yang, Kehan Guo, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Abstract: Laboratory accidents pose significant risks to human life and property, underscoring the importance of robust safety protocols. Despite advancements in safety training, laboratory personnel may still unknowingly engage in unsafe practices. With the increasing reliance on large language models (LLMs) for guidance in various fields, including laboratory settings, there is a growing concern about their reliability in critical safety-related decision-making. Unlike trained human researchers, LLMs lack formal lab safety education, raising questions about their ability to provide safe and accurate guidance. Existing research on LLM trustworthiness primarily focuses on issues such as ethical compliance, truthfulness, and fairness but fails to fully cover safety-critical real-world applications, like lab safety. To address this gap, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive evaluation framework based on a new taxonomy aligned with Occupational Safety and Health Administration (OSHA) protocols. This benchmark includes 765 multiple-choice questions verified by human experts, assessing LLMs and vision language models (VLMs) performance in lab safety contexts. Our evaluations demonstrate that while GPT-4o outperforms human participants, it is still prone to critical errors, highlighting the risks of relying on LLMs in safety-critical environments. Our findings emphasize the need for specialized benchmarks to accurately assess the trustworthiness of LLMs in real-world safety applications.
Authors: Yanhao Jin, Krishnakumar Balasubramanian, Lifeng Lai
Abstract: We theoretically investigate the in-context learning capabilities of transformers in the context of learning mixtures of linear regression models. For the case of two mixtures, we demonstrate the existence of transformers that can achieve an accuracy, relative to the oracle predictor, of order $\mathcal{\tilde{O}}((d/n)^{1/4})$ in the low signal-to-noise ratio (SNR) regime and $\mathcal{\tilde{O}}(\sqrt{d/n})$ in the high SNR regime, where $n$ is the length of the prompt, and $d$ is the dimension of the problem. Additionally, we derive in-context excess risk bounds of order $\mathcal{O}(L/\sqrt{B})$, where $B$ denotes the number of (training) prompts, and $L$ represents the number of attention layers. The order of $L$ depends on whether the SNR is low or high. In the high SNR regime, we extend the results to $K$-component mixture models for finite $K$. Extensive simulations also highlight the advantages of transformers for this task, outperforming other baselines such as the Expectation-Maximization algorithm.
Authors: Xiaochuan Li, Zichun Yu, Chenyan Xiong
Abstract: Synthetic data has been widely used to train large language models, but their generative nature inevitably introduces noisy, non-informative, and misleading learning signals. In this paper, we propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model's learning process. Specifically, we utilize local data influence of synthetic training data points on students to characterize students' learning preferences. Then, we train the teacher model with Direct Preference Optimization (DPO) to generate synthetic data tailored toward student learning preferences. Experiments with Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and MT-Bench demonstrate that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35\% and 46.24\% relatively. Our method also beats data synthesized by a stronger teacher model, GPT-4o. Further analysis confirms the benefits of teacher's learning to generate more influential training data in the student's improved learning, the advantages of local data influence in accurately measuring student preferences, and the robustness of Montessori-Instruct across different student models. Our code and data are open-sourced at https://github.com/cxcscmu/Montessori-Instruct.
Authors: Michael Ben Ali (IRIT), Omar El-Rifai (IRIT), Imen Megdiche (IRIT, IRIT-SIG, INUC), Andr\'e Peninou (IRIT, IRIT-SIG, UT2J), Olivier Teste (IRIT-SIG, IRIT, UT2J, UT)
Abstract: Over recent years, Federated Learning (FL) has proven to be one of the most promising methods of distributed learning which preserves data privacy. As the method evolved and was confronted to various real-world scenarios, new challenges have emerged. One such challenge is the presence of highly heterogeneous (often referred as non-IID) data distributions among participants of the FL protocol. A popular solution to this hurdle is Clustered Federated Learning (CFL), which aims to partition clients into groups where the distribution are homogeneous. In the literature, state-of-the-art CFL algorithms are often tested using a few cases of data heterogeneities, without systematically justifying the choices. Further, the taxonomy used for differentiating the different heterogeneity scenarios is not always straightforward. In this paper, we explore the performance of two state-of-theart CFL algorithms with respect to a proposed taxonomy of data heterogeneities in federated learning (FL). We work with three image classification datasets and analyze the resulting clusters against the heterogeneity classes using extrinsic clustering metrics. Our objective is to provide a clearer understanding of the relationship between CFL performances and data heterogeneity scenarios.
Authors: Sushmita Paul, Jinqiang Yu, Jip J. Dekker, Alexey Ignatiev, Peter J. Stuckey
Abstract: Despite the practical success of Artificial Intelligence (AI), current neural AI algorithms face two significant issues. First, the decisions made by neural architectures are often prone to bias and brittleness. Second, when a chain of reasoning is required, neural systems often perform poorly. Neuro-symbolic artificial intelligence is a promising approach that tackles these (and other) weaknesses by combining the power of neural perception and symbolic reasoning. Meanwhile, the success of AI has made it critical to understand its behaviour, leading to the development of explainable artificial intelligence (XAI). While neuro-symbolic AI systems have important advantages over purely neural AI, we still need to explain their actions, which are obscured by the interactions of the neural and symbolic components. To address the issue, this paper proposes a formal approach to explaining the decisions of neuro-symbolic systems. The approach hinges on the use of formal abductive explanations and on solving the neuro-symbolic explainability problem hierarchically. Namely, it first computes a formal explanation for the symbolic component of the system, which serves to identify a subset of the individual parts of neural information that needs to be explained. This is followed by explaining only those individual neural inputs, independently of each other, which facilitates succinctness of hierarchical formal explanations and helps to increase the overall performance of the approach. Experimental results for a few complex reasoning tasks demonstrate practical efficiency of the proposed approach, in comparison to purely neural systems, from the perspective of explanation size, explanation time, training time, model sizes, and the quality of explanations reported.
Authors: Zia-ur-Rehman, Arif Mahmood, Wenxiong Kang
Abstract: Self-supervised learning systems have gained significant attention in recent years by leveraging clustering-based pseudo-labels to provide supervision without the need for human annotations. However, the noise in these pseudo-labels caused by the clustering methods poses a challenge to the learning process leading to degraded performance. In this work, we propose a pseudo-label refinement (SLR) algorithm to address this issue. The cluster labels from the previous epoch are projected to the current epoch cluster-labels space and a linear combination of the new label and the projected label is computed as a soft refined label containing the information from the previous epoch clusters as well as from the current epoch. In contrast to the common practice of using the maximum value as a cluster/class indicator, we employ hierarchical clustering on these soft pseudo-labels to generate refined hard-labels. This approach better utilizes the information embedded in the soft labels, outperforming the simple maximum value approach for hard label generation. The effectiveness of the proposed SLR algorithm is evaluated in the context of person re-identification (Re-ID) using unsupervised domain adaptation (UDA). Experimental results demonstrate that the modified Re-ID baseline, incorporating the SLR algorithm, achieves significantly improved mean Average Precision (mAP) performance in various UDA tasks, including real-to-synthetic, synthetic-to-real, and different real-to-real scenarios. These findings highlight the efficacy of the SLR algorithm in enhancing the performance of self-supervised learning systems.
Authors: Chen Zhang, Meizhi Zhong, Qimeng Wang, Xuantao Lu, Zheyu Ye, Chengqiang Lu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, Dawei Song
Abstract: Long-context efficiency has recently become a trending topic in serving large language models (LLMs). And mixture of depths (MoD) is proposed as a perfect fit to bring down both latency and memory. In this paper, however, we discover that MoD can barely transform existing LLMs without costly training over an extensive number of tokens. To enable the transformations from any LLMs to MoD ones, we showcase top-k operator in MoD should be promoted to threshold-p operator, and refinement to architecture and data should also be crafted along. All these designs form our method termed MoDification. Through a comprehensive set of experiments covering model scales from 3B to 70B, we exhibit MoDification strikes an excellent balance between efficiency and effectiveness. MoDification can achieve up to ~1.2x speedup in latency and ~1.8x reduction in memory compared to original LLMs especially in long-context applications.
Authors: Floris Holstege, Bram Wouters, Noud van Giersbergen, Cees Diks
Abstract: A distribution shift between the training and test data can severely harm performance of machine learning models. Importance weighting addresses this issue by assigning different weights to data points during training. We argue that existing heuristics for determining the weights are suboptimal, as they neglect the increase of the variance of the estimated model due to the finite sample size of the training data. We interpret the optimal weights in terms of a bias-variance trade-off, and propose a bi-level optimization procedure in which the weights and model parameters are optimized simultaneously. We apply this optimization to existing importance weighting techniques for last-layer retraining of deep neural networks in the presence of sub-population shifts and show empirically that optimizing weights significantly improves generalization performance.
Authors: Frank Nielsen
Abstract: The symmetric Kullback-Leibler centroid also called the Jeffreys centroid of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks including information retrieval, information fusion, and clustering in image, video and sound processing. However, the Jeffreys centroid is not available in closed-form for sets of categorical or normal distributions, two widely used statistical models, and thus need to be approximated numerically in practice. In this paper, we first propose the new Jeffreys-Fisher-Rao center defined as the Fisher-Rao midpoint of the sided Kullback-Leibler centroids as a plug-in replacement of the Jeffreys centroid. This Jeffreys-Fisher-Rao center admits a generic formula for uni-parameter exponential family distributions, and closed-form formula for categorical and normal distributions, matches exactly the Jeffreys centroid for same-mean normal distributions, and is experimentally observed in practice to be close to the Jeffreys centroid. Second, we define a new type of inductive centers generalizing the principle of Gauss arithmetic-geometric double sequence mean for pairs of densities of any given exponential family. This center is shown experimentally to approximate very well the Jeffreys centroid and is suggested to use when the Jeffreys-Fisher-Rao center is not available in closed form. Moreover, this Gauss-Bregman inductive center always converges and matches the Jeffreys centroid for sets of same-mean normal distributions. We report on our experiments demonstrating the use of the Jeffreys-Fisher-Rao and Gauss-Bregman centers instead of the Jeffreys centroid. Finally, we conclude this work by reinterpreting these fast proxy centers of Jeffreys centroids under the lens of dually flat spaces in information geometry.
Authors: Taras Kucherenko, Derek Peristy, Judith B\"utepage
Abstract: Animation data is often obtained through optical motion capture systems, which utilize a multitude of cameras to establish the position of optical markers. However, system errors or occlusions can result in missing markers, the manual cleaning of which can be time-consuming. This has sparked interest in machine learning-based solutions for missing marker reconstruction in the academic community. Most academic papers utilize a simplistic mean square error as the main metric. In this paper, we show that this metric does not correlate with subjective perception of the fill quality. We introduce and evaluate a set of better-correlated metrics that can drive progress in the field.
Authors: Iraklis Giannakis, Craig Warren, Antonios Giannopoulos, Georgios Leontidis, Yan Su, Feng Zhou, Javier Martin-Torres, Nectaria Diamanti
Abstract: Ground-penetrating radar (GPR) is a mature geophysical method that has gained increasing popularity in planetary science over the past decade. GPR has been utilised both for Lunar and Martian missions providing pivotal information regarding the near surface geology of Terrestrial planets. Within that context, numerous processing pipelines have been suggested to address the unique challenges present in planetary setups. These processing pipelines often require manual tuning resulting to ambiguous outputs open to non-unique interpretations. These pitfalls combined with the large number of planetary GPR data (kilometers in magnitude), highlight the necessity for automatic, objective and advanced processing and interpretation schemes. The current paper investigates the potential of deep learning for interpreting and processing GPR data. The one-shot multi-offset configuration is investigated via a coherent numerical case study, showcasing the potential of deep learning for A) reconstructing the dielectric distribution of the the near surface of Terrestrial planets, and B) filling missing or bad-quality traces. Special care was taken for the numerical data to be both realistic and challenging. Moreover, the generated synthetic data are properly labelled and made publicly available for training future data-driven pipelines and contributing towards developing pre-trained foundation models for GPR.
Authors: Hubert Siuzdak, Florian Gr\"otschla, Luca A. Lanzend\"orfer
Abstract: Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations. The code and model weights are open-sourced at https://github.com/hubertsiuzdak/snac.
Authors: Benoit Oriol
Abstract: The spectrum of the weighted sample covariance shows a asymptotic non random behavior when the dimension grows with the number of samples. In this setting, we prove that the asymptotic spectral distribution $F$ of the weighted sample covariance has a continuous density on $\mathbb{R}^*$. We address then the practical problem of numerically finding this density. We propose a procedure to compute it, to determine the support of $F$ and define an efficient grid on it. We use this procedure to design the $\textit{WeSpeR}$ algorithm, which estimates the spectral density and retrieves the true spectral covariance spectrum. Empirical tests confirm the good properties of the $\textit{WeSpeR}$ algorithm.
Authors: Benoit Oriol
Abstract: We compute asymptotic non-linear shrinkage formulas for covariance and precision matrix estimators for weighted sample covariances, in the spirit of Ledoit and P\'ech\'e. We detail explicitly the formulas for exponentially-weighted sample covariances. Those new tools pave a way for applying non-linear shrinkage methods on weighted sample covariance. We show experimentally the performance of the asymptotic shrinkage formulas. Finally, we test the robustness of the theory to a heavy-tailed distributions.
Authors: Cynthia Maldonado-Garcia, Arezoo Zakeri, Alejandro F Frangi, Nishant Ravikumar
Abstract: Early identification of patients at risk of cardiovascular diseases (CVD) is crucial for effective preventive care, reducing healthcare burden, and improving patients' quality of life. This study demonstrates the potential of retinal optical coherence tomography (OCT) imaging combined with fundus photographs for identifying future adverse cardiac events. We used data from 977 patients who experienced CVD within a 5-year interval post-image acquisition, alongside 1,877 control participants without CVD, totaling 2,854 subjects. We propose a novel binary classification network based on a Multi-channel Variational Autoencoder (MCVAE), which learns a latent embedding of patients' fundus and OCT images to classify individuals into two groups: those likely to develop CVD in the future and those who are not. Our model, trained on both imaging modalities, achieved promising results (AUROC 0.78 +/- 0.02, accuracy 0.68 +/- 0.002, precision 0.74 +/- 0.02, sensitivity 0.73 +/- 0.02, and specificity 0.68 +/- 0.01), demonstrating its efficacy in identifying patients at risk of future CVD events based on their retinal images. This study highlights the potential of retinal OCT imaging and fundus photographs as cost-effective, non-invasive alternatives for predicting cardiovascular disease risk. The widespread availability of these imaging techniques in optometry practices and hospitals further enhances their potential for large-scale CVD risk screening. Our findings contribute to the development of standardized, accessible methods for early CVD risk identification, potentially improving preventive care strategies and patient outcomes.
Authors: Rui Hu, Qian He, Gaofeng He, Jiedong Zhuang, Huang Chen, Huafeng Liu, Huamin Wang
Abstract: Modeling and producing lifelike clothed human images has attracted researchers' attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.
Authors: Rabea Khatun, Wahia Tasnim, Maksuda Akter, Md Manowarul Islam, Md. Ashraf Uddin, Md. Zulfiker Mahmud, Saurav Chandra Das
Abstract: Gallbladder cancer (GBC) is the most frequent cause of disease among biliary tract neoplasms. Identifying the molecular mechanisms and biomarkers linked to GBC progression has been a significant challenge in scientific research. Few recent studies have explored the roles of biomarkers in GBC. Our study aimed to identify biomarkers in GBC using machine learning (ML) and bioinformatics techniques. We compared GBC tumor samples with normal samples to identify differentially expressed genes (DEGs) from two microarray datasets (GSE100363, GSE139682) obtained from the NCBI GEO database. A total of 146 DEGs were found, with 39 up-regulated and 107 down-regulated genes. Functional enrichment analysis of these DEGs was performed using Gene Ontology (GO) terms and REACTOME pathways through DAVID. The protein-protein interaction network was constructed using the STRING database. To identify hub genes, we applied three ranking algorithms: Degree, MNC, and Closeness Centrality. The intersection of hub genes from these algorithms yielded 11 hub genes. Simultaneously, two feature selection methods (Pearson correlation and recursive feature elimination) were used to identify significant gene subsets. We then developed ML models using SVM and RF on the GSE100363 dataset, with validation on GSE139682, to determine the gene subset that best distinguishes GBC samples. The hub genes outperformed the other gene subsets. Finally, NTRK2, COL14A1, SCN4B, ATP1A2, SLC17A7, SLIT3, COL7A1, CLDN4, CLEC3B, ADCYAP1R1, and MFAP4 were identified as crucial genes, with SLIT3, COL7A1, and CLDN4 being strongly linked to GBC development and prediction.
Authors: Peiwen Li, Menghua Wu
Abstract: Perturbation experiments allow biologists to discover causal relationships between variables of interest, but the sparsity and high dimensionality of these data pose significant challenges for causal structure learning algorithms. Biological knowledge graphs can bootstrap the inference of causal structures in these situations, but since they compile vastly diverse information, they can bias predictions towards well-studied systems. Alternatively, amortized causal structure learning algorithms encode inductive biases through data simulation and train supervised models to recapitulate these synthetic graphs. However, realistically simulating biology is arguably even harder than understanding a specific system. In this work, we take inspiration from both strategies and propose an amortized algorithm for refining domain knowledge, based on data observations. On real and synthetic datasets, we show that our approach outperforms baselines in recovering ground truth causal graphs and identifying errors in the prior knowledge with limited interventional data.
Authors: Andrea Bulgarelli, Elia Cellini, Karl Jansen, Stefan K\"uhn, Alessandro Nada, Shinichi Nakajima, Kim A. Nicoli, Marco Panero
Abstract: We introduce a novel technique to numerically calculate R\'enyi entanglement entropies in lattice quantum field theory using generative models. We describe how flow-based approaches can be combined with the replica trick using a custom neural-network architecture around a lattice defect connecting two replicas. Numerical tests for the $\phi^4$ scalar field theory in two and three dimensions demonstrate that our technique outperforms state-of-the-art Monte Carlo calculations, and exhibit a promising scaling with the defect size.
Authors: Paul Gavrikov, Shashank Agnihotri, Margret Keuper, Janis Keuper
Abstract: Not all learnable parameters (e.g., weights) contribute equally to a neural network's decision function. In fact, entire layers' parameters can sometimes be reset to random values with little to no impact on the model's decisions. We revisit earlier studies that examined how architecture and task complexity influence this phenomenon and ask: is this phenomenon also affected by how we train the model? We conducted experimental evaluations on a diverse set of ImageNet-1k classification models to explore this, keeping the architecture and training data constant but varying the training pipeline. Our findings reveal that the training method strongly influences which layers become critical to the decision function for a given task. For example, improved training regimes and self-supervised training increase the importance of early layers while significantly under-utilizing deeper layers. In contrast, methods such as adversarial training display an opposite trend. Our preliminary results extend previous findings, offering a more nuanced understanding of the inner mechanics of neural networks. Code: https://github.com/paulgavrikov/layer_criticality
Authors: Cody Clop, Yannick Teglia
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent text but remain limited by the static nature of their training data. Retrieval Augmented Generation (RAG) addresses this issue by combining LLMs with up-to-date information retrieval, but also expand the attack surface of the system. This paper investigates prompt injection attacks on RAG, focusing on malicious objectives beyond misinformation, such as inserting harmful links, promoting unauthorized services, and initiating denial-of-service behaviors. We build upon existing corpus poisoning techniques and propose a novel backdoor attack aimed at the fine-tuning process of the dense retriever component. Our experiments reveal that corpus poisoning can achieve significant attack success rates through the injection of a small number of compromised documents into the retriever corpus. In contrast, backdoor attacks demonstrate even higher success rates but necessitate a more complex setup, as the victim must fine-tune the retriever using the attacker poisoned dataset.
Authors: Hugh Dance, Peter Orbanz, Arthur Gretton
Abstract: Accurate uncertainty quantification for causal effects is essential for robust decision making in complex systems, but remains challenging in non-parametric settings. One promising framework represents conditional distributions in a reproducing kernel Hilbert space and places Gaussian process priors on them to infer posteriors on causal effects, but requires restrictive nuclear dominant kernels and approximations that lead to unreliable uncertainty estimates. In this work, we introduce a method, IMPspec, that addresses these limitations via a spectral representation of the Hilbert space. We show that posteriors in this model can be obtained explicitly, by extending a result in Hilbert space regression theory. We also learn the spectral representation to optimise posterior calibration. Our method achieves state-of-the-art performance in uncertainty quantification and causal Bayesian optimisation across simulations and a healthcare application.
Authors: Maksuda Akter, Rabea Khatun, Md. Alamin Talukder, Md. Manowarul Islam, Md. Ashraf Uddin
Abstract: Skin cancer is a serious and potentially fatal disease caused by DNA damage. Early detection significantly increases survival rates, making accurate diagnosis crucial. In this groundbreaking study, we present a hybrid framework based on Deep Learning (DL) that achieves precise classification of benign and malignant skin lesions. Our approach begins with dataset preprocessing to enhance classification accuracy, followed by training two separate pre-trained DL models, InceptionV3 and DenseNet121. By fusing the results of each model using the weighted sum rule, our system achieves exceptional accuracy rates. Specifically, we achieve a 92.27% detection accuracy rate, 92.33% sensitivity, 92.22% specificity, 90.81% precision, and 91.57% F1-score, outperforming existing models and demonstrating the robustness and trustworthiness of our hybrid approach. Our study represents a significant advance in skin cancer diagnosis and provides a promising foundation for further research in the field. With the potential to save countless lives through earlier detection, our hybrid deep-learning approach is a game-changer in the fight against skin cancer.
Authors: Lakshmideepakreddy Manda, Shaoru Chen, Mahyar Fazlyab
Abstract: Learning-based approaches for constructing Control Barrier Functions (CBFs) are increasingly being explored for safety-critical control systems. However, these methods typically require complete retraining when applied to unseen environments, limiting their adaptability. To address this, we propose a self-supervised deep operator learning framework that learns the mapping from environmental parameters to the corresponding CBF, rather than learning the CBF directly. Our approach leverages the residual of a parametric Partial Differential Equation (PDE), where the solution defines a parametric CBF approximating the maximal control invariant set. This framework accommodates complex safety constraints, higher relative degrees, and actuation limits. We demonstrate the effectiveness of the method through numerical experiments on navigation tasks involving dynamic obstacles.
Authors: Qiyuan Chen, Raed Al Kontar
Abstract: This paper introduces a framework for Bayesian Optimization (BO) with metric movement costs, addressing a critical challenge in practical applications where input alterations incur varying costs. Our approach is a convenient plug-in that seamlessly integrates with the existing literature on batched algorithms, where designs within batches are observed following the solution of a Traveling Salesman Problem. The proposed method provides a theoretical guarantee of convergence in terms of movement costs for BO. Empirically, our method effectively reduces average movement costs over time while maintaining comparable regret performance to conventional BO methods. This framework also shows promise for broader applications in various bandit settings with movement costs.
Authors: J\'erome Eertmans, Enrico Maria Vittuci, Vittorio Degli Esposti, Laurent Jacques, Claude Oestges
Abstract: With the increasing presence of dynamic scenarios, such as Vehicle-to-Vehicle communications, radio propagation modeling tools must adapt to the rapidly changing nature of the radio channel. Recently, both Differentiable and Dynamic Ray Tracing frameworks have emerged to address these challenges. However, there is often confusion about how these approaches differ and which one should be used in specific contexts. In this paper, we provide an overview of these two techniques and a comparative analysis against two state-of-the-art tools: 3DSCAT from UniBo and Sionna from NVIDIA. To provide a more precise characterization of the scope of these methods, we introduce a novel simulation-based metric, the Multipath Lifetime Map, which enables the evaluation of spatial and temporal coherence in radio channels only based on the geometrical description of the environment. Finally, our metrics are evaluated on a classic urban street canyon scenario, yielding similar results to those obtained from measurement campaigns.
Authors: Weichun Xia, Jiaxin Jiang, Lei Shi
Abstract: We introduce a novel diffusion-based spectral algorithm to tackle regression analysis on high-dimensional data, particularly data embedded within lower-dimensional manifolds. Traditional spectral algorithms often fall short in such contexts, primarily due to the reliance on predetermined kernel functions, which inadequately address the complex structures inherent in manifold-based data. By employing graph Laplacian approximation, our method uses the local estimation property of heat kernel, offering an adaptive, data-driven approach to overcome this obstacle. Another distinct advantage of our algorithm lies in its semi-supervised learning framework, enabling it to fully use the additional unlabeled data. This ability enhances the performance by allowing the algorithm to dig the spectrum and curvature of the data manifold, providing a more comprehensive understanding of the dataset. Moreover, our algorithm performs in an entirely data-driven manner, operating directly within the intrinsic manifold structure of the data, without requiring any predefined manifold information. We provide a convergence analysis of our algorithm. Our findings reveal that the algorithm achieves a convergence rate that depends solely on the intrinsic dimension of the underlying manifold, thereby avoiding the curse of dimensionality associated with the higher ambient dimension.
Authors: Thennal D K, Tim Fischer, Chris Biemann
Abstract: Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last $p\%$ layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30\% of layers with negligible impact on performance and up to 80\% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose $\text{L}^3 \text{Prune}$, a novel layer-pruning strategy based on the model's initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21\% of the parameters with a $-0.3$ performance drop, and the small variant only suffers from a $-5.1$ decrease while pruning 74\% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.
Authors: Francesca Bartolucci, Marcello Carioni, Jos\'e A. Iglesias, Yury Korolev, Emanuele Naldi, Stefano Vigogna
Abstract: We revisit the mean field parametrization of shallow neural networks, using signed measures on unbounded parameter spaces and duality pairings that take into account the regularity and growth of activation functions. This setting directly leads to the use of unbalanced Kantorovich-Rubinstein norms defined by duality with Lipschitz functions, and of spaces of measures dual to those of continuous functions with controlled growth. These allow to make transparent the need for total variation and moment bounds or penalization to obtain existence of minimizers of variational formulations, under which we prove a compactness result in strong Kantorovich-Rubinstein norm, and in the absence of which we show several examples demonstrating undesirable behavior. Further, the Kantorovich-Rubinstein setting enables us to combine the advantages of a completely linear parametrization and ensuing reproducing kernel Banach space framework with optimal transport insights. We showcase this synergy with representer theorems and uniform large data limits for empirical risk minimization, and in proposed formulations for distillation and fusion applications.
Authors: Colin Dirren, Mattia Bianchi, Panagiotis D. Grontas, John Lygeros, Florian D\"orfler
Abstract: We study the convex-concave bilinear saddle-point problem $\min_x \max_y f(x) + y^\top Ax - g(y)$, where both, only one, or none of the functions $f$ and $g$ are strongly convex, and suitable rank conditions on the matrix $A$ hold. The solution of this problem is at the core of many machine learning tasks. By employing tools from operator theory, we systematically prove the contractivity (in turn, the linear convergence) of several first-order primal-dual algorithms, including the Chambolle-Pock method. Our approach results in concise and elegant proofs, and it yields new convergence guarantees and tighter bounds compared to known results.
Authors: Arman Adibi, Sanjeev Kulkarni, H. Vincent Poor, Taposh Banerjee, Vahid Tarokh
Abstract: This paper addresses the problem of detecting changes when only unnormalized pre- and post-change distributions are accessible. This situation happens in many scenarios in physics such as in ferromagnetism, crystallography, magneto-hydrodynamics, and thermodynamics, where the energy models are difficult to normalize. Our approach is based on the estimation of the Cumulative Sum (CUSUM) statistics, which is known to produce optimal performance. We first present an intuitively appealing approximation method. Unfortunately, this produces a biased estimator of the CUSUM statistics and may cause performance degradation. We then propose the Log-Partition Approximation Cumulative Sum (LPA-CUSUM) algorithm based on thermodynamic integration (TI) in order to estimate the log-ratio of normalizing constants of pre- and post-change distributions. It is proved that this approach gives an unbiased estimate of the log-partition function and the CUSUM statistics, and leads to an asymptotically optimal performance. Moreover, we derive a relationship between the required sample size for thermodynamic integration and the desired detection delay performance, offering guidelines for practical parameter selection. Numerical studies are provided demonstrating the efficacy of our approach.
Authors: Mariusz Wisniewski, Paraskevas Chatzithanos, Weisi Guo, Antonios Tsourdos
Abstract: Deep Reinforcement learning (DRL) is used to enable autonomous navigation in unknown environments. Most research assume perfect sensor data, but real-world environments may contain natural and artificial sensor noise and denial. Here, we present a benchmark of both well-used and emerging DRL algorithms in a navigation task with configurable sensor denial effects. In particular, we are interested in comparing how different DRL methods (e.g. model-free PPO vs. model-based DreamerV3) are affected by sensor denial. We show that DreamerV3 outperforms other methods in the visual end-to-end navigation task with a dynamic goal - and other methods are not able to learn this. Furthermore, DreamerV3 generally outperforms other methods in sensor-denied environments. In order to improve robustness, we use adversarial training and demonstrate an improved performance in denied environments, although this generally comes with a performance cost on the vanilla environments. We anticipate this benchmark of different DRL methods and the usage of adversarial training to be a starting point for the development of more elaborate navigation strategies that are capable of dealing with uncertain and denied sensor readings.
Authors: Ameya Daigavane, Bodhi P. Vani, Saeed Saremi, Joseph Kleinhenz, Joshua Rackers
Abstract: Conformational ensembles of protein structures are immensely important both to understanding protein function, and for drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles are computationally inefficient, or do not transfer to systems outside their training data. We present walk-Jump Accelerated Molecular ensembles with Universal Noise (JAMUN), a step towards the goal of efficiently sampling the Boltzmann distribution of arbitrary proteins. By extending Walk-Jump Sampling to point clouds, JAMUN enables ensemble generation at orders of magnitude faster rates than traditional molecular dynamics or state-of-the-art ML methods. Further, JAMUN is able to predict the stable basins of small peptides that were not seen during training.
Authors: Ce Sui, Deaglan J. Bartlett, Shivam Pandey, Harry Desmond, Pedro G. Ferreira, Benjamin D. Wandelt
Abstract: Current and future large scale structure surveys aim to constrain the neutrino mass and the equation of state of dark energy. We aim to construct accurate and interpretable symbolic approximations to the linear and nonlinear matter power spectra as a function of cosmological parameters in extended $\Lambda$CDM models which contain massive neutrinos and non-constant equations of state for dark energy. This constitutes an extension of the syren-halofit emulators to incorporate these two effects, which we call syren-new (SYmbolic-Regression-ENhanced power spectrum emulator with NEutrinos and $W_0-w_a$). We also obtain a simple approximation to the derived parameter $\sigma_8$ as a function of the cosmological parameters for these models. Our results for the linear power spectrum are designed to emulate CLASS, whereas for the nonlinear case we aim to match the results of EuclidEmulator2. We compare our results to existing emulators and $N$-body simulations. Our analytic emulators for $\sigma_8$, the linear and nonlinear power spectra achieve root mean squared errors of 0.1%, 0.3% and 1.3%, respectively, across a wide range of cosmological parameters, redshifts and wavenumbers. We verify that emulator-related discrepancies are subdominant compared to observational errors and other modelling uncertainties when computing shear power spectra for LSST-like surveys. Our expressions have similar accuracy to existing (numerical) emulators, but are at least an order of magnitude faster, both on a CPU and GPU. Our work greatly improves the accuracy, speed and range of applicability of current symbolic approximations to the linear and nonlinear matter power spectra. We provide publicly available code for all symbolic approximations found.
Authors: Chun Yin Kong, Picasso Vasquez, Makan Farhoodimoghadam, Chris Brandt, Titus C. Brown, Krystle L. Reagan, Allison Zwingenberger, Stefan M. Keller
Abstract: In the rapidly evolving landscape of veterinary healthcare, integrating machine learning (ML) clinical decision-making tools with electronic health records (EHRs) promises to improve diagnostic accuracy and patient care. However, the seamless integration of ML classifiers into existing EHRs in veterinary medicine is frequently hindered by the rigidity of EHR systems or the limited availability of IT resources. To address this shortcoming, we present Anna, a freely-available software solution that provides ML classifier results for EHR laboratory data in real-time.
Authors: Sandeep Nagar, Girish Varma
Abstract: Inverse of an invertible convolution is an important operation that comes up in Normalizing Flows, Image Deblurring, etc. The naive algorithm for backpropagation of this operation using Gaussian elimination has running time $O(n^3)$ where $n$ is the number of pixels in the image. We give a fast parallel backpropagation algorithm with running time $O(\sqrt{n})$ for a square image and provide a GPU implementation of the same. Inverse Convolutions are usually used in Normalizing Flows in the sampling pass, making them slow. We propose to use Inverse Convolutions in the forward (image to latent vector) pass of the Normalizing flow. Since the sampling pass is the inverse of the forward pass, it will use convolutions only, resulting in efficient sampling times. We use our parallel backpropagation algorithm for optimizing the inverse convolution layer resulting in fast training times also. We implement this approach in various Normalizing Flow backbones, resulting in our Inverse-Flow models. We benchmark Inverse-Flow on standard datasets and show significantly improved sampling times with similar bits per dimension compared to previous models.
Authors: Maitreya Prafulla Chitale, Uday Bindal, Rajakrishnan Rajkumar, Rahul Mishra
Abstract: Summarizing movie screenplays presents a unique set of challenges compared to standard document summarization. Screenplays are not only lengthy, but also feature a complex interplay of characters, dialogues, and scenes, with numerous direct and subtle relationships and contextual nuances that are difficult for machine learning models to accurately capture and comprehend. Recent attempts at screenplay summarization focus on fine-tuning transformer-based pre-trained models, but these models often fall short in capturing long-term dependencies and latent relationships, and frequently encounter the "lost in the middle" issue. To address these challenges, we introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph). This approach is well-suited for various downstream tasks, such as summarization, question-answering, and salience detection. The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay's content. We further explore a baseline method that combines the CaD Graph with the corresponding movie script through a late fusion of graph and text modalities, and we present very initial promising results.
Authors: Rodrigo Gonz\'alez Laiz, Tobias Schmidt, Steffen Schneider
Abstract: Self-supervised learning (SSL) approaches have brought tremendous success across many tasks and domains. It has been argued that these successes can be attributed to a link between SSL and identifiable representation learning: Temporal structure and auxiliary variables ensure that latent representations are related to the true underlying generative factors of the data. Here, we deepen this connection and show that SSL can perform system identification in latent space. We propose DynCL, a framework to uncover linear, switching linear and non-linear dynamics under a non-linear observation model, give theoretical guarantees and validate them empirically.
Authors: Hongming Zhang, Ke Sun, Bo Xu, Linglong Kong, Martin M\"uller
Abstract: In deep reinforcement learning (RL) systems, abnormal states pose significant risks by potentially triggering unpredictable behaviors and unsafe actions, thus impeding the deployment of RL systems in real-world scenarios. It is crucial for reliable decision-making systems to have the capability to cast an alert whenever they encounter unfamiliar observations that they are not equipped to handle. In this paper, we propose a novel Mahalanobis distance-based (MD) anomaly detection framework, called \textit{MDX}, for deep RL algorithms. MDX simultaneously addresses random, adversarial, and out-of-distribution (OOD) state outliers in both offline and online settings. It utilizes Mahalanobis distance within class-conditional distributions for each action and operates within a statistical hypothesis testing framework under the Gaussian assumption. We further extend it to robust and distribution-free versions by incorporating Robust MD and conformal inference techniques. Through extensive experiments on classical control environments, Atari games, and autonomous driving scenarios, we demonstrate the effectiveness of our MD-based detection framework. MDX offers a simple, unified, and practical anomaly detection tool for enhancing the safety and reliability of RL systems in real-world applications.
Authors: Philipp Baumann, Dorit S. Hochbaum
Abstract: We study here the semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC (Pairwise-Confidence-Constraints-Clustering) algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm uses integer programming for the assignment of objects which allows to include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. We developed an enhanced multi-start approach and a model-size reduction technique for the integer program that contributes to the effectiveness and the efficiency of the algorithm. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard or all soft constraints both in terms of runtime and various metrics of solution quality. The code of the PCCC algorithm is publicly available on GitHub.
Authors: Paul Louis, Shweta Ann Jacob, Amirali Salehi-Abari
Abstract: Link prediction on graphs is a fundamental problem. Subgraph representation learning approaches (SGRLs), by transforming link prediction to graph classification on the subgraphs around the links, have achieved state-of-the-art performance in link prediction. However, SGRLs are computationally expensive, and not scalable to large-scale graphs due to expensive subgraph-level operations. To unlock the scalability of SGRLs, we propose a new class of SGRLs, that we call Scalable Simplified SGRL (S3GRL). Aimed at faster training and inference, S3GRL simplifies the message passing and aggregation operations in each link's subgraph. S3GRL, as a scalability framework, accommodates various subgraph sampling strategies and diffusion operators to emulate computationally-expensive SGRLs. We propose multiple instances of S3GRL and empirically study them on small to large-scale graphs. Our extensive experiments demonstrate that the proposed S3GRL models scale up SGRLs without significant performance compromise (even with considerable gains in some cases), while offering substantially lower computational footprints (e.g., multi-fold inference and training speedup).
Authors: Oskar Nordenfors, Fredrik Ohlsson, Axel Flinth
Abstract: We investigate the optimization of neural networks on symmetric data, and compare the strategy of constraining the architecture to be equivariant to that of using data augmentation. Our analysis reveals that that the relative geometry of the admissible and the equivariant layers, respectively, plays a key role. Under natural assumptions on the data, network, loss, and group of symmetries, we show that compatibility of the spaces of admissible layers and equivariant layers, in the sense that the corresponding orthogonal projections commute, implies that the sets of equivariant stationary points are identical for the two strategies. If the linear layers of the network also are given a unitary parametrization, the set of equivariant layers is even invariant under the gradient flow for augmented models. Our analysis however also reveals that even in the latter situation, stationary points may be unstable for augmented training although they are stable for the manifestly equivariant models.
Authors: Li Zeng, Xiaoliang Wan, Tao Zhou
Abstract: In this paper, we develop an invertible mapping, called B-KRnet, on a bounded domain and apply it to density estimation/approximation for data or the solutions of PDEs such as the Fokker-Planck equation and the Keller-Segel equation. Similar to KRnet, the structure of B-KRnet adapts the pseudo-triangular structure into a normalizing flow model. The main difference between B-KRnet and KRnet is that B-KRnet is defined on a hypercube while KRnet is defined on the whole space, in other words, a new mechanism is introduced in B-KRnet to maintain the exact invertibility. Using B-KRnet as a transport map, we obtain an explicit probability density function (PDF) model that corresponds to the pushforward of a prior (uniform) distribution on the hypercube. It can be directly applied to density estimation when only data are available. By coupling KRnet and B-KRnet, we define a deep generative model on a high-dimensional domain where some dimensions are bounded and other dimensions are unbounded. A typical case is the solution of the stationary kinetic Fokker-Planck equation, which is a PDF of position and momentum. Based on B-KRnet, we develop an adaptive learning approach to approximate partial differential equations whose solutions are PDFs or can be treated as PDFs. A variety of numerical experiments is presented to demonstrate the effectiveness of B-KRnet.
Authors: Herbert Woisetschl\"ager, Alexander Isenko, Ruben Mayer, Shiqiang Wang, Hans-Arno Jacobsen
Abstract: Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge. Heterogeneous hardware, unreliable client devices, and energy constraints often characterize edge computing systems. In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities. We focus on computational and communication bottlenecks, client behavior, and data security implications. Our experiments with models varying from 14K to 80M trainable parameters are carried out on dedicated hardware with emulated network characteristics and client behavior. We find that state-of-the-art embedded hardware has significant memory bottlenecks, leading to 4x longer processing times than on modern data center GPUs.
Authors: John Lazarsfeld, Dan Alistarh
Abstract: We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like $R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like $R(T)/T = \widetilde O(1/\sqrt{T})$ as long as the number of rounds does not grow too fast as a function of $n$.
Authors: Pengfei Li, Jianyi Yang, Adam Wierman, Shaolei Ren
Abstract: This paper studies decentralized online convex optimization in a networked multi-agent system and proposes a novel algorithm, Learning-Augmented Decentralized Online optimization (LADO), for individual agents to select actions only based on local online information. LADO leverages a baseline policy to safeguard online actions for worst-case robustness guarantees, while staying close to the machine learning (ML) policy for average performance improvement. In stark contrast with the existing learning-augmented online algorithms that focus on centralized settings, LADO achieves strong robustness guarantees in a decentralized setting. We also prove the average cost bound for LADO, revealing the tradeoff between average performance and worst-case robustness and demonstrating the advantage of training the ML policy by explicitly considering the robustness requirement.
Authors: Francesco Demelas, Joseph Le Roux, Mathieu Lacroix, Axel Parmentier
Abstract: Lagrangian relaxation stands among the most efficient approaches for solving a Mixed Integer Linear Programs (MILP) with difficult constraints. Given any duals for these constraints, called Lagrangian Multipliers (LMs), it returns a bound on the optimal value of the MILP, and Lagrangian methods seek the LMs giving the best such bound. But these methods generally rely on iterative algorithms resembling gradient descent to maximize the concave piecewise linear dual function: the computational burden grows quickly with the number of relaxed constraints. We introduce a deep learning approach that bypasses the descent, effectively amortizing the local, per instance, optimization. A probabilistic encoder based on a graph convolutional network computes high-dimensional representations of relaxed constraints in MILP instances. A decoder then turns these representations into LMs. We train the encoder and decoder jointly by directly optimizing the bound obtained from the predicted multipliers. Numerical experiments show that our approach closes up to 85~\% of the gap between the continuous relaxation and the best Lagrangian bound, and provides a high quality warm-start for descent based Lagrangian methods.
Authors: Zheng Lin, Zhe Chen, Zihan Fang, Xianhao Chen, Xiong Wang, Yue Gao
Abstract: Recently, a large number of Low Earth Orbit (LEO) satellites have been launched and deployed successfully in space by commercial companies, such as SpaceX. Due to multimodal sensors equipped by the LEO satellites, they serve not only for communication but also for various machine learning applications, such as space modulation recognition, remote sensing image classification, etc. However, the ground station (GS) may be incapable of downloading such a large volume of raw sensing data for centralized model training due to the limited contact time with LEO satellites (e.g. 5 minutes). Therefore, federated learning (FL) has emerged as the promising solution to address this problem via on-device training. Unfortunately, to enable FL on LEO satellites, we still face three critical challenges that are i) heterogeneous computing and memory capabilities, ii) limited uplink rate, and iii) model staleness. To this end, we propose FedSN as a general FL framework to tackle the above challenges, and fully explore data diversity on LEO satellites. Specifically, we first present a novel sub-structure scheme to enable heterogeneous local model training considering different computing, memory, and communication constraints on LEO satellites. Additionally, we propose a pseudo-synchronous model aggregation strategy to dynamically schedule model aggregation for compensating model staleness. To further demonstrate the effectiveness of the FedSN, we evaluate it using space modulation recognition and remote sensing image classification tasks by leveraging the data from real-world satellite networks. Extensive experimental results demonstrate that FedSN framework achieves higher accuracy, lower computing, and communication overhead than the state-of-the-art benchmarks and the effectiveness of each components in FedSN.
Authors: Hang Chen, Yuchuan Jang, Weijie Zhou, Cristian Meo, Ziwei Chen, Dianbo Liu
Abstract: Individuals, despite having varied life experiences and learning processes, can communicate effectively through languages. This study aims to explore the efficiency of language as a communication medium. We put forth two specific hypotheses: First, discrete messages are more effective than continuous ones when agents have diverse personal experiences. Second, communications using multiple discrete tokens are more advantageous than those using a single token. To valdate these hypotheses, we designed multi-agent machine learning experiments to assess communication efficiency using various information transmission methods between speakers and listeners. Our empirical findings indicate that, in scenarios where agents are exposed to different data, communicating through sentences composed of discrete tokens offers the best inter-agent communication efficiency. The limitations of our finding include lack of systematic advantages over other more sophisticated encoder-decoder model such as variational autoencoder and lack of evluation on non-image dataset, which we will leave for future studies.
Authors: Heng Chang, Jiangnan Ye, Alejo Lopez Avila, Jinhua Du, Jia Li
Abstract: Graph Neural Networks (GNNs) have achieved great success in Knowledge Graph Completion (KGC) by modelling how entities and relations interact in recent years. However, the explanation of the predicted facts has not caught the necessary attention. Proper explanations for the results of GNN-based KGC models increase model transparency and help researchers develop more reliable models. Existing practices for explaining KGC tasks rely on instance/subgraph-based approaches, while in some scenarios, paths can provide more user-friendly and interpretable explanations. Nonetheless, the methods for generating path-based explanations for KGs have not been well-explored. To address this gap, we propose Power-Link, the first path-based KGC explainer that explores GNN-based models. We design a novel simplified graph-powering technique, which enables the generation of path-based explanations with a fully parallelisable and memory-efficient training scheme. We further introduce three new metrics for quantitative evaluation of the explanations, together with a qualitative human evaluation. Extensive experiments demonstrate that Power-Link outperforms the SOTA baselines in interpretability, efficiency, and scalability.
Authors: Qilong Yan, Yufeng Zhang, Jinghao Zhang, Jingpu Duan, Jian Yin
Abstract: Graph Neural Networks (GNNs) have made significant advancements in node classification, but their success relies on sufficient labeled nodes per class in the training data. Real-world graph data often exhibits a long-tail distribution with sparse labels, emphasizing the importance of GNNs' ability in few-shot node classification, which entails categorizing nodes with limited data. Traditional episodic meta-learning approaches have shown promise in this domain, but they face an inherent limitation: it might lead the model to converge to suboptimal solutions because of random and uniform task assignment, ignoring task difficulty levels. This could lead the meta-learner to face complex tasks too soon, hindering proper learning. Ideally, the meta-learner should start with simple concepts and advance to more complex ones, like human learning. So, we introduce CPT, a novel two-stage curriculum learning method that aligns task difficulty with the meta-learner's progressive competence, enhancing overall performance. Specifically, in CPT's initial stage, the focus is on simpler tasks, fostering foundational skills for engaging with complex tasks later. Importantly, the second stage dynamically adjusts task difficulty based on the meta-learner's growing competence, aiming for optimal knowledge acquisition. Extensive experiments on popular node classification datasets demonstrate significant improvements of our strategy over existing methods.
Authors: Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, Mingsheng Long
Abstract: Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous progress has been achieved with the emergence of large language models, exhibiting unprecedented abilities such as few-shot generalization, scalability, and task generality, which are however absent in small deep models. To change the status quo of training scenario-specific small models from scratch, this paper aims at the early development of large time series models (LTSM). During pre-training, we curate large-scale datasets with up to 1 billion time points, unify heterogeneous time series into single-series sequence (S3) format, and develop the GPT-style architecture toward LTSMs. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task. The outcome of this study is a Time Series Transformer (Timer), which is generative pre-trained by next token prediction and adapted to various downstream tasks with promising capabilities as an LTSM. Code and datasets are available at: https://github.com/thuml/Large-Time-Series-Model.
Authors: Jin-Hwa Kim
Abstract: Recent advancements in visualizing deep neural networks provide insights into their structures and mesh extraction from Continuous Piecewise Affine (CPWA) functions. Meanwhile, developments in neural surface representation learning incorporate non-linear positional encoding, addressing issues like spectral bias; however, this poses challenges in applying mesh extraction techniques based on CPWA functions. Focusing on trilinear interpolating methods as positional encoding, we present theoretical insights and an analytical mesh extraction, showing the transformation of hypersurfaces to flat planes within the trilinear region under the eikonal constraint. Moreover, we introduce a method for approximating intersecting points among three hypersurfaces contributing to broader applications. We empirically validate correctness and parsimony through chamfer distance and efficiency, and angular distance, while examining the correlation between the eikonal loss and the planarity of the hypersurfaces.
Authors: Daniel Beaglehole, Peter S\'uken\'ik, Marco Mondelli, Mikhail Belkin
Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. The Deep Recursive Feature Machine (Deep RFM) is a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate empirically that DNC occurs in Deep RFM across standard settings as a consequence of the projection with the AGOP matrix computed at each layer. Further, we theoretically explain DNC in Deep RFM in an asymptotic setting and as a result of kernel learning. We then provide evidence that this mechanism holds for neural networks more generally. In particular, we show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.
Authors: Chaolong Ying, Xinjian Zhao, Tianshu Yu
Abstract: Recently, there has been an emerging trend to integrate persistent homology (PH) into graph neural networks (GNNs) to enrich expressive power. However, naively plugging PH features into GNN layers always results in marginal improvement with low interpretability. In this paper, we investigate a novel mechanism for injecting global topological invariance into pooling layers using PH, motivated by the observation that filtration operation in PH naturally aligns graph pooling in a cut-off manner. In this fashion, message passing in the coarsened graph acts along persistent pooled topology, leading to improved performance. Experimentally, we apply our mechanism to a collection of graph pooling methods and observe consistent and substantial performance gain over several popular datasets, demonstrating its wide applicability and flexibility.
Authors: Nazar Buzun, Maksim Bobrin, Dmitry V. Dylov
Abstract: We present a new approach for Neural Optimal Transport (NOT) training procedure, capable of accurately and efficiently estimating optimal transportation plan via specific regularization on dual Kantorovich potentials. The main bottleneck of existing NOT solvers is associated with the procedure of finding a near-exact approximation of the conjugate operator (i.e., the c-transform), which is done either by optimizing over non-convex max-min objectives or by the computationally intensive fine-tuning of the initial approximated prediction. We resolve both issues by proposing a new, theoretically justified loss in the form of expectile regularisation which enforces binding conditions on the learning process of dual potentials. Such a regularization provides the upper bound estimation over the distribution of possible conjugate potentials and makes the learning stable, completely eliminating the need for additional extensive fine-tuning. Proposed method, called Expectile-Regularised Neural Optimal Transport (ENOT), outperforms previous state-of-the-art approaches on the established Wasserstein-2 benchmark tasks by a large margin (up to a 3-fold improvement in quality and up to a 10-fold improvement in runtime). Moreover, we showcase performance of ENOT for varying cost functions on different tasks such as image generation, showing robustness of proposed algorithm. OTT-JAX library includes our implementation of ENOT algorithm https://ott-jax.readthedocs.io/en/latest/tutorials/ENOT.html
URLs: https://ott-jax.readthedocs.io/en/latest/tutorials/ENOT.html
Authors: Michael H. Shaham, Taskin Padir
Abstract: Platooning of autonomous vehicles has the potential to increase safety and fuel efficiency on highways. The goal of platooning is to have each vehicle drive at a specified speed (set by the leader) while maintaining a safe distance from its neighbors. Many prior works have analyzed various controllers for platooning, most commonly linear feedback and distributed model predictive controllers. In this work, we introduce an algorithm for learning a stable, safe, distributed controller for a heterogeneous platoon. Our algorithm relies on recent developments in learning neural network stability certificates. We train a controller for autonomous platooning in simulation and evaluate its performance on hardware with a platoon of four F1Tenth vehicles. We then perform further analysis in simulation with a platoon of 100 vehicles. Experimental results demonstrate the practicality of the algorithm and the learned controller by comparing the performance of the neural network controller to linear feedback and distributed model predictive controllers.
Authors: Thomas Guyet, Pierre Pinson, Enoal Gesny
Abstract: Improving the future of healthcare starts by better understanding the current actual practices in hospital settings. This motivates the objective of discovering typical care pathways from patient data. Revealing typical care pathways can be achieved through clustering. The difficulty in clustering care pathways, represented by sequences of timestamped events, lies in defining a semantically appropriate metric and clustering algorithms. In this article, we adapt two methods developed for time series to the clustering of timed sequences: the drop-DTW metric and the DBA approach for the construction of averaged time sequences. These methods are then applied in clustering algorithms to propose original and sound clustering algorithms for timed sequences. This approach is experimented with and evaluated on synthetic and real-world data.
Authors: Artur Grigorev, Adriana-Simona Mihaita Khaled Saleh, Yuming Ou
Abstract: The proposed IncidentResponseGPT framework - a novel system that applies generative artificial intelligence (AI) to potentially enhance the efficiency and effectiveness of traffic incident response. This model allows for synthesis of region-specific incident response guidelines and generates incident response plans adapted to specific area, aiming to expedite decision-making for traffic management authorities. This approach aims to accelerate incident resolution times by suggesting various recommendations (e.g. optimal rerouting strategies, estimating resource needs) to minimize the overall impact on the urban traffic network. The system suggests specific actions, including dynamic lane closures, optimized rerouting and dispatching appropriate emergency resources. We utilize the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) to rank generated response plans based on criteria like impact minimization and resource efficiency based on their proximity to an human-proposed solution.
Authors: David J. Schodt
Abstract: Bayesian Neural Networks (BNNs) extend traditional neural networks to provide uncertainties associated with their outputs. On the forward pass through a BNN, predictions (and their uncertainties) are made either by Monte Carlo sampling network weights from the learned posterior or by analytically propagating statistical moments through the network. Though flexible, Monte Carlo sampling is computationally expensive and can be infeasible or impractical under resource constraints or for large networks. While moment propagation can ameliorate the computational costs of BNN inference, it can be difficult or impossible for networks with arbitrary nonlinearities, thereby restricting the possible set of network layers permitted with such a scheme. In this work, we demonstrate a simple yet effective approach for propagating statistical moments through arbitrary nonlinearities with only 3 deterministic samples, enabling few-sample variational inference of BNNs without restricting the set of network layers used. Furthermore, we leverage this approach to demonstrate a novel nonlinear activation function that we use to inject physics-informed prior information into output nodes of a BNN.
Authors: Nikola B. Kovachki, Samuel Lanthaler, Hrushikesh Mhaskar
Abstract: Operator learning has emerged as a new paradigm for the data-driven approximation of nonlinear operators. Despite its empirical success, the theoretical underpinnings governing the conditions for efficient operator learning remain incomplete. The present work develops theory to study the data complexity of operator learning, complementing existing research on the parametric complexity. We investigate the fundamental question: How many input/output samples are needed in operator learning to achieve a desired accuracy $\epsilon$? This question is addressed from the point of view of $n$-widths, and this work makes two key contributions. The first contribution is to derive lower bounds on $n$-widths for general classes of Lipschitz and Fr\'echet differentiable operators. These bounds rigorously demonstrate a ``curse of data-complexity'', revealing that learning on such general classes requires a sample size exponential in the inverse of the desired accuracy $\epsilon$. The second contribution of this work is to show that ``parametric efficiency'' implies ``data efficiency''; using the Fourier neural operator (FNO) as a case study, we show rigorously that on a narrower class of operators, efficiently approximated by FNO in terms of the number of tunable parameters, efficient operator learning is attainable in data complexity as well. Specifically, we show that if only an algebraically increasing number of tunable parameters is needed to reach a desired approximation accuracy, then an algebraically bounded number of data samples is also sufficient to achieve the same accuracy.
Authors: Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, Mahdi S. Hosseini
Abstract: First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.
Authors: Yuankai Luo, Hongkang Li, Qijiong Liu, Lei Shi, Xiao-Ming Wu
Abstract: We present a novel end-to-end framework that generates highly compact (typically 6-15 dimensions), discrete (int4 type), and interpretable node representations, termed node identifiers (node IDs), to tackle inference challenges on large-scale graphs. By employing vector quantization, we compress continuous node embeddings from multiple layers of a Graph Neural Network (GNN) into discrete codes, applicable under both self-supervised and supervised learning paradigms. These node IDs capture high-level abstractions of graph data and offer interpretability that traditional GNN embeddings lack. Extensive experiments on 34 datasets, encompassing node classification, graph classification, link prediction, and attributed graph clustering tasks, demonstrate that the generated node IDs significantly enhance speed and memory efficiency while achieving competitive performance compared to current state-of-the-art methods.
Authors: Itamar Zimerman, Ameen Ali, Lior Wolf
Abstract: Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs, all featuring sub-quadratic complexity in sequence length and excellent scaling properties, enabling the construction of a new type of foundation models. In this paper, we present a unified view of these models, formulating such layers as implicit causal self-attention layers. The formulation includes most of their sub-components and is not limited to a specific part of the architecture. The framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods. Our experiments show that our attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba. For the other architectures for which our method is the first to provide such a view, our method is effective and competitive in the relevant metrics compared to the results obtained by state-of-the-art Transformer explainability methods. Our code is publicly available.
Authors: Loukas Kavouras, Eleni Psaroudaki, Konstantinos Tsopelas, Dimitrios Rontogiannis, Nikolaos Theologitis, Dimitris Sacharidis, Giorgos Giannopoulos, Dimitrios Tomaras, Kleopatra Markou, Dimitrios Gunopulos, Dimitris Fotakis, Ioannis Emiris
Abstract: The widespread deployment of machine learning systems in critical real-world decision-making applications has highlighted the urgent need for counterfactual explainability methods that operate effectively. Global counterfactual explanations, expressed as actions to offer recourse, aim to provide succinct explanations and insights applicable to large population subgroups. Effectiveness is measured by the fraction of the population that is provided recourse, ensuring that the actions benefit as many individuals as possible. Keeping the cost of actions low ensures the proposed recourse actions remain practical and actionable. Limiting the number of actions that provide global counterfactuals is essential to maximize interpretability. The primary challenge, therefore, is balancing these trade-offs, i.e., maximizing effectiveness, minimizing cost, while maintaining a small number of actions. We introduce GLANCE, a versatile and adaptive framework, comprising two algorithms, that allows the careful balancing of the trade-offs among the three key objectives, with the size objective functioning as a tunable parameter to keep the actions few and easy to interpret. C-GLANCE employs a clustering approach that considers both the feature space and the space of counterfactual actions, thereby accounting for the distribution of points in a way that aligns with the structure of the model. T-GLANCE provides additional features to enhance flexibility. It employs a tree-based approach, that allows users to specify split features, to build a decision tree with a single counterfactual action at each node that can be used as a subgroup policy. Our extensive experimental evaluation demonstrates that our method consistently shows greater robustness and performance compared to existing methods across various datasets and models.
Authors: Michail Theologitis, Georgios Frangias, Georgios Anestis, Vasilis Samoladas, Antonios Deligiannakis
Abstract: Driven by the ever-growing volume and decentralized nature of data, coupled with the need to harness this data and generate knowledge from it, has led to the extensive use of distributed deep learning (DDL) techniques for training. These techniques rely on local training that is performed at the distributed nodes based on locally collected data, followed by a periodic synchronization process that combines these models to create a global model. However, frequent synchronization of DL models, encompassing millions to many billions of parameters, creates a communication bottleneck, severely hindering scalability. Worse yet, DDL algorithms typically waste valuable bandwidth, and make themselves less practical in bandwidth-constrained federated settings, by relying on overly simplistic, periodic, and rigid synchronization schedules. These drawbacks also have a direct impact on the time required for the training process, necessitating excessive time for data communication. To address these shortcomings, we propose Federated Dynamic Averaging (FDA), a communication-efficient DDL strategy that dynamically triggers synchronization based on the value of the model variance. In essence, the costly synchronization step is triggered only if the local models, which are initialized from a common global model after each synchronization, have significantly diverged. This decision is facilitated by the communication of a small local state from each distributed node/worker. Through extensive experiments across a wide range of learning tasks we demonstrate that FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge communication-efficient algorithms. Additionally, we show that FDA maintains robust performance across diverse data heterogeneity settings.
Authors: Wenzhuo Tang, Haitao Mao, Danial Dervovic, Ivan Brugere, Saumitra Mishra, Yuying Xie, Jiliang Tang
Abstract: Models for natural language and images benefit from data scaling behavior: the more data fed into the model, the better they perform. This 'better with more' phenomenon enables the effectiveness of large-scale pre-training on vast amounts of data. However, current graph pre-training methods struggle to scale up data due to heterogeneity across graphs. To achieve effective data scaling, we aim to develop a general model that is able to capture diverse data patterns of graphs and can be utilized to adaptively help the downstream tasks. To this end, we propose UniAug, a universal graph structure augmentor built on a diffusion model. We first pre-train a discrete diffusion model on thousands of graphs across domains to learn the graph structural patterns. In the downstream phase, we provide adaptive enhancement by conducting graph structure augmentation with the help of the pre-trained diffusion model via guided generation. By leveraging the pre-trained diffusion model for structure augmentation, we consistently achieve performance improvements across various downstream tasks in a plug-and-play manner. To the best of our knowledge, this study represents the first demonstration of a data-scaling graph structure augmentor on graphs across domains.
Authors: Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar
Abstract: On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($\epsilon=1.29$, $\epsilon=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.
Authors: Zhong Li, Simon Geisler, Yuhang Wang, Stephan G\"unnemann, Matthijs van Leeuwen
Abstract: Predictions made by graph neural networks (GNNs) usually lack interpretability due to their complex computational behavior and the abstract nature of graphs. In an attempt to tackle this, many GNN explanation methods have emerged. Their goal is to explain a model's predictions and thereby obtain trust when GNN models are deployed in decision critical applications. Most GNN explanation methods work in a post-hoc manner and provide explanations in the form of a small subset of important edges and/or nodes. In this paper we demonstrate that these explanations can unfortunately not be trusted, as common GNN explanation methods turn out to be highly susceptible to adversarial perturbations. That is, even small perturbations of the original graph structure that preserve the model's predictions may yield drastically different explanations. This calls into question the trustworthiness and practical utility of post-hoc explanation methods for GNNs. To be able to attack GNN explanation models, we devise a novel attack method dubbed \textit{GXAttack}, the first \textit{optimization-based} adversarial white-box attack method for post-hoc GNN explanations under such settings. Due to the devastating effectiveness of our attack, we call for an adversarial evaluation of future GNN explainers to demonstrate their robustness. For reproducibility, our code is available via GitHub.
Authors: Julia Gastinger, Shenyang Huang, Mikhail Galkin, Erfan Loghmani, Ali Parviz, Farimah Poursafaei, Jacob Danovitch, Emanuele Rossi, Ioannis Koutis, Heiner Stuckenschmidt, Reihaneh Rabbany, Guillaume Rabusseau
Abstract: Multi-relational temporal graphs are powerful tools for modeling real-world data, capturing the evolving and interconnected nature of entities over time. Recently, many novel models are proposed for ML on such graphs intensifying the need for robust evaluation and standardized benchmark datasets. However, the availability of such resources remains scarce and evaluation faces added complexity due to reproducibility issues in experimental protocols. To address these challenges, we introduce Temporal Graph Benchmark 2.0 (TGB 2.0), a novel benchmarking framework tailored for evaluating methods for predicting future links on Temporal Knowledge Graphs and Temporal Heterogeneous Graphs with a focus on large-scale datasets, extending the Temporal Graph Benchmark. TGB 2.0 facilitates comprehensive evaluations by presenting eight novel datasets spanning five domains with up to 53 million edges. TGB 2.0 datasets are significantly larger than existing datasets in terms of number of nodes, edges, or timestamps. In addition, TGB 2.0 provides a reproducible and realistic evaluation pipeline for multi-relational temporal graphs. Through extensive experimentation, we observe that 1) leveraging edge-type information is crucial to obtain high performance, 2) simple heuristic baselines are often competitive with more complex methods, 3) most methods fail to run on our largest datasets, highlighting the need for research on more scalable methods.
Authors: Bhrij Patel, Vishnu Sashank Dorbala, Amrit Singh Bedi, Dinesh Manocha
Abstract: Large language models (LLMs) have grown in popularity due to their natural language interface and pre trained knowledge, leading to rapidly increasing success in question-answering (QA) tasks. More recently, multi-agent systems with LLM-based agents (Multi-LLM) have been utilized increasingly more for QA. In these scenarios, the models may each answer the question and reach a consensus or each model is specialized to answer different domain questions. However, most prior work dealing with Multi-LLM QA has focused on scenarios where the models are asked in a zero-shot manner or are given information sources to extract the answer. For question answering of an unknown environment, embodied exploration of the environment is first needed to answer the question. This skill is necessary for personalizing embodied AI to environments such as households. There is a lack of insight into whether a Multi-LLM system can handle question-answering based on observations from embodied exploration. In this work, we address this gap by investigating the use of Multi-Embodied LLM Explorers (MELE) for QA in an unknown environment. Multiple LLM-based agents independently explore and then answer queries about a household environment. We analyze different aggregation methods to generate a single, final answer for each query: debating, majority voting, and training a central answer module (CAM). Using CAM, we observe a $46\%$ higher accuracy compared against the other non-learning-based aggregation methods. We provide code and the query dataset for further research.
Authors: Rudrajit Das, Inderjit S. Dhillon, Alessandro Epasto, Adel Javanmard, Jieming Mao, Vahab Mirrokni, Sujay Sanghavi, Peilin Zhong
Abstract: The performance of a model trained with \textit{noisy labels} is often improved by simply \textit{retraining} the model with its own predicted \textit{hard} labels (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP) which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at \textit{no extra privacy cost}; we call this \textit{consensus-based retraining}. As an example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain $6.4\%$ improvement in accuracy with consensus-based retraining.
Authors: Antonio Terpin, Nicolas Lanzetti, Martin Gadea, Florian D\"orfler
Abstract: Diffusion regulates numerous natural processes and the dynamics of many successful generative models. Existing models to learn the diffusion terms from observational data rely on complex bilevel optimization problems and model only the drift of the system. We propose a new simple model, JKOnet*, which bypasses the complexity of existing architectures while presenting significantly enhanced representational capabilities: JKOnet* recovers the potential, interaction, and internal energy components of the underlying diffusion process. JKOnet* minimizes a simple quadratic loss and outperforms other baselines in terms of sample efficiency, computational complexity, and accuracy. Additionally, JKOnet* provides a closed-form optimal solution for linearly parametrized functionals, and, when applied to predict the evolution of cellular processes from real-world data, it achieves state-of-the-art accuracy at a fraction of the computational cost of all existing methods. Our methodology is based on the interpretation of diffusion processes as energy-minimizing trajectories in the probability space via the so-called JKO scheme, which we study via its first-order optimality conditions.
Authors: Mengdan Zhu, Raasikh Kanjiani, Jiahui Lu, Andrew Choi, Qirui Ye, Liang Zhao
Abstract: Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces \textit{LatentExplainer}, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. \textit{LatentExplainer} tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. Our approach perturbs latent variables, interpreting changes in generated data, and uses multi-modal large language models (MLLMs) to produce human-understandable explanations. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations for latent variables. The results highlight the effectiveness of incorporating inductive biases and uncertainty quantification, significantly enhancing model interpretability.
Authors: Haohao Qu, Liangbo Ning, Rui An, Wenqi Fan, Tyler Derr, Hui Liu, Xin Xu, Qing Li
Abstract: As one of the most representative DL techniques, Transformer architecture has empowered numerous advanced models, especially the large language models (LLMs) that comprise billions of parameters, becoming a cornerstone in deep learning. Despite the impressive achievements, Transformers still face inherent limitations, particularly the time-consuming inference resulting from the quadratic computation complexity of attention calculation. Recently, a novel architecture named Mamba, drawing inspiration from classical state space models (SSMs), has emerged as a promising alternative for building foundation models, delivering comparable modeling abilities to Transformers while preserving near-linear scalability concerning sequence length. This has sparked an increasing number of studies actively exploring Mamba's potential to achieve impressive performance across diverse domains. Given such rapid evolution, there is a critical need for a systematic review that consolidates existing Mamba-empowered models, offering a comprehensive understanding of this emerging model architecture. In this survey, we therefore conduct an in-depth investigation of recent Mamba-associated studies, covering three main aspects: the advancements of Mamba-based models, the techniques of adapting Mamba to diverse data, and the applications where Mamba can excel. Specifically, we first review the foundational knowledge of various representative deep learning models and the details of Mamba-1&2 as preliminaries. Then, to showcase the significance of Mamba for AI, we comprehensively review the related studies focusing on Mamba models' architecture design, data adaptability, and applications. Finally, we present a discussion of current limitations and explore various promising research directions to provide deeper insights for future investigations.
Authors: Hanqiu Chen, Xuebin Yao, Pradeep Subedi, Cong Hao
Abstract: Edge computing is a distributed computing paradigm that collects and processes data at or near the source of data generation. The on-device learning at edge relies on device-to-device wireless communication to facilitate real-time data sharing and collaborative decision-making among multiple devices. This significantly improves the adaptability of the edge computing system to the changing environments. However, as the scale of the edge computing system is getting larger, communication among devices is becoming the bottleneck because of the limited bandwidth of wireless communication leads to large data transfer latency. To reduce the amount of device-to-device data transmission and accelerate on-device learning, in this paper, we propose Residual-INR, a fog computing-based communication-efficient on-device learning framework by utilizing implicit neural representation (INR) to compress images/videos into neural network weights. Residual-INR enhances data transfer efficiency by collecting JPEG images from edge devices, compressing them into INR format at the fog node, and redistributing them for on-device learning. By using a smaller INR for full image encoding and a separate object INR for high-quality object region reconstruction through residual encoding, our technique can reduce the encoding redundancy while maintaining the object quality. Residual-INR is a promising solution for edge on-device learning because it reduces data transmission by up to 5.16 x across a network of 10 edge devices. It also facilitates CPU-free accelerated on-device learning, achieving up to 2.9 x speedup without sacrificing accuracy. Our code is available at: https://github.com/sharclab/Residual-INR.
Authors: Giulio Biroli, Marc M\'ezard
Abstract: This paper studies Kernel Density Estimation for a high-dimensional distribution $\rho(x)$. Traditional approaches have focused on the limit of large number of data points $n$ and fixed dimension $d$. We analyze instead the regime where both the number $n$ of data points $y_i$ and their dimensionality $d$ grow with a fixed ratio $\alpha=(\log n)/d$. Our study reveals three distinct statistical regimes for the kernel-based estimate of the density $\hat \rho_h^{\mathcal {D}}(x)=\frac{1}{n h^d}\sum_{i=1}^n K\left(\frac{x-y_i}{h}\right)$, depending on the bandwidth $h$: a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, $h_{CLT}(\alpha)$, we find that the CLT breaks down. The statistics of $\hat\rho_h^{\mathcal {D}}(x)$ for a fixed $x$ drawn from $\rho(x)$ is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value $h_G(\alpha)$, we find that $\hat\rho_h^{\mathcal {D}}(x)$ is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. As known by practitioners, when decreasing the bandwidth a Kernel-estimated estimated changes from a smooth curve to a collections of peaks centred on the data points. Our findings reveal that this general phenomenon is related to sharp transitions between phases characterized by different statistical properties, and offer new insights for Kernel density estimation in high-dimensional settings.
Authors: Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
Abstract: Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.
Authors: Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Ting Liu, Bing Qin
Abstract: LLMs' performance on complex tasks is still unsatisfactory. A key issue is that presently LLMs learn in a data-driven schema, while the instructions about these complex tasks are both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could enhance the efficiency and effectiveness of the LLM's ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.Based on these insights, experiments are conducted to actually enhance the efficiency and effectiveness of SFT.
Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang
Abstract: Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach's effectiveness, with up to 2.4\times speedup in matrix multiplication compared to NVIDIA's CUTLASS. When integrated into LLMs, we achieve up to 6.7\times inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.
Authors: Ji Liu, Jiaxiang Ren, Ruoming Jin, Zijie Zhang, Yang Zhou, Patrick Valduriez, Dejing Dou
Abstract: As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).
Authors: Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, Anima Anandkumar
Abstract: Large Language Models (LLMs) have been successful in mathematical reasoning tasks such as formal theorem proving when integrated with interactive proof assistants like Lean. Existing approaches involve training or fine-tuning an LLM on a specific dataset to perform well on particular domains, such as undergraduate-level mathematics. These methods struggle with generalizability to advanced mathematics. A fundamental limitation is that these approaches operate on static domains, failing to capture how mathematicians often work across multiple domains and projects simultaneously or cyclically. We present LeanAgent, a novel lifelong learning framework for theorem proving that continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge. LeanAgent introduces several key innovations, including a curriculum learning strategy that optimizes the learning trajectory in terms of mathematical difficulty, a dynamic database for efficient management of evolving mathematical knowledge, and progressive training to balance stability and plasticity. LeanAgent successfully proves 162 theorems previously unproved by humans across 23 diverse Lean repositories, many from advanced mathematics. It performs significantly better than the static LLM baseline, proving challenging theorems in domains like abstract algebra and algebraic topology while showcasing a clear progression of learning from basic concepts to advanced topics. In addition, we analyze LeanAgent's superior performance on key lifelong learning metrics. LeanAgent achieves exceptional scores in stability and backward transfer, where learning new tasks improves performance on previously learned tasks. This emphasizes LeanAgent's continuous generalizability and improvement, explaining its superior theorem-proving performance.
Authors: Haolin Liu, Artin Tajdini, Andrew Wagenmaker, Chen-Yu Wei
Abstract: In linear bandits, how can a learner effectively learn when facing corrupted rewards? While significant work has explored this question, a holistic understanding across different adversarial models and corruption measures is lacking, as is a full characterization of the minimax regret bounds. In this work, we compare two types of corruptions commonly considered: strong corruption, where the corruption level depends on the action chosen by the learner, and weak corruption, where the corruption level does not depend on the action chosen by the learner. We provide a unified framework to analyze these corruptions. For stochastic linear bandits, we fully characterize the gap between the minimax regret under strong and weak corruptions. We also initiate the study of corrupted adversarial linear bandits, obtaining upper and lower bounds with matching dependencies on the corruption level. Next, we reveal a connection between corruption-robust learning and learning with gap-dependent mis-specification, a setting first studied by Liu et al. (2023a), where the misspecification level of an action or policy is proportional to its suboptimality. We present a general reduction that enables any corruption-robust algorithm to handle gap-dependent misspecification. This allows us to recover the results of Liu et al. (2023a) in a black-box manner and significantly generalize them to settings like linear MDPs, yielding the first results for gap-dependent misspecification in reinforcement learning. However, this general reduction does not attain the optimal rate for gap-dependent misspecification. Motivated by this, we develop a specialized algorithm that achieves optimal bounds for gap-dependent misspecification in linear bandits, thus answering an open question posed by Liu et al. (2023a).
Authors: Rohan Alur, Loren Laine, Darrick K. Li, Dennis Shung, Manish Raghavan, Devavrat Shah
Abstract: We introduce a novel framework for human-AI collaboration in prediction and decision tasks. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or "look the same" to any feasible predictive algorithm. We argue that this framing clarifies the problem of human-AI collaboration in prediction and decision tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm's training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of "side information", and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We demonstrate the utility of our framework in a case study of emergency room triage decisions, where we find that although algorithmic risk scores are highly competitive with physicians, there is strong evidence that physician judgments provide signal which could not be replicated by any predictive algorithm. This insight yields a range of natural decision rules which leverage the complementary strengths of human experts and predictive algorithms.
Authors: Devdhar Patel, Hava Siegelmann
Abstract: Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a "temporal recall" mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning.
Authors: Stefano Alberto Russo, Giuliano Taffoni, Luca Bortolussi
Abstract: Timeseria is an object-oriented time series processing library implemented in Python, which aims at making it easier to manipulate time series data and to build statistical and machine learning models on top of it. Unlike common data analysis frameworks, it builds up from well defined and reusable logical units (objects), which can be easily combined together in order to ensure a high level of consistency. Thanks to this approach, Timeseria can address by design several non-trivial issues often underestimated, such as handling data losses, non-uniform sampling rates, differences between aggregated data and punctual observations, time zones, daylight saving times, and more. Timeseria comes with a comprehensive set of base data structures, common data manipulation operations, and extensible models for data reconstruction, forecasting and anomaly detection. It also integrates a powerful plotting engine capable of handling even millions of data points.
Authors: Morris Yau, Ekin Aky\"urek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas
Abstract: Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
Authors: Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen
Abstract: Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands and the need for enhanced performance. In this work, we introduce Liger-Kernel, an open-sourced set of Triton kernels developed specifically for LLM training. With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage for popular LLMs compared to HuggingFace implementations. In addition, Liger-Kernel is designed with modularity, accessibility, and adaptability in mind, catering to both casual and expert users. Comprehensive benchmarks and integration tests are built in to ensure compatibility, performance, correctness, and convergence across diverse computing environments and model architectures. The source code is available under a permissive license at: github.com/linkedin/Liger-Kernel.
Authors: Zhitong Xu, Da Long, Yiming Xu, Guang Yang, Shandian Zhe, Houman Owhadi
Abstract: This paper introduces a novel kernel learning framework toward efficiently solving nonlinear partial differential equations (PDEs). In contrast to the state-of-the-art kernel solver that embeds differential operators within kernels, posing challenges with a large number of collocation points, our approach eliminates these operators from the kernel. We model the solution using a standard kernel interpolation form and differentiate the interpolant to compute the derivatives. Our framework obviates the need for complex Gram matrix construction between solutions and their derivatives, allowing for a straightforward implementation and scalable computation. As an instance, we allocate the collocation points on a grid and adopt a product kernel, which yields a Kronecker product structure in the interpolation. This structure enables us to avoid computing the full Gram matrix, reducing costs and scaling efficiently to a large number of collocation points. We provide a proof of the convergence and rate analysis of our method under appropriate regularity assumptions. In numerical experiments, we demonstrate the advantages of our method in solving several benchmark PDEs.
Authors: Jiacheng Cen, Anyi Li, Ning Lin, Yuxiang Ren, Zihe Wang, Wenbing Huang
Abstract: Equivariant Graph Neural Networks (GNNs) that incorporate E(3) symmetry have achieved significant success in various scientific applications. As one of the most successful models, EGNN leverages a simple scalarization technique to perform equivariant message passing over only Cartesian vectors (i.e., 1st-degree steerable vectors), enjoying greater efficiency and efficacy compared to equivariant GNNs using higher-degree steerable vectors. This success suggests that higher-degree representations might be unnecessary. In this paper, we disprove this hypothesis by exploring the expressivity of equivariant GNNs on symmetric structures, including $k$-fold rotations and regular polyhedra. We theoretically demonstrate that equivariant GNNs will always degenerate to a zero function if the degree of the output representations is fixed to 1 or other specific values. Based on this theoretical insight, we propose HEGNN, a high-degree version of EGNN to increase the expressivity by incorporating high-degree steerable vectors while maintaining EGNN's efficiency through the scalarization trick. Our extensive experiments demonstrate that HEGNN not only aligns with our theoretical analyses on toy datasets consisting of symmetric structures, but also shows substantial improvements on more complicated datasets such as $N$-body and MD17. Our theoretical findings and empirical results potentially open up new possibilities for the research of equivariant GNNs.
Authors: Panayiotis Petousis (on behalf of CURE-CKD), David Gordon (on behalf of CURE-CKD), Susanne B. Nicholas (on behalf of CURE-CKD), Alex A. T. Bui (on behalf of CURE-CKD)
Abstract: Randomized controlled trials (RCTs) are the standard for evaluating the effectiveness of clinical interventions. To address the limitations of RCTs on real-world populations, we developed a methodology that uses a large observational electronic health record (EHR) dataset. Principles of regression discontinuity (rd) were used to derive randomized data subsets to test expert-driven interventions using dynamic Bayesian Networks (DBNs) do-operations. This combined method was applied to a chronic kidney disease (CKD) cohort of more than two million individuals and used to understand the associational and causal relationships of CKD variables with respect to a surrogate outcome of >=40% decline in estimated glomerular filtration rate (eGFR). The associational and causal analyses depicted similar findings across DBNs from two independent healthcare systems. The associational analysis showed that the most influential variables were eGFR, urine albumin-to-creatinine ratio, and pulse pressure, whereas the causal analysis showed eGFR as the most influential variable, followed by modifiable factors such as medications that may impact kidney function over time. This methodology demonstrates how real-world EHR data can be used to provide population-level insights to inform improved healthcare delivery.
Authors: Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang
Abstract: Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.
Authors: Idan Attias, Steve Hanneke, Arvind Ramaswami
Abstract: We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size $f(d_\mathrm{VC})$, where $d_\mathrm{VC}$ is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists a multiclass compression scheme of size $O(f(d_\mathrm{G}))$, where $d_\mathrm{G}$ is the graph dimension. Moreover, for general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{G})\log|Y|)$, where $Y$ is the label space. (2) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists an $\epsilon$-approximate compression scheme for regression over $[0,1]$-valued functions of size $O(f(d_\mathrm{P}))$, where $d_\mathrm{P}$ is the pseudo-dimension. For general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{P})\log(1/\epsilon))$. These results would have significant implications if the sample compression conjecture, which posits that any binary concept class with a finite VC dimension admits a binary compression scheme of size $O(d_\mathrm{VC})$, is resolved (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995; Warmuth, 2003). Our results would then extend the proof of the conjecture immediately to other settings. We establish similar results for adversarially robust learning and also provide an example of a concept class that is robustly learnable but has no bounded-size compression scheme, demonstrating that learnability is not equivalent to having a compression scheme independent of the sample size, unlike in binary classification, where compression of size $2^{O(d_\mathrm{VC})}$ is attainable (Moran and Yehudayoff, 2016).
Authors: Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang
Abstract: Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads. NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.
Authors: Chuanyu Tang, Yilong Chen, Zhenyu Zhang, Junyuan Shang, Wenyuan Zhang, Yong Huang, Tingwen Liu
Abstract: Low-Rank Adaptation (LoRA) drives research to align its performance with full fine-tuning. However, significant challenges remain: (1) Simply increasing the rank size of LoRA does not effectively capture high-rank information, which leads to a performance bottleneck.(2) MoE-style LoRA methods substantially increase parameters and inference latency, contradicting the goals of efficient fine-tuning and ease of application. To address these challenges, we introduce Mixture of Ranks (MoR), which learns rank-specific information for different tasks based on input and efficiently integrates multi-rank information. We firstly propose a new framework that equates the integration of multiple LoRAs to expanding the rank of LoRA. Moreover, we hypothesize that low-rank LoRA already captures sufficient intrinsic information, and MoR can derive high-rank information through mathematical transformations of the low-rank components. Thus, MoR can reduces the learning difficulty of LoRA and enhances its multi-task capabilities. MoR achieves impressive results, with MoR delivering a 1.31\% performance improvement while using only 93.93\% of the parameters compared to baseline methods.
Authors: Zawar Hussain, Quan Z. Sheng, Wei Emma Zhang, Jorge Ortiz, Seyedamin Pouriyeh
Abstract: Quality sleep is very important for a healthy life. Nowadays, many people around the world are not getting enough sleep which is having negative impacts on their lifestyles. Studies are being conducted for sleep monitoring and have now become an important tool for understanding sleep behavior. The gold standard method for sleep analysis is polysomnography (PSG) conducted in a clinical environment but this method is both expensive and complex for long-term use. With the advancements in the field of sensors and the introduction of off-the-shelf technologies, unobtrusive solutions are becoming common as alternatives for in-home sleep monitoring. Various solutions have been proposed using both wearable and non-wearable methods which are cheap and easy to use for in-home sleep monitoring. In this paper, we present a comprehensive survey of the latest research works (2015 and after) conducted in various categories of sleep monitoring including sleep stage classification, sleep posture recognition, sleep disorders detection, and vital signs monitoring. We review the latest works done using the non-invasive approach and cover both wearable and non-wearable methods. We discuss the design approaches and key attributes of the work presented and provide an extensive analysis based on 10 key factors, to give a comprehensive overview of the recent developments and trends in all four categories of sleep monitoring. We also present some publicly available datasets for different categories of sleep monitoring. In the end, we discuss several open issues and provide future research directions in the area of sleep monitoring.
Authors: Yaoyiran Li, Fangyu Liu, Nigel Collier, Anna Korhonen, Ivan Vuli\'c
Abstract: Word translation or bilingual lexicon induction (BLI) is a key cross-lingual task, aiming to bridge the lexical gap between different languages. In this work, we propose a robust and effective two-stage contrastive learning framework for the BLI task. At Stage C1, we propose to refine standard cross-lingual linear maps between static word embeddings (WEs) via a contrastive learning objective; we also show how to integrate it into the self-learning procedure for even more refined cross-lingual maps. In Stage C2, we conduct BLI-oriented contrastive fine-tuning of mBERT, unlocking its word translation capability. We also show that static WEs induced from the `C2-tuned' mBERT complement static WEs from Stage C1. Comprehensive experiments on standard BLI datasets for diverse languages and different experimental setups demonstrate substantial gains achieved by our framework. While the BLI method from Stage C1 already yields substantial gains over all state-of-the-art BLI methods in our comparison, even stronger improvements are met with the full two-stage framework: e.g., we report gains for 112/112 BLI setups, spanning 28 language pairs.
Authors: Changxi Zhu, Mehdi Dastani, Shihan Wang
Abstract: Communication is an effective mechanism for coordinating the behaviors of multiple agents, broadening their views of the environment, and to support their collaborations. In the field of multi-agent deep reinforcement learning (MADRL), agents can improve the overall learning performance and achieve their objectives by communication. Agents can communicate various types of messages, either to all agents or to specific agent groups, or conditioned on specific constraints. With the growing body of research work in MADRL with communication (Comm-MADRL), there is a lack of a systematic and structural approach to distinguish and classify existing Comm-MADRL approaches. In this paper, we survey recent works in the Comm-MADRL field and consider various aspects of communication that can play a role in designing and developing multi-agent reinforcement learning systems. With these aspects in mind, we propose 9 dimensions along which Comm-MADRL approaches can be analyzed, developed, and compared. By projecting existing works into the multi-dimensional space, we discover interesting trends. We also propose some novel directions for designing future Comm-MADRL systems through exploring possible combinations of the dimensions.
Authors: Yaoyiran Li, Fangyu Liu, Ivan Vuli\'c, Anna Korhonen
Abstract: Bilingual lexicon induction (BLI) with limited bilingual supervision is a crucial yet challenging task in multilingual NLP. Current state-of-the-art BLI methods rely on the induction of cross-lingual word embeddings (CLWEs) to capture cross-lingual word similarities; such CLWEs are obtained 1) via traditional static models (e.g., VecMap), or 2) by extracting type-level CLWEs from multilingual pretrained language models (mPLMs), or 3) through combining the former two options. In this work, we propose a novel semi-supervised post-hoc reranking method termed BLICEr (BLI with Cross-Encoder Reranking), applicable to any precalculated CLWE space, which improves their BLI capability. The key idea is to 'extract' cross-lingual lexical knowledge from mPLMs, and then combine it with the original CLWEs. This crucial step is done via 1) creating a word similarity dataset, comprising positive word pairs (i.e., true translations) and hard negative pairs induced from the original CLWE space, and then 2) fine-tuning an mPLM (e.g., mBERT or XLM-R) in a cross-encoder manner to predict the similarity scores. At inference, we 3) combine the similarity score from the original CLWE space with the score from the BLI-tuned cross-encoder. BLICEr establishes new state-of-the-art results on two standard BLI benchmarks spanning a wide spectrum of diverse languages: it substantially outperforms a series of strong baselines across the board. We also validate the robustness of BLICEr with different CLWEs.
Authors: Neil K. Chada, Quanjun Lang, Fei Lu, Xiong Wang
Abstract: Kernels are efficient in representing nonlocal dependence and they are widely used to design operators between function spaces. Thus, learning kernels in operators from data is an inverse problem of general interest. Due to the nonlocal dependence, the inverse problem can be severely ill-posed with a data-dependent singular inversion operator. The Bayesian approach overcomes the ill-posedness through a non-degenerate prior. However, a fixed non-degenerate prior leads to a divergent posterior mean when the observation noise becomes small, if the data induces a perturbation in the eigenspace of zero eigenvalues of the inversion operator. We introduce a data-adaptive prior to achieve a stable posterior whose mean always has a small noise limit. The data-adaptive prior's covariance is the inversion operator with a hyper-parameter selected adaptive to data by the L-curve method. Furthermore, we provide a detailed analysis on the computational practice of the data-adaptive prior, and demonstrate it on Toeplitz matrices and integral operators. Numerical tests show that a fixed prior can lead to a divergent posterior mean in the presence of any of the four types of errors: discretization error, model error, partial observation and wrong noise assumption. In contrast, the data-adaptive prior always attains posterior means with small noise limits.
Authors: Dongyue Guo, Zheng Zhang, Bo Yang, Jianwei Zhang, Hongyu Yang, Yi Lin
Abstract: The booming air transportation industry inevitably burdens air traffic controllers' workload, causing unexpected human factor-related incidents. Current air traffic control systems fail to consider spoken instructions for traffic prediction, bringing significant challenges in detecting human errors during real-time traffic operations. Here, we present an automation paradigm integrating controlling intent into the information processing loop through the spoken instruction-aware flight trajectory prediction framework. A 3-stage progressive multi-modal learning paradigm is proposed to address the modality gap between the trajectory and spoken instructions, as well as minimize the data requirements. Experiments on a real-world dataset show the proposed framework achieves flight trajectory prediction with high predictability and timeliness, obtaining over 20% relative reduction in mean deviation error. Moreover, the generalizability of the proposed framework is also confirmed by various model architectures. The proposed framework can formulate full-automated information processing in real-world air traffic applications, supporting human error detection and enhancing aviation safety.
Authors: Senrui Chen, Changhun Oh, Sisi Zhou, Hsin-Yuan Huang, Liang Jiang
Abstract: Quantum entanglement is a crucial resource for learning properties from nature, but a precise characterization of its advantage can be challenging. In this work, we consider learning algorithms without entanglement to be those that only utilize states, measurements, and operations that are separable between the main system of interest and an ancillary system. Interestingly, we show that these algorithms are equivalent to those that apply quantum circuits on the main system interleaved with mid-circuit measurements and classical feedforward. Within this setting, we prove a tight lower bound for Pauli channel learning without entanglement that closes the gap between the best-known upper and lower bound. In particular, we show that $\Theta(2^n\varepsilon^{-2})$ rounds of measurements are required to estimate each eigenvalue of an $n$-qubit Pauli channel to $\varepsilon$ error with high probability when learning without entanglement. In contrast, a learning algorithm with entanglement only needs $\Theta(\varepsilon^{-2})$ copies of the Pauli channel. The tight lower bound strengthens the foundation for an experimental demonstration of entanglement-enhanced advantages for Pauli noise characterization.
Authors: Kangxian Xie, Jiancheng Yang, Donglai Wei, Ziqiao Weng, Pascal Fua
Abstract: Pulmonary diseases rank prominently among the principal causes of death worldwide. Curing them will require, among other things, a better understanding of the complex 3D tree-shaped structures within the pulmonary system, such as airways, arteries, and veins. Traditional approaches using high-resolution image stacks and standard CNNs on dense voxel grids face challenges in computational efficiency, limited resolution, local context, and inadequate preservation of shape topology. Our method addresses these issues by shifting from dense voxel to sparse point representation, offering better memory efficiency and global context utilization. However, the inherent sparsity in point representation can lead to a loss of crucial connectivity in tree-shaped structures. To mitigate this, we introduce graph learning on skeletonized structures, incorporating differentiable feature fusion for improved topology and long-distance context capture. Furthermore, we employ an implicit function for efficient conversion of sparse representations into dense reconstructions end-to-end. The proposed method not only delivers state-of-the-art performance in labeling accuracy, both overall and at key locations, but also enables efficient inference and the generation of closed surface shapes. Addressing data scarcity in this field, we have also curated a comprehensive dataset to validate our approach. Data and code are available at \url{https://github.com/M3DV/pulmonary-tree-labeling}.
Authors: Marien Renaud, Jiaming Liu, Valentin de Bortoli, Andr\'es Almansa, Ulugbek S. Kamilov
Abstract: Posterior sampling has been shown to be a powerful Bayesian approach for solving imaging inverse problems. The recent plug-and-play unadjusted Langevin algorithm (PnP-ULA) has emerged as a promising method for Monte Carlo sampling and minimum mean squared error (MMSE) estimation by combining physical measurement models with deep-learning priors specified using image denoisers. However, the intricate relationship between the sampling distribution of PnP-ULA and the mismatched data-fidelity and denoiser has not been theoretically analyzed. We address this gap by proposing a posterior-L2 pseudometric and using it to quantify an explicit error bound for PnP-ULA under mismatched posterior distribution. We numerically validate our theory on several inverse problems such as sampling from Gaussian mixture models and image deblurring. Our results suggest that the sensitivity of the sampling distribution of PnP-ULA to a mismatch in the measurement model and the denoiser can be precisely characterized.
Authors: Ralph Peeters, Aaron Steiner, Christian Bizer
Abstract: Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. The study covers hosted and open-source LLMs which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models. We show that there is no single best prompt but that the prompt needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning LLMs using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform comparably to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions and can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers to improve entity matching pipelines.
Authors: Jacob Tanner, Sina Mansour L., Ludovico Coletta, Alessandro Gozzi, Richard F. Betzel
Abstract: Recent theoretical and experimental work in neuroscience has focused on the representational and dynamical character of neural manifolds --subspaces in neural activity space wherein many neurons coactivate. Importantly, neural populations studied under this "neural manifold hypothesis" are continuous and not cleanly divided into separate neural populations. This perspective clashes with the "modular hypothesis" of brain organization, wherein neural elements maintain an "all-or-nothing" affiliation with modules. In line with this modular hypothesis, recent research on recurrent neural networks suggests that multi-task networks become modular across training, such that different modules specialize for task-general dynamical motifs. If the modular hypothesis is true, then it would be important to use a dimensionality reduction technique that captures modular structure. Here, we investigate the features of such a method. We leverage RNNs as a model system to study the character of modular neural populations, using a community detection method from network science known as modularity maximization to partition neurons into distinct modules. These partitions allow us to ask the following question: do these modular boundaries matter to the system? ...
Authors: Jean Ogier du Terrail, Quentin Klopfenstein, Honghao Li, Imke Mayer, Nicolas Loiseau, Mohammad Hallal, Michael Debouver, Thibault Camalon, Thibault Fouqueray, Jorge Arellano Castro, Zahia Yanes, Laetitia Dahan, Julien Ta\"ieb, Pierre Laurent-Puig, Jean-Baptiste Bachet, Shulin Zhao, Remy Nicolle, J\'erome Cros, Daniel Gonzalez, Robert Carreras-Torres, Adelaida Garcia Velasco, Kawther Abdilleh, Sudheer Doss, F\'elix Balazard, Mathieu Andreux
Abstract: External control arms (ECA) can inform the early clinical development of experimental drugs and provide efficacy evidence for regulatory approval. However, the main challenge in implementing ECA lies in accessing real-world or historical clinical trials data. Indeed, regulations protecting patients' rights by strictly controlling data processing make pooling data from multiple sources in a central server often difficult. To address these limitations, we develop a new method, 'FedECA' that leverages federated learning (FL) to enable inverse probability of treatment weighting (IPTW) for time-to-event outcomes on separate cohorts without needing to pool data. To showcase the potential of FedECA, we apply it in different settings of increasing complexity culminating with a real-world use-case in which FedECA provides evidence for a differential effect between two drugs that would have otherwise gone unnoticed. By sharing our code, we hope FedECA will foster the creation of federated research networks and thus accelerate drug development.
Authors: Eva Giboulot, Teddy Furon
Abstract: Watermarking is a technical means to dissuade malfeasant usage of Large Language Models. This paper proposes a novel watermarking scheme, so-called WaterMax, that enjoys high detectability while sustaining the quality of the generated text of the original LLM. Its new design leaves the LLM untouched (no modification of the weights, logits, temperature, or sampling technique). WaterMax balances robustness and complexity contrary to the watermarking techniques of the literature inherently provoking a trade-off between quality and robustness. Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite. Code available at https://github.com/eva-giboulot/WaterMax.
Authors: Manish Chandra, Debasis Ganguly, Iadh Ounis
Abstract: In-context learning (ICL) refers to the process of adding a small number of localized examples (ones that are semantically similar to the input) from a training set of labelled data to an LLM's prompt with an objective to effectively control the generative process seeking to improve the downstream task performance. Existing ICL approaches use an identical number of examples (a pre-configured hyper-parameter) for each data instance. Our work alleviates the limitations of this 'one fits all' approach by dynamically predicting the number of examples for each data instance to be used in few-shot inference with LLMs. In particular, we employ a multi-label classifier, the parameters of which are fitted using a training set, where the label for each instance in the training set indicates if using a specific value of k (number of most similar examples from 0 up to a maximum value) leads to correct k-shot downstream predictions. Our experiments on a number of text classification benchmarks show that AICL substantially outperforms standard ICL by up to 17%.
Authors: Norman Di Palo, Edward Johns
Abstract: We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens.
Authors: Pouya Pezeshkpour, Estevam Hruschka
Abstract: Utilizing large language models (LLMs) to rank a set of items has become a common approach in recommendation and retrieval systems. Typically, these systems focus on ordering a substantial number of documents in a monotonic order based on a given query. However, real-world scenarios often present a different challenge: ranking a comparatively smaller set of items, but according to a variety of diverse and occasionally conflicting conditions. In this paper, we define and explore the task of multi-conditional ranking by introducing MCRank, a benchmark tailored for assessing multi-conditional ranking across various item types and conditions. Our analysis of LLMs using MCRank indicates a significant decrease in performance as the number and complexity of items and conditions grow. To overcome this limitation, we propose a novel decomposed reasoning method, consisting of EXtracting and Sorting the conditions, and then Iteratively Ranking the items (EXSIR). Our extensive experiments show that this decomposed reasoning method enhances LLMs' performance significantly, achieving up to a 14.4% improvement over existing LLMs. We also provide a detailed analysis of LLMs performance across various condition categories, and examine the effectiveness of decomposition step. Furthermore, we compare our method with existing approaches such as Chain-of-Thought and existing ranking models, demonstrating the superiority of our approach and complexity of MCR task. We released our dataset and code.
Authors: Javier Marin
Abstract: Understanding how consumers react to advertising inputs is essential for marketers aiming to optimize advertising strategies and improve campaign effectiveness. This study examines the complex nature of consumer behaviour by applying theoretical frameworks derived from physics and social psychology. We present an innovative equation that captures the relation between spending on advertising and consumer response, using concepts such as symmetries, scaling laws, and phase transitions. By validating our equation against well-known models such as the Michaelis-Menten and Hill equations, we prove its effectiveness in accurately representing the complexity of consumer response dynamics. The analysis emphasizes the importance of key model parameters, such as marketing effectiveness, response sensitivity, and behavioural sensitivity, in influencing consumer behaviour. The work explores the practical implications for advertisers and marketers, as well as discussing the limitations and future research directions. In summary, this study provides a thorough framework for comprehending and forecasting consumer reactions to advertising, which has implications for optimizing advertising strategies and allocating resources.
Authors: Dayu Yang
Abstract: Merger and Acquisition (M&A) activities play a vital role in market consolidation and restructuring. For acquiring companies, M&A serves as a key investment strategy, with one primary goal being to attain complementarities that enhance market power in competitive industries. In addition to intrinsic factors, a M&A behavior of a firm is influenced by the M&A activities of its peers, a phenomenon known as the "peer effect." However, existing research often fails to capture the rich interdependencies among M&A events within industry networks. An effective M&A predictive model should offer deal-level predictions without requiring ad-hoc feature engineering or data rebalancing. Such a model would predict the M&A behaviors of rival firms and provide specific recommendations for both bidder and target firms. However, most current models only predict one side of an M&A deal, lack firm-specific recommendations, and rely on arbitrary time intervals that impair predictive accuracy. Additionally, due to the sparsity of M&A events, existing models require data rebalancing, which introduces bias and limits their real-world applicability. To address these challenges, we propose a Temporal Dynamic Industry Network (TDIN) model, leveraging temporal point processes and deep learning to capture complex M&A interdependencies without ad-hoc data adjustments. The temporal point process framework inherently models event sparsity, eliminating the need for data rebalancing. Empirical evaluations on M&A data from January 1997 to December 2020 validate the effectiveness of our approach in predicting M&A events and offering actionable, deal-level recommendations.
Authors: Gabriel Arpino, Xiaoqi Liu, Julia Gontarek, Ramji Venkataramanan
Abstract: We consider the problem of localizing change points in a generalized linear model (GLM), a model that covers many widely studied problems in statistical learning including linear, logistic, and rectified linear regression. We propose a novel and computationally efficient Approximate Message Passing (AMP) algorithm for estimating both the signals and the change point locations, and rigorously characterize its performance in the high-dimensional limit where the number of parameters $p$ is proportional to the number of samples $n$. This characterization is in terms of a state evolution recursion, which allows us to precisely compute performance measures such as the asymptotic Hausdorff error of our change point estimates, and allows us to tailor the algorithm to take advantage of any prior structural information on the signals and change points. Moreover, we show how our AMP iterates can be used to efficiently compute a Bayesian posterior distribution over the change point locations in the high-dimensional limit. We validate our theory via numerical experiments, and demonstrate the favorable performance of our estimators on both synthetic and real data in the settings of linear, logistic, and rectified linear regression.
Authors: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
Abstract: We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and checkpoints at https://github.com/facebookresearch/LayerSkip.
Authors: Eduard Poesina, Cornelia Caragea, Radu Tudor Ionescu
Abstract: Natural language inference (NLI), the task of recognizing the entailment relationship in sentence pairs, is an actively studied topic serving as a proxy for natural language understanding. Despite the relevance of the task in building conversational agents and improving text classification, machine translation and other NLP tasks, to the best of our knowledge, there is no publicly available NLI corpus for the Romanian language. To this end, we introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs, which are obtained via distant supervision, and 6K validation and test sentence pairs, which are manually annotated with the correct labels. We conduct experiments with multiple machine learning methods based on distant learning, ranging from shallow models based on word embeddings to transformer-based neural networks, to establish a set of competitive baselines. Furthermore, we improve on the best model by employing a new curriculum learning strategy based on data cartography. Our dataset and code to reproduce the baselines are available at https://github.com/Eduard6421/RONLI.
Authors: Ivan Gvozdanovi\'c, Sonja Petrovi\'c
Abstract: We consider the problem of constructing exact goodness-of-fit tests for discrete exponential family models. This classical problem remains practically unsolved for many types of structured or sparse data, as it rests on a computationally difficult core task: to produce a reliable sample from lattice points in a high-dimensional polytope. We translate the problem into a Markov decision process and demonstrate a reinforcement learning approach for learning `good moves' for sampling. We illustrate the approach on data sets and models for which traditional MCMC samplers converge too slowly due to problem size, sparsity structure, and the requirement to use prohibitive non-linear algebra computations in the process. The differentiating factor is the use of scalable tools from \emph{linear} algebra in the context of theoretical guarantees provided by \emph{non-linear} algebra. Our algorithm is based on an actor-critic sampling scheme, with provable convergence. The discovered moves can be used to efficiently obtain an exchangeable sample, significantly cutting computational times with regards to statistical testing.
Authors: Miao Zhang, Ziming Wang, Runtian Xing, Kui Xiao, Zhifei Li, Yan Zhang, Chang Tang
Abstract: Cognitive diagnosis is a fundamental and critical task in learning assessment, which aims to infer students' proficiency on knowledge concepts from their response logs. Current works assume each knowledge concept will certainly be tested and covered by multiple exercises. However, whether online or offline courses, it's hardly feasible to completely cover all knowledge concepts in several exercises. Restricted tests lead to undiscovered knowledge deficits, especially untested knowledge concepts(UKCs). In this paper, we propose a novel framework for Cognitive Diagnosis called Disentangling Heterogeneous Knowledge Cognitive Diagnosis(DisKCD) on untested knowledge. Specifically, we leverage course grades, exercise questions, and learning resources to learn the potential representations of students, exercises, and knowledge concepts. In particular, knowledge concepts are disentangled into tested and untested based on the limiting actual exercises. We construct a heterogeneous relation graph network via students, exercises, tested knowledge concepts(TKCs), and UKCs. Then, through a hierarchical heterogeneous message-passing mechanism, the fine-grained relations are incorporated into the embeddings of the entities. Finally, the embeddings will be applied to multiple existing cognitive diagnosis models to infer students' proficiency on UKCs. Experimental results on real-world datasets show that the proposed model can effectively improve the performance of the task of diagnosing students' proficiency on UKCs. Our code is available at https://github.com/Hubuers/DisKCD.
Authors: Yichun Hu, Nathan Kallus, Xiaojie Mao, Yanchen Wu
Abstract: Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients and thereby improve average-cost performance. An example is the stochastic shortest path problem with random edge costs (e.g., traffic) and contextual features (e.g., lagged traffic, weather). Existing work on CLO assumes the data has fully observed cost coefficient vectors, but in many applications, we can only see the realized cost of a historical decision, that is, just one projection of the random cost coefficient vector, to which we refer as bandit feedback. We study a class of offline learning algorithms for CLO with bandit feedback, which we term induced empirical risk minimization (IERM), where we fit a predictive model to directly optimize the downstream performance of the policy it induces. We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate, and we develop computationally tractable surrogate losses. A byproduct of our theory of independent interest is fast-rate regret bound for IERM with full feedback and misspecified policy class. We compare the performance of different modeling choices numerically using a stochastic shortest path example and provide practical insights from the empirical results.
Authors: Friedemann Zenke, Axel Laborieux
Abstract: Humans and animals learn throughout life. Such continual learning is crucial for intelligence. In this chapter, we examine the pivotal role plasticity mechanisms with complex internal synaptic dynamics could play in enabling this ability in neural networks. By surveying theoretical research, we highlight two fundamental enablers for continual learning. First, synaptic plasticity mechanisms must maintain and evolve an internal state over several behaviorally relevant timescales. Second, plasticity algorithms must leverage the internal state to intelligently regulate plasticity at individual synapses to facilitate the seamless integration of new memories while avoiding detrimental interference with existing ones. Our chapter covers successful applications of these principles to deep neural networks and underscores the significance of synaptic metaplasticity in sustaining continual learning capabilities. Finally, we outline avenues for further research to understand the brain's superb continual learning abilities and harness similar mechanisms for artificial intelligence systems.
Authors: Robert Graf, Paul-S\"oren Platzek, Evamaria Olga Riedel, Constanze Ramsch\"utz, Sophie Starck, Hendrik Kristian M\"oller, Matan Atad, Henry V\"olzke, Robin B\"ulow, Carsten Oliver Schmidt, Julia R\"udebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian L\"offler, Fabian Bamberg, Bene Wiestler, Johannes C. Paetzold, Daniel Rueckert, Jan Stefan Kirschke
Abstract: Objectives: To present a publicly available torso segmentation network for large epidemiology datasets on volumetric interpolated breath-hold examination (VIBE) images. Materials & Methods: We extracted preliminary segmentations from TotalSegmentator, spine, and body composition networks for VIBE images, then improved them iteratively and retrained a nnUNet network. Using subsets of NAKO (85 subjects) and UK Biobank (16 subjects), we evaluated with Dice-score on a holdout set (12 subjects) and existing organ segmentation approach (1000 subjects), generating 71 semantic segmentation types for VIBE images. We provide an additional network for the vertebra segments 22 individual vertebra types. Results: We achieved an average Dice score of 0.89 +- 0.07 overall 71 segmentation labels. We scored > 0.90 Dice-score on the abdominal organs except for the pancreas with a Dice of 0.70. Conclusion: Our work offers a detailed and refined publicly available full torso segmentation on VIBE images.
Authors: Ziyu Wang, Chris Holmes
Abstract: Applications of large language models often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears difficult to define in general cases. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed methods can be applied to black-box language models. We illustrate the methods on question answering and machine translation tasks. Our experiments provide a principled evaluation of task-specific calibration, and demonstrate that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.
Authors: Rohit Jena, Pratik Chaudhari, James C. Gee
Abstract: Deep Learning in Image Registration (DLIR) methods have been tremendously successful in image registration due to their speed and ability to incorporate weak label supervision at training time. However, DLIR methods forego many of the benefits of classical optimization-based methods. The functional nature of deep networks do not guarantee that the predicted transformation is a local minima of the registration objective, the representation of the transformation (displacement/velocity field/affine) is fixed, and the networks are not robust to domain shift. Our method aims to bridge this gap between classical and learning methods by incorporating optimization as a layer in a deep network. A deep network is trained to predict multi-scale dense feature images that are registered using a black box iterative optimization solver. This optimal warp is then used to minimize image and label alignment errors. By implicitly differentiating end-to-end through an iterative optimization solver, our learned features are registration and label-aware, and the warp functions are guaranteed to be local minima of the registration objective in the feature space. Our framework shows excellent performance on in-domain datasets, and is agnostic to domain shift such as anisotropy and varying intensity profiles. For the first time, our method allows switching between arbitrary transformation representations (free-form to diffeomorphic) at test time with zero retraining. End-to-end feature learning also facilitates interpretability of features, and out-of-the-box promptability using additional label-fidelity terms at inference.
Authors: Yuyan Liu, Sirui Ding, Sheng Zhou, Wenqi Fan, Qiaoyu Tan
Abstract: Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 15.7% increase on classification accuracy and decrease of 17.9 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. The code is available at https://github.com/NYUSHCS/MolecularGPT.
Authors: Jirui Qi, Gabriele Sarti, Raquel Fern\'andez, Arianna Bisazza
Abstract: Ensuring the verifiability of model answers is a fundamental challenge for retrieval-augmented generation (RAG) in the question answering (QA) domain. Recently, self-citation prompting was proposed to make large language models (LLMs) generate citations to supporting documents along with their answers. However, self-citing LLMs often struggle to match the required format, refer to non-existent sources, and fail to faithfully reflect LLMs' context usage throughout the generation. In this work, we present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in RAG applications. MIRAGE detects context-sensitive answer tokens and pairs them with retrieved documents contributing to their prediction via saliency methods. We evaluate our proposed approach on a multilingual extractive QA dataset, finding high agreement with human answer attribution. On open-ended QA, MIRAGE achieves citation quality and efficiency comparable to self-citation while also allowing for a finer-grained control of attribution parameters. Our qualitative evaluation highlights the faithfulness of MIRAGE's attributions and underscores the promising application of model internals for RAG answer attribution.
Authors: Zhuoxiao Chen, Junjie Meng, Mahsa Baktashmotlagh, Yonggang Zhang, Zi Huang, Yadan Luo
Abstract: LiDAR-based 3D object detection is crucial for various applications but often experiences performance degradation in real-world deployments due to domain shifts. While most studies focus on cross-dataset shifts, such as changes in environments and object geometries, practical corruptions from sensor variations and weather conditions remain underexplored. In this work, we propose a novel online test-time adaptation framework for 3D detectors that effectively tackles these shifts, including a challenging cross-corruption scenario where cross-dataset shifts and corruptions co-occur. By leveraging long-term knowledge from previous test batches, our approach mitigates catastrophic forgetting and adapts effectively to diverse shifts. Specifically, we propose a Model Synergy (MOS) strategy that dynamically selects historical checkpoints with diverse knowledge and assembles them to best accommodate the current test batch. This assembly is directed by our proposed Synergy Weights (SW), which perform a weighted averaging of the selected checkpoints, minimizing redundancy in the composite model. The SWs are computed by evaluating the similarity of predicted bounding boxes on the test data and the independence of features between checkpoint pairs in the model bank. To maintain an efficient and informative model bank, we discard checkpoints with the lowest average SW scores, replacing them with newly updated models. Our method was rigorously tested against existing test-time adaptation strategies across three datasets and eight types of corruptions, demonstrating superior adaptability to dynamic scenes and conditions. Notably, it achieved a 67.3% improvement in a challenging cross-corruption scenario, offering a more comprehensive benchmark for adaptation. The source code will be made publicly available.
Authors: Hao Liu, Junze Tony Ye, Jose Blanchet, Nian Si
Abstract: We introduce ScoreFusion, a theoretically grounded method for fusing multiple pre-trained diffusion models that are assumed to generate from auxiliary populations. ScoreFusion is particularly useful for enhancing the generative modeling of a target population with limited observed data. Our starting point considers the family of KL barycenters of the auxiliary populations, which is proven to be an optimal parametric class in the KL sense, but difficult to learn. Nevertheless, by recasting the learning problem as score matching in denoising diffusion, we obtain a tractable way of computing the optimal KL barycenter weights. We prove a dimension-free sample complexity bound in total variation distance, provided that the auxiliary models are well fitted for their own task and the auxiliary tasks combined capture the target well. We also explain a connection of the practice of checkpoint merging in AI art creation to an approximation of our KL-barycenter-based fusion approach. However, our fusion method differs in key aspects, allowing generation of new populations, as we illustrate in experiments.
Authors: Nan Xu, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen
Abstract: Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from pretraining data. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks.
Authors: Juraj Bodik, Olivier C. Pasche
Abstract: We introduce a rigorous mathematical framework for Granger causality in extremes, designed to identify causal links from extreme events in time series. Granger causality plays a pivotal role in uncovering directional relationships among time-varying variables. While this notion gains heightened importance during extreme and highly volatile periods, state-of-the-art methods primarily focus on causality within the body of the distribution, often overlooking causal mechanisms that manifest only during extreme events. Our framework is designed to infer causality mainly from extreme events by leveraging the causal tail coefficient. We establish equivalences between causality in extremes and other causal concepts, including (classical) Granger causality, Sims causality, and structural causality. We prove other key properties of Granger causality in extremes and show that the framework is especially helpful under the presence of hidden confounders. We also propose a novel inference method for detecting the presence of Granger causality in extremes from data. Our method is model-free, can handle non-linear and high-dimensional time series, outperforms current state-of-the-art methods in all considered setups, both in performance and speed, and was found to uncover coherent effects when applied to financial and extreme weather observations.
Authors: Andrea Eirale, Matteo Leonetti, Marcello Chiaberge
Abstract: Achieving social acceptance is one of the main goals of Social Robotic Navigation. Despite this topic has received increasing interest in recent years, most of the research has focused on driving the robotic agent along obstacle-free trajectories, planning around estimates of future human motion to respect personal distances and optimize navigation. However, social interactions in everyday life are also dictated by norms that do not strictly depend on movement, such as when standing at the end of a queue rather than cutting it. In this paper, we propose a novel method to recognize common social scenarios and modify a traditional planner's cost function to adapt to them. This solution enables the robot to carry out different social navigation behaviors that would not arise otherwise, maintaining the robustness of traditional navigation. Our approach allows the robot to learn different social norms with a single learned model, rather than having different modules for each task. As a proof of concept, we consider the tasks of queuing and respect interaction spaces of groups of people talking to one another, but the method can be extended to other human activities that do not involve motion.
Authors: Mladen Popovi\'c, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar
Abstract: Determining the chronology of ancient handwritten manuscripts is essential for reconstructing the evolution of ideas. For the Dead Sea Scrolls, this is particularly important. However, there is an almost complete lack of date-bearing manuscripts evenly distributed across the timeline and written in similar scripts available for palaeographic comparison. Here, we present Enoch, a state-of-the-art AI-based date-prediction model, trained on the basis of new radiocarbon-dated samples of the scrolls. Enoch uses established handwriting-style descriptors and applies Bayesian ridge regression. The challenge of this study is that the number of radiocarbon-dated manuscripts is small, while current machine learning requires an abundance of training data. We show that by using combined angular and allographic writing style feature vectors and applying Bayesian ridge regression, Enoch could predict the radiocarbon-based dates from style, supported by leave-one-out validation, with varied MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch was then used to estimate the dates of 135 unseen manuscripts, revealing that 79 per cent of the samples were considered 'realistic' upon palaeographic post-hoc evaluation. We present a new chronology of the scrolls. The radiocarbon ranges and Enoch's style-based predictions are often older than the traditionally assumed palaeographic estimates. In the range of 300-50 BCE, Enoch's date prediction provides an improved granularity. The study is in line with current developments in multimodal machine-learning techniques, and the methods can be used for date prediction in other partially-dated manuscript collections. This research shows how Enoch's quantitative, probability-based approach can be a tool for palaeographers and historians, re-dating ancient Jewish key texts and contributing to current debates on Jewish and Christian origins.
Authors: Aras Selvi, Eleonora Kreacic, Mohsen Ghassemi, Vamsi Potluru, Tucker Balch, Manuela Veloso
Abstract: Adversarially robust optimization (ARO) has become the de facto standard for training models to defend against adversarial attacks during testing. However, despite their robustness, these models often suffer from severe overfitting. To mitigate this issue, several successful approaches have been proposed, including replacing the empirical distribution in training with: (i) a worst-case distribution within an ambiguity set, leading to a distributionally robust (DR) counterpart of ARO; or (ii) a mixture of the empirical distribution with one derived from an auxiliary dataset (e.g., synthetic, external, or out-of-domain). Building on the first approach, we explore the Wasserstein DR counterpart of ARO for logistic regression and show it admits a tractable convex optimization reformulation. Adopting the second approach, we enhance the DR framework by intersecting its ambiguity set with one constructed from an auxiliary dataset, which yields significant improvements when the Wasserstein distance between the data-generating and auxiliary distributions can be estimated. We analyze the resulting optimization problem, develop efficient solutions, and show that our method outperforms benchmark approaches on standard datasets.
Authors: Shuze Chen, David Simchi-Levi, Chonghuan Wang
Abstract: Utilizing randomized experiments to evaluate the effect of short-term treatments on the short-term outcomes has been well understood and become the golden standard in industrial practice. However, as service systems become increasingly dynamical and personalized, much focus is shifting toward maximizing long-term cumulative outcomes, such as customer lifetime value, through lifetime exposure to interventions. To bridge this gap, we investigate the randomized experiments within dynamical systems modeled as Markov Decision Processes (MDPs). Our goal is to assess the impact of treatment and control policies on long-term cumulative rewards from relatively short-term observations. We first develop optimal inference techniques for assessing the effects of general treatment patterns. Furthermore, recognizing that many real-world treatments tend to be fine-grained and localized for practical efficiency and operational convenience, we then propose methods to harness this localized structure by sharing information on the non-targeted states. Our new estimator effectively overcomes the variance lower bound for general treatments while matching the more stringent lower bound incorporating the local treatment structure. Furthermore, our estimator can optimally achieve a linear reduction with the number of test arms for a major part of the variance. Finally, we explore scenarios with perfect knowledge of the control arm and design estimators that further improve inference efficiency.
Authors: Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin
Abstract: Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.
Authors: Yunyi Shen, Renato Berlinghieri, Tamara Broderick
Abstract: Practitioners often aim to infer an unobserved population trajectory using sample snapshots at multiple time points. E.g., given single-cell sequencing data, scientists would like to learn how gene expression changes over a cell's life cycle. But sequencing any cell destroys that cell. So we can access data for any particular cell only at a single time point, but we have data across many cells. The deep learning community has recently explored using Schr\"odinger bridges (SBs) and their extensions in similar settings. However, existing methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic (often set to Brownian motion within SBs). But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model family for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a family of reference dynamics, not a single fixed one. We demonstrate the advantages of our method on simulated and real data.
Authors: Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Abstract: In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh ICML 2023] from $O(d^{3})$ to $O(d^{\omega} + d^{2+a+o(1)} + d^{1+\omega(1,1,a)-a})$ for any $a \in [0, 1]$, where $\omega$ is the exponent of matrix multiplication. In particular, for the current $\omega \approx 2.371$ [Alman, Duan, Williams, Xu, Xu, Zhou 2024], our running time boils down to $O(d^{2.53})$. This running time is due to the analysis of the lazy update behavior in iterative maintenance problems such as [Deng, Song, Weinstein 2022; Brand, Song, Zhou ICML 2024].
Authors: Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr \.Zelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Abstract: Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
Authors: Tonmoy Roy, Md Robiul Islam, Asif Ahammad Miazee, Anika Antara, Al Amin, Sunjim Hossain
Abstract: Over the years, the number of users of social media has increased drastically. People frequently share their thoughts through social platforms, and this leads to an increase in hate content. In this virtual community, individuals share their views, express their feelings, and post photos, videos, blogs, and more. Social networking sites like Facebook and Twitter provide platforms to share vast amounts of content with a single click. However, these platforms do not impose restrictions on the uploaded content, which may include abusive language and explicit images unsuitable for social media. To resolve this issue, a new idea must be implemented to divide the inappropriate content. Numerous studies have been done to automate the process. In this paper, we propose a new Bi-GRU-CNN model to classify whether the text is offensive or not. The combination of the Bi-GRU and CNN models outperforms the existing model.
Authors: Dylan Zhang, Justin Wang, Francois Charton
Abstract: Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we rigorously examine the key factors that enable models to generalize to unseen instructions, providing insights to guide the collection of data for instruction-tuning. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization $\textbf{only emerges}$ when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model's adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of $\textit{$\textbf{specialist}$}$ and $\textit{$\textbf{generalist}$}$ models. In both cases, we demonstrate that 1) better performance can be achieved by increasing the diversity of an established dataset while keeping the data size constant, and 2) when scaling up the data, diversifying the semantics of instructions is more effective than simply increasing the quantity of similar data. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.
Authors: Zhuoran Zhang, Yongxiang Li, Zijian Kan, Keyuan Cheng, Lijie Hu, Di Wang
Abstract: The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper MLP layers, unlike single-hop tasks, which rely on earlier layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. IFMET employs multi-hop editing prompts and supplementary sets to locate and modify knowledge across different reasoning stages. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, effectively overcoming the limitations of previous locate-then-edit methods.
Authors: Friedrich Wolf-Monheim
Abstract: Convolutional neural networks (CNNs) are widely used in computer vision. They can be used not only for conventional digital image material to recognize patterns, but also for feature extraction from digital imagery representing spectral and rhythm features extracted from time-domain digital audio signals for the acoustic classification of sounds. Different spectral and rhythm feature representations like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCCs), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams are investigated in terms of the audio classification performance using a deep convolutional neural network. It can be clearly shown that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCCs) perform significantly better than the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs. The experiments were carried out with the aid of the ESC-50 dataset with 2,000 labeled environmental audio recordings.
Authors: Peifan Jiang, Xuben Wang, Shuang Wang, Fei Deng, Kunpeng Wang, Bin Wang, Yuhan Yang, Islam Fadel
Abstract: Magnetotelluric deep learning (DL) inversion methods based on joint data-driven and physics-driven have become a hot topic in recent years. When mapping observation data (or forward modeling data) to the resistivity model using neural networks (NNs), incorporating the error (loss) term of the inversion resistivity's forward modeling response--which introduces physical information about electromagnetic field propagation--can significantly enhance the inversion accuracy. To efficiently achieve data-physical dual-driven MT deep learning inversion for large-scale 3-D MT data, we propose using DL forward modeling networks to compute this portion of the loss. This approach introduces pseudo-physical information through the forward modeling of NN simulation, further guiding the inversion network fitting. Specifically, we first pre-train the forward modeling networks as fixed forward modeling operators, then transfer and integrate them into the inversion network training, and finally optimize the inversion network by minimizing the multinomial loss. Theoretical experimental results indicate that despite some simulation errors in DL forward modeling, the introduced pseudo-physical information still enhances inversion accuracy and significantly mitigates the overfitting problem during training. Additionally, we propose a new input mode that involves masking and adding noise to the data, simulating the field data environment of 3-D MT inversion, thereby making the method more flexible and effective for practical applications.
Authors: Yiwei Zhang, Rouzbeh Behnia, Attila A. Yavuz, Reza Ebrahimi, Elisa Bertino
Abstract: Federated learning enables the collaborative learning of a global model on diverse data, preserving data locality and eliminating the need to transfer user data to a central server. However, data privacy remains vulnerable, as attacks can target user training data by exploiting the updates sent by users during each learning iteration. Secure aggregation protocols are designed to mask/encrypt user updates and enable a central server to aggregate the masked information. MicroSecAgg (PoPETS 2024) proposes a single server secure aggregation protocol that aims to mitigate the high communication complexity of the existing approaches by enabling a one-time setup of the secret to be re-used in multiple training iterations. In this paper, we identify a security flaw in the MicroSecAgg that undermines its privacy guarantees. We detail the security flaw and our attack, demonstrating how an adversary can exploit predictable masking values to compromise user privacy. Our findings highlight the critical need for enhanced security measures in secure aggregation protocols, particularly the implementation of dynamic and unpredictable masking strategies. We propose potential countermeasures to mitigate these vulnerabilities and ensure robust privacy protection in the secure aggregation frameworks.
Authors: Xinyuan Wang, Victor Shea-Jay Huang, Renmiao Chen, Hao Wang, Chengwei Pan, Lei Sha, Minlie Huang
Abstract: While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.
Authors: Abhijit Manatkar, Ashlesha Akella, Parthivi Gupta, Krishnasuri Narayanam
Abstract: Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.
Authors: Zhengyan Shi, Sander Land, Acyr Locatelli, Matthieu Geist, Max Bartolo
Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), have emerged as alternatives to online Reinforcement Learning from Human Feedback (RLHF) algorithms such as Proximal Policy Optimisation (PPO) for aligning language models to human preferences, without the need for explicit reward modelling. These methods generally aim to increase the likelihood of generating better (preferred) completions while discouraging worse (non-preferred) ones, while staying close to the original model's behaviour. In this work, we explore the relationship between completion likelihood and model performance in state-of-the-art DAAs, and identify a critical issue of likelihood over-optimisation. Contrary to expectations, we find that higher likelihood of better completions and larger margins between better and worse completion likelihoods do not necessarily lead to better performance, and may even degrade it. Our analysis reveals that while higher likelihood correlates with better memorisation of factual knowledge patterns, a slightly lower completion likelihood tends to improve output diversity, thus leading to better generalisation to unseen scenarios. Moreover, we identify two key indicators that signal when over-optimised output diversity begins to harm performance: Decreasing Entropy over Top-k Tokens and Diminishing Top-k Probability Mass. Our experimental results validate that these indicators are reliable signs of declining performance under different regularisations, helping prevent over-optimisation and improve alignment with human preferences.
Authors: Alison N. O'Connor, Stephen E. Ryan, Gauri Vaidya, Paul Harford, Meghana Kshirsagar
Abstract: Increased healthcare demand is significantly straining European services. Digital solutions including advanced modelling techniques offer a promising solution to optimising patient flow without impacting day-to-day healthcare provision. In this work we outline an ongoing project that aims to optimise healthcare resources using agent-based simulations.
Authors: Jiayi Guo, Zan Chen, Yingrui Ji, Liyun Zhang, Daqin Luo, Zhigang Li, Yiqin Shen
Abstract: Automated Machine Learning (AutoML) has simplified complex ML processes such as data pre-processing, model selection, and hyper-parameter searching. However, traditional AutoML frameworks focus solely on discriminative tasks, often falling short in tackling AutoML for generative models. Additionally, these frameworks lack interpretability and user engagement during the training process, primarily due to the absence of human-centered design. It leads to a lack of transparency in final decision-making and limited user control, potentially reducing trust and adoption of AutoML methods. To address these limitations, we introduce UniAutoML, a human-centered AutoML framework that leverages Large Language Models (LLMs) to unify AutoML for both discriminative (e.g., Transformers and CNNs for classification or regression tasks) and generative tasks (e.g., fine-tuning diffusion models or LLMs). The human-centered design of UniAutoML innovatively features a conversational user interface (CUI) that facilitates natural language interactions, providing users with real-time guidance, feedback, and progress updates for better interpretability. This design enhances transparency and user control throughout the AutoML training process, allowing users to seamlessly break down or modify the model being trained. To mitigate potential risks associated with LLM generated content, UniAutoML incorporates a safety guardline that filters inputs and censors outputs. We evaluated UniAutoML's performance and usability through experiments on eight diverse datasets and user studies involving 25 participants, demonstrating that UniAutoML not only enhances performance but also improves user control and trust. Our human-centered design bridges the gap between AutoML capabilities and user understanding, making ML more accessible to a broader audience.
Authors: Thomas Uriot
Abstract: Current advances in Natural Language Processing (NLP) have made it increasingly feasible to build applications leveraging textual data. Generally, the core of these applications rely on having a good semantic representation of text into vectors, via embedding models. However, it has been shown that these embeddings capture and perpetuate biases already present in text. While a few techniques have been proposed to debias embeddings, they do not take advantage of the recent advances in context understanding of modern embedding models. In this paper, we fill this gap by conducting a review of 19 embedding models by quantifying their biases and how well they respond to context injection as a mean of debiasing. We show that higher performing models are more prone to capturing biases, but are also better at incorporating context. Surprisingly, we find that while models can easily embed affirmative semantics, they fail at embedding neutral semantics. Finally, in a retrieval task, we show that biases in embeddings can lead to non-desirable outcomes. We use our new-found insights to design a simple algorithm for top $k$ retrieval, where $k$ is dynamically selected. We show that our algorithm is able to retrieve all relevant gendered and neutral chunks.
Authors: Jameson Merkow, Felix J. Dorfner, Xiyu Yang, Alexander Ersoy, Giridhar Dasegowda, Mannudeep Kalra, Matthew P. Lungren, Christopher P. Bridge, Ivan Tarapov
Abstract: The integration of artificial intelligence (AI) into medical imaging has advanced clinical diagnostics but poses challenges in managing model drift and ensuring long-term reliability. To address these challenges, we develop MMC+, an enhanced framework for scalable drift monitoring, building upon the CheXstray framework that introduced real-time drift detection for medical imaging AI models using multi-modal data concordance. This work extends the original framework's methodologies, providing a more scalable and adaptable solution for real-world healthcare settings and offers a reliable and cost-effective alternative to continuous performance monitoring addressing limitations of both continuous and periodic monitoring methods. MMC+ introduces critical improvements to the original framework, including more robust handling of diverse data streams, improved scalability with the integration of foundation models like MedImageInsight for high-dimensional image embeddings without site-specific training, and the introduction of uncertainty bounds to better capture drift in dynamic clinical environments. Validated with real-world data from Massachusetts General Hospital during the COVID-19 pandemic, MMC+ effectively detects significant data shifts and correlates them with model performance changes. While not directly predicting performance degradation, MMC+ serves as an early warning system, indicating when AI systems may deviate from acceptable performance bounds and enabling timely interventions. By emphasizing the importance of monitoring diverse data streams and evaluating data shifts alongside model performance, this work contributes to the broader adoption and integration of AI solutions in clinical settings.
Authors: Tangwen Qian, Junhe Li, Yile Chen, Gao Cong, Tao Sun, Fei Wang, Yongjun Xu
Abstract: Modeling trajectory data with generic-purpose dense representations has become a prevalent paradigm for various downstream applications, such as trajectory classification, travel time estimation and similarity computation. However, existing methods typically rely on trajectories from a single spatial view, limiting their ability to capture the rich contextual information that is crucial for gaining deeper insights into movement patterns across different geospatial contexts. To this end, we propose MVTraj, a novel multi-view modeling method for trajectory representation learning. MVTraj integrates diverse contextual knowledge, from GPS to road network and points-of-interest to provide a more comprehensive understanding of trajectory data. To align the learning process across multiple views, we utilize GPS trajectories as a bridge and employ self-supervised pretext tasks to capture and distinguish movement patterns across different spatial views. Following this, we treat trajectories from different views as distinct modalities and apply a hierarchical cross-modal interaction module to fuse the representations, thereby enriching the knowledge derived from multiple sources. Extensive experiments on real-world datasets demonstrate that MVTraj significantly outperforms existing baselines in tasks associated with various spatial views, validating its effectiveness and practical utility in spatio-temporal modeling.
Authors: Luyao Zou, Yu Min Park, Chu Myaet Thwal, Yan Kyaw Tun, Zhu Han, Choong Seon Hong
Abstract: Low Earth orbit (LEO) satellites are capable of gathering abundant Earth observation data (EOD) to enable different Internet of Things (IoT) applications. However, to accomplish an effective EOD processing mechanism, it is imperative to investigate: 1) the challenge of processing the observed data without transmitting those large-size data to the ground because the connection between the satellites and the ground stations is intermittent, and 2) the challenge of processing the non-independent and identically distributed (non-IID) satellite data. In this paper, to cope with those challenges, we propose an orbit-based spectral clustering-assisted clustered federated self-knowledge distillation (OSC-FSKD) approach for each orbit of an LEO satellite constellation, which retains the advantage of FL that the observed data does not need to be sent to the ground. Specifically, we introduce normalized Laplacian-based spectral clustering (NLSC) into federated learning (FL) to create clustered FL in each round to address the challenge resulting from non-IID data. Particularly, NLSC is adopted to dynamically group clients into several clusters based on cosine similarities calculated by model updates. In addition, self-knowledge distillation is utilized to construct each local client, where the most recent updated local model is used to guide current local model training. Experiments demonstrate that the observation accuracy obtained by the proposed method is separately 1.01x, 2.15x, 1.10x, and 1.03x higher than that of pFedSD, FedProx, FedAU, and FedALA approaches using the SAT4 dataset. The proposed method also shows superiority when using other datasets.
Authors: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh
Abstract: Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.