Authors: Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fr\"anken, Chelsea Finn, Alon Albalak
Abstract: Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs). The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs. Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM. However, recent work has shown that synthetic preferences labels may not align well with human preference judgments. To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies. We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments. Empirically, we show that zero-shot LLM-based judgments under-perform compared to Bradley-Terry reward models on in-distribution tasks (between 9-36%). In contrast, GenRM achieves in-distribution accuracy comparable to Bradley-Terry models, while significantly outperforming them on out-of-distribution tasks (between 10-45%). Moreover, GenRM surpasses the performance of using LLMs as judges on both in-distribution (by 9-31%) and out-of-distribution tasks (by 2- 6%). Our results show that combining the strengths of RLHF and RLAIF offers a promising approach for improving the quality of synthetic preference labels.
Authors: Ameet Gadekar, Aristides Gionis, Suhas Thejaswi
Abstract: Data summarization tasks are often modeled as $k$-clustering problems, where the goal is to choose $k$ data points, called cluster centers, that best represent the dataset by minimizing a clustering objective. A popular objective is to minimize the maximum distance between any data point and its nearest center, which is formalized as the $k$-center problem. While in some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a predefined subset of points, referred as facilities or suppliers; this is known as the $k$-supplier problem. In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group while minimizing the $k$-supplier objective. The groups can be disjoint or overlapping, leading to two distinct problem variants each with different computational complexity. We present $3$-approximation algorithms for both variants, improving the previously known factor of $5$. For disjoint groups, our algorithm runs in polynomial time, while for overlapping groups, we present a fixed-parameter tractable algorithm, where the exponential runtime depends only on the number of groups and centers. We show that these approximation factors match the theoretical lower bounds, assuming standard complexity theory conjectures. Finally, using an open-source implementation, we demonstrate the scalability of our algorithms on large synthetic datasets and assess the price of fairness on real-world data, comparing solution quality with and without fairness constraints.
Authors: Arham Khan, Todd Nief, Nathaniel Hudson, Mansi Sakarvadia, Daniel Grzenda, Aswathy Ajith, Jordan Pettyjohn, Kyle Chard, Ian Foster
Abstract: Understanding neural networks is crucial to creating reliable and trustworthy deep learning models. Most contemporary research in interpretability analyzes just one model at a time via causal intervention or activation analysis. Yet despite successes, these methods leave significant gaps in our understanding of the training behaviors of neural networks, how their inner representations emerge, and how we can predictably associate model components with task-specific behaviors. Seeking new insights from work in related fields, here we survey literature in the field of model merging, a field that aims to combine the abilities of various neural networks by merging their parameters and identifying task-specific model components in the process. We analyze the model merging literature through the lens of loss landscape geometry, an approach that enables us to connect observations from empirical studies on interpretability, security, model merging, and loss landscape analysis to phenomena that govern neural network training and the emergence of their inner representations. To systematize knowledge in this area, we present a novel taxonomy of model merging techniques organized by their core algorithmic principles. Additionally, we distill repeated empirical observations from the literature in these fields into characterizations of four major aspects of loss landscape geometry: mode convexity, determinism, directedness, and connectivity. We argue that by improving our understanding of the principles underlying model merging and loss landscape geometry, this work contributes to the goal of ensuring secure and trustworthy machine learning in practice.
Authors: Qidong Yang, Jonathan Giezendanner, Daniel Salles Civitarese, Johannes Jakubik, Eric Schmitt, Anirban Chandra, Jeremy Vila, Detlef Hohl, Chris Hill, Campbell Watson, Sherrie Wang
Abstract: Urgent applications like wildfire management and renewable energy generation require precise, localized weather forecasts near the Earth's surface. However, weather forecast products from machine learning or numerical weather models are currently generated on a global regular grid, on which a naive interpolation cannot accurately reflect fine-grained weather patterns close to the ground. In this work, we train a heterogeneous graph neural network (GNN) end-to-end to downscale gridded forecasts to off-grid locations of interest. This multi-modal GNN takes advantage of local historical weather observations (e.g., wind, temperature) to correct the gridded weather forecast at different lead times towards locally accurate forecasts. Each data modality is modeled as a different type of node in the graph. Using message passing, the node at the prediction location aggregates information from its heterogeneous neighbor nodes. Experiments using weather stations across the Northeastern United States show that our model outperforms a range of data-driven and non-data-driven off-grid forecasting methods. Our approach demonstrates how the gap between global large-scale weather models and locally accurate predictions can be bridged to inform localized decision-making.
Authors: Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
Abstract: Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability -- which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability -- can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the lookup-table mechanism for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models. We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.
Authors: Aayush Agrawal, Aniruddh Sikdar, Rajini Makam, Suresh Sundaram, Suresh Kumar Besai, Mahesh Gopi
Abstract: Underwater mine detection with deep learning suffers from limitations due to the scarcity of real-world data. This scarcity leads to overfitting, where models perform well on training data but poorly on unseen data. This paper proposes a Syn2Real (Synthetic to Real) domain generalization approach using diffusion models to address this challenge. We demonstrate that synthetic data generated with noise by DDPM and DDIM models, even if not perfectly realistic, can effectively augment real-world samples for training. The residual noise in the final sampled images improves the model's ability to generalize to real-world data with inherent noise and high variation. The baseline Mask-RCNN model when trained on a combination of synthetic and original training datasets, exhibited approximately a 60% increase in Average Precision (AP) compared to being trained solely on the original training data. This significant improvement highlights the potential of Syn2Real domain generalization for underwater mine detection tasks.
Authors: Ali Borji
Abstract: The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.
Authors: Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade
Abstract: While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear $O(L\log^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $1.6\times$ end-to-end improvement over standard inference by improving $50\times$ within the position-mixing part.
Authors: Jinzhu Luo, Dingyang Chen, Qi Zhang
Abstract: Data augmentation creates new data points by transforming the original ones for a reinforcement learning (RL) agent to learn from, which has been shown to be effective for the objective of improving the data efficiency of RL for continuous control. Prior work towards this objective has been largely restricted to perturbation-based data augmentation where new data points are created by perturbing the original ones, which has been impressively effective for tasks where the RL agent observes control states as images with perturbations including random cropping, shifting, etc. This work focuses on state-based control, where the RL agent can directly observe raw kinematic and task features, and considers an alternative data augmentation applied to these features based on Euclidean symmetries under transformations like rotations. We show that the default state features used in exiting benchmark tasks that are based on joint configurations are not amenable to Euclidean transformations. We therefore advocate using state features based on configurations of the limbs (i.e., the rigid bodies connected by the joints) that instead provide rich augmented data under Euclidean transformations. With minimal hyperparameter tuning, we show this new Euclidean data augmentation strategy significantly improves both data efficiency and asymptotic performance of RL on a wide range of continuous control tasks.
Authors: Stefan Jaeger
Abstract: Contemporary machine learning methods will try to approach the Bayes error, as it is the lowest possible error any model can achieve. This paper postulates that any decision is composed of not one but two Bayesian decisions and that decision-making is, therefore, a double-Bayesian process. The paper shows how this duality implies intrinsic uncertainty in decisions and how it incorporates explainability. The proposed approach understands that Bayesian learning is tantamount to finding a base for a logarithmic function measuring uncertainty, with solutions being fixed points. Furthermore, following this approach, the golden ratio describes possible solutions satisfying Bayes' theorem. The double-Bayesian framework suggests using a learning rate and momentum weight with values similar to those used in the literature to train neural networks with stochastic gradient descent.
Authors: Nazanin Fouladgar, Marjan Alirezaie, Kary Fr\"amling
Abstract: Local explanation of machine learning (ML) models has recently received significant attention due to its ability to reduce ambiguities about why the models make specific decisions. Extensive efforts have been invested to address explainability for different data types, particularly images. However, the work on multivariate time series data is limited. A possible reason is that the conflation of time and other variables in time series data can cause the generated explanations to be incomprehensible to humans. In addition, some efforts on time series fall short of providing accurate explanations as they either ignore a context in the time domain or impose differentiability requirements on the ML models. Such restrictions impede their ability to provide valid explanations in real-world applications and non-differentiable ML settings. In this paper, we propose a swapping--sliding decision explanation for multivariate time series classifiers, called SSET. The proposal consists of swapping and sliding stages, by which salient sub-sequences causing significant drops in the prediction score are presented as explanations. In the former stage, the important variables are detected by swapping the series of interest with close train data from target classes. In the latter stage, the salient observations of these variables are explored by sliding a window over each time step. Additionally, the model measures the importance of different variables over time in a novel way characterized by multiple factors. We leverage SSET on affect detection domain where evaluations are performed on two real-world physiological time series datasets, WESAD and MAHNOB-HCI, and a deep convolutional classifier, CN-Waterfall. This classifier has shown superior performance to prior models to detect human affective states. Comparing SSET with several benchmarks, including LIME, integrated gradients, and Dynamask, we found..
Authors: David Farr, Nico Manzonelli, Iain Cruickshank, Kate Starbird, Jevin West
Abstract: The ability of large language models (LLMs) to perform zero-shot classification makes them viable solutions for data annotation in rapidly evolving domains where quality labeled data is often scarce and costly to obtain. However, the large-scale deployment of LLMs can be prohibitively expensive. This paper introduces an LLM chain ensemble methodology that aligns multiple LLMs in a sequence, routing data subsets to subsequent models based on classification uncertainty. This approach leverages the strengths of individual LLMs within a broader system, allowing each model to handle data points where it exhibits the highest confidence, while forwarding more complex cases to potentially more robust models. Our results show that the chain ensemble method often exceeds the performance of the best individual model in the chain and achieves substantial cost savings, making LLM chain ensembles a practical and efficient solution for large-scale data annotation challenges.
Authors: Arka Daw, Megan Hong-Thanh Chung, Maria Mahbub, Amir Sadovnik
Abstract: Machine learning models are known to be vulnerable to adversarial attacks, but traditional attacks have mostly focused on single-modalities. With the rise of large multi-modal models (LMMs) like CLIP, which combine vision and language capabilities, new vulnerabilities have emerged. However, prior work in multimodal targeted attacks aim to completely change the model's output to what the adversary wants. In many realistic scenarios, an adversary might seek to make only subtle modifications to the output, so that the changes go unnoticed by downstream models or even by humans. We introduce Hiding-in-Plain-Sight (HiPS) attacks, a novel class of adversarial attacks that subtly modifies model predictions by selectively concealing target object(s), as if the target object was absent from the scene. We propose two HiPS attack variants, HiPS-cls and HiPS-cap, and demonstrate their effectiveness in transferring to downstream image captioning models, such as CLIP-Cap, for targeted object removal from image captions.
Authors: Idan Attias, Steve Hanneke, Arvind Ramaswami
Abstract: We present novel reductions from sample compression schemes in multiclass classification, regression, and adversarially robust learning settings to binary sample compression schemes. Assuming we have a compression scheme for binary classes of size $f(d_\mathrm{VC})$, where $d_\mathrm{VC}$ is the VC dimension, then we have the following results: (1) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists a multiclass compression scheme of size $O(f(d_\mathrm{G}))$, where $d_\mathrm{G}$ is the graph dimension. Moreover, for general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{G})\log|Y|)$, where $Y$ is the label space. (2) If the binary compression scheme is a majority-vote or a stable compression scheme, then there exists an $\epsilon$-approximate compression scheme for regression over $[0,1]$-valued functions of size $O(f(d_\mathrm{P}))$, where $d_\mathrm{P}$ is the pseudo-dimension. For general binary compression schemes, we obtain a compression of size $O(f(d_\mathrm{P})\log(1/\epsilon))$. These results would have significant implications if the sample compression conjecture, which posits that any binary concept class with a finite VC dimension admits a binary compression scheme of size $O(d_\mathrm{VC})$, is resolved (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995; Warmuth, 2003). Our results would then extend the proof of the conjecture immediately to other settings. We establish similar results for adversarially robust learning and also provide an example of a concept class that is robustly learnable but has no bounded-size compression scheme, demonstrating that learnability is not equivalent to having a compression scheme independent of the sample size, unlike in binary classification, where compression of size $2^{O(d_\mathrm{VC})}$ is attainable (Moran and Yehudayoff, 2016).
Authors: Evelyn Ma, Chao Pan, Rasoul Etesami, Han Zhao, Olgica Milenkovic
Abstract: The performance of Transfer Learning (TL) heavily relies on effective pretraining, which demands large datasets and substantial computational resources. As a result, executing TL is often challenging for individual model developers. Federated Learning (FL) addresses these issues by facilitating collaborations among clients, expanding the dataset indirectly, distributing computational costs, and preserving privacy. However, key challenges remain unresolved. First, existing FL methods tend to optimize transferability only within local domains, neglecting the global learning domain. Second, most approaches rely on indirect transferability metrics, which do not accurately reflect the final target loss or true degree of transferability. To address these gaps, we propose two enhancements to FL. First, we introduce a client-server exchange protocol that leverages cross-client Jacobian (gradient) norms to boost transferability. Second, we increase the average Jacobian norm across clients at the server, using this as a local regularizer to reduce cross-client Jacobian variance. Our transferable federated algorithm, termed FedGTST (Federated Global Transferability via Statistics Tuning), demonstrates that increasing the average Jacobian and reducing its variance allows for tighter control of the target loss. This leads to an upper bound on the target loss in terms of the source loss and source-target domain discrepancy. Extensive experiments on datasets such as MNIST to MNIST-M and CIFAR10 to SVHN show that FedGTST outperforms relevant baselines, including FedSR. On the second dataset pair, FedGTST improves accuracy by 9.8% over FedSR and 7.6% over FedIIR when LeNet is used as the backbone.
Authors: Tong Liu, Hadi Meidani
Abstract: Supply chain networks are critical to the operational efficiency of industries, yet their increasing complexity presents significant challenges in mapping relationships and identifying the roles of various entities. Traditional methods for constructing supply chain networks rely heavily on structured datasets and manual data collection, limiting their scope and efficiency. In contrast, recent advancements in Natural Language Processing (NLP) and large language models (LLMs) offer new opportunities for discovering and analyzing supply chain networks using unstructured text data. This paper proposes a novel approach that leverages LLMs to extract and process raw textual information from publicly available sources to construct a comprehensive supply chain graph. We focus on the civil engineering sector as a case study, demonstrating how LLMs can uncover hidden relationships among companies, projects, and other entities. Additionally, we fine-tune an LLM to classify entities within the supply chain graph, providing detailed insights into their roles and relationships. The results show that domain-specific fine-tuning improves classification accuracy, highlighting the potential of LLMs for industry-specific supply chain analysis. Our contributions include the development of a supply chain graph for the civil engineering sector, as well as a fine-tuned LLM model that enhances entity classification and understanding of supply chain networks.
Authors: Moritz Willig, Tim Nelson Tobiasch, Florian Peter Busch, Jonas Seng, Devendra Singh Dhami, Kristian Kersting
Abstract: Most work on causality in machine learning assumes that causal relationships are driven by a constant underlying process. However, the flexibility of agents' actions or tipping points in the environmental process can change the qualitative dynamics of the system. As a result, new causal relationships may emerge, while existing ones change or disappear, resulting in an altered causal graph. To analyze these qualitative changes on the causal graph, we propose the concept of meta-causal states, which groups classical causal models into clusters based on equivalent qualitative behavior and consolidates specific mechanism parameterizations. We demonstrate how meta-causal states can be inferred from observed agent behavior, and discuss potential methods for disentangling these states from unlabeled data. Finally, we direct our analysis towards the application of a dynamical system, showing that meta-causal states can also emerge from inherent system dynamics, and thus constitute more than a context-dependent framework in which mechanisms emerge only as a result of external factors.
Authors: Nandan Kumar Jha, Brandon Reagen
Abstract: The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23$\times$ communication and 1.94$\times$ latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.
Authors: Youpeng Li, Xinda Wang, Fuxun Yu, Lichao Sun, Wenbin Zhang, Xuyu Wang
Abstract: Federated learning (FL), an emerging distributed machine learning paradigm, has been applied to various privacy-preserving scenarios. However, due to its distributed nature, FL faces two key issues: the non-independent and identical distribution (non-IID) of user data and vulnerability to Byzantine threats. To address these challenges, in this paper, we propose FedCAP, a robust FL framework against both data heterogeneity and Byzantine attacks. The core of FedCAP is a model update calibration mechanism to help a server capture the differences in the direction and magnitude of model updates among clients. Furthermore, we design a customized model aggregation rule that facilitates collaborative training among similar clients while accelerating the model deterioration of malicious clients. With a Euclidean norm-based anomaly detection mechanism, the server can quickly identify and permanently remove malicious clients. Moreover, the impact of data heterogeneity and Byzantine attacks can be further mitigated through personalization on the client side. We conduct extensive experiments, comparing multiple state-of-the-art baselines, to demonstrate that FedCAP performs well in several non-IID settings and shows strong robustness under a series of poisoning attacks.
Authors: Peng Xia, Kangyu Zhu, Haoran Li, Tianze Wang, Weijia Shi, Sheng Wang, Linjun Zhang, James Zou, Huaxiu Yao
Abstract: Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG.
Authors: Jie Ren, Kangrui Chen, Chen Chen, Vikash Sehwag, Yue Xing, Jiliang Tang, Lingjuan Lyu
Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have made significant advancements in a wide range of natural language processing and vision-language tasks. Access to large web-scale datasets has been a key factor in their success. However, concerns have been raised about the unauthorized use of copyrighted materials and potential copyright infringement. Existing methods, such as sample-level Membership Inference Attacks (MIA) and distribution-based dataset inference, distinguish member data (data used for training) and non-member data by leveraging the common observation that models tend to memorize and show greater confidence in member data. Nevertheless, these methods face challenges when applied to LLMs and VLMs, such as the requirement for ground-truth member data or non-member data that shares the same distribution as the test data. In this paper, we propose a novel dataset-level membership inference method based on Self-Comparison. We find that a member prefix followed by a non-member suffix (paraphrased from a member suffix) can further trigger the model's memorization on training data. Instead of directly comparing member and non-member data, we introduce paraphrasing to the second half of the sequence and evaluate how the likelihood changes before and after paraphrasing. Unlike prior approaches, our method does not require access to ground-truth member data or non-member data in identical distribution, making it more practical. Extensive experiments demonstrate that our proposed method outperforms traditional MIA and dataset inference techniques across various datasets and models, including including public models, fine-tuned models, and API-based commercial models.
Authors: Sajjad Ghiasvand, Yifan Yang, Zhiyu Xue, Mahnoosh Alizadeh, Zheng Zhang, Ramtin Pedarsani
Abstract: Parameter-efficient fine-tuning (PEFT) methods typically assume that Large Language Models (LLMs) are trained on data from a single device or client. However, real-world scenarios often require fine-tuning these models on private data distributed across multiple devices. Federated Learning (FL) offers an appealing solution by preserving user privacy, as sensitive data remains on local devices during training. Nonetheless, integrating PEFT methods into FL introduces two main challenges: communication overhead and data heterogeneity. In this paper, we introduce FedTT and FedTT+, methods for adapting LLMs by integrating tensorized adapters into client-side models' encoder/decoder blocks. FedTT is versatile and can be applied to both cross-silo FL and large-scale cross-device FL. FedTT+, an extension of FedTT tailored for cross-silo FL, enhances robustness against data heterogeneity by adaptively freezing portions of tensor factors, further reducing the number of trainable parameters. Experiments on BERT and LLaMA models demonstrate that our proposed methods successfully address data heterogeneity challenges and perform on par or even better than existing federated PEFT approaches while achieving up to 10$\times$ reduction in communication cost.
Authors: Jakub Grudzien Kuba, Pieter Abbeel, Sergey Levine
Abstract: Expressive large-scale neural networks enable training powerful models for prediction tasks. However, in many engineering and science domains, such models are intended to be used not just for prediction, but for design -- e.g., creating new proteins that serve as effective therapeutics, or creating new materials or chemicals that maximize a downstream performance measure. Thus, researchers have recently grown an interest in building deep learning methods that solve offline \emph{model-based optimization} (MBO) problems, in which design candidates are optimized with respect to surrogate models learned from offline data. However, straightforward application of predictive models that are effective at predicting in-distribution properties of a design are not necessarily the best suited for use in creating new designs. Thus, the most successful algorithms that tackle MBO draw inspiration from reinforcement learning and generative modeling to meet the in-distribution constraints. Meanwhile, recent theoretical works have observed that exploiting the structure of the target black-box function is an effective strategy for solving MBO from offline data. Unfortunately, discovering such structure remains an open problem. In this paper, following first principles, we develop a model that learns the structure of an MBO task and empirically leads to improved designs. To this end, we introduce \emph{Cliqueformer} -- a scalable transformer-based architecture that learns the black-box function's structure in the form of its \emph{functional graphical model} (FGM), thus bypassing the problem of distribution shift, previously tackled by conservative approaches. We evaluate Cliqueformer on various tasks, ranging from high-dimensional black-box functions from MBO literature to real-world tasks of chemical and genetic design, consistently demonstrating its state-of-the-art performance.
Authors: Emilio Calvano, Nika Haghtalab, Ellen Vitercik, Eric Zhao
Abstract: The content selection problem of digital services is often modeled as a decision-process where a service chooses, over multiple rounds, an arm to pull from a set of arms that each return a certain reward. This classical model does not account for the possibility that users disengage when dissatisfied and thus fails to capture an important trade-off between choosing content that promotes future engagement versus immediate reward. In this work, we introduce a model for the content selection problem where dissatisfied users may disengage and where the content that maximizes immediate reward does not necessarily maximize the odds of future user engagement. We show that when the relationship between each arm's expected reward and effect on user satisfaction are linearly related, an optimal content selection policy can be computed efficiently with dynamic programming under natural assumptions about the complexity of the users' engagement patterns. Moreover, we show that in an online learning setting where users with unknown engagement patterns arrive, there is a variant of Hedge that attains a $\tfrac 12$-competitive ratio regret bound. We also use our model to identify key primitives that determine how digital services should weigh engagement against revenue. For example, when it is more difficult for users to rejoin a service they are disengaged from, digital services naturally see a reduced payoff but user engagement may -- counterintuitively -- increase.
Authors: Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck
Abstract: Autoregressive models have demonstrated an unprecedented ability at modeling the intricacies of natural language. However, they continue to struggle with generating complex outputs that adhere to logical constraints. Sampling from a fully-independent distribution subject to a constraint is hard. Sampling from an autoregressive distribution subject to a constraint is doubly hard: We have to contend not only with the hardness of the constraint but also the distribution's lack of structure. We propose a tractable probabilistic approach that performs Bayesian conditioning to draw samples subject to a constraint. Our approach considers the entire sequence, leading to a more globally optimal constrained generation than current greedy methods. Starting from a model sample, we induce a local, factorized distribution which we can tractably condition on the constraint. To generate samples that satisfy the constraint, we sample from the conditional distribution, correct for biases in the samples and resample. The resulting samples closely approximate the target distribution and are guaranteed to satisfy the constraints. We evaluate our approach on several tasks, including LLM detoxification and solving Sudoku puzzles. We show that by disallowing a list of toxic expressions our approach is able to steer the model's outputs away from toxic generations, outperforming similar approaches to detoxification. We conclude by showing that our approach achieves a perfect accuracy on Sudoku compared to <50% for GPT4-o and Gemini 1.5.
Authors: Handi Zhang, Langchen Liu, Lu Lu
Abstract: By leveraging neural networks, the emerging field of scientific machine learning (SciML) offers novel approaches to address complex problems governed by partial differential equations (PDEs). In practical applications, challenges arise due to the distributed essence of data, concerns about data privacy, or the impracticality of transferring large volumes of data. Federated learning (FL), a decentralized framework that enables the collaborative training of a global model while preserving data privacy, offers a solution to the challenges posed by isolated data pools and sensitive data issues. Here, this paper explores the integration of FL and SciML to approximate complex functions and solve differential equations. We propose two novel models: federated physics-informed neural networks (FedPINN) and federated deep operator networks (FedDeepONet). We further introduce various data generation methods to control the degree of non-independent and identically distributed (non-iid) data and utilize the 1-Wasserstein distance to quantify data heterogeneity in function approximation and PDE learning. We systematically investigate the relationship between data heterogeneity and federated model performance. Additionally, we propose a measure of weight divergence and develop a theoretical framework to establish growth bounds for weight divergence in federated learning compared to traditional centralized learning. To demonstrate the effectiveness of our methods, we conducted 10 experiments, including 2 on function approximation, 5 PDE problems on FedPINN, and 3 PDE problems on FedDeepONet. These experiments demonstrate that proposed federated methods surpass the models trained only using local data and achieve competitive accuracy of centralized models trained using all data.
Authors: Khiem Le, Nitesh V. Chawla
Abstract: Molecule optimization is a critical task in drug discovery to optimize desired properties of a given molecule through chemical modification. Despite Large Language Models (LLMs) holding the potential to efficiently simulate this task by using natural language to direct the optimization, straightforwardly utilizing shows limited performance. In this work, we facilitate utilizing LLMs in an iterative paradigm by proposing a simple yet highly effective domain feedback provider, namely $\text{Re}^2$DF. In detail, $\text{Re}^2$DF harnesses an external toolkit, RDKit, to handle the molecule hallucination, if the modified molecule is chemically invalid. Otherwise, its desired properties are computed and compared to the original one, establishing reliable domain feedback with correct direction and distance towards the objective, followed by a retrieved example, to explicitly guide the LLM to refine the modified molecule. We conduct experiments across both single- and multi-property objectives with 2 thresholds, where $\text{Re}^2$DF shows significant improvements. Particularly, for 20 single-property objectives, $\text{Re}^2$DF enhances the Hit ratio by 16.95\% and 20.76\% under loose and strict thresholds, respectively. For 32 multi-property objectives, $\text{Re}^2$DF enhances the Hit ratio by 6.04\% and 5.25\%.
Authors: Edoardo Cetin, Qi Sun, Tianyu Zhao, Yujin Tang
Abstract: Prior methods propose to offset the escalating costs of modern foundation models by dropping specific parts of their contexts with hand-designed rules, while attempting to preserve their original performance. We overcome this trade-off with Neural Attention Memory Models (NAMMs), introducing a learned network for memory management that improves both the performance and efficiency of transformers. We evolve NAMMs atop pre-trained transformers to provide different latent contexts focusing on the most relevant information for individual layers and attention heads.NAMMs are universally applicable to any model using self-attention as they condition exclusively on the values in the produced attention matrices. Learning NAMMs on a small set of problems, we achieve substantial performance improvements across multiple long-context benchmarks while cutting the model's input contexts up to a fraction of the original sizes. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.
Authors: Cheng Huang, Pan Mu, Cong Bai, Peter AG Watson
Abstract: Precipitation from tropical cyclones (TCs) can cause disasters such as flooding, mudslides, and landslides. Predicting such precipitation in advance is crucial, giving people time to prepare and defend against these precipitation-induced disasters. Developing deep learning (DL) rainfall prediction methods offers a new way to predict potential disasters. However, one problem is that most existing methods suffer from cumulative errors and lack physical consistency. Second, these methods overlook the importance of meteorological factors in TC rainfall and their integration with the numerical weather prediction (NWP) model. Therefore, we propose Tropical Cyclone Precipitation Diffusion (TCP-Diffusion), a multi-modal model for global tropical cyclone precipitation forecasting. It forecasts TC rainfall around the TC center for the next 12 hours at 3 hourly resolution based on past rainfall observations and multi-modal environmental variables. Adjacent residual prediction (ARP) changes the training target from the absolute rainfall value to the rainfall trend and gives our model the ability of rainfall change awareness, reducing cumulative errors and ensuring physical consistency. Considering the influence of TC-related meteorological factors and the useful information from NWP model forecasts, we propose a multi-model framework with specialized encoders to extract richer information from environmental variables and results provided by NWP models. The results of extensive experiments show that our method outperforms other DL methods and the NWP method from the European Centre for Medium-Range Weather Forecasts (ECMWF).
Authors: Ziwei Yang, Zheng Chen, Xin Liu, Rikuto Kotoge, Peng Chen, Yasuko Matsubara, Yasushi Sakurai, Jimeng Sun
Abstract: Retrieving gene functional networks from knowledge databases presents a challenge due to the mismatch between disease networks and subtype-specific variations. Current solutions, including statistical and deep learning methods, often fail to effectively integrate gene interaction knowledge from databases or explicitly learn subtype-specific interactions. To address this mismatch, we propose GeSubNet, which learns a unified representation capable of predicting gene interactions while distinguishing between different disease subtypes. Graphs generated by such representations can be considered subtype-specific networks. GeSubNet is a multi-step representation learning framework with three modules: First, a deep generative model learns distinct disease subtypes from patient gene expression profiles. Second, a graph neural network captures representations of prior gene networks from knowledge databases, ensuring accurate physical gene interactions. Finally, we integrate these two representations using an inference loss that leverages graph generation capabilities, conditioned on the patient separation loss, to refine subtype-specific information in the learned representation. GeSubNet consistently outperforms traditional methods, with average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across four graph evaluation metrics, averaged over four cancer datasets. Particularly, we conduct a biological simulation experiment to assess how the behavior of selected genes from over 11,000 candidates affects subtypes or patient distributions. The results show that the generated network has the potential to identify subtype-specific genes with an 83% likelihood of impacting patient distribution shifts. The GeSubNet resource is available: https://anonymous.4open.science/r/GeSubNet/
Authors: Fanyu Meng, Xin Liu, Zhaodan Kong, Xin Chen
Abstract: eXplainable Artificial Intelligence (XAI) has garnered significant attention for enhancing transparency and trust in machine learning models. However, the scopes of most existing explanation techniques focus either on offering a holistic view of the explainee model (global explanation) or on individual instances (local explanation), while the middle ground, i.e., cohort-based explanation, is less explored. Cohort explanations offer insights into the explainee's behavior on a specific group or cohort of instances, enabling a deeper understanding of model decisions within a defined context. In this paper, we discuss the unique challenges and opportunities associated with measuring cohort explanations, define their desired properties, and create a generalized framework for generating cohort explanations based on supervised clustering.
Authors: George I. Kamberov
Abstract: Many machine learning (ML) classifiers are claimed to outperform humans, but they still make mistakes that humans do not. The most notorious examples of such mistakes are adversarial visual metamers. This paper aims to define and investigate the phenomenon of adversarial Doppelgangers (AD), which includes adversarial visual metamers, and to compare the performance and robustness of ML classifiers to human performance. We find that AD are inputs that are close to each other with respect to a perceptual metric defined in this paper. AD are qualitatively different from the usual adversarial examples. The vast majority of classifiers are vulnerable to AD and robustness-accuracy trade-offs may not improve them. Some classification problems may not admit any AD robust classifiers because the underlying classes are ambiguous. We provide criteria that can be used to determine whether a classification problem is well defined or not; describe the structure and attributes of an AD-robust classifier; introduce and explore the notions of conceptual entropy and regions of conceptual ambiguity for classifiers that are vulnerable to AD attacks, along with methods to bound the AD fooling rate of an attack. We define the notion of classifiers that exhibit hypersensitive behavior, that is, classifiers whose only mistakes are adversarial Doppelgangers. Improving the AD robustness of hyper-sensitive classifiers is equivalent to improving accuracy. We identify conditions guaranteeing that all classifiers with sufficiently high accuracy are hyper-sensitive. Our findings are aimed at significant improvements in the reliability and security of machine learning systems.
Authors: Al Zadid Sultan Bin Habib, Kesheng Wang, Mary-Anne Hartley, Gianfranco Doretto, Donald A. Adjeroh
Abstract: Effective analysis of tabular data still poses a significant problem in deep learning, mainly because features in tabular datasets are often heterogeneous and have different levels of relevance. This work introduces TabSeq, a novel framework for the sequential ordering of features, addressing the vital necessity to optimize the learning process. Features are not always equally informative, and for certain deep learning models, their random arrangement can hinder the model's learning capacity. Finding the optimum sequence order for such features could improve the deep learning models' learning process. The novel feature ordering technique we provide in this work is based on clustering and incorporates both local ordering and global ordering. It is designed to be used with a multi-head attention mechanism in a denoising autoencoder network. Our framework uses clustering to align comparable features and improve data organization. Multi-head attention focuses on essential characteristics, whereas the denoising autoencoder highlights important aspects by rebuilding from distorted inputs. This method improves the capability to learn from tabular data while lowering redundancy. Our research, demonstrating improved performance through appropriate feature sequence rearrangement using raw antibody microarray and two other real-world biomedical datasets, validates the impact of feature ordering. These results demonstrate that feature ordering can be a viable approach to improved deep learning of tabular data.
Authors: Gabriel Wu, Jacob Hilton
Abstract: We consider the problem of low probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model's output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model's logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the probability estimate of an undesirable behavior generalizes adversarial training, and argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.
Authors: Qian Tao, Wenyuan Yu, Jingren Zhou
Abstract: Large language models have shown exceptional capabilities in a wide range of tasks, such as text generation and video generation, among others. However, due to their massive parameter count, these models often require substantial storage space, imposing significant constraints on the machines deploying LLMs. To overcome this limitation, one research direction proposes to compress the models using integer replacements for floating-point numbers, in a process known as Quantization. Some recent studies suggest quantizing the key and value cache (KV Cache) of LLMs, and designing quantization techniques that treat the key and value matrices equivalently. This work delves deeper into the asymmetric structural roles of KV Cache, a phenomenon where the transformer's output loss is more sensitive to the quantization of key matrices. We conduct a systematic examination of the attention output error resulting from key and value quantization. The phenomenon inspires us to propose an asymmetric quantization strategy. Our approach allows for 1-bit quantization of the KV cache by implementing distinct configurations for key and value matrices. We carry out experiments across a variety of datasets, demonstrating that our proposed model allows for the quantization of up to 75% decoder layers with 1 bit, while simultaneously maintaining performance levels comparable to those of the models with floating parameters.
Authors: Alex Mallen, Nora Belrose
Abstract: Scalable oversight studies methods of training and evaluating AI systems in domains where human judgement is unreliable or expensive, such as scientific research and software engineering in complex codebases. Recent work in this area by Burns et al. (2023) suggests that Language Models (LMs) pretrained on internet-scale corpora exhibit an inductive bias toward producing correct answers, even when finetuned on error-prone labels produced by a smaller language model. This suggests that massive pretraining combined with finetuning on imperfect human labels may be a solid baseline method for scalable oversight. In the real world, however, label quality is not fixed: practitioners face a quantity-quality tradeoff when generating finetuning data. In this paper, we explore the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023). We find that there are three regimes of eliciting classification knowledge from pretrained models using supervised finetuning: quantity-dominant, quality-dominant, and a mixed regime involving the use of low- and high-quality data together to attain higher accuracy at a lower cost than using either alone. We explore sample-efficient elicitation methods that make use of two datasets of differing qualities, and establish a Pareto frontier of scalable elicitation methods that optimally trade off labeling cost and classifier performance.
Authors: Ruohan Wang, Zilong Wang, Ziyang Song, David Buckeridge, Yue Li
Abstract: Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups and enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.
Authors: Juan Diego Toscano, Vivek Oommen, Alan John Varghese, Zongren Zou, Nazanin Ahmadi Daryakenari, Chenxi Wu, George Em Karniadakis
Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a key tool in Scientific Machine Learning since their introduction in 2017, enabling the efficient solution of ordinary and partial differential equations using sparse measurements. Over the past few years, significant advancements have been made in the training and optimization of PINNs, covering aspects such as network architectures, adaptive refinement, domain decomposition, and the use of adaptive weights and activation functions. A notable recent development is the Physics-Informed Kolmogorov-Arnold Networks (PIKANS), which leverage a representation model originally proposed by Kolmogorov in 1957, offering a promising alternative to traditional PINNs. In this review, we provide a comprehensive overview of the latest advancements in PINNs, focusing on improvements in network design, feature expansion, optimization techniques, uncertainty quantification, and theoretical insights. We also survey key applications across a range of fields, including biomedicine, fluid and solid mechanics, geophysics, dynamical systems, heat transfer, chemical engineering, and beyond. Finally, we review computational frameworks and software tools developed by both academia and industry to support PINN research and applications.
Authors: Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu
Abstract: State Space Models (SSMs) have emerged as an appealing alternative to Transformers for large language models, achieving state-of-the-art accuracy with constant memory complexity which allows for holding longer context lengths than attention-based networks. The superior computational efficiency of SSMs in long sequence modeling positions them favorably over Transformers in many scenarios. However, improving the efficiency of SSMs on request-intensive cloud-serving and resource-limited edge applications is still a formidable task. SSM quantization is a possible solution to this problem, making SSMs more suitable for wide deployment, while still maintaining their accuracy. Quantization is a common technique to reduce the model size and to utilize the low bit-width acceleration features on modern computing units, yet existing quantization techniques are poorly suited for SSMs. Most notably, SSMs have highly sensitive feature maps within the selective scan mechanism (i.e., linear recurrence) and massive outliers in the output activations which are not present in the output of token-mixing in the self-attention modules. To address this issue, we propose a static 8-bit per-tensor SSM quantization method which suppresses the maximum values of the input activations to the selective SSM for finer quantization precision and quantizes the output activations in an outlier-free space with Hadamard transform. Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72x lower generation latency on an Nvidia Orin Nano 8G, with only a 0.9% drop in average accuracy on zero-shot tasks. The experiments demonstrate the effectiveness and practical applicability of our approach for deploying SSM-based models of all sizes on both cloud and edge platforms.
Authors: Ryotaro Shimizu, Takashi Wada, Yu Wang, Johannes Kruse, Sean O'Brien, Sai HtaungKham, Linxin Song, Yuya Yoshikawa, Yuki Saito, Fugee Tsung, Masayuki Goto, Julian McAuley
Abstract: Recent research on explainable recommendation generally frames the task as a standard text generation problem, and evaluates models simply based on the textual similarity between the predicted and ground-truth explanations. However, this approach fails to consider one crucial aspect of the systems: whether their outputs accurately reflect the users' (post-purchase) sentiments, i.e., whether and why they would like and/or dislike the recommended items. To shed light on this issue, we introduce new datasets and evaluation methods that focus on the users' sentiments. Specifically, we construct the datasets by explicitly extracting users' positive and negative opinions from their post-purchase reviews using an LLM, and propose to evaluate systems based on whether the generated explanations 1) align well with the users' sentiments, and 2) accurately identify both positive and negative opinions of users on the target items. We benchmark several recent models on our datasets and demonstrate that achieving strong performance on existing metrics does not ensure that the generated explanations align well with the users' sentiments. Lastly, we find that existing models can provide more sentiment-aware explanations when the users' (predicted) ratings for the target items are directly fed into the models as input. We will release our code and datasets upon acceptance.
Authors: Jintao Zhang, Mingyue Cheng, Xiaoyu Tao, Zhiding Liu, Daoyu Wang
Abstract: Time series forecasting is vital in numerous web applications, influencing critical decision-making across industries. While diffusion models have recently gained increasing popularity for this task, we argue they suffer from a significant drawback: indiscriminate noise addition to the original time series followed by denoising, which can obscure underlying dynamic evolving trend and complicate forecasting. To address this limitation, we propose a novel flexible decoupled framework (FDF) that learns high-quality time series representations for enhanced forecasting performance. A key characteristic of our approach leverages the inherent inductive bias of time series data by decomposing it into trend and seasonal components, each modeled separately to enable decoupled analysis and modeling. Specifically, we propose an innovative Conditional Denoising Seasonal Module (CDSM) within the diffusion model, which leverages statistical information from the historical window to conditionally model the complex seasonal component. Notably, we incorporate a Polynomial Trend Module (PTM) to effectively capture the smooth trend component, thereby enhancing the model's ability to represent temporal dependencies. Extensive experiments validate the effectiveness of our framework, demonstrating superior performance over existing methods and higlighting its flexibility in time series forecasting. We hope our work can bring a new perspective for time series forecasting. We intend to make our code publicly available as open-source in the future.
Authors: Dian Meng, Bohao Xing, Xinlei Huang, Yanran Liu, Yijun Zhou, Yongjun xiao, Zitong Yu, Xubin Zheng
Abstract: Single-cell multi-omics (scMulti-omics) refers to the paired multimodal data, such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), where the regulation of each cell was measured from different modalities, i.e. genes and proteins. scMulti-omics can reveal heterogeneity inside tumors and understand the distinct genetic properties of diverse cell types, which is crucial to targeted therapy. Currently, deep learning methods based on attention structures in the bioinformatics area face two challenges. The first challenge is the vast number of genes in a single cell. Traditional attention-based modules struggled to effectively leverage all gene information due to their limited capacity for long-context learning and high-complexity computing. The second challenge is that genes in the human genome are ordered and influence each other's expression. Most of the methods ignored this sequential information. The recently introduced Test-Time Training (TTT) layer is a novel sequence modeling approach, particularly suitable for handling long contexts like genomics data because TTT layer is a linear complexity sequence modeling structure and is better suited to data with sequential relationships. In this paper, we propose scFusionTTT, a novel method for Single-Cell multimodal omics Fusion with TTT-based masked autoencoder. Of note, we combine the order information of genes and proteins in the human genome with the TTT layer, fuse multimodal omics, and enhance unimodal omics analysis. Finally, the model employs a three-stage training strategy, which yielded the best performance across most metrics in four multimodal omics datasets and four unimodal omics datasets, demonstrating the superior performance of our model. The dataset and code will be available on https://github.com/DM0815/scFusionTTT.
Authors: Xu Han, Yuancheng Sun, Kai Chen, Kang Liu, Qiwei Ye
Abstract: Coarse-grained(CG) molecular dynamics simulations offer computational efficiency for exploring protein conformational ensembles and thermodynamic properties. Though coarse representations enable large-scale simulations across extended temporal and spatial ranges, the sacrifice of atomic-level details limits their utility in tasks such as ligand docking and protein-protein interaction prediction. Backmapping, the process of reconstructing all-atom structures from coarse-grained representations, is crucial for recovering these fine details. While recent machine learning methods have made strides in protein structure generation, challenges persist in reconstructing diverse atomistic conformations that maintain geometric accuracy and chemical validity. In this paper, we present Latent Diffusion Backmapping (LDB), a novel approach leveraging denoising diffusion within latent space to address these challenges. By combining discrete latent encoding with diffusion, LDB bypasses the need for equivariant and internal coordinate manipulation, significantly simplifying the training and sampling processes as well as facilitating better and wider exploration in configuration space. We evaluate LDB's state-of-the-art performance on three distinct protein datasets, demonstrating its ability to efficiently reconstruct structures with high structural accuracy and chemical validity. Moreover, LDB shows exceptional versatility in capturing diverse protein ensembles, highlighting its capability to explore intricate conformational spaces. Our results position LDB as a powerful and scalable approach for backmapping, effectively bridging the gap between CG simulations and atomic-level analyses in computational biology.
Authors: Jake Robertson, Thorsten Schmidt, Frank Hutter, Noor Awad
Abstract: Fairness-aware Machine Learning (FairML) applications are often characterized by complex social objectives and legal requirements, frequently involving multiple, potentially conflicting notions of fairness. Despite the well-known Impossibility Theorem of Fairness and extensive theoretical research on the statistical and socio-technical trade-offs between fairness metrics, many FairML tools still optimize or constrain for a single fairness objective. However, this one-sided optimization can inadvertently lead to violations of other relevant notions of fairness. In this socio-technical and empirical study, we frame fairness as a many-objective (MaO) problem by treating fairness metrics as conflicting objectives. We introduce ManyFairHPO, a human-in-the-loop, fairness-aware model selection framework that enables practitioners to effectively navigate complex and nuanced fairness objective landscapes. ManyFairHPO aids in the identification, evaluation, and balancing of fairness metric conflicts and their related social consequences, leading to more informed and socially responsible model-selection decisions. Through a comprehensive empirical evaluation and a case study on the Law School Admissions problem, we demonstrate the effectiveness of ManyFairHPO in balancing multiple fairness objectives, mitigating risks such as self-fulfilling prophecies, and providing interpretable insights to guide stakeholders in making fairness-aware modeling decisions.
Authors: Xiaoyan Hu, Ho-fung Leung, Farzan Farnia
Abstract: Selecting a sample generation scheme from multiple text-based generative models is typically addressed by choosing the model that maximizes an averaged evaluation score. However, this score-based selection overlooks the possibility that different models achieve the best generation performance for different types of text prompts. An online identification of the best generation model for various input prompts can reduce the costs associated with querying sub-optimal models. In this work, we explore the possibility of varying rankings of text-based generative models for different text prompts and propose an online learning framework to predict the best data generation model for a given input prompt. The proposed framework adapts the kernelized contextual bandit (CB) methodology to a CB setting with shared context variables across arms, utilizing the generated data to update a kernel-based function that predicts which model will achieve the highest score for unseen text prompts. Additionally, we apply random Fourier features (RFF) to the kernelized CB algorithm to accelerate the online learning process and establish a $\widetilde{\mathcal{O}}(\sqrt{T})$ regret bound for the proposed RFF-based CB algorithm over T iterations. Our numerical experiments on real and simulated text-to-image and image-to-text generative models show RFF-UCB performs successfully in identifying the best generation model across different sample types.
Authors: Prakhar Dixit, Tim Oates
Abstract: Many students struggle with math word problems (MWPs), often finding it difficult to identify key information and select the appropriate mathematical operations.Schema-based instruction (SBI) is an evidence-based strategy that helps students categorize problems based on their structure, improving problem-solving accuracy. Building on this, we propose a Schema-Based Instruction Retrieval-Augmented Generation (SBI-RAG) framework that incorporates a large language model (LLM).Our approach emphasizes step-by-step reasoning by leveraging schemas to guide solution generation. We evaluate its performance on the GSM8K dataset, comparing it with GPT-4 and GPT-3.5 Turbo, and introduce a "reasoning score" metric to assess solution quality. Our findings suggest that SBI-RAG enhances reasoning clarity and problem-solving accuracy, potentially providing educational benefits for students
Authors: Mingda Lu, Zitian Ao, Chao Wang, Sudhakar Prasad, Raymond H. Chan
Abstract: For the 3D localization problem using point spread function (PSF) engineering, we propose a novel enhancement of our previously introduced localization neural network, LocNet. The improved network is a physics-informed neural network (PINN) that we call PiLocNet. Previous works on the localization problem may be categorized separately into model-based optimization and neural network approaches. Our PiLocNet combines the unique strengths of both approaches by incorporating forward-model-based information into the network via a data-fitting loss term that constrains the neural network to yield results that are physically sensible. We additionally incorporate certain regularization terms from the variational method, which further improves the robustness of the network in the presence of image noise, as we show for the Poisson and Gaussian noise models. This framework accords interpretability to the neural network, and the results we obtain show its superiority. Although the paper focuses on the use of single-lobe rotating PSF to encode the full 3D source location, we expect the method to be widely applicable to other PSFs and imaging problems that are constrained by known forward processes.
Authors: Janine Strotherm, Barbara Hammer
Abstract: As relevant examples such as the future criminal detection software [1] show, fairness of AI-based and social domain affecting decision support tools constitutes an important area of research. In this contribution, we investigate the applications of AI to socioeconomically relevant infrastructures such as those of water distribution networks (WDNs), where fairness issues have yet to gain a foothold. To establish the notion of fairness in this domain, we propose an appropriate definition of protected groups and group fairness in WDNs as an extension of existing definitions. We demonstrate that typical methods for the detection of leakages in WDNs are unfair in this sense. Further, we thus propose a remedy to increase the fairness which can be applied even to non-differentiable ensemble classification methods as used in this context.
Authors: David Hoffmann, Kailash Budhathoki, Matthaeus Kleindessner
Abstract: The evolving capabilities of large language models are accompanied by growing sizes and deployment costs, necessitating effective inference optimisation techniques. We propose a novel pruning method utilising centrality measures from graph theory, reducing both the computational requirements and the memory footprint of these models. Specifically, we devise a method for creating a weighted directed acyclical graph representation of multilayer perceptrons to which we apply a modified version of the weighted PageRank centrality measure to compute node importance scores. In combination with uniform pruning this leads to structured sparsity. We call this pruning method MLPRank. Furthermore we introduce an extension to decoder-only transformer models and call it LLMRank. For both variants we demonstrate a strong performance. With MLPRank on average leading to 6.09 % higher accuracy retention than three popular baselines and 13.42 % with LLMRank compared to two popular baselines.
Authors: Chongyang Wan, Shunbo Lei, Yuan Luo
Abstract: The increasing severity of climate change necessitates an urgent transition to renewable energy sources, making the large-scale adoption of wind energy crucial for mitigating environmental impact. However, the inherent uncertainty of wind power poses challenges for grid stability, underscoring the need for accurate wind energy prediction models to enable effective power system planning and operation. While many existing studies on wind power prediction focus on short-term forecasting, they often overlook the importance of long-term predictions. Long-term wind power forecasting is essential for effective power grid dispatch and market transactions, as it requires careful consideration of weather features such as wind speed and direction, which directly influence power output. Consequently, methods designed for short-term predictions may lead to inaccurate results and high computational costs in long-term settings. To adress these limitations, we propose a novel approach called Hybrid Frequency Feature Enhancement Inverted Transformer (Hiformer). Hiformer introduces a unique structure that integrates signal decomposition technology with weather feature extraction technique to enhance the modeling of correlations between meteorological conditions and wind power generation. Additionally, Hiformer employs an encoder-only architecture, which reduces the computational complexity associated with long-term wind power forecasting. Compared to the state-of-the-art methods, Hiformer: (i) can improve the prediction accuracy by up to 52.5\%; and (ii) can reduce computational time by up to 68.5\%.
Authors: ChaoRong Li, XuDong Ling, YiLan Xue, Wenjie Luo, LiHong Zhu, FengQing Qin, Yaodong Zhou, Yuanyuan Huang
Abstract: Short-term precipitation forecasting remains challenging due to the difficulty in capturing long-term spatiotemporal dependencies. Current deep learning methods fall short in establishing effective dependencies between conditions and forecast results, while also lacking interpretability. To address this issue, we propose a Precipitation Nowcasting Using Diffusion Transformer with Causal Attention model. Our model leverages Transformer and combines causal attention mechanisms to establish spatiotemporal queries between conditional information (causes) and forecast results (results). This design enables the model to effectively capture long-term dependencies, allowing forecast results to maintain strong causal relationships with input conditions over a wide range of time and space. We explore four variants of spatiotemporal information interactions for DTCA, demonstrating that global spatiotemporal labeling interactions yield the best performance. In addition, we introduce a Channel-To-Batch shift operation to further enhance the model's ability to represent complex rainfall dynamics. We conducted experiments on two datasets. Compared to state-of-the-art U-Net-based methods, our approach improved the CSI (Critical Success Index) for predicting heavy precipitation by approximately 15% and 8% respectively, achieving state-of-the-art performance.
Authors: Rushi Shah, Mingyuan Yan, Michael Curtis Mozer, Dianbo Liu
Abstract: Discrete representations play a crucial role in many deep learning architectures, yet their non-differentiable nature poses significant challenges for gradient-based optimization. To address this issue, various gradient estimators have been developed, including the Straight-Through Gumbel-Softmax (ST-GS) estimator, which combines the Straight-Through Estimator (STE) and the Gumbel-based reparameterization trick. However, the performance of ST-GS is highly sensitive to temperature, with its selection often compromising gradient fidelity. In this work, we propose a simple yet effective extension to ST-GS by employing decoupled temperatures for forward and backward passes, which we refer to as "Decoupled ST-GS". We show that our approach significantly enhances the original ST-GS through extensive experiments across multiple tasks and datasets. We further investigate the impact of our method on gradient fidelity from multiple perspectives, including the gradient gap and the bias-variance trade-off of estimated gradients. Our findings contribute to the ongoing effort to improve discrete optimization in deep learning, offering a practical solution that balances simplicity and effectiveness.
Authors: Hongfan Gao, Wangmeng Shen, Xiangfei Qiu, Ronghui Xu, Jilin Hu, Bin Yang
Abstract: Probabilistic time series imputation has been widely applied in real-world scenarios due to its ability to estimate uncertainty of imputation results. Meanwhile, denoising diffusion probabilistic models (DDPMs) have achieved great success in probabilistic time series imputation tasks with its power to model complex distributions. However, current DDPM-based probabilistic time series imputation methodologies are confronted with two types of challenges: 1)~\textit{~The backbone modules of the denoising parts are not capable of achieving sequence modeling with low time complexity.} 2)~\textit{The architecture of denoising modules can not handle the inter-variable and bidirectional dependencies in the time series imputation problem effectively.} To address the first challenge, we integrate the computational efficient state space model, namely Mamba, as the backbone denosing module for DDPMs. To tackle the second challenge, we carefully devise several SSM-based blocks for bidirectional modeling and inter-variable relation understanding. Experimental results demonstrate that our approach can achieve state-of-the-art time series imputation results on multiple datasets, different missing scenarios and missing ratios.
Authors: Florian E. Dorner, Vivian Y. Nastl, Moritz Hardt
Abstract: High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation, and points out promising avenues for future work.
Authors: Rosemary He, Ichiro Takeuchi
Abstract: Alzheimer's Disease is challenging to diagnose due to our limited understanding of its mechanism and large heterogeneity among patients. Neurodegeneration is studied widely as a biomarker for clinical diagnosis, which can be measured from time series MRI progression. On the other hand, generative AI has shown promise in anomaly detection in medical imaging and used for tasks including tumor detection. However, testing the reliability of such data-driven methods is non-trivial due to the issue of double-dipping in hypothesis testing. In this work, we propose to solve this issue with selective inference and develop a reliable generative AI method for Alzheimer's prediction. We show that compared to traditional statistical methods with highly inflated p-values, selective inference successfully controls the false discovery rate under the desired alpha level while retaining statistical power. In practice, our pipeline could assist clinicians in Alzheimer's diagnosis and early intervention.
Authors: Kangkang Lu, Yanhua Yu, Zhiyong Huang, Jia Li, Yuling Wang, Meiyu Liang, Xiting Qin, Yimeng Ren, Tat-Seng Chua, Xidian Wang
Abstract: Graph Neural Networks (GNNs) have garnered significant scholarly attention for their powerful capabilities in modeling graph structures. Despite this, two primary challenges persist: heterogeneity and heterophily. Existing studies often address heterogeneous and heterophilic graphs separately, leaving a research gap in the understanding of heterogeneous heterophilic graphs-those that feature diverse node or relation types with dissimilar connected nodes. To address this gap, we investigate the application of spectral graph filters within heterogeneous graphs. Specifically, we propose a Heterogeneous Heterophilic Spectral Graph Neural Network (H2SGNN), which employs a dual-module approach: local independent filtering and global hybrid filtering. The local independent filtering module applies polynomial filters to each subgraph independently to adapt to different homophily, while the global hybrid filtering module captures interactions across different subgraphs. Extensive empirical evaluations on four real-world datasets demonstrate the superiority of H2SGNN compared to state-of-the-art methods.
Authors: Shuwen Sun, Lihong Feng, Peter Benner
Abstract: Numerically solving a large parametric nonlinear dynamical system is challenging due to its high complexity and the high computational costs. In recent years, machine-learning-aided surrogates are being actively researched. However, many methods fail in accurately generalizing in the entire time interval $[0, T]$, when the training data is available only in a training time interval $[0, T_0]$, with $T_0
Authors: Yingtao Ren, Yu-Cheng Chang, Thomas Do, Zehong Cao, Chin-Teng Lin
Abstract: Fuzzy Neural Networks (FNNs) are effective machine learning models for classification tasks, commonly based on the Takagi-Sugeno-Kang (TSK) fuzzy system. However, when faced with high-dimensional data, especially with noise, FNNs encounter challenges such as vanishing gradients, excessive fuzzy rules, and limited access to prior knowledge. To address these challenges, we propose a novel fuzzy system, the Self-Constructing Multi-Expert Fuzzy System (SOME-FS). It combines two learning strategies: mixed structure learning and multi-expert advanced learning. The former enables each base classifier to effectively determine its structure without requiring prior knowledge, while the latter tackles the issue of vanishing gradients by enabling each rule to focus on its local region, thereby enhancing the robustness of the fuzzy classifiers. The overall ensemble architecture enhances the stability and prediction performance of the fuzzy system. Our experimental results demonstrate that the proposed SOME-FS is effective in high-dimensional tabular data, especially in dealing with uncertainty. Moreover, our stable rule mining process can identify concise and core rules learned by the SOME-FS.
Authors: Opeyemi Sheu Alamu, Bismar Jorge Gutierrez Choque, Syed Wajeeh Abbs Rizvi, Samah Badr Hammed, Isameldin Elamin Medani, Md Kamrul Siam, Waqar Ahmad Tahir
Abstract: Breast cancer remains a significant global health challenge, with prognosis and treatment decisions largely dependent on clinical characteristics. Accurate prediction of patient outcomes is crucial for personalized treatment strategies. This study employs survival analysis techniques, including Cox proportional hazards and parametric survival models, to enhance the prediction of the log odds of survival in breast cancer patients. Clinical variables such as tumor size, hormone receptor status, HER2 status, age, and treatment history were analyzed to assess their impact on survival outcomes. Data from 1557 breast cancer patients were obtained from a publicly available dataset provided by the University College Hospital, Ibadan, Nigeria. This dataset was preprocessed and analyzed using both univariate and multivariate approaches to evaluate survival outcomes. Kaplan-Meier survival curves were generated to visualize survival probabilities, while the Cox proportional hazards model identified key risk factors influencing mortality. The results showed that older age, larger tumor size, and HER2-positive status were significantly associated with an increased risk of mortality. In contrast, estrogen receptor positivity and breast-conserving surgery were linked to better survival outcomes. The findings suggest that integrating these clinical variables into predictive models improvesthe accuracy of survival predictions, helping to identify high-risk patients who may benefit from more aggressive interventions. This study demonstrates the potential of survival analysis in optimizing breast cancer care, particularly in resource-limited settings. Future research should focus on integrating genomic data and real-world clinical outcomes to further refine these models.
Authors: Chuanyu Tang, Yilong Chen, Zhenyu Zhang, Junyuan Shang, Wenyuan Zhang, Yong Huang, Tingwen Liu
Abstract: Low-Rank Adaptation (LoRA) drives research to align its performance with full fine-tuning. However, significant challenges remain: (1) Simply increasing the rank size of LoRA does not effectively capture high-rank information, which leads to a performance bottleneck.(2) MoE-style LoRA methods substantially increase parameters and inference latency, contradicting the goals of efficient fine-tuning and ease of application. To address these challenges, we introduce Mixture of Ranks (MoR), which learns rank-specific information for different tasks based on input and efficiently integrates multi-rank information. We firstly propose a new framework that equates the integration of multiple LoRAs to expanding the rank of LoRA. Moreover, we hypothesize that low-rank LoRA already captures sufficient intrinsic information, and MoR can derive high-rank information through mathematical transformations of the low-rank components. Thus, MoR can reduces the learning difficulty of LoRA and enhances its multi-task capabilities. MoR achieves impressive results, with MoR delivering a 1.31\% performance improvement while using only 93.93\% of the parameters compared to baseline methods.
Authors: Dimitrios Kelesis, Dimitris Fotakis, Georgios Paliouras
Abstract: In this work we investigate an observation made by Kipf \& Welling, who suggested that untrained GCNs can generate meaningful node embeddings. In particular, we investigate the effect of training only a single layer of a GCN, while keeping the rest of the layers frozen. We propose a basis on which the effect of the untrained layers and their contribution to the generation of embeddings can be predicted. Moreover, we show that network width influences the dissimilarity of node embeddings produced after the initial node features pass through the untrained part of the model. Additionally, we establish a connection between partially trained GCNs and oversmoothing, showing that they are capable of reducing it. We verify our theoretical results experimentally and show the benefits of using deep networks that resist oversmoothing, in a ``cold start'' scenario, where there is a lack of feature information for unlabeled nodes.
Authors: Zhanpeng Wang, Shenghao Li, Chen Wang, Shuting Cao, Na Lei, Zhongxuan Luo
Abstract: In recent years, the knowledge surrounding diffusion models(DMs) has grown significantly, though several theoretical gaps remain. Particularly noteworthy is prior error, defined as the discrepancy between the termination distribution of the forward process and the initial distribution of the reverse process. To address these deficiencies, this paper explores the deeper relationship between optimal transport(OT) theory and DMs with discrete initial distribution. Specifically, we demonstrate that the two stages of DMs fundamentally involve computing time-dependent OT. However, unavoidable prior error result in deviation during the reverse process under quadratic transport cost. By proving that as the diffusion termination time increases, the probability flow exponentially converges to the gradient of the solution to the classical Monge-Amp\`ere equation, we establish a vital link between these fields. Therefore, static OT emerges as the most intrinsic single-step method for bridging this theoretical potential gap. Additionally, we apply these insights to accelerate sampling in both unconditional and conditional generation scenarios. Experimental results across multiple image datasets validate the effectiveness of our approach.
Authors: Guangming Huang, Yunfei Long, Cunjin Luo, Sheng Liu
Abstract: Supervised contrastive learning has been explored in making use of label information for multi-label classification, but determining positive samples in multi-label scenario remains challenging. Previous studies have examined strategies for identifying positive samples, considering label overlap proportion between anchors and samples. However, they ignore various relations between given anchors and samples, as well as how to dynamically adjust the weights in contrastive loss functions based on different relations, leading to great ambiguity. In this paper, we introduce five distinct relations between multi-label samples and propose a Similarity-Dissimilarity Loss with contrastive learning for multi-label classification. Our loss function re-weights the loss by computing the similarity and dissimilarity between positive samples and a given anchor based on the introduced relations. We mainly conduct experiments for multi-label text classification on MIMIC datasets, then further extend the evaluation on MS-COCO. The Experimental results show that our proposed loss effectively improves the performance on all encoders under supervised contrastive learning paradigm, demonstrating its effectiveness and robustness.
Authors: Jinyang Liu, Tessa Steensgaard, Marvin N. Wright, Niklas Pfister, Munir Hiabu
Abstract: Many existing interpretation methods are based on Partial Dependence (PD) functions that, for a pre-trained machine learning model, capture how a subset of the features affects the predictions by averaging over the remaining features. Notable methods include Shapley additive explanations (SHAP) which computes feature contributions based on a game theoretical interpretation and PD plots (i.e., 1-dim PD functions) that capture average marginal main effects. Recent work has connected these approaches using a functional decomposition and argues that SHAP values can be misleading since they merge main and interaction effects into a single local effect. A major advantage of SHAP compared to other PD-based interpretations, however, has been the availability of fast estimation techniques, such as \texttt{TreeSHAP}. In this paper, we propose a new tree-based estimator, \texttt{FastPD}, which efficiently estimates arbitrary PD functions. We show that \texttt{FastPD} consistently estimates the desired population quantity -- in contrast to path-dependent \texttt{TreeSHAP} which is inconsistent when features are correlated. For moderately deep trees, \texttt{FastPD} improves the complexity of existing methods from quadratic to linear in the number of observations. By estimating PD functions for arbitrary feature subsets, \texttt{FastPD} can be used to extract PD-based interpretations such as SHAP, PD plots and higher order interaction effects.
Authors: Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris
Abstract: In spite of the great potential of large language models (LLMs) across various tasks, their deployment on resource-constrained devices remains challenging due to their excessive computational and memory demands. Quantization has emerged as an effective solution by storing weights in reduced precision. However, utilizing low precisions (i.e.~2/3-bit) to substantially alleviate the memory-boundedness of LLM decoding, still suffers from prohibitive performance drop. In this work, we argue that existing approaches fail to explore the diversity in computational patterns, redundancy, and sensitivity to approximations of the different phases of LLM inference, resorting to a uniform quantization policy throughout. Instead, we propose a novel phase-aware method that selectively allocates precision during different phases of LLM inference, achieving both strong context extraction during prefill and efficient memory bandwidth utilization during decoding. To further address the memory-boundedness of the decoding phase, we introduce Progressive Mixed-Precision Decoding (PMPD), a technique that enables the gradual lowering of precision deeper in the generated sequence, together with a spectrum of precision-switching schedulers that dynamically drive the precision-lowering decisions in either task-adaptive or prompt-adaptive manner. Extensive evaluation across diverse language tasks shows that when targeting Nvidia GPUs, PMPD achieves 1.4$-$12.2$\times$ speedup in matrix-vector multiplications over fp16 models, while when targeting an LLM-optimized NPU, our approach delivers a throughput gain of 3.8$-$8.0$\times$ over fp16 models and up to 1.54$\times$ over uniform quantization approaches while preserving the output quality.
Authors: Riccardo Poiani, Nicole Nobili, Alberto Maria Metelli, Marcello Restelli
Abstract: Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths, i.e., \emph{truncated}. Specifically, this surrogate shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate. Building on these findings, we present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust version of the surrogate of the estimator's error. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.
Authors: Michele Guerra, Simone Scardapane, Filippo Maria Bianchi
Abstract: Spatiotemporal graph neural networks (STGNNs) have shown promising results in many domains, from forecasting to epidemiology. However, understanding the dynamics learned by these models and explaining their behaviour is significantly more complex than for models dealing with static data. Inspired by Koopman theory, which allows a simpler description of intricate, nonlinear dynamical systems, we introduce an explainability approach for temporal graphs. We present two methods to interpret the STGNN's decision process and identify the most relevant spatial and temporal patterns in the input for the task at hand. The first relies on dynamic mode decomposition (DMD), a Koopman-inspired dimensionality reduction method. The second relies on sparse identification of nonlinear dynamics (SINDy), a popular method for discovering governing equations, which we use for the first time as a general tool for explainability. We show how our methods can correctly identify interpretable features such as infection times and infected nodes in the context of dissemination processes.
Authors: Alessandro Micheli, M\'elodie Monod
Abstract: This paper tackles the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets. We introduce a novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) to address this issue, with a focus on transient price impact modeled by a general decay kernel. Through numerical experiments with various decay kernels, we show that our algorithm successfully approximates the optimal execution strategy. Additionally, the proposed algorithm demonstrates adaptability to evolving market conditions, where parameters fluctuate over time. Our findings also show that modern reinforcement learning algorithms can provide a solution that reduces the need for frequent and inefficient human intervention in optimal execution tasks.
Authors: Yoav Alon, Cristina David
Abstract: Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM's space exploration: (1) the Agent has access to domain-specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM's training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought (CoT) and Tree of Thoughts (ToT). We assess both the downstream task, denoting the binary classification, and the intermediate reasoning steps. Our approach compares positively against CoT and ToT.
Authors: Andreas Opedal, Haruki Shirakami, Bernhard Sch\"olkopf, Abulhair Saparov, Mrinmaya Sachan
Abstract: Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to problems that are more complex than the ones on which they have been trained. Empirical investigations of such questions are impeded by two major flaws of current evaluations: (i) much of the evaluation data is contaminated, in the sense that it has already been seen during training, and (ii) benchmark datasets do not capture how problem proofs may be arbitrarily complex in various ways. As a step towards addressing these issues, we present a framework for evaluating LLMs on problems that have arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problems that follow fixed proof specifications -- along with chain-of-thought reasoning annotations -- enabling systematic studies on generalization with respect to arithmetic proof complexity. We apply MathGAP to analyze how in-context learning interacts with generalization to problems that have more complex proofs. We find that among the models tested, most show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for GPT-4o. Surprisingly, providing in-context examples from the same distribution as the test set is not always beneficial for performance. In particular, zero-shot prompting as well as demonstrating a diverse range of examples that are less complex than the test data sometimes yield similar or higher accuracies.
Authors: Marco Spinaci, Marek Polewczyk, Johannes Hoffart, Markus C. Kohler, Sam Thelin, Tassilo Klein
Abstract: Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.
Authors: Guy Blanc, Gregory Valiant
Abstract: We resolve a fundamental question about the ability to perform a statistical task, such as learning, when an adversary corrupts the sample. Such adversaries are specified by the types of corruption they can make and their level of knowledge about the sample. The latter distinguishes between sample-adaptive adversaries which know the contents of the sample when choosing the corruption, and sample-oblivious adversaries, which do not. We prove that for all types of corruptions, sample-adaptive and sample-oblivious adversaries are \emph{equivalent} up to polynomial factors in the sample size. This resolves the main open question introduced by \cite{BLMT22} and further explored in \cite{CHLLN23}. Specifically, consider any algorithm $A$ that solves a statistical task even when a sample-oblivious adversary corrupts its input. We show that there is an algorithm $A'$ that solves the same task when the corresponding sample-adaptive adversary corrupts its input. The construction of $A'$ is simple and maintains the computational efficiency of $A$: It requests a polynomially larger sample than $A$ uses and then runs $A$ on a uniformly random subsample. One of our main technical tools is a new structural result relating two distributions defined on sunflowers which may be of independent interest.
Authors: Jesus Garcia Fernandez, Nasir Ahmad, Marcel van Gerven
Abstract: Learning is a fundamental property of intelligent systems, observed across biological organisms and engineered systems. While modern intelligent systems typically rely on gradient descent for learning, the need for exact gradients and complex information flow makes its implementation in biological and neuromorphic systems challenging. This has motivated the exploration of alternative learning mechanisms that can operate locally and do not rely on exact gradients. In this work, we introduce a novel approach that leverages noise in the parameters of the system and global reinforcement signals. Using an Ornstein-Uhlenbeck process with adaptive dynamics, our method balances exploration and exploitation during learning, driven by deviations from error predictions, akin to reward prediction error. Operating in continuous time, Orstein-Uhlenbeck adaptation (OUA) is proposed as a general mechanism for learning dynamic, time-evolving environments. We validate our approach across diverse tasks, including supervised learning and reinforcement learning in feedforward and recurrent systems. Additionally, we demonstrate that it can perform meta-learning, adjusting hyper-parameters autonomously. Our results indicate that OUA provides a viable alternative to traditional gradient-based methods, with potential applications in neuromorphic computing. It also hints at a possible mechanism for noise-driven learning in the brain, where stochastic neurotransmitter release may guide synaptic adjustments.
Authors: Eliahu Horwitz, Bar Cavia, Jonathan Kahana, Yedid Hoshen
Abstract: The increasing availability of public models begs the question: can we train neural networks that use other networks as input? This paper learns to represent models within a joint space that embeds both model weights and language. However, machine learning on model weights is challenging as model weights often exhibit significant variation unrelated to the models' semantic properties (nuisance variation). We identify a key property of real-world models: most public models belong to a small set of Model Trees, where all models within a tree are fine-tuned from a common ancestor (e.g., a foundation model). Importantly, we find that within each tree there is less nuisance variation between models. For example, while classifying models according to their training dataset generally requires complex architectures, in our case, even a linear classifier trained on a single layer is often effective. While effective, linear layers are computationally expensive as model weights are very high dimensional. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated, lightweight probing method. Notably, ProbeX is the first probing method designed to learn from the weights of just a single model layer. We also construct and release a dataset that simulates the structure of public model repositories. Our results show that ProbeX can effectively map the weights of large models into a shared weight-language embedding space. Furthermore, we demonstrate the impressive generalization of our method, achieving zero-shot model classification and retrieval.
Authors: Benjamin Leblanc, Mathieu Bazinet, Nathaniel D'Amours, Alexandre Drouin, Pascal Germain
Abstract: Reconstruction functions are pivotal in sample compression theory, a framework for deriving tight generalization bounds. From a small sample of the training set (the compression set) and an optional stream of information (the message), they recover a predictor previously learned from the whole training set. While usually fixed, we propose to learn reconstruction functions. To facilitate the optimization and increase the expressiveness of the message, we derive a new sample compression generalization bound for real-valued messages. From this theoretical analysis, we then present a new hypernetwork architecture that outputs predictors with tight generalization guarantees when trained using an original meta-learning framework. The results of promising preliminary experiments are then reported.
Authors: Zhiqiang Kou, Haoyuan Xuan, Jing Wang, Yuheng Jia, Xin Geng
Abstract: Label Distribution Learning (LDL) is a novel machine learning paradigm that addresses the problem of label ambiguity and has found widespread applications. Obtaining complete label distributions in real-world scenarios is challenging, which has led to the emergence of Incomplete Label Distribution Learning (InLDL). However, the existing InLDL methods overlook a crucial aspect of LDL data: the inherent imbalance in label distributions. To address this limitation, we propose \textbf{Incomplete and Imbalance Label Distribution Learning (I\(^2\)LDL)}, a framework that simultaneously handles incomplete labels and imbalanced label distributions. Our method decomposes the label distribution matrix into a low-rank component for frequent labels and a sparse component for rare labels, effectively capturing the structure of both head and tail labels. We optimize the model using the Alternating Direction Method of Multipliers (ADMM) and derive generalization error bounds via Rademacher complexity, providing strong theoretical guarantees. Extensive experiments on 15 real-world datasets demonstrate the effectiveness and robustness of our proposed framework compared to existing InLDL methods.
Authors: Yida Xiong, Kun Li, Weiwei Liu, Jia Wu, Bo Du, Shirui Pan, Wenbin Hu
Abstract: Molecular optimization (MO) is a crucial stage in drug discovery in which task-oriented generated molecules are optimized to meet practical industrial requirements. Existing mainstream MO approaches primarily utilize external property predictors to guide iterative property optimization. However, learning all molecular samples in the vast chemical space is unrealistic for predictors. As a result, errors and noise are inevitably introduced during property prediction due to the nature of approximation. This leads to discrepancy accumulation, generalization reduction and suboptimal molecular candidates. In this paper, we propose a text-guided multi-property molecular optimization method utilizing transformer-based diffusion language model (TransDLM). TransDLM leverages standardized chemical nomenclature as semantic representations of molecules and implicitly embeds property requirements into textual descriptions, thereby preventing error propagation during diffusion process. Guided by physically and chemically detailed textual descriptions, TransDLM samples and optimizes encoded source molecules, retaining core scaffolds of source molecules and ensuring structural similarities. Moreover, TransDLM enables simultaneous sampling of multiple molecules, making it ideal for scalable, efficient large-scale optimization through distributed computation on web platforms. Furthermore, our approach surpasses state-of-the-art methods in optimizing molecular structural similarity and enhancing chemical properties on the benchmark dataset. The code is available at: https://anonymous.4open.science/r/TransDLM-A901.
Authors: Clayton Souza Leite, Henry Mauranen, Aziza Zhanabatyrova, Yu Xiao
Abstract: Transformers have excelled in natural language processing and computer vision, paving their way to sensor-based Human Activity Recognition (HAR). Previous studies show that transformers outperform their counterparts exclusively when they harness abundant data or employ compute-intensive optimization algorithms. However, neither of these scenarios is viable in sensor-based HAR due to the scarcity of data in this field and the frequent need to perform training and inference on resource-constrained devices. Our extensive investigation into various implementations of transformer-based versus non-transformer-based HAR using wearable sensors, encompassing more than 500 experiments, corroborates these concerns. We observe that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices. Additionally, transformers demonstrate lower robustness to adversarial attacks, posing a potential threat to user trust in HAR.
Authors: Patrik Okanovic, Andreas Kirsch, Jannes Kasper, Torsten Hoefler, Andreas Krause, Nezihe Merve G\"urel
Abstract: With the multitude of pretrained models available thanks to the advancements in large-scale supervised and self-supervised learning, choosing the right model is becoming increasingly pivotal in the machine learning lifecycle. However, much like the training process, choosing the best pretrained off-the-shelf model for raw, unlabeled data is a labor-intensive task. To overcome this, we introduce MODEL SELECTOR, a framework for label-efficient selection of pretrained classifiers. Given a pool of unlabeled target data, MODEL SELECTOR samples a small subset of highly informative examples for labeling, in order to efficiently identify the best pretrained model for deployment on this target dataset. Through extensive experiments, we demonstrate that MODEL SELECTOR drastically reduces the need for labeled data while consistently picking the best or near-best performing model. Across 18 model collections on 16 different datasets, comprising over 1,500 pretrained models, MODEL SELECTOR reduces the labeling cost by up to 94.15% to identify the best model compared to the cost of the strongest baseline. Our results further highlight the robustness of MODEL SELECTOR in model selection, as it reduces the labeling cost by up to 72.41% when selecting a near-best model, whose accuracy is only within 1% of the best model.
Authors: Alexandra Bazarova, Evgenia Romanenkova, Alexey Zaytsev
Abstract: Change point detection (CPD) methods aim to identify abrupt shifts in the distribution of input data streams. Accurate estimators for this task are crucial across various real-world scenarios. Yet, traditional unsupervised CPD techniques face significant limitations, often relying on strong assumptions or suffering from low expressive power due to inherent model simplicity. In contrast, representation learning methods overcome these drawbacks by offering flexibility and the ability to capture the full complexity of the data without imposing restrictive assumptions. However, these approaches are still emerging in the CPD field and lack robust theoretical foundations to ensure their reliability. Our work addresses this gap by integrating the expressive power of representation learning with the groundedness of traditional CPD techniques. We adopt spectral normalization (SN) for deep representation learning in CPD tasks and prove that the embeddings after SN are highly informative for CPD. Our method significantly outperforms current state-of-the-art methods during the comprehensive evaluation via three standard CPD datasets.
Authors: Girish Narayanswamy, Xin Liu, Kumar Ayush, Yuzhe Yang, Xuhai Xu, Shun Liao, Jake Garrison, Shyam Tailor, Jake Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, Daniel McDuff
Abstract: Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data; however, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of sensor modalities to date. Our results establish the scaling laws of LSM for tasks such as imputation, interpolation and extrapolation, both across time and sensor modalities. Moreover, we highlight how LSM enables sample-efficient downstream learning for tasks like exercise and activity recognition.
Authors: Chenyu Wang, Masatoshi Uehara, Yichun He, Amy Wang, Tommaso Biancalani, Avantika Lal, Tommi Jaakkola, Sergey Levine, Hanchen Wang, Aviv Regev
Abstract: Recent studies have demonstrated the strong empirical performance of diffusion models on discrete sequences across domains from natural language to biological sequence generation. For example, in the protein inverse folding task, conditional diffusion models have achieved impressive results in generating natural-like sequences that fold back into the original structure. However, practical design tasks often require not only modeling a conditional distribution but also optimizing specific task objectives. For instance, we may prefer protein sequences with high stability. To address this, we consider the scenario where we have pre-trained discrete diffusion models that can generate natural-like sequences, as well as reward models that map sequences to task objectives. We then formulate the reward maximization problem within discrete diffusion models, analogous to reinforcement learning (RL), while minimizing the KL divergence against pretrained diffusion models to preserve naturalness. To solve this RL problem, we propose a novel algorithm, DRAKES, that enables direct backpropagation of rewards through entire trajectories generated by diffusion models, by making the originally non-differentiable trajectories differentiable using the Gumbel-Softmax trick. Our theoretical analysis indicates that our approach can generate sequences that are both natural-like and yield high rewards. While similar tasks have been recently explored in diffusion models for continuous domains, our work addresses unique algorithmic and theoretical challenges specific to discrete diffusion models, which arise from their foundation in continuous-time Markov chains rather than Brownian motion. Finally, we demonstrate the effectiveness of DRAKES in generating DNA and protein sequences that optimize enhancer activity and protein stability, respectively, important tasks for gene therapies and protein-based therapeutics.
Authors: Hagen Holthusen, Tim Brepols, Kevin Linka, Ellen Kuhl
Abstract: Soft biological tissues exhibit a tendency to maintain a preferred state of tensile stress, known as tensional homeostasis, which is restored even after external mechanical stimuli. This macroscopic behavior can be described using the theory of kinematic growth, where the deformation gradient is multiplicatively decomposed into an elastic part and a part related to growth and remodeling. Recently, the concept of homeostatic surfaces was introduced to define the state of homeostasis and the evolution equations for inelastic deformations. However, identifying the optimal model and material parameters to accurately capture the macroscopic behavior of inelastic materials can only be accomplished with significant expertise, is often time-consuming, and prone to error, regardless of the specific inelastic phenomenon. To address this challenge, built-in physics machine learning algorithms offer significant potential. In this work, we extend our inelastic Constitutive Artificial Neural Networks (iCANNs) by incorporating kinematic growth and homeostatic surfaces to discover the scalar model equations, namely the Helmholtz free energy and the pseudo potential. The latter describes the state of homeostasis in a smeared sense. We evaluate the ability of the proposed network to learn from experimentally obtained tissue equivalent data at the material point level, assess its predictive accuracy beyond the training regime, and discuss its current limitations when applied at the structural level. Our source code, data, examples, and an implementation of the corresponding material subroutine are made accessible to the public at https://doi.org/10.5281/zenodo.13946282.
Authors: Wei Huang, Richard Combes, Hind Castel-Taleb, Badii Jouaber
Abstract: We propose a model for the virtual function placement problem and several novel algorithms using ideas based on multi-armed bandits. We prove that these algorithms learn the optimal placement policy rapidly, and their regret grows at a rate at most $O( N M \sqrt{T\ln T} )$ while respecting the feasibility constraints with high probability. We show through numerical experiments that those algorithms both have good practical performance and modest computational complexity. Using the proposed acceleration technique, they can be used to learn in large networks where computational power is limited. Our experiments are fully reproducible, and the code is publicly available.
Authors: Mustofa Ahmed, Abdul Muntakim, Nawrin Tabassum, Mohammad Asifur Rahim, Faisal Muhammad Shah
Abstract: Depression detection using deep learning models has been widely explored in previous studies, especially due to the large amounts of data available from social media posts. These posts provide valuable information about individuals' mental health conditions and can be leveraged to train models and identify patterns in the data. However, distributed learning approaches have not been extensively explored in this domain. In this study, we adopt Federated Learning (FL) to facilitate decentralized training on smartphones while protecting user data privacy. We train three neural network architectures--GRU, RNN, and LSTM on Reddit posts to detect signs of depression and evaluate their performance under heterogeneous FL settings. To optimize the training process, we leverage a common tokenizer across all client devices, which reduces the computational load. Additionally, we analyze resource consumption and communication costs on smartphones to assess their impact in a real-world FL environment. Our experimental results demonstrate that the federated models achieve comparable performance to the centralized models. This study highlights the potential of FL for decentralized mental health prediction by providing a secure and efficient model training process on edge devices.
Authors: Kaipeng Zheng, Weiran Huang, Wanli Ouyang, Han-Sen Zhong, Yuqiang Li
Abstract: Atomic structure analysis of crystalline materials is a paramount endeavor in both chemical and material sciences. This sophisticated technique necessitates not only a solid foundation in crystallography but also a profound comprehension of the intricacies of the accompanying software, posing a significant challenge in meeting the rigorous daily demands. For the first time, we confront this challenge head-on by harnessing the power of deep learning for ultra-precise structural analysis at the full-atom level. To validate the performance of the model, named CrystalX, we employed a vast dataset comprising over 50,000 X-ray diffraction measurements derived from authentic experiments, demonstrating performance that is commensurate with human experts and adept at deciphering intricate geometric patterns. Remarkably, CrystalX revealed that even peer-reviewed publications can harbor errors that are stealthy to human scrutiny, yet CrystalX adeptly rectifies them. This deep learning model revolutionizes the time frame for crystal structure analysis, slashing it down to seconds. It has already been successfully applied in the structure analysis of newly discovered compounds in the latest research without human intervention. Overall, CrystalX marks the beginning of a new era in automating routine structural analysis within self-driving laboratories.
Authors: Vinod Raman, Ambuj Tewari
Abstract: We study generation through the lens of statistical learning theory. First, we abstract and formalize the results of Gold [1967], Angluin [1979, 1980], and Kleinberg and Mullainathan [2024] for language identification/generation in the limit in terms of a binary hypothesis class defined over an abstract instance space. Then, we formalize a different paradigm of generation studied by Kleinberg and Mullainathan [2024], which we call ``uniform generation," and provide a characterization of which hypothesis classes are uniformly generatable. As is standard in statistical learning theory, our characterization is in terms of the finiteness of a new combinatorial dimension we call the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and online learnability) and show that these two properties of hypothesis classes are \emph{incompatible} - there are classes that are generatable but not predictable and vice versa.
Authors: Bernhard Bermeitinger, Tomas Hrycej, Massimo Pavone, Julianus Kath, Siegfried Handschuh
Abstract: Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further components can also be reorganized to reduce the number of parameters. Under some conditions, query and key matrices can be collapsed into a single matrix of the same size. The same is true about value and projection matrices, which can also be omitted without eliminating the substance of the attention mechanism. Initially, the similarity measure was defined asymmetrically, with peculiar properties such as that a token is possibly dissimilar to itself. A possible symmetric definition requires only half of the parameters. We have laid the groundwork by testing widespread CV benchmarks: MNIST and CIFAR-10. The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and (c) symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90% of parameters without hurting the classification performance.
Authors: Minxing Zheng, Shixiang Zhu
Abstract: Generative models have shown significant promise in critical domains such as medical diagnosis, autonomous driving, and climate science, where reliable decision-making hinges on accurate uncertainty quantification. While probabilistic conformal prediction (PCP) offers a powerful framework for this purpose, its coverage efficiency -- the size of the uncertainty set -- is limited when dealing with complex underlying distributions and a finite number of generated samples. In this paper, we propose a novel PCP framework that enhances efficiency by first vectorizing the non-conformity scores with ranked samples and then optimizing the shape of the prediction set by varying the quantiles for samples at the same rank. Our method delivers valid coverage while producing discontinuous and more efficient prediction sets, making it particularly suited for high-stakes applications. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.
Authors: Yue Huang, Zhaoxian Wu, Shiqian Ma, Qing Ling
Abstract: Stochastic approximation (SA) that involves multiple coupled sequences, known as multiple-sequence SA (MSSA), finds diverse applications in the fields of signal processing and machine learning. However, existing theoretical understandings {of} MSSA are limited: the multi-timescale analysis implies a slow convergence rate, whereas the single-timescale analysis relies on a stringent fixed point smoothness assumption. This paper establishes tighter single-timescale analysis for MSSA, without assuming smoothness of the fixed points. Our theoretical findings reveal that, when all involved operators are strongly monotone, MSSA converges at a rate of $\tilde{\mathcal{O}}(K^{-1})$, where $K$ denotes the total number of iterations. In addition, when all involved operators are strongly monotone except for the main one, MSSA converges at a rate of $\mathcal{O}(K^{-\frac{1}{2}})$. These theoretical findings align with those established for single-sequence SA. Applying these theoretical findings to bilevel optimization and communication-efficient distributed learning offers relaxed assumptions and/or simpler algorithms with performance guarantees, as validated by numerical experiments.
Authors: Yuchen Liang, Peizhong Ju, Yingbin Liang, Ness Shroff
Abstract: The denoising diffusion model has recently emerged as a powerful generative technique, capable of transforming noise into meaningful data. While theoretical convergence guarantees for diffusion models are well established when the target distribution aligns with the training distribution, practical scenarios often present mismatches. One common case is in zero-shot conditional diffusion sampling, where the target conditional distribution is different from the (unconditional) training distribution. These score-mismatched diffusion models remain largely unexplored from a theoretical perspective. In this paper, we present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers, focusing on target distributions with finite second moments. We show that score mismatches result in an asymptotic distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions. This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise. Interestingly, the derived convergence upper bound offers useful guidance for designing a novel bias-optimal zero-shot sampler in linear conditional models that minimizes the asymptotic bias. For such bias-optimal samplers, we further establish convergence guarantees with explicit dependencies on dimension and conditioning, applied to several interesting target distributions, including those with bounded support and Gaussian mixtures. Our findings are supported by numerical studies.
Authors: Albert Gong, Kyuseong Choi, Raaz Dwivedi
Abstract: The kernel thinning algorithm of Dwivedi & Mackey (2024) provides a better-than-i.i.d. compression of a generic set of points. By generating high-fidelity coresets of size significantly smaller than the input points, KT is known to speed up unsupervised tasks like Monte Carlo integration, uncertainty quantification, and non-parametric hypothesis testing, with minimal loss in statistical accuracy. In this work, we generalize the KT algorithm to speed up supervised learning problems involving kernel methods. Specifically, we combine two classical algorithms--Nadaraya-Watson (NW) regression or kernel smoothing, and kernel ridge regression (KRR)--with KT to provide a quadratic speed-up in both training and inference times. We show how distribution compression with KT in each setting reduces to constructing an appropriate kernel, and introduce the Kernel-Thinned NW and Kernel-Thinned KRR estimators. We prove that KT-based regression estimators enjoy significantly superior computational efficiency over the full-data estimators and improved statistical efficiency over i.i.d. subsampling of the training data. En route, we also provide a novel multiplicative error guarantee for compressing with KT. We validate our design choices with both simulations and real data experiments.
Authors: Guibin Zhang, Haonan Dong, Yuchen Zhang, Zhixun Li, Dingshuo Chen, Kai Wang, Tianlong Chen, Yuxuan Liang, Dawei Cheng, Kun Wang
Abstract: Training high-quality deep models necessitates vast amounts of data, resulting in overwhelming computational and memory demands. Recently, data pruning, distillation, and coreset selection have been developed to streamline data volume by retaining, synthesizing, or selecting a small yet informative subset from the full set. Among these methods, data pruning incurs the least additional training cost and offers the most practical acceleration benefits. However, it is the most vulnerable, often suffering significant performance degradation with imbalanced or biased data schema, thus raising concerns about its accuracy and reliability in on-device deployment. Therefore, there is a looming need for a new data pruning paradigm that maintains the efficiency of previous practices while ensuring balance and robustness. Unlike the fields of computer vision and natural language processing, where mature solutions have been developed to address these issues, graph neural networks (GNNs) continue to struggle with increasingly large-scale, imbalanced, and noisy datasets, lacking a unified dataset pruning solution. To achieve this, we introduce a novel dynamic soft-pruning method, GDeR, designed to update the training ``basket'' during the process using trainable prototypes. GDeR first constructs a well-modeled graph embedding hypersphere and then samples \textit{representative, balanced, and unbiased subsets} from this embedding space, which achieves the goal we called Graph Training Debugging. Extensive experiments on five datasets across three GNN backbones, demonstrate that GDeR (I) achieves or surpasses the performance of the full dataset with 30%~50% fewer training samples, (II) attains up to a 2.81x lossless training speedup, and (III) outperforms state-of-the-art pruning methods in imbalanced training and noisy training scenarios by 0.3%~4.3% and 3.6%~7.8%, respectively.
Authors: Raisa Bentay Hossain, Farid Ahmed, Kazuma Kobayashi, Seid Koric, Diab Abueidda, Syed Bahauddin Alam
Abstract: Effective real-time monitoring technique is crucial for detecting material degradation and maintaining the structural integrity of nuclear systems to ensure both safety and operational efficiency. Traditional physical sensor systems face limitations such as installation challenges, high costs, and difficulties in measuring critical parameters in hard-to-reach or harsh environments, often resulting in incomplete data coverage. Machine learning-driven virtual sensors offer a promising solution by enhancing physical sensor capabilities to monitor critical degradation indicators like pressure, velocity, and turbulence. However, conventional machine learning models struggle with real-time monitoring due to the high-dimensional nature of reactor data and the need for frequent retraining. This paper explores the use of Deep Operator Networks (DeepONet) within a digital twin (DT) framework to predict key thermal-hydraulic parameters in the hot leg of an AP-1000 Pressurized Water Reactor (PWR). In this study, DeepONet is trained with different operational conditions, which relaxes the requirement of continuous retraining, making it suitable for online and real-time prediction components for DT. Our results show that DeepONet achieves accurate predictions with low mean squared error and relative L2 error and can make predictions on unknown data 160,000 times faster than traditional finite element (FE) simulations. This speed and accuracy make DeepONet a powerful tool for tracking conditions that contribute to material degradation in real-time, enhancing reactor safety and longevity.
Authors: Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli
Abstract: We study the problem of Non-Stationary Reinforcement Learning (NS-RL) without prior knowledge about the system's non-stationarity. A state-of-the-art, black-box algorithm, known as MASTER, is considered, with a focus on identifying the conditions under which it can achieve its stated goals. Specifically, we prove that MASTER's non-stationarity detection mechanism is not triggered for practical choices of horizon, leading to performance akin to a random restarting algorithm. Moreover, we show that the regret bound for MASTER, while being order optimal, stays above the worst-case linear regret until unreasonably large values of the horizon. To validate these observations, MASTER is tested for the special case of piecewise stationary multi-armed bandits, along with methods that employ random restarting, and others that use quickest change detection to restart. A simple, order optimal random restarting algorithm, that has prior knowledge of the non-stationarity is proposed as a baseline. The behavior of the MASTER algorithm is validated in simulations, and it is shown that methods employing quickest change detection are more robust and consistently outperform MASTER and other random restarting approaches.
Authors: Priyam Ganguly, Isha Mukherjee
Abstract: In retail sales forecasting, accurately predicting future sales is crucial for inventory management and strategic planning. Traditional methods like LR often fall short due to the complexity of sales data, which includes seasonality and numerous product families. Recent advancements in machine learning (ML) provide more robust alternatives. This research benefits from the power of ML, particularly Random Forest (RF), Gradient Boosting (GB), Support Vector Regression (SVR), and XGBoost, to improve prediction accuracy. Despite advancements, a significant gap exists in handling complex datasets with high seasonality and multiple product families. The proposed solution involves implementing and optimizing a RF model, leveraging hyperparameter tuning through randomized search cross-validation. This approach addresses the complexities of the dataset, capturing intricate patterns that traditional methods miss. The optimized RF model achieved an R-squared value of 0.945, substantially higher than the initial RF model and traditional LR, which had an R-squared of 0.531. The model reduced the root mean squared logarithmic error (RMSLE) to 1.172, demonstrating its superior predictive capability. The optimized RF model did better than cutting-edge models like Gradient Boosting (R-squared: 0.942), SVR (R-squared: 0.940), and XGBoost (R-squared: 0.939), with more minor mean squared error (MSE) and mean absolute error (MAE) numbers. The results demonstrate that the optimized RF model excels in forecasting retail sales, handling the datasets complexity with higher accuracy and reliability. This research highlights the importance of advanced ML techniques in predictive analytics, offering a significant improvement over traditional methods and other contemporary models.
Authors: Michelangelo Olmo Nogara Notarianni, Filippo Leveni, Diego Stucchi, Luca Frittoli, Giacomo Boracchi
Abstract: We present Kernel-QuantTree Exponentially Weighted Moving Average (KQT-EWMA), a non-parametric change-detection algorithm that combines the Kernel-QuantTree (KQT) histogram and the EWMA statistic to monitor multivariate data streams online. The resulting monitoring scheme is very flexible, since histograms can be used to model any stationary distribution, and practical, since the distribution of test statistics does not depend on the distribution of datastream in stationary conditions (non-parametric monitoring). KQT-EWMA enables controlling false alarms by operating at a pre-determined Average Run Length ($ARL_0$), which measures the expected number of stationary samples to be monitored before triggering a false alarm. The latter peculiarity is in contrast with most non-parametric change-detection tests, which rarely can control the $ARL_0$ a priori. Our experiments on synthetic and real-world datasets demonstrate that KQT-EWMA can control $ARL_0$ while achieving detection delays comparable to or lower than state-of-the-art methods designed to work in the same conditions.
Authors: Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu
Abstract: Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.
Authors: Ilya Kaufman, Omri Azencot
Abstract: Transformer models have consistently achieved remarkable results in various domains such as natural language processing and computer vision. However, despite ongoing research efforts to better understand these models, the field still lacks a comprehensive understanding. This is particularly true for deep time series forecasting methods, where analysis and understanding work is relatively limited. Time series data, unlike image and text information, can be more challenging to interpret and analyze. To address this, we approach the problem from a manifold learning perspective, assuming that the latent representations of time series forecasting models lie next to a low-dimensional manifold. In our study, we focus on analyzing the geometric features of these latent data manifolds, including intrinsic dimension and principal curvatures. Our findings reveal that deep transformer models exhibit similar geometric behavior across layers, and these geometric features are correlated with model performance. Additionally, we observe that untrained models initially have different structures, but they rapidly converge during training. By leveraging our geometric analysis and differentiable tools, we can potentially design new and improved deep forecasting neural networks. This approach complements existing analysis studies and contributes to a better understanding of transformer models in the context of time series forecasting. Code is released at https://github.com/azencot-group/GATLM.
Authors: Da Long, Zhitong Xu, Guang Yang, Akil Narayan, Shandian Zhe
Abstract: Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. While machine learning-based surrogate models can offer significant cost reductions, most focus on a single task, such as forward prediction, and typically lack uncertainty quantification -- an essential component in many applications. To overcome these limitations, we propose Arbitrarily-Conditioned Multi-Functional Diffusion (ACMFD), a versatile probabilistic surrogate model for multi-physics emulation. ACMFD can perform a wide range of tasks within a single framework, including forward prediction, various inverse problems, and simulating data for entire systems or subsets of quantities conditioned on others. Specifically, we extend the standard Denoising Diffusion Probabilistic Model (DDPM) for multi-functional generation by modeling noise as Gaussian processes (GP). We then introduce an innovative denoising loss. The training involves randomly sampling the conditioned part and fitting the corresponding predicted noise to zero, enabling ACMFD to flexibly generate function values conditioned on any other functions or quantities. To enable efficient training and sampling, and to flexibly handle irregularly sampled data, we use GPs to interpolate function samples onto a grid, inducing a Kronecker product structure for efficient computation. We demonstrate the advantages of ACMFD across several fundamental multi-physics systems.
Authors: Patrik Zavoral, Du\v{s}an Vari\v{s}, Ond\v{r}ej Bojar
Abstract: The Transformer model has a tendency to overfit various aspects of the training data, such as the overall sequence length. We study elementary string edit functions using a defined set of error indicators to interpret the behaviour of the sequence-to-sequence Transformer. We show that generalization to shorter sequences is often possible, but confirm that longer sequences are highly problematic, although partially correct answers are often obtained. Additionally, we find that other structural characteristics of the sequences, such as subsegment length, may be equally important. We hypothesize that the models learn algorithmic aspects of the tasks simultaneously with structural aspects but adhering to the structural aspects is unfortunately often preferred by Transformer when they come into conflict.
Authors: Takeru Miyato, Sindy L\"owe, Andreas Geiger, Max Welling
Abstract: It has long been known in both neuroscience and AI that ``binding'' between neurons leads to a form of competitive learning where representations are compressed in order to represent more abstract concepts in deeper layers of the network. More recently, it was also hypothesized that dynamic (spatiotemporal) representations play an important role in both neuroscience and AI. Building on these ideas, we introduce Artificial Kuramoto Oscillatory Neurons (AKOrN) as a dynamical alternative to threshold units, which can be combined with arbitrary connectivity designs such as fully connected, convolutional, or attentive mechanisms. Our generalized Kuramoto updates bind neurons together through their synchronization dynamics. We show that this idea provides performance improvements across a wide spectrum of tasks such as unsupervised object discovery, adversarial robustness, calibrated uncertainty quantification, and reasoning. We believe that these empirical results show the importance of rethinking our assumptions at the most basic neuronal level of neural representation, and in particular show the importance of dynamical representations.
Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet
Abstract: With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. After validating the relevance of rationale-parsed skills and inferring skills for $46$k instances over $12$ benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is $18\%$ more accurate in "computing molar mass", but $19\%$ less accurate in "applying constitutional law", despite the overall accuracies of the three models differing by a mere $0.4\%$. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a $3\%$ accuracy improvement over our $12$ dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
Authors: Hui Yuan, Yifan Zeng, Yue Wu, Huazheng Wang, Mengdi Wang, Liu Leqi
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based methods -- the under-specification of ideal LM behavior on preferred and dispreferred responses individually, which leads to two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) The probability of preferred responses may decrease, even when those responses are ideal. We demystify the reasons behind these problematic behaviors: margin-based losses couple the change in the preferred probability to the gradient of the dispreferred one, and vice versa, often preventing the preferred probability from increasing while the dispreferred one decreases, and thus causing a synchronized increase or decrease in both probabilities. We term this effect, inherent in margin-based objectives, gradient entanglement. Formally, we derive conditions for general margin-based alignment objectives under which gradient entanglement becomes concerning: the inner product of the gradients of preferred and dispreferred log-probabilities is large relative to the individual gradient norms. We theoretically investigate why such inner products can be large when aligning language models and empirically validate our findings. Empirical implications of our framework extend to explaining important differences in the training dynamics of various preference optimization algorithms, and suggesting potential algorithm designs to mitigate the under-specification issue of margin-based methods and thereby improving language model alignment.
Authors: Kajetan Schweighofer, Adrian Arnaiz-Rodriguez, Sepp Hochreiter, Nuria Oliver
Abstract: Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness investigates how a model's performance varies across different groups, typically defined by protected attributes such as age, gender, or race. In this work, we investigate the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups in what we refer to as a disparate benefits effect. We empirically investigate this effect with Deep Ensembles applied to popular facial analysis and medical imaging datasets, where protected group attributes are given and find that it occurs for multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify the per-group difference in predictive diversity of ensemble members as the potential cause of the disparate benefits effect. Finally, we evaluate different approaches to reduce unfairness due to the disparate benefits effect. Our findings show that post-processing is an effective method to mitigate this unfairness while preserving the improved performance of Deep Ensembles.
Authors: Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei
Abstract: Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
Authors: Chen Bo Calvin Zhang, Zhang-Wei Hong, Aldo Pacchiano, Pulkit Agrawal
Abstract: Reward shaping is a critical component in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. While shaping rewards have been introduced to provide additional guidance, selecting effective shaping functions remains challenging and computationally expensive. This paper introduces Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames shaping reward selection as an online model selection problem. ORSO employs principled exploration strategies to automatically identify promising shaping reward functions without human intervention, balancing exploration and exploitation with provable regret guarantees. We demonstrate ORSO's effectiveness across various continuous control tasks using the Isaac Gym simulator. Compared to traditional methods that fully evaluate each shaping reward function, ORSO significantly improves sample efficiency, reduces computational time, and consistently identifies high-quality reward functions that produce policies comparable to those generated by domain experts through hand-engineered rewards.
Authors: Qiaoyu Tang, Le Yu, Bowen Yu, Hongyu Lin, Keming Lu, Yaojie Lu, Xianpei Han, Le Sun
Abstract: Post-training has emerged as a crucial paradigm for adapting large-scale pre-trained models to various tasks, whose effects are fully reflected by delta parameters (i.e., the disparity between post-trained and pre-trained parameters). While numerous studies have explored delta parameter properties via operations like pruning, quantization, low-rank approximation, and extrapolation, a unified framework for systematically examining these characteristics has been lacking. In this paper, we propose a novel perspective based on Riemann sum approximation of the loss function to elucidate delta parameter editing operations. Our analysis categorizes existing methods into three classes based on their post-editing performance: competitive, decreased, and improved, explaining how they are expressed by the Riemann sum approximation term and how they alter the model performance. Extensive experiments on both visual and language models, including ViT, LLaMA 3, Qwen 2, and Mistral, corroborate our theoretical findings. Furthermore, we introduce extensions to existing techniques like DARE and BitDelta, highlighting their limitations in leveraging the properties of delta parameters and reorganizing them into general expressions to enhance the applicability and effectiveness of delta parameter editing in post-trained models.
Authors: Bruno Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, Richard Turner
Abstract: Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an \textit{influence functions} framework. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.
Authors: Yifeng Wang, Xueying Zhan, Siyu Huang
Abstract: As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both time-consuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectiveness of different AL algorithms can vary significantly across data scenarios, and determining which AL algorithm best fits a given task remains a challenging problem. This work presents the first differentiable AL strategy search method, named AutoAL, which is designed on top of existing AL sampling strategies. AutoAL consists of two neural nets, named SearchNet and FitNet, which are optimized concurrently under a differentiable bi-level optimization framework. For any given task, SearchNet and FitNet are iteratively co-optimized using the labeled data, learning how well a set of candidate AL algorithms perform on that task. With the optimal AL strategies identified, SearchNet selects a small subset from the unlabeled pool for querying their annotations, enabling efficient training of the task model. Experimental results demonstrate that AutoAL consistently achieves superior accuracy compared to all candidate AL algorithms and other selective AL approaches, showcasing its potential for adapting and integrating multiple existing AL methods across diverse tasks and domains. Code will be available at: https://github.com/haizailache999/AutoAL.
Authors: Runzhe Wu, Yiding Chen, Gokul Swamy, Kiant\'e Brantley, Wen Sun
Abstract: Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function, and can therefore be thought of as the sequential generalization of a Generative Adversarial Network (GAN). A prominent example of this framework is Generative Adversarial Imitation Learning (GAIL). However, in recent years, diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression, yet produce generations of a higher quality. In response, we investigate how to lift insights from diffusion modeling to the sequential setting. We propose diffusing states and performing score-matching along diffused states to measure the discrepancy between the expert's and learner's states. Thus, our approach only requires training score functions to predict noises via standard regression, making it significantly easier and more stable to train than adversarial methods. Theoretically, we prove first- and second-order instance-dependent bounds with linear scaling in the horizon, proving that our approach avoids the compounding errors that stymie offline approaches to imitation learning. Empirically, we show our approach outperforms GAN-style imitation learning baselines across various continuous control problems, including complex tasks like controlling humanoids to walk, sit, and crawl.
Authors: Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, Liwei Wang
Abstract: Despite the remarkable success of Transformer-based Large Language Models (LLMs) across various domains, understanding and enhancing their mathematical capabilities remains a significant challenge. In this paper, we conduct a rigorous theoretical analysis of LLMs' mathematical abilities, with a specific focus on their arithmetic performances. We identify numerical precision as a key factor that influences their effectiveness in mathematical tasks. Our results show that Transformers operating with low numerical precision fail to address arithmetic tasks, such as iterated addition and integer multiplication, unless the model size grows super-polynomially with respect to the input length. In contrast, Transformers with standard numerical precision can efficiently handle these tasks with significantly smaller model sizes. We further support our theoretical findings through empirical experiments that explore the impact of varying numerical precision on arithmetic tasks, providing valuable insights for improving the mathematical reasoning capabilities of LLMs.
Authors: Kateryna Lutsai, Christoph H. Lampert
Abstract: This research is aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data. The suggested approach implements neural networks for natural language processing (NLP) to estimate the location as coordinate pairs (longitude, latitude) and two-dimensional Gaussian Mixture Models (GMMs). The scope of proposed models has been finetuned on a Twitter dataset using pretrained Bidirectional Encoder Representations from Transformers (BERT) as base models. Performance metrics show a median error of fewer than 30 km on a worldwide-level, and fewer than 15 km on the US-level datasets for the models trained and evaluated on text features of tweets' content and metadata context. Our source code and data are available at https://github.com/K4TEL/geo-twitter.git
Authors: Fengyu Gao, Ruida Zhou, Tianhao Wang, Cong Shen, Jing Yang
Abstract: Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL). To mitigate the risk of LLMs potentially leaking private information contained in examples in the prompt, we introduce a novel data-adaptive differentially private algorithm called AdaDPSyn to generate synthetic examples from the private dataset and then use these synthetic examples to perform ICL. The objective of AdaDPSyn is to adaptively adjust the noise level in the data synthesis mechanism according to the inherent statistical properties of the data, thereby preserving high ICL accuracy while maintaining formal differential privacy guarantees. A key innovation in AdaDPSyn is the Precision-Focused Iterative Radius Reduction technique, which dynamically refines the aggregation radius - the scope of data grouping for noise addition - based on patterns observed in data clustering, thereby minimizing the amount of additive noise. We conduct extensive experiments on standard benchmarks and compare AdaDPSyn with DP few-shot generation algorithm (Tang et al., 2023). The experiments demonstrate that AdaDPSyn not only outperforms DP few-shot generation, but also maintains high accuracy levels close to those of non-private baselines, providing an effective solution for ICL with privacy protection.
Authors: Wei Shi, Chen Fu, Qi Xu, Sanjian Chen, Jizhe Zhang, Qinqin Zhu, Zhigang Hua, Shuang Yang
Abstract: Ads supply personalization aims to balance the revenue and user engagement, two long-term objectives in social media ads, by tailoring the ad quantity and density. In the industry-scale system, the challenge for ads supply lies in modeling the counterfactual effects of a conservative supply treatment (e.g., a small density change) over an extended duration. In this paper, we present a streamlined framework for personalized ad supply. This framework optimally utilizes information from data collection policies through the doubly robust learning. Consequently, it significantly improves the accuracy of long-term treatment effect estimates. Additionally, its low-complexity design not only results in computational cost savings compared to existing methods, but also makes it scalable for billion-scale applications. Through both offline experiments and online production tests, the framework consistently demonstrated significant improvements in top-line business metrics over months. The framework has been fully deployed to live traffic in one of the world's largest social media platforms.
Authors: Mythreyi Velmurugan, Chun Ouyang, Yue Xu, Renuka Sindhgatta, Bemali Wickramanayake, Catarina Moreira
Abstract: Explainable Artificial Intelligence (XAI) techniques are used to provide transparency to complex, opaque predictive models. However, these techniques are often designed for image and text data, and it is unclear how fit-for-purpose they are when applied to tabular data. As XAI techniques are rarely evaluated in settings with tabular data, the applicability of existing evaluation criteria and methods are also unclear and needs (re-)examination. For example, some works suggest that evaluation methods may unduly influence the evaluation results when using tabular data. This lack of clarity on evaluation procedures can lead to reduced transparency and ineffective use of XAI techniques in real world settings. In this study, we examine literature on XAI evaluation to derive guidelines on functionally-grounded assessment of local, post hoc XAI techniques. We identify 20 evaluation criteria and associated evaluation methods, and derive guidelines on when and how each criterion should be evaluated. We also identify key research gaps to be addressed by future work. Our study contributes to the body of knowledge on XAI evaluation through in-depth examination of functionally-grounded XAI evaluation protocols, and has laid the groundwork for future research on XAI evaluation.
Authors: Alison N. O'Connor, Stephen E. Ryan, Gauri Vaidya, Paul Harford, Meghana Kshirsagar
Abstract: Increased healthcare demand is significantly straining European services. Digital solutions including advanced modelling techniques offer a promising solution to optimising patient flow without impacting day-to-day healthcare provision. In this work we outline an ongoing project that aims to optimise healthcare resources using agent-based simulations.
Authors: Arya Chakraborty, Auhona Basu
Abstract: The financial domain presents a complex environment for stock market prediction, characterized by volatile patterns and the influence of multifaceted data sources. Traditional models have leveraged either Convolutional Neural Networks (CNN) for spatial feature extraction or Long Short-Term Memory (LSTM) networks for capturing temporal dependencies, with limited integration of external textual data. This paper proposes a novel Two-Level Conv-LSTM Neural Network integrated with a Large Language Model (LLM) for comprehensive stock advising. The model harnesses the strengths of Conv-LSTM for analyzing time-series data and LLM for processing and understanding textual information from financial news, social media, and reports. In the first level, convolutional layers are employed to identify local patterns in historical stock prices and technical indicators, followed by LSTM layers to capture the temporal dynamics. The second level integrates the output with an LLM that analyzes sentiment and contextual information from textual data, providing a holistic view of market conditions. The combined approach aims to improve prediction accuracy and provide contextually rich stock advising.
Authors: Saima Rathore, Jatin Chaudhary, Boning Tong, Selen Bozkurt
Abstract: Cerebral microbleeds, markers of brain damage from vascular and amyloid pathologies, are linked to cognitive decline in aging, but their role in Alzheimer's disease (AD) onset and progression remains unclear. This study aimed to explore whether the presence and location of lobar microbleeds are associated with amyloid-$\beta$ (A$\beta$)-PET, tau tangle formation (tau-PET), and longitudinal cognitive decline. We analyzed 1,573 ADNI participants with MR imaging data and information on the number and location of microbleeds. Associations between lobar microbleeds and pathology, cerebrospinal fluid (CSF), genetics, and cognition were examined, focusing on regional microbleeds and domain-specific cognitive decline using ordinary least-squares regression while adjusting for covariates. Cognitive decline was assessed with ADAS-Cog11 and its domain-specific sub-scores. Participants underwent neuropsychological testing at least twice, with a minimum two-year interval between assessments. Among the 1,573 participants (692 women, mean age 71.23 years), 373 participants had microbleeds. The presence of microbleeds was linked to cognitive decline, particularly in the semantic, language, and praxis domains for those with temporal lobe microbleeds. Microbleeds in the overall cortex were associated with language decline. Pathologically, temporal lobe microbleeds were associated with increased tau in the overall cortex, while cortical microbleeds were linked to elevated A$\beta$ in the temporal, parietal, and frontal regions. In this mixed population, microbleeds were connected to longitudinal cognitive decline, especially in semantic and language domains, and were associated with higher baseline A$\beta$ and tau pathology. These findings suggest that lobar microbleeds should be included in AD diagnostic and prognostic evaluations.
Authors: Yanan Zhang, Jiangmeng Li, Lixiang Liu, Wenwen Qiang
Abstract: Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate the interference of task-irrelevant knowledge. Specifically, we decouple semantics contained in the data of downstream tasks and perform classification based on each semantic. Furthermore, we employ the Dempster-Shafer evidence theory to evaluate the uncertainty of each prediction generated by diverse semantics. Experiments conducted in multiple different settings have consistently demonstrated the effectiveness of CDC.
Authors: Haruki Yonekura, Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi
Abstract: This paper presents a novel system for reconstructing high-resolution GPS trajectory data from truncated or synthetic low-resolution inputs, addressing the critical challenge of balancing data utility with privacy preservation in mobility applications. The system integrates transformer-based encoder-decoder models with graph convolutional networks (GCNs) to effectively capture both the temporal dependencies of trajectory data and the spatial relationships in road networks. By combining these techniques, the system is able to recover fine-grained trajectory details that are lost through data truncation or rounding, a common practice to protect user privacy. We evaluate the system on the Beijing trajectory dataset, demonstrating its superior performance over traditional map-matching algorithms and LSTM-based synthetic data generation methods. The proposed model achieves an average Fr\'echet distance of 0.198 km, significantly outperforming map-matching algorithms (0.632 km) and synthetic trajectory models (0.498 km). The results show that the system is not only capable of accurately reconstructing real-world trajectories but also generalizes effectively to synthetic data. These findings suggest that the system can be deployed in urban mobility applications, providing both high accuracy and robust privacy protection.
Authors: Francisco M. Calatrava-Nicol\'as, Oscar Martinez Mozos
Abstract: We present a new adversarial deep learning framework for the problem of human activity recognition (HAR) using inertial sensors worn by people. Our framework incorporates a novel adversarial activity-based discrimination task that addresses inter-person variability-i.e., the fact that different people perform the same activity in different ways. Overall, our proposed framework outperforms previous approaches on three HAR datasets using a leave-one-(person)-out cross-validation (LOOCV) benchmark. Additional results demonstrate that our discrimination task yields better classification results compared to previous tasks within the same adversarial framework.
Authors: Marte Eggen, Inga Str\"umke
Abstract: Navigation is a fundamental cognitive skill extensively studied in neuroscientific experiments and has lately gained substantial interest in artificial intelligence research. Recreating the task solved by rodents in the well-established Morris Water Maze (MWM) experiment, this work applies a transformer-based architecture using deep reinforcement learning -- an approach previously unexplored in this context -- to navigate a 2D version of the maze. Specifically, the agent leverages a decoder-only transformer architecture serving as a deep Q-network performing effective decision making in the partially observable environment. We demonstrate that the proposed architecture enables the agent to efficiently learn spatial navigation strategies, overcoming challenges associated with a limited field of vision, corresponding to the visual information available to a rodent in the MWM. Demonstrating the potential of transformer-based models for enhancing navigation performance in partially observable environments, this work suggests promising avenues for future research in artificial agents whose behavior resembles that of biological agents. Finally, the flexibility of the transformer architecture in supporting varying input sequence lengths opens opportunities for gaining increased understanding of the artificial agent's inner representation of the environment.
Authors: Marc Rigter, Tarun Gupta, Agrin Hilmkil, Chao Ma
Abstract: Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.1 Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.
Authors: Saeed Saviz Naeini, Reda Snaiki, Teng Wu
Abstract: Coastal regions in North America face major threats from storm surges caused by hurricanes and nor'easters. Traditional numerical models, while accurate, are computationally expensive, limiting their practicality for real-time predictions. Recently, deep learning techniques have been developed for efficient simulation of time-dependent storm surge. To resolve the small scales of storm surge in both time and space over a long duration and a large area, these simulations typically need to employ oversized neural networks that struggle with the accumulation of prediction errors over successive time steps. To address these challenges, this study introduces a hierarchical deep neural network (HDNN) combined with a convolutional autoencoder (CAE) to accurately and efficiently predict storm surge time series. The CAE reduces the dimensionality of storm surge data, streamlining the learning process. HDNNs then map storm parameters to the low-dimensional representation of storm surge, allowing for sequential predictions across different time scales. Specifically, the current-level neural network is utilized to predict future states with a relatively large time step, which are passed as inputs to the next-level neural network for smaller time-step predictions. This process continues sequentially for all time steps. The results from different-level neural networks across various time steps are then stacked to acquire the entire time series of storm surge. The simulated low-dimensional representations are finally decoded back into storm surge time series. The proposed model was trained and tested using synthetic data from the North Atlantic Comprehensive Coastal Study. Results demonstrate its excellent performance to effectively handle high-dimensional surge data while mitigating the accumulation of prediction errors over time, making it a promising tool for advancing storm surge prediction.
Authors: Belguutei Ariuntugs, Kehelwala Dewage Gayan Madurang
Abstract: In the data-driven world of actuarial science, machine learning (ML) plays a crucial role in predictive modeling, enhancing risk assessment and pricing strategies. Neural networks, specifically combined actuarial neural networks (CANN), are vital for tasks such as mortality forecasting and pricing. However, optimizing hyperparameters (e.g., learning rates, layers) is essential for resource efficiency. This study utilizes a factorial design and response surface methodology (RSM) to optimize CANN performance. RSM effectively explores the hyperparameter space and captures potential curvature, outperforming traditional grid search. Our results show accurate performance predictions, identifying critical hyperparameters. By dropping statistically insignificant hyperparameters, we reduced runs from 288 to 188, with negligible loss in accuracy, achieving near-optimal out-of-sample Poisson deviance loss.
Authors: Dwipam Katariya, Juan Manuel Origgi, Yage Wang, Thomas Caputo
Abstract: Users engage with financial services companies through multiple channels, often interacting with mobile applications, web platforms, call centers, and physical locations to service their accounts. The resulting interactions are recorded at heterogeneous temporal resolutions across these domains. This multi-channel data can be combined and encoded to create a comprehensive representation of the customer's journey for accurate intent prediction. This demands sequential learning solutions. NMT transformers achieve state-of-the-art sequential representation learning by encoding context and decoding for the next best action to represent long-range dependencies. However, three major challenges exist while combining multi-domain sequences within an encode-decoder transformers architecture for intent prediction applications: a) aligning sequences with different sampling rates b) learning temporal dynamics across multi-variate, multi-domain sequences c) combining dynamic and static sequences. We propose an encoder-decoder transformer model to address these challenges for contextual and sequential intent prediction in financial servicing applications. Our experiments show significant improvement over the existing tabular method.
Authors: Prasad Chaudhari, Aman Kumar, Chandravardhan Singh Raghaw, Mohammad Zia Ur Rehman, Nagendra Kumar
Abstract: Sentiment analysis and emotion recognition in videos are challenging tasks, given the diversity and complexity of the information conveyed in different modalities. Developing a highly competent framework that effectively addresses the distinct characteristics across various modalities is a primary concern in this domain. Previous studies on combined multimodal sentiment and emotion analysis often overlooked effective fusion for modality integration, intermodal contextual congruity, optimizing concatenated feature spaces, leading to suboptimal architecture. This paper presents a novel framework that leverages the multi-modal contextual information from utterances and applies metaheuristic algorithms to learn the contributing features for utterance-level sentiment and emotion prediction. Our Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network (GCM-Net) integrates graph sampling and aggregation to recalibrate the modality features for video sentiment and emotion prediction. GCM-Net includes a cross-modal attention module determining intermodal interactions and utterance relevance. A harmonic optimization module employing a metaheuristic algorithm combines attended features, allowing for handling both single and multi-utterance inputs. To show the effectiveness of our approach, we have conducted extensive evaluations on three prominent multi-modal benchmark datasets, CMU MOSI, CMU MOSEI, and IEMOCAP. The experimental results demonstrate the efficacy of our proposed approach, showcasing accuracies of 91.56% and 86.95% for sentiment analysis on MOSI and MOSEI datasets. We have performed emotion analysis for the IEMOCAP dataset procuring an accuracy of 85.66% which signifies substantial performance enhancements over existing methods.
Authors: Maxx Richard Rahman, Ruoxuan Liu, Wolfgang Maass
Abstract: Anomaly detection in clinical time-series holds significant potential in identifying suspicious patterns in different biological parameters. In this paper, we propose a targeted method that incorporates the clinical domain knowledge into LLMs to improve their ability to detect anomalies. We introduce the Metabolism Pathway-driven Prompting (MPP) method, which integrates the information about metabolic pathways to better capture the structural and temporal changes in biological samples. We applied our method for doping detection in sports, focusing on steroid metabolism, and evaluated using real-world data from athletes. The results show that our method improves anomaly detection performance by leveraging metabolic context, providing a more nuanced and accurate prediction of suspicious samples in athletes' profiles.
Authors: Adam Roegiest, Radha Chitta
Abstract: Finding answers to legal questions about clauses in contracts is an important form of analysis in many legal workflows (e.g., understanding market trends, due diligence, risk mitigation) but more important is being able to do this at scale. Prior work showed that it is possible to use large language models with simple zero-shot prompts to generate structured answers to questions, which can later be incorporated into legal workflows. Such prompts, while effective on simple and straightforward clauses, fail to perform when the clauses are long and contain information not relevant to the question. In this paper, we propose two-stage prompt chaining to produce structured answers to multiple-choice and multiple-select questions and show that they are more effective than simple prompts on more nuanced legal text. We analyze situations where this technique works well and areas where further refinement is needed, especially when the underlying linguistic variations are more than can be captured by simply specifying possible answers. Finally, we discuss future research that seeks to refine this work by improving stage one results by making them more question-specific.
Authors: Jiayi Guo, Liyun Zhang, Yiqin Shen
Abstract: Automated Machine Learning (AutoML) has simplified complex ML processes such as data pre-processing, model selection, and hyper-parameter searching. However, traditional AutoML frameworks focus solely on discriminative tasks, often falling short in tackling AutoML for generative models. Additionally, these frameworks lack interpretability and user engagement during the training process, primarily due to the absence of human-centered design. It leads to a lack of transparency in final decision-making and limited user control, potentially reducing trust and adoption of AutoML methods. To address these limitations, we introduce UniAutoML, a human-centered AutoML framework that leverages Large Language Models (LLMs) to unify AutoML for both discriminative (e.g., Transformers and CNNs for classification or regression tasks) and generative tasks (e.g., fine-tuning diffusion models or LLMs). The human-centered design of UniAutoML innovatively features a conversational user interface (CUI) that facilitates natural language interactions, providing users with real-time guidance, feedback, and progress updates for better interpretability. This design enhances transparency and user control throughout the AutoML training process, allowing users to seamlessly break down or modify the model being trained. To mitigate potential risks associated with LLM generated content, UniAutoML incorporates a safety guardline that filters inputs and censors outputs. We evaluated UniAutoML's performance and usability through experiments on eight diverse datasets and user studies involving 25 participants, demonstrating that UniAutoML not only enhances performance but also improves user control and trust. Our human-centered design bridges the gap between AutoML capabilities and user understanding, making ML more accessible to a broader audience.
Authors: Jian Chen, Ruiyi Zhang, Yufan Zhou, Jennifer Healey, Jiuxiang Gu, Zhiqiang Xu, Changyou Chen
Abstract: Automatic generation of graphical layouts is crucial for many real-world applications, including designing posters, flyers, advertisements, and graphical user interfaces. Given the incredible ability of Large language models (LLMs) in both natural language understanding and generation, we believe that we could customize an LLM to help people create compelling graphical layouts starting with only text instructions from the user. We call our method TextLap (text-based layout planning). It uses a curated instruction-based layout planning dataset (InsLap) to customize LLMs as a graphic designer. We demonstrate the effectiveness of TextLap and show that it outperforms strong baselines, including GPT-4 based methods, for image generation and graphical design benchmarks.
Authors: Ruiqing Yan, Linghan Zheng, Xingbo Du, Han Zou, Yufeng Guo, Jianfei Yang
Abstract: Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism's memory overhead. We observe that certain attention heads exhibit a distribution where the attention weights concentrate on tokens near the query token, termed as recency aware, which focuses on local and short-range dependencies. Leveraging this insight, we propose RecurFormer, a novel architecture that replaces these attention heads with linear recurrent neural networks (RNNs), specifically the Mamba architecture. This replacement reduces the cache size without evicting tokens, thus maintaining generation quality. RecurFormer retains the ability to model long-range dependencies through the remaining attention heads and allows for reusing pre-trained Transformer-based LLMs weights with continual training. Experiments demonstrate that RecurFormer matches the original model's performance while significantly enhancing inference efficiency. Our approach provides a practical solution to the computational challenges of Transformer-based LLMs inference, making it highly attractive for tasks involving long inputs.
Authors: Vasileios Saketos, Despina-Athanasia Pantazi, Manolis Koubarakis
Abstract: We develop four versions of GreekLegalRoBERTa, which are four large language models trained on Greek legal and nonlegal text. We show that our models surpass the performance of GreekLegalBERT, Greek- LegalBERT-v2, and GreekBERT in two tasks involving Greek legal documents: named entity recognition and multi-class legal topic classification. We view our work as a contribution to the study of domain-specific NLP tasks in low-resource languages, like Greek, using modern NLP techniques and methodologies.
Authors: Mahmood Hegazy
Abstract: Large language models (LLMs) excel in natural language generation but often confidently produce incorrect responses, especially in tasks like mathematical reasoning. Chain-of-thought prompting, self-verification, and multi-agent debate are among the strategies proposed to improve the reasoning and factual accuracy of LLMs. Building on Du et al.'s multi-agent debate framework, we find that multi-agent debate helps at any model scale, and that diversity of thought elicits stronger reasoning in debating LLMs. Across various model sizes, performance on mathematical reasoning tasks benefits most when diverse trained models are used. Remarkably, after 4 rounds of debate, a diverse set of medium-capacity models (Gemini-Pro, Mixtral 7BX8, and PaLM 2-M) outperforms GPT-4 on the GSM-8K benchmark, scoring 91% accuracy. By comparison, when 3 instances of Gemini-Pro are used, performance only reaches 82%. Finally, this diverse set of medium-capacity models sets a new state-of-the-art performance on the ASDiv benchmark (94%). These results underscore the idea that the future of AI is agentic, with diverse cooperating agents yielding emergent capabilities beyond even the most powerful individual models.
Authors: Yanlin Zhang, Ning Li, Quan Gan, Weinan Zhang, David Wipf, Minjie Wang
Abstract: Crafting effective features is a crucial yet labor-intensive and domain-specific task within machine learning pipelines. Fortunately, recent advancements in Large Language Models (LLMs) have shown promise in automating various data science tasks, including feature engineering. But despite this potential, evaluations thus far are primarily based on the end performance of a complete ML pipeline, providing limited insight into precisely how LLMs behave relative to human experts in feature engineering. To address this gap, we propose ELF-Gym, a framework for Evaluating LLM-generated Features. We curated a new dataset from historical Kaggle competitions, including 251 "golden" features used by top-performing teams. ELF-Gym then quantitatively evaluates LLM-generated features by measuring their impact on downstream model performance as well as their alignment with expert-crafted features through semantic and functional similarity assessments. This approach provides a more comprehensive evaluation of disparities between LLMs and human experts, while offering valuable insights into specific areas where LLMs may have room for improvement. For example, using ELF-Gym we empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%. Moreover, in other cases LLMs may fail completely, particularly on datasets that require complex features, indicating broad potential pathways for improvement.
Authors: Di Wu, Siyuan Li, Chen Feng, Lu Cao, Yue Zhang, Jie Yang, Mohamad Sawan
Abstract: Recent advancements in brain-computer interfaces (BCIs) have enabled the decoding of lexical tones from intracranial recordings, offering the potential to restore the communication abilities of speech-impaired tonal language speakers. However, data heterogeneity induced by both physiological and instrumental factors poses a significant challenge for unified invasive brain tone decoding. Traditional subject-specific models, which operate under a heterogeneous decoding paradigm, fail to capture generalized neural representations and cannot effectively leverage data across subjects. To address these limitations, we introduce Homogeneity-Heterogeneity Disentangled Learning for neural Representations (H2DiLR), a novel framework that disentangles and learns both the homogeneity and heterogeneity from intracranial recordings across multiple subjects. To evaluate H2DiLR, we collected stereoelectroencephalography (sEEG) data from multiple participants reading Mandarin materials comprising 407 syllables, representing nearly all Mandarin characters. Extensive experiments demonstrate that H2DiLR, as a unified decoding paradigm, significantly outperforms the conventional heterogeneous decoding approach. Furthermore, we empirically confirm that H2DiLR effectively captures both homogeneity and heterogeneity during neural representation learning.
Authors: Agasthya Gangavarapu, Ananya Gangavarapu
Abstract: Since the onset of COVID-19, rural communities worldwide have faced significant challenges in accessing healthcare due to the migration of experienced medical professionals to urban centers. Semi-trained caregivers, such as Community Health Workers (CHWs) and Registered Medical Practitioners (RMPs), have stepped in to fill this gap, but often lack formal training. This paper proposes an advanced agentic medical assistant system designed to improve healthcare delivery in rural areas by utilizing Large Language Models (LLMs) and agentic approaches. The system is composed of five crucial components: translation, medical complexity assessment, expert network integration, final medical advice generation, and response simplification. Our innovative framework ensures context-sensitive, adaptive, and reliable medical assistance, capable of clinical triaging, diagnostics, and identifying cases requiring specialist intervention. The system is designed to handle cultural nuances and varying literacy levels, providing clear and actionable medical advice in local languages. Evaluation results using the MedQA, PubMedQA, and JAMA datasets demonstrate that this integrated approach significantly enhances the effectiveness of rural healthcare workers, making healthcare more accessible and understandable for underserved populations. All code and supplemental materials associated with the paper and IMAS are available at https://github.com/uheal/imas.
Authors: Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Hui Xiong, Ranjay Krishna
Abstract: Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. Existing works usually leverage a powerful LLM (e.g., GPT4) as the judge for comparing LLMs' output pairwisely, yet such model-based evaluator is vulnerable to conflicting preference, i.e., output A is better than B, B than C, but C than A, causing contradictory evaluation results. To improve model-based preference evaluation, we introduce GED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensemble and denoise these graphs for better, non-contradictory evaluation results. In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process to eliminate cyclic inconsistencies, ensuring a directed acyclic graph (DAG) structure. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments across ten benchmark datasets show that GED outperforms baseline methods in model ranking, response selection, and model alignment tasks. Notably, GED combines weaker evaluators like Llama3-8B, Mistral-7B, and Qwen2-7B to surpass the performance of stronger evaluators like Qwen2-72B, highlighting its ability to enhance evaluation reliability and improve model performance.
Authors: Andrei Cosmin Redis, Mohammadreza Fani Sani, Bahram Zarrin, Andrea Burattin
Abstract: Large language models (LLMs) hold promise for generating plans for complex tasks, but their effectiveness is limited by sequential execution, lack of control flow models, and difficulties in skill retrieval. Addressing these issues is crucial for improving the efficiency and interpretability of plan generation as LLMs become more central to automation and decision-making. We introduce a novel approach to skill learning in LLMs by integrating process mining techniques, leveraging process discovery for skill acquisition, process models for skill storage, and conformance checking for skill retrieval. Our methods enhance text-based plan generation by enabling flexible skill discovery, parallel execution, and improved interpretability. Experimental results suggest the effectiveness of our approach, with our skill retrieval method surpassing state-of-the-art accuracy baselines under specific conditions.
Authors: Michele Laurelli
Abstract: Proton-boron (p-11B) fusion presents a promising path towards sustainable, neutron-free energy generation. However, its implementation is hindered by extreme operational conditions, such as plasma temperatures exceeding billions of degrees and the complexity of controlling high-energy particles. Traditional control systems face significant challenges in managing the highly dynamic and non-linear behavior of the plasma. In this paper, we propose a novel approach utilizing backpropagation-based neural networks to autonomously control key parameters in a proton-boron fusion reactor. Our method leverages real-time feedback and learning from physical data to adapt to changing plasma conditions, offering a potential breakthrough in stable and efficient p-11B fusion. Furthermore, we expand on the scalability and generalization of our approach to other fusion systems and future AI technologies.
Authors: JongWoo Kim, SeongYeub Chu, Bryan Wong, Mun Yi
Abstract: Knowledge Tracing (KT) is vital in educational data mining, enabling personalized learning by tracking learners' knowledge states and forecasting their academic outcomes. This study introduces the LOKT (Large Language Model Option-weighted Knowledge Tracing) model to address the cold start problem where limited historical data available using large language models (LLMs). While traditional KT models have incorporated option weights, our research extends this by integrating these weights into an LLM-based KT framework. Moving beyond the binary classification of correct and incorrect responses, we emphasize that different types of incorrect answers offer valuable insights into a learner's knowledge state. By converting these responses into text-based ordinal categories, we enable LLMs to assess learner understanding with greater clarity, although our approach focuses on the final knowledge state rather than the progression of learning over time. Using five public datasets, we demonstrate that the LOKT model sustains high predictive accuracy even with limited data, effectively addressing both "learner cold-start" and "system cold-start" scenarios. These findings showcase LOKT's potential to enhance LLM-based learning tools and support early-stage personalization.
Authors: Thomas Uriot
Abstract: Current advances in NLP has made it increasingly feasible to build applications leveraging textual data. Generally, the core of these applications rely on having a good semantic representation of text into vectors, via specialized embedding models. However, it has been shown that these embeddings capture and perpetuate biases already present in text. While a few techniques have been proposed to debias embeddings, they do not take advantage of the recent advances in context understanding of the modern embedding models. In this paper, we fill this gap by conducting a review of 19 embedding models by quantifying their biases and how well they respond to context injection as a mean of debiasing. We show that higher performing embedding models are more prone to capturing common biases, but are also better able to incorporate context. Surprisingly, we find that while models can easily embed affirmative context, they fail at embedding neutral semantics. Finally, in a retrieval task, we show that biases in embeddings can lead to non-desirable outcomes. We use our new-found insights to design a simple algorithm for top $k$ retrieval where $k$ is dynamically selected.
Authors: Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng
Abstract: The KV-Cache technique has become the standard for the inference of large language models (LLMs). It caches states of self-attention to avoid recomputation. Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system, especially when confronted with ultra-large models and long-context queries. A natural remedy is to discard the KV-Cache for less important tokens, with StreamingLLM as an example, but the used static eviction strategies cannot flexibly adapt to varying contexts. Remedies like H2O leverage accumulative attention scores to perform dynamic eviction but suffer from the attention bias issue in capturing contextual information. This paper bridges this gap by devising a parameterized KV-Cache eviction mechanism, dubbed as Attention-Gate, which accepts the whole context as input and yields eviction flags for each token to realize in-context eviction. The subsequent self-attention module proceeds according to the flags and only the KV states for the remaining tokens need to be cached. The Attention-Gates can vary among different heads and layers and be trivially plugged into pre-trained LLMs, tuned by cost-effective continual pre-training or supervised fine-tuning objectives to acquire what to discard. The computational and memory overhead introduced by Attention-Gates is minimal. Our method is validated across multiple tasks, demonstrating both efficiency and adaptability. After a highly efficient continual pre-training, it achieves higher average accuracy and evicts more tokens compared to traditional training-free methods. In supervised fine-tuning, it not only evicts many tokens but also outperforms LoRA-finetuned LLMs on some datasets, such as RTE, where it improves accuracy by 13.9% while evicting 62.8% of tokens, showing that effective eviction of redundant tokens can even enhance performance.
Authors: Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi
Abstract: The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
Authors: Sahar Iravani, Tim . O . F Conrad
Abstract: Table processing, a key task in natural language processing, has significantly benefited from recent advancements in language models (LMs). However, the capabilities of LMs in table-to-text generation, which transforms structured data into coherent narrative text, require an in-depth investigation, especially with current open-source models. This study explores the effectiveness of various in-context learning strategies in LMs across benchmark datasets, focusing on the impact of providing examples to the model. More importantly, we examine a real-world use case, offering valuable insights into practical applications. To complement traditional evaluation metrics, we employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore. Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced. This points to the need for more reliable evaluation methods.
Authors: Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song
Abstract: We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, addressing the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
Authors: Mohammad Reza Rezaei, Maziar Hafezi, Amit Satpathy, Lovell Hodge, Ebrahim Pourjafari
Abstract: Recent advancements in QA with LLM, like GPT-4, have shown limitations in handling complex multi-hop queries. We propose AT-RAG, a novel multistep RAG incorporating topic modeling for efficient document retrieval and reasoning. Using BERTopic, our model dynamically assigns topics to queries, improving retrieval accuracy and efficiency. We evaluated AT-RAG on multihop benchmark datasets QA and a medical case study QA. Results show significant improvements in correctness, completeness, and relevance compared to existing methods. AT-RAG reduces retrieval time while maintaining high precision, making it suitable for general tasks QA and complex domain-specific challenges such as medical QA. The integration of topic filtering and iterative reasoning enables our model to handle intricate queries efficiently, which makes it suitable for applications that require nuanced information retrieval and decision-making.
Authors: Siu Lun Chau, Antonin Schrab, Arthur Gretton, Dino Sejdinovic, Krikamol Muandet
Abstract: We introduce credal two-sample testing, a new hypothesis testing framework for comparing credal sets -- convex sets of probability measures where each element captures aleatoric uncertainty and the set itself represents epistemic uncertainty that arises from the modeller's partial ignorance. Classical two-sample tests, which rely on comparing precise distributions, fail to address epistemic uncertainty due to partial ignorance. To bridge this gap, we generalise two-sample tests to compare credal sets, enabling reasoning for equality, inclusion, intersection, and mutual exclusivity, each offering unique insights into the modeller's epistemic beliefs. We formalise these tests as two-sample tests with nuisance parameters and introduce the first permutation-based solution for this class of problems, significantly improving upon existing methods. Our approach properly incorporates the modeller's epistemic uncertainty into hypothesis testing, leading to more robust and credible conclusions, with kernel-based implementations for real-world applications.
Authors: Dhrumil Patel, Daniel Koch, Saahil Patel, Mark M. Wilde
Abstract: Estimating the ground-state energy of Hamiltonians is a fundamental task for which it is believed that quantum computers can be helpful. Several approaches have been proposed toward this goal, including algorithms based on quantum phase estimation and hybrid quantum-classical optimizers involving parameterized quantum circuits, the latter falling under the umbrella of the variational quantum eigensolver. Here, we analyze the performance of quantum Boltzmann machines for this task, which is a less explored ansatz based on parameterized thermal states and which is not known to suffer from the barren-plateau problem. We delineate a hybrid quantum-classical algorithm for this task and rigorously prove that it converges to an $\varepsilon$-approximate stationary point of the energy function optimized over parameter space, while using a number of parameterized-thermal-state samples that is polynomial in $\varepsilon^{-1}$, the number of parameters, and the norm of the Hamiltonian being optimized. Our algorithm estimates the gradient of the energy function efficiently by means of a novel quantum circuit construction that combines classical sampling, Hamiltonian simulation, and the Hadamard test, thus overcoming a key obstacle to quantum Boltzmann machine learning that has been left open since [Amin \textit{et al.}, Phys.~Rev.~X \textbf{8}, 021050 (2018)]. Additionally supporting our main claims are calculations of the gradient and Hessian of the energy function, as well as an upper bound on the matrix elements of the latter that is used in the convergence analysis.
Authors: Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge, Pradeep Dasigi
Abstract: Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors). In experiments focusing on scientific literature understanding, safety, and coding, we find that the parallel-train-then-merge procedure, which is significantly cheaper than retraining the models on updated data mixtures, is often comparably effective. Our experiments also show that parallel training is especially well-suited for enabling safety features in LMs relative to continued finetuning and retraining, as it dramatically improves model compliance with safe prompts while preserving its ability to refuse dangerous or harmful prompts.
Authors: Lu Pang, Tao Sun, Weimin Lyu, Haibin Ling, Chao Chen
Abstract: Recently, backdoor attack has become an increasing security threat to deep neural networks and drawn the attention of researchers. Backdoor attacks exploit vulnerabilities in third-party pretrained models during the training phase, enabling them to behave normally for clean samples and mispredict for samples with specific triggers. Existing backdoor attacks mainly focus on balanced datasets. However, real-world datasets often follow long-tailed distributions. In this paper, for the first time, we explore backdoor attack on such datasets. Specifically, we first analyze the influence of data imbalance on backdoor attack. Based on our analysis, we propose an effective backdoor attack named Dynamic Data Augmentation Operation (D$^2$AO). We design D$^2$AO selectors to select operations depending jointly on the class, sample type (clean vs. backdoored) and sample features. Meanwhile, we develop a trigger generator to generate sample-specific triggers. Through simultaneous optimization of the backdoored model and trigger generator, guided by dynamic data augmentation operation selectors, we achieve significant advancements. Extensive experiments demonstrate that our method can achieve the state-of-the-art attack performance while preserving the clean accuracy.
Authors: Faizan Faisal, Umair Yousaf
Abstract: We present LEGAL-UQA, the first Urdu legal question-answering dataset derived from Pakistan's constitution. This parallel English-Urdu dataset includes 619 question-answer pairs, each with corresponding legal article contexts, addressing the need for domain-specific NLP resources in low-resource languages. We describe the dataset creation process, including OCR extraction, manual refinement, and GPT-4-assisted translation and generation of QA pairs. Our experiments evaluate the latest generalist language and embedding models on LEGAL-UQA, with Claude-3.5-Sonnet achieving 99.19% human-evaluated accuracy. We fine-tune mt5-large-UQA-1.0, highlighting the challenges of adapting multilingual models to specialized domains. Additionally, we assess retrieval performance, finding OpenAI's text-embedding-3-large outperforms Mistral's mistral-embed. LEGAL-UQA bridges the gap between global NLP advancements and localized applications, particularly in constitutional law, and lays the foundation for improved legal information access in Pakistan.
Authors: Zhaocheng Zhu
Abstract: Reasoning, the ability to logically draw conclusions from existing knowledge, is a hallmark of human. Together with perception, they constitute the two major themes of artificial intelligence. While deep learning has pushed the limit of perception beyond human-level performance, the progress in reasoning domains is way behind. One fundamental reason is that reasoning problems usually have flexible structures for both knowledge and queries, and many existing models only perform well on structures seen during training. Here we aim to push the boundary of reasoning models by devising algorithms that generalize across knowledge and query structures, as well as systems that accelerate development on structured data. This thesis consists of three parts. In Part I, we study models that can inductively generalize to unseen knowledge graphs with new entity and relation vocabularies. For new entities, we propose a framework that learns neural operators in a dynamic programming algorithm computing path representations. For relations, we construct a relation graph to capture the interactions between relations, thereby converting new relations into new entities. In Part II, we propose two solutions for generalizing across multi-step queries on knowledge graphs and text respectively. For knowledge graphs, we show that multi-step queries can be solved by multiple calls of graph neural networks and fuzzy logic operations. For text, we devise an algorithm to learn explicit knowledge as textual rules to improve large language models on multi-step queries. In Part III, we propose two systems to facilitate machine learning development on structured data. Our library treats structured data as first-class citizens and removes the barrier for developing algorithms on structured data. Our node embedding system solves the GPU memory bottleneck of embedding matrices and scales to graphs with billion nodes.
Authors: Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, Samy Jelassi
Abstract: Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning of Large Language Models (LLMs). We study how different LoRA modules can be merged to achieve skill composition -- testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA. This setup is favorable when it is difficult to obtain training data for the target task and when it can be decomposed into multiple skills. First, we identify practically occurring use-cases that can be studied under the realm of skill composition, e.g. solving hard math-word problems with code, creating a bot to answer questions on proprietary manuals or about domain-specialized corpora. Our main contribution is to show that concatenation of LoRAs (CAT), which optimally averages LoRAs that were individually trained on different skills, outperforms existing model- and data- merging techniques; for instance on math-word problems, CAT beats these methods by an average of 43% and 12% respectively. Thus, this paper advocates model merging as an efficient way to solve compositional tasks and underscores CAT as a simple, compute-friendly and effective procedure. To our knowledge, this is the first work demonstrating the superiority of model merging over data mixing for binary skill composition tasks.
Authors: Jiaqi Han, Minkai Xu, Aaron Lou, Haotian Ye, Stefano Ermon
Abstract: Generative models have shown great promise in generating 3D geometric systems, which is a fundamental problem in many natural science domains such as molecule and protein design. However, existing approaches only operate on static structures, neglecting the fact that physical systems are always dynamic in nature. In this work, we propose geometric trajectory diffusion models (GeoTDM), the first diffusion model for modeling the temporal distribution of 3D geometric trajectories. Modeling such distribution is challenging as it requires capturing both the complex spatial interactions with physical symmetries and temporal correspondence encapsulated in the dynamics. We theoretically justify that diffusion models with equivariant temporal kernels can lead to density with desired symmetry, and develop a novel transition kernel leveraging SE(3)-equivariant spatial convolution and temporal attention. Furthermore, to induce an expressive trajectory distribution for conditional generation, we introduce a generalized learnable geometric prior into the forward diffusion process to enhance temporal conditioning. We conduct extensive experiments on both unconditional and conditional generation in various scenarios, including physical simulation, molecular dynamics, and pedestrian motion. Empirical results on a wide suite of metrics demonstrate that GeoTDM can generate realistic geometric trajectories with significantly higher quality.
Authors: Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz
Abstract: Large language models (LLMs) are increasingly relied upon to solve complex mathematical word problems. However, being susceptible to hallucination, they may generate inaccurate results when presented with unanswerable questions, raising concerns about their potential harm. While GPT models are now widely used and trusted, the exploration of how they can effectively abstain from answering unanswerable math problems and the enhancement of their abstention capabilities has not been rigorously investigated. In this paper, we investigate whether GPTs can appropriately respond to unanswerable math word problems by applying prompts typically used in solvable mathematical scenarios. Our experiments utilize the Unanswerable Word Math Problem (UWMP) dataset, directly leveraging GPT model APIs. Evaluation metrics are introduced, which integrate three key factors: abstention, correctness and confidence. Our findings reveal critical gaps in GPT models and the hallucination it suffers from for unsolvable problems, highlighting the need for improved models capable of better managing uncertainty and complex reasoning in math word problem-solving contexts.
Authors: Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad
Abstract: Despite the significant influx of prompt-tuning techniques for generative vision-language models (VLMs), it remains unclear how sensitive these models are to lexical and semantic alterations in prompts. In this paper, we evaluate the ability of generative VLMs to understand lexical and semantic changes in text using the SugarCrepe++ dataset. We analyze the sensitivity of VLMs to lexical alterations in prompts without corresponding semantic changes. Our findings demonstrate that generative VLMs are highly sensitive to such alterations. Additionally, we show that this vulnerability affects the performance of techniques aimed at achieving consistency in their outputs.
Authors: Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adri\`a Garriga-Alonso, Andrew Jesson, Maggie Makar, David M. Blei
Abstract: Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis? In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothesis tests to evaluate how well circuits satisfy them. The criteria focus on the extent to which the LLM's behavior is preserved, the degree of localization of this behavior, and whether the circuit is minimal. We apply these tests to six circuits described in the research literature. We find that synthetic circuits -- circuits that are hard-coded in the model -- align with the idealized properties. Circuits discovered in Transformer models satisfy the criteria to varying degrees. To facilitate future empirical studies of circuits, we created the \textit{circuitry} package, a wrapper around the \textit{TransformerLens} library, which abstracts away lower-level manipulations of hooks and activations. The software is available at \url{https://github.com/blei-lab/circuitry}.
Authors: Adrian Ciotinga, YooJung Choi
Abstract: We introduce a novel optimal transport framework for probabilistic circuits (PCs). While it has been shown recently that divergences between distributions represented as certain classes of PCs can be computed tractably, to the best of our knowledge, there is no existing approach to compute the Wasserstein distance between probability distributions given by PCs. We consider a Wasserstein-type distance that restricts the coupling measure of the associated optimal transport problem to be a probabilistic circuit. We then develop an algorithm for computing this distance by solving a series of small linear programs and derive the circuit conditions under which this is tractable. Furthermore, we show that we can also retrieve the optimal transport plan between the PCs from the solutions to these linear programming problems. We then consider the empirical Wasserstein distance between a PC and a dataset, and show that we can estimate the PC parameters to minimize this distance through an efficient iterative algorithm.
Authors: Ryan Murray, Adam Pickarski
Abstract: This work considers large-data asymptotics for t-distributed stochastic neighbor embedding (tSNE), a widely-used non-linear dimension reduction algorithm. We identify an appropriate continuum limit of the tSNE objective function, which can be viewed as a combination of a kernel-based repulsion and an asymptotically-vanishing Laplacian-type regularizer. As a consequence, we show that embeddings of the original tSNE algorithm cannot have any consistent limit as $n \to \infty$. We propose a rescaled model which mitigates the asymptotic decay of the attractive energy, and which does have a consistent limit.
Authors: Jeongyeol Kwon, Luke Dotson, Yudong Chen, Qiaomin Xie
Abstract: Previous studies on two-timescale stochastic approximation (SA) mainly focused on bounding mean-squared errors under diminishing stepsize schemes. In this work, we investigate {\it constant} stpesize schemes through the lens of Markov processes, proving that the iterates of both timescales converge to a unique joint stationary distribution in Wasserstein metric. We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes in the presence of Markovian noise. Specifically, with two constant stepsizes $\alpha < \beta$, we show that the biases scale linearly with both stepsizes as $\Theta(\alpha)+\Theta(\beta)$ up to higher-order terms, while the variance of the slower iterate (resp., faster iterate) scales only with its own stepsize as $O(\alpha)$ (resp., $O(\beta)$). Unlike previous work, our results require no additional assumptions such as $\beta^2 \ll \alpha$ nor extra dependence on dimensions. These fine-grained characterizations allow tail-averaging and extrapolation techniques to reduce variance and bias, improving mean-squared error bound to $O(\beta^4 + \frac{1}{t})$ for both iterates.
Authors: Samuel Kiegeland, Ethan Gotlieb Wilcox, Afra Amini, David Robert Reich, Ryan Cotterell
Abstract: Numerous previous studies have sought to determine to what extent language models, pretrained on natural language text, can serve as useful models of human cognition. In this paper, we are interested in the opposite question: whether we can directly optimize a language model to be a useful cognitive model by aligning it to human psychometric data. To achieve this, we introduce a novel alignment technique in which we fine-tune a language model to implicitly optimize the parameters of a linear regressor that directly predicts humans' reading times of in-context linguistic units, e.g., phonemes, morphemes, or words, using surprisal estimates derived from the language model. Using words as a test case, we evaluate our technique across multiple model sizes and datasets and find that it improves language models' psychometric predictive power. However, we find an inverse relationship between psychometric power and a model's performance on downstream NLP tasks as well as its perplexity on held-out test data. While this latter trend has been observed before (Oh et al., 2022; Shain et al., 2024), we are the first to induce it by manipulating a model's alignment to psychometric data.
Authors: Dhananjay Ashok, Jonathan May
Abstract: Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human generated.
Authors: Lai Wei, Ambuj Tewari, Michael A. Cianfrocco
Abstract: We introduce a novel extension of the contextual bandit problem, where new sets of arms can be requested with stochastic time delays and associated costs. In this setting, the learner can select multiple arms from a decision set, with each selection taking one unit of time. The problem is framed as a special case of semi-Markov decision processes (SMDPs). The arm contexts, request times, and costs are assumed to follow an unknown distribution. We consider the regret of an online learning algorithm with respect to the optimal policy that achieves the maximum average reward. By leveraging the Bellman optimality equation, we design algorithms that can effectively select arms and determine the appropriate time to request new arms, thereby minimizing their regret. Under the realizability assumption, we analyze the proposed algorithms and demonstrate that their regret upper bounds align with established results in the contextual bandit literature. We validate the algorithms through experiments on simulated data and a movie recommendation dataset, showing that their performance is consistent with theoretical analyses.
Authors: Jacob Feitelberg, Kyuseong Choi, Anish Agarwal, Raaz Dwivedi
Abstract: We introduce the problem of distributional matrix completion: Given a sparsely observed matrix of empirical distributions, we seek to impute the true distributions associated with both observed and unobserved matrix entries. This is a generalization of traditional matrix completion where the observations per matrix entry are scalar valued. To do so, we utilize tools from optimal transport to generalize the nearest neighbors method to the distributional setting. Under a suitable latent factor model on probability distributions, we establish that our method recovers the distributions in the Wasserstein norm. We demonstrate through simulations that our method is able to (i) provide better distributional estimates for an entry compared to using observed samples for that entry alone, (ii) yield accurate estimates of distributional quantities such as standard deviation and value-at-risk, and (iii) inherently support heteroscedastic noise. We also prove novel asymptotic results for Wasserstein barycenters over one-dimensional distributions.
Authors: Nashrah Haque, Xiang Li, Zhehui Chen, Yanzhao Wu, Lei Yu, Arun Iyengar, Wenqi Wei
Abstract: We propose a novel framework, Stable Diffusion-based Momentum Integrated Adversarial Examples (SD-MIAE), for generating adversarial examples that can effectively mislead neural network classifiers while maintaining visual imperceptibility and preserving the semantic similarity to the original class label. Our method leverages the text-to-image generation capabilities of the Stable Diffusion model by manipulating token embeddings corresponding to the specified class in its latent space. These token embeddings guide the generation of adversarial images that maintain high visual fidelity. The SD-MIAE framework consists of two phases: (1) an initial adversarial optimization phase that modifies token embeddings to produce misclassified yet natural-looking images and (2) a momentum-based optimization phase that refines the adversarial perturbations. By introducing momentum, our approach stabilizes the optimization of perturbations across iterations, enhancing both the misclassification rate and visual fidelity of the generated adversarial examples. Experimental results demonstrate that SD-MIAE achieves a high misclassification rate of 79%, improving by 35% over the state-of-the-art method while preserving the imperceptibility of adversarial perturbations and the semantic similarity to the original class label, making it a practical method for robust adversarial evaluation.
Authors: Felix J. Yu, Nicholas Kamp, Carlos A. Arg\"uelles
Abstract: Neutrino telescopes detect rare interactions of particles produced in some of the most extreme environments in the Universe. This is accomplished by instrumenting a cubic-kilometer volume of naturally occurring transparent medium with light sensors. Given their substantial size and the high frequency of background interactions, these telescopes amass an enormous quantity of large variance, high-dimensional data. These attributes create substantial challenges for analyzing and reconstructing interactions, particularly when utilizing machine learning (ML) techniques. In this paper, we present a novel approach, called om2vec, that employs transformer-based variational autoencoders to efficiently represent neutrino telescope events by learning compact and descriptive latent representations. We demonstrate that these latent representations offer enhanced flexibility and improved computational efficiency, thereby facilitating downstream tasks in data analysis.
Authors: Hossein Nasiri, Seda Dogan-Tusha, Muhammad Iqbal Rochman, Monisha Ghosh
Abstract: Robust classification of the operational environment of wireless devices is becoming increasingly important for wireless network optimization, particularly in a shared spectrum environment. Distinguishing between indoor and outdoor devices can enhance reliability and improve coexistence with existing, outdoor, incumbents. For instance, the unlicensed but shared 6 GHz band (5.925 - 7.125 GHz) enables sharing by imposing lower transmit power for indoor unlicensed devices and a spectrum coordination requirement for outdoor devices. Further, indoor devices are prohibited from using battery power, external antennas, and weatherization to prevent outdoor operations. As these rules may be circumvented, we propose a robust indoor/outdoor classification method by leveraging the fact that the radio-frequency environment faced by a device are quite different indoors and outdoors. We first collect signal strength data from all cellular and Wi-Fi bands that can be received by a smartphone in various environments (indoor interior, indoor near windows, and outdoors), along with GPS accuracy, and then evaluate three machine learning (ML) methods: deep neural network (DNN), decision tree, and random forest to perform classification into these three categories. Our results indicate that the DNN model performs the best, particularly in minimizing the most important classification error, that of classifying outdoor devices as indoor interior devices.
Authors: Mathis Gerdes, Pim de Haan, Roberto Bondesan, Miranda C. N. Cheng
Abstract: Continuous normalizing flows are known to be highly expressive and flexible, which allows for easier incorporation of large symmetries and makes them a powerful tool for sampling in lattice field theories. Building on previous work, we present a general continuous normalizing flow architecture for matrix Lie groups that is equivariant under group transformations. We apply this to lattice gauge theories in two dimensions as a proof-of-principle and demonstrate competitive performance, showing its potential as a tool for future lattice sampling tasks.
Authors: Yusuke Endo, Koujin Takeda
Abstract: We propose a new method of independent component analysis (ICA) in order to extract appropriate features from high-dimensional data. In general, matrix factorization methods including ICA have a problem regarding the interpretability of extracted features. For the improvement of interpretability, it is considered that sparse constraint on a factorized matrix is helpful. With this background, we construct a new ICA method with sparsity. In our method, the L1-regularization term is added to the cost function of ICA, and minimization of the cost function is performed by difference of convex functions algorithm. For the validity of our proposed method, we apply it to synthetic data and real functional magnetic resonance imaging data.
Authors: Jameson Merkow, Felix J. Dorfner, Xiyu Yang, Alexander Ersoy, Giridhar Dasegowda, Mannudeep Kalra, Matthew P. Lungren, Christopher P. Bridge, Ivan Tarapov
Abstract: The integration of artificial intelligence (AI) into medical imaging has advanced clinical diagnostics but poses challenges in managing model drift and ensuring long-term reliability. To address these challenges, we develop MMC+, an enhanced framework for scalable drift monitoring, building upon the CheXstray framework that introduced real-time drift detection for medical imaging AI models using multi-modal data concordance. This work extends the original framework's methodologies, providing a more scalable and adaptable solution for real-world healthcare settings and offers a reliable and cost-effective alternative to continuous performance monitoring addressing limitations of both continuous and periodic monitoring methods. MMC+ introduces critical improvements to the original framework, including more robust handling of diverse data streams, improved scalability with the integration of foundation models like MedImageInsight for high-dimensional image embeddings without site-specific training, and the introduction of uncertainty bounds to better capture drift in dynamic clinical environments. Validated with real-world data from Massachusetts General Hospital during the COVID-19 pandemic, MMC+ effectively detects significant data shifts and correlates them with model performance changes. While not directly predicting performance degradation, MMC+ serves as an early warning system, indicating when AI systems may deviate from acceptable performance bounds and enabling timely interventions. By emphasizing the importance of monitoring diverse data streams and evaluating data shifts alongside model performance, this work contributes to the broader adoption and integration of AI solutions in clinical settings.
Authors: Ashish Seth, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Abstract: In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions to the model for reconstruction. Our approach automatically selects hard regions and is built on the observation that the reconstruction loss of individual frames in MAM can provide natural signals to judge the difficulty of solving the MAM pre-text task for that frame. To identify these hard regions, we employ a teacher model that first predicts the frame-wise losses and then decides which frames to mask. By learning to create challenging problems, such as identifying harder frames and solving them simultaneously, the model is able to learn more effective representations and thereby acquire a more comprehensive understanding of the speech. Quantitatively, EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%. Additionally, we conduct a thorough analysis to show that the regions masked by EH-MAM effectively capture useful context across speech frames.
Authors: Tangwen Qian, Junhe Li, Yile Chen, Gao Cong, Tao Sun, Fei Wang, Yongjun Xu
Abstract: Modeling trajectory data with generic-purpose dense representations has become a prevalent paradigm for various downstream applications, such as trajectory classification, travel time estimation and similarity computation. However, existing methods typically rely on trajectories from a single spatial view, limiting their ability to capture the rich contextual information that is crucial for gaining deeper insights into movement patterns across different geospatial contexts. To this end, we propose MVTraj, a novel multi-view modeling method for trajectory representation learning. MVTraj integrates diverse contextual knowledge, from GPS to road network and points-of-interest to provide a more comprehensive understanding of trajectory data. To align the learning process across multiple views, we utilize GPS trajectories as a bridge and employ self-supervised pretext tasks to capture and distinguish movement patterns across different spatial views. Following this, we treat trajectories from different views as distinct modalities and apply a hierarchical cross-modal interaction module to fuse the representations, thereby enriching the knowledge derived from multiple sources. Extensive experiments on real-world datasets demonstrate that MVTraj significantly outperforms existing baselines in tasks associated with various spatial views, validating its effectiveness and practical utility in spatio-temporal modeling.
Authors: Yun-Yen Chuang, Hung-Min Hsu, Kevin Lin, Chen-Sheng Gu, Ling Zhen Li, Ray-I Chang, Hung-yi Lee
Abstract: The diffusion model, a new generative modeling paradigm, has achieved significant success in generating images, audio, video, and text. It has been adapted for sequence-to-sequence text generation (Seq2Seq) through DiffuSeq, termed S2S Diffusion. Existing S2S-Diffusion models predominantly rely on fixed or hand-crafted rules to schedule noise during the diffusion and denoising processes. However, these models are limited by non-contextualized noise, which fails to fully consider the characteristics of Seq2Seq tasks. In this paper, we propose the Meta-DiffuB framework - a novel scheduler-exploiter S2S-Diffusion paradigm designed to overcome the limitations of existing S2S-Diffusion models. We employ Meta-Exploration to train an additional scheduler model dedicated to scheduling contextualized noise for each sentence. Our exploiter model, an S2S-Diffusion model, leverages the noise scheduled by our scheduler model for updating and generation. Meta-DiffuB achieves state-of-the-art performance compared to previous S2S-Diffusion models and fine-tuned pre-trained language models (PLMs) across four Seq2Seq benchmark datasets. We further investigate and visualize the impact of Meta-DiffuB's noise scheduling on the generation of sentences with varying difficulties. Additionally, our scheduler model can function as a "plug-and-play" model to enhance DiffuSeq without the need for fine-tuning during the inference stage.
Authors: Caigao Jiang, Xiang Shu, Hong Qian, Xingyu Lu, Jun Zhou, Aimin Zhou, Yang Yu
Abstract: Optimization problems are prevalent across various scenarios. Formulating and then solving optimization problems described by natural language often requires highly specialized human expertise, which could block the widespread application of optimization-based decision making. To make problem formulating and solving automated, leveraging large language models (LLMs) has emerged as a potential way. However, this kind of way suffers from the issue of optimization generalization. Namely, the accuracy of most current LLM-based methods and the generality of optimization problem types that they can model are still limited. In this paper, we propose a unified learning-based framework called LLMOPT to boost optimization generalization. Starting from the natural language descriptions of optimization problems and a pre-trained LLM, LLMOPT constructs the introduced five-element formulation as a universal model for learning to define diverse optimization problem types. Then, LLMOPT employs the multi-instruction tuning to enhance both problem formalization and solver code generation accuracy and generality. After that, to prevent hallucinations in LLMs, such as sacrificing solving accuracy to avoid execution errors, model alignment and self-correction mechanism are adopted in LLMOPT. We evaluate the optimization generalization ability of LLMOPT and compared methods across six real-world datasets covering roughly 20 fields such as health, environment, energy and manufacturing, etc. Extensive experiment results show that LLMOPT is able to model various optimization problem types such as linear/nonlinear programming, mixed integer programming and combinatorial optimization, and achieves a notable 11.08% average solving accuracy improvement compared with the state-of-the-art methods. The code is available at https://github.com/caigaojiang/LLMOPT.
Authors: Weishan Cai, Wenjun Ma, Yuncheng Jiang
Abstract: The success of current Entity Alignment (EA) task depends largely on the supervision information provided by labeled data. Considering the cost of labeled data, most supervised methods are difficult to apply in practical scenarios. Therefore, more and more works based on contrastive learning, active learning or other deep learning techniques have been developed, to solve the performance bottleneck caused by the lack of labeled data. However, the existing unsupervised EA methods still have some limitations, either their modeling complexity is high or they cannot balance the effectiveness and practicality of alignment. To overcome these issues, we propose a Simplifying and Learnable graph convolutional attention network for Unsupervised Knowledge Graphs alignment method (SLU). Specifically, we first introduce LCAT, a new and simple framework as the backbone network to model the graph structure of two KGs. Then we design a reconstruction method of relation structure based on potential matching relations for efficiently filtering invalid neighborhood information of aligned entities, to improve the usability and scalability of SLU. Impressively, a similarity function based on consistency is proposed to better measure the similarity of candidate entity pairs. Finally, we conduct extensive experiments on three datasets of different sizes (15K and 100K) and different types (cross-lingual and monolingual) to verify the superiority of SLU. Experimental results show that SLU significantly improves alignment accuracy, outperforming 25 supervised or unsupervised methods, and improving 6.4% in Hits@1 over the best baseline in the best case.
Authors: Kexuan Shi, Hai Chen, Leheng Zhang, Shuhang Gu
Abstract: Implicit Neural Representations (INRs), as a versatile representation paradigm, have achieved success in various computer vision tasks. Due to the spectral bias of the vanilla multi-layer perceptrons (MLPs), existing methods focus on designing MLPs with sophisticated architectures or repurposing training techniques for highly accurate INRs. In this paper, we delve into the linear dynamics model of MLPs and theoretically identify the empirical Neural Tangent Kernel (eNTK) matrix as a reliable link between spectral bias and training dynamics. Based on eNTK matrix, we propose a practical inductive gradient adjustment method, which could purposefully improve the spectral bias via inductive generalization of eNTK-based gradient transformation matrix. We evaluate our method on different INRs tasks with various INR architectures and compare to existing training techniques. The superior representation performance clearly validates the advantage of our proposed method. Armed with our gradient adjustment method, better INRs with more enhanced texture details and sharpened edges can be learned from data by tailored improvements on spectral bias.
Authors: Yu-Neng Chuang, Helen Zhou, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu
Abstract: Large language models (LLMs) have demonstrated impressive performance on several tasks and are increasingly deployed in real-world applications. However, especially in high-stakes settings, it becomes vital to know when the output of an LLM may be unreliable. Depending on whether an answer is trustworthy, a system can then choose to route the question to another expert, or otherwise fall back on a safe default behavior. In this work, we study the extent to which LLMs can reliably indicate confidence in their answers, and how this notion of confidence can translate into downstream accuracy gains. We propose Self-REF, a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Self-REF introduces confidence tokens into the LLM, from which a confidence score can be extracted. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.
Authors: Roman Soletskyi, Marylou Gabri\'e, Bruno Loureiro
Abstract: While deep learning has expanded the possibilities for highly expressive variational families, the practical benefits of these tools for variational inference (VI) are often limited by the minimization of the traditional Kullback-Leibler objective, which can yield suboptimal solutions. A major challenge in this context is \emph{mode collapse}: the phenomenon where a model concentrates on a few modes of the target distribution during training, despite being statistically capable of expressing them all. In this work, we carry a theoretical investigation of mode collapse for the gradient flow on Gaussian mixture models. We identify the key low-dimensional statistics characterizing the flow, and derive a closed set of low-dimensional equations governing their evolution. Leveraging this compact description, we show that mode collapse is present even in statistically favorable scenarios, and identify two key mechanisms driving it: mean alignment and vanishing weight. Our theoretical findings are consistent with the implementation of VI using normalizing flows, a class of popular generative models, thereby offering practical insights.
Authors: Ruud JG van Sloun
Abstract: Ultrasound (US) has the unique potential to offer access to medical imaging to anyone, everywhere. Devices have become ultra-portable and cost-effective, akin to the stethoscope. Nevertheless US image quality and diagnostic efficacy are still highly operator- and patient-dependent. In difficult-to-image patients, image quality is often insufficient for reliable diagnosis. In this paper, we put forth that US imaging systems can be recast as information-seeking agents that engage in reciprocal interactions with their anatomical environment. Such agents autonomously adapt their transmit-receive sequences to fully personalize imaging and actively maximize information gain in-situ. To that end, we will show that the sequence of pulse-echo experiments that a US system performs can be interpreted as a perception-action loop: the action is the data acquisition, probing tissue with acoustic waves and recording reflections at the detection array, and perception is the inference of the anatomical and or functional state, potentially including associated diagnostic quantities. We then equip systems with a mechanism to actively reduce uncertainty and maximize diagnostic value across a sequence of experiments, treating action and perception jointly using Bayesian inference given generative models of the environment and action-conditional pulse-echo observations. Since the representation capacity of the generative models dictates both the quality of inferred anatomical states and the effectiveness of inferred sequences of future imaging actions, we will be greatly leveraging the enormous advances in deep generative modelling that are currently disrupting many fields and society at large. Finally, we show some examples of cognitive, closed-loop, US systems that perform active beamsteering and adaptive scanline selection, based on deep generative models that track anatomical belief states.
Authors: Isack Lee, Haebin Seong
Abstract: Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of PCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense method PCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. PCDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures.
Authors: Yu Yuan, Lili Zhao, Kai Zhang, Guangting Zheng, Qi Liu
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, LLMs may rely on dataset biases as shortcuts for prediction, which can significantly impair their robustness and generalization capabilities. This paper presents Shortcut Suite, a comprehensive test suite designed to evaluate the impact of shortcuts on LLMs' performance, incorporating six shortcut types, five evaluation metrics, and four prompting strategies. Our extensive experiments yield several key findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream tasks, significantly impairing their performance. 2) Larger LLMs are more likely to utilize shortcuts under zero-shot and few-shot in-context learning prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and outperforms other prompting strategies, while few-shot prompts generally underperform compared to zero-shot prompts. 4) LLMs often exhibit overconfidence in their predictions, especially when dealing with datasets that contain shortcuts. 5) LLMs generally have a lower explanation quality in shortcut-laden datasets, with errors falling into three types: distraction, disguised comprehension, and logical fallacy. Our findings offer new insights for evaluating robustness and generalization in LLMs and suggest potential directions for mitigating the reliance on shortcuts. The code is available at \url {https://github.com/yyhappier/ShortcutSuite.git}.
Authors: Vijay Prakash Dwivedi, Viktor Schlegel, Andy T. Liu, Thanh-Tung Nguyen, Abhinav Ramesh Kashyap, Jeng Wei, Wei-Hsian Yin, Stefan Winkler, Robby T. Tan
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various domains, including healthcare. However, their ability to effectively represent structured non-textual data, such as the alphanumeric medical codes used in records like ICD-10 or SNOMED-CT, is limited and has been particularly exposed in recent research. This paper examines the challenges LLMs face in processing medical codes due to the shortcomings of current tokenization methods. As a result, we introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data, which addresses these challenges by adapting subword tokenization techniques specifically for the structured medical codes. Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records. Trained on over 1 billion tokens on the internal medical database, the proposed model achieves up to a 23% improvement in evaluation metrics, with around 2% gain attributed to our proposed tokenization. Additionally, when evaluated on the EHRSHOT public benchmark with a 1/1000 fraction of the pre-training data, the UniStruct model improves performance on over 42% of the downstream tasks. Our approach not only enhances the representation and generalization capabilities of patient-centric models but also bridges a critical gap in representation learning models' ability to handle complex structured medical data, alongside unstructured text.
Authors: Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue
Abstract: The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://github.com/Hoar012/RAP-MLLM.
Authors: Kyuseong Choi, Jacob Feitelberg, Anish Agarwal, Raaz Dwivedi
Abstract: Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user's weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data, where observations are available only for certain unit-outcome combinations and the observation availability can be correlated with the properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. Furthermore, we demonstrate that our nearest neighbors approach is robust to heteroscedastic noise, provided we have access to two or more measurements for the observed unit-outcome entries, a robustness not present in prior works on nearest neighbors with single measurements.
Authors: Fatih Dogangun, Serdar Bahar, Yigit Yildirim, Bora Toprak Temir, Emre Ugur, Mustafa Doga Dogan
Abstract: As robotics continue to enter various sectors beyond traditional industrial applications, the need for intuitive robot training and interaction systems becomes increasingly more important. This paper introduces Robotic Augmented Reality for Machine Programming (RAMPA), a system that utilizes the capabilities of state-of-the-art and commercially available AR headsets, e.g., Meta Quest 3, to facilitate the application of Programming from Demonstration (PfD) approaches on industrial robotic arms, such as Universal Robots UR10. Our approach enables in-situ data recording, visualization, and fine-tuning of skill demonstrations directly within the user's physical environment. RAMPA addresses critical challenges of PfD, such as safety concerns, programming barriers, and the inefficiency of collecting demonstrations on the actual hardware. The performance of our system is evaluated against the traditional method of kinesthetic control in teaching three different robotic manipulation tasks and analyzed with quantitative metrics, measuring task performance and completion time, trajectory smoothness, system usability, user experience, and task load using standardized surveys. Our findings indicate a substantial advancement in how robotic tasks are taught and refined, promising improvements in operational safety, efficiency, and user engagement in robotic programming.
Authors: Abhishek Gupta, Amruta Parulekar, Sameep Chattopadhyay, Preethi Jyothi
Abstract: Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting without any labeled speech.
Authors: Luca Rolshoven, Vishvaksenan Rasiah, Srinanda Br\"ugger Bose, Matthias St\"urmer, Joel Niklaus
Abstract: Legal research is a time-consuming task that most lawyers face on a daily basis. A large part of legal research entails looking up relevant caselaw and bringing it in relation to the case at hand. Lawyers heavily rely on summaries (also called headnotes) to find the right cases quickly. However, not all decisions are annotated with headnotes and writing them is time-consuming. Automated headnote creation has the potential to make hundreds of thousands of decisions more accessible for legal research in Switzerland alone. To kickstart this, we introduce the Swiss Leading Decision Summarization ( SLDS) dataset, a novel cross-lingual resource featuring 18K court rulings from the Swiss Federal Supreme Court (SFSC), in German, French, and Italian, along with German headnotes. We fine-tune and evaluate three mT5 variants, along with proprietary models. Our analysis highlights that while proprietary models perform well in zero-shot and one-shot settings, fine-tuned smaller models still provide a strong competitive edge. We publicly release the dataset to facilitate further research in multilingual legal summarization and the development of assistive technologies for legal professionals
Authors: Ronja Stern, Ken Kawamura, Matthias St\"urmer, Ilias Chalkidis, Joel Niklaus
Abstract: Predicting case criticality helps legal professionals in the court system manage large volumes of case law. This paper introduces the Criticality Prediction dataset, a new resource for evaluating the potential influence of Swiss Federal Supreme Court decisions on future jurisprudence. Unlike existing approaches that rely on resource-intensive manual annotations, we semi-automatically derive labels leading to a much larger dataset than otherwise possible. Our dataset features a two-tier labeling system: (1) the LD-Label, which identifies cases published as Leading Decisions (LD), and (2) the Citation-Label, which ranks cases by their citation frequency and recency. This allows for a more nuanced evaluation of case importance. We evaluate several multilingual models, including fine-tuned variants and large language models, and find that fine-tuned models consistently outperform zero-shot baselines, demonstrating the need for task-specific adaptation. Our contributions include the introduction of this task and the release of a multilingual dataset to the research community.
Authors: Dibyanayan Bandyopadhyay, Mohammed Hasanuzzaman, Asif Ekbal
Abstract: Detecting offensive memes is crucial, yet standard deep neural network systems often remain opaque. Various input attribution-based methods attempt to interpret their behavior, but they face challenges with implicitly offensive memes and non-causal attributions. To address these issues, we propose a framework based on a Structural Causal Model (SCM). In this framework, VisualBERT is trained to predict the class of an input meme based on both meme input and causal concepts, allowing for transparent interpretation. Our qualitative evaluation demonstrates the framework's effectiveness in understanding model behavior, particularly in determining whether the model was right due to the right reason, and in identifying reasons behind misclassification. Additionally, quantitative analysis assesses the significance of proposed modelling choices, such as de-confounding, adversarial learning, and dynamic routing, and compares them with input attribution methods. Surprisingly, we find that input attribution methods do not guarantee causality within our framework, raising questions about their reliability in safety-critical applications. The project page is at: https://newcodevelop.github.io/causality_adventure/
Authors: Ke Duan, Kai Yang, Houde Liu, Xueqian Wang
Abstract: In reinforcement learning, agents collect state information and rewards through environmental interactions, essential for policy refinement. This process is notably time-consuming, especially in complex robotic simulations and real-world applications. Traditional algorithms usually re-engage with the environment after processing a single batch of samples, thereby failing to fully capitalize on historical data. However, frequently observed states, with reliable value estimates, require minimal updates; in contrast, rare observed states necessitate more intensive updates for achieving accurate value estimations. To address uneven sample utilization, we propose Novelty-guided Sample Reuse (NSR). NSR provides extra updates for infrequent, novel states and skips additional updates for frequent states, maximizing sample use before interacting with the environment again. Our experiments show that NSR improves the convergence rate and success rate of algorithms without significantly increasing time consumption. Our code is publicly available at https://github.com/ppksigs/NSR-DDPG-HER.
Authors: Rahimanuddin Shaik, Katikela Sreeharsha Kishore
Abstract: Text generation is the automated process of producing written or spoken language using computational methods. It involves generating coherent and contextually relevant text based on predefined rules or learned patterns. However, challenges in text generation arise from maintaining coherence, ensuring diversity and creativity, and avoiding biases or inappropriate content. This research paper developed a novel approach to improve text generation in the context of joint Natural Language Generation (NLG) and Natural Language Understanding (NLU) learning. The data is prepared by gathering and preprocessing annotated datasets, including cleaning, tokenization, stemming, and stop-word removal. Feature extraction techniques such as POS tagging, Bag of words, and Term Frequency-Inverse Document Frequency (TF-IDF) are applied. Transformer-based encoders and decoders, capturing long range dependencies and improving source-target sequence modelling. Pre-trained language models like Optimized BERT are incorporated, along with a Hybrid Redfox Artificial Hummingbird Algorithm (HRAHA). Reinforcement learning with policy gradient techniques, semi-supervised training, improved attention mechanisms, and differentiable approximations like straight-through Gumbel SoftMax estimator are employed to fine-tune the models and handle complex linguistic tasks effectively. The proposed model is implemented using Python.
Authors: Dominik Hirner, Friedrich Fraundorfer
Abstract: Stereo estimation has made many advancements in recent years with the introduction of deep-learning. However the traditional supervised approach to deep-learning requires the creation of accurate and plentiful ground-truth data, which is expensive to create and not available in many situations. This is especially true for remote sensing applications, where there is an excess of available data without proper ground truth. To tackle this problem, we propose a self-supervised CNN with self-improving adaptive abilities. In the first iteration, the created disparity map is inaccurate and noisy. Leveraging the left-right consistency check, we get a sparse but more accurate disparity map which is used as an initial pseudo ground-truth. This pseudo ground-truth is then adapted and updated after every epoch in the training step of the network. We use the sum of inconsistent points in order to track the network convergence. The code for our method is publicly available at: https://github.com/thedodo/SAda-Net}{https://github.com/thedodo/SAda-Net
URLs: https://github.com/thedodo/SAda-Net, https://github.com/thedodo/SAda-Net
Authors: Efimia Panagiotaki, Georgi Pramatarov, Lars Kunze, Daniele De Martini
Abstract: This paper proposes a method for on-demand scenario generation in simulation, grounded on real-world data. Evaluating the behaviour of Autonomous Vehicles (AVs) in both safety-critical and regular scenarios is essential for assessing their robustness before real-world deployment. By integrating scenarios derived from real-world datasets into the simulation, we enhance the plausibility and validity of testing sets. This work introduces a novel approach that employs temporal scene graphs to capture evolving spatiotemporal relationships among scene entities from a real-world dataset, enabling the generation of dynamic scenarios in simulation through Graph Neural Networks (GNNs). User-defined action and criticality conditioning are used to ensure flexible, tailored scenario creation. Our model significantly outperforms the benchmarks in accurately predicting links corresponding to the requested scenarios. We further evaluate the validity and compatibility of our generated scenarios in an off-the-shelf simulator.
Authors: Muhammad Saad Nawaz, Thomas Dallmann, Torsten Schoen, Dirk Heberling
Abstract: For the validation and verification of automotive radars, datasets of realistic traffic scenarios are required, which, how ever, are laborious to acquire. In this paper, we introduce radar scene synthesis using GANs as an alternative to the real dataset acquisition and simulation-based approaches. We train a PointNet++ based GAN model to generate realistic radar point cloud scenes and use a binary classifier to evaluate the performance of scenes generated using this model against a test set of real scenes. We demonstrate that our GAN model achieves similar performance (~87%) to the real scenes test set.
Authors: Luyao Zou, Yu Min Park, Chu Myaet Thwal, Yan Kyaw Tun, Zhu Han, Choong Seon Hong
Abstract: Low Earth orbit (LEO) satellites are capable of gathering abundant Earth observation data (EOD) to enable different Internet of Things (IoT) applications. However, to accomplish an effective EOD processing mechanism, it is imperative to investigate: 1) the challenge of processing the observed data without transmitting those large-size data to the ground because the connection between the satellites and the ground stations is intermittent, and 2) the challenge of processing the non-independent and identically distributed (non-IID) satellite data. In this paper, to cope with those challenges, we propose an orbit-based spectral clustering-assisted clustered federated self-knowledge distillation (OSC-FSKD) approach for each orbit of an LEO satellite constellation, which retains the advantage of FL that the observed data does not need to be sent to the ground. Specifically, we introduce normalized Laplacian-based spectral clustering (NLSC) into federated learning (FL) to create clustered FL in each round to address the challenge resulting from non-IID data. Particularly, NLSC is adopted to dynamically group clients into several clusters based on cosine similarities calculated by model updates. In addition, self-knowledge distillation is utilized to construct each local client, where the most recent updated local model is used to guide current local model training. Experiments demonstrate that the observation accuracy obtained by the proposed method is separately 1.01x, 2.15x, 1.10x, and 1.03x higher than that of pFedSD, FedProx, FedAU, and FedALA approaches using the SAT4 dataset. The proposed method also shows superiority when using other datasets.
Authors: Shaikat Galib, Shanshan Wang, Guanshuo Xu, Pascal Pfeiffer, Ryan Chesler, Mark Landry, Sri Satish Ambati
Abstract: Smaller vision-language models (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube language models, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.
Authors: Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, Rui Wang
Abstract: LLM self-evaluation relies on the LLM's own ability to estimate response correctness, which can greatly improve its deployment reliability. In this research track, we propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation. CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs. We find that when LLMs respond correctly and incorrectly, their CoE features differ, these discrepancies assist us in estimating LLM response correctness. Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensure real-time feedback in large-scale scenarios. More importantly, we provide interesting insights into LLM response correctness from the perspective of hidden state changes inside LLMs.
Authors: Shengbin Ye, Meng Li
Abstract: Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extreme-scale SR), which are common in modern scientific applications. This ``large $p$'' setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 17 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets.
Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, Yongbin Li
Abstract: Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads' contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries, while only modifying 0.006% of the parameters, in contrast to the ~ 5% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models.
Authors: Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, Yuming Du
Abstract: We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.
Authors: Gen Li, Yuchen Jiao
Abstract: Score-based diffusion models have achieved remarkable empirical performance in the field of machine learning and artificial intelligence for their ability to generate high-quality new data instances from complex distributions. Improving our understanding of diffusion models, including mainly convergence analysis for such models, has attracted a lot of interests. Despite a lot of theoretical attempts, there still exists significant gap between theory and practice. Towards to close this gap, we establish an iteration complexity at the order of $d^{1/3}\varepsilon^{-2/3}$, which is better than $d^{5/12}\varepsilon^{-1}$, the best known complexity achieved before our work. This convergence analysis is based on a randomized midpoint method, which is first proposed for log-concave sampling (Shen and Lee, 2019), and then extended to diffusion models by Gupta et al. (2024). Our theory accommodates $\varepsilon$-accurate score estimates, and does not require log-concavity on the target distribution. Moreover, the algorithm can also be parallelized to run in only $O(\log^2(d/\varepsilon))$ parallel rounds in a similar way to prior works.
Authors: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh
Abstract: Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.
Authors: Walker Byrnes, Miroslav Bogdanovic, Avi Balakirsky, Stephen Balakirsky, Animesh Garg
Abstract: Intelligent and reliable task planning is a core capability for generalized robotics, requiring a descriptive domain representation that sufficiently models all object and state information for the scene. We present CLIMB, a continual learning framework for robot task planning that leverages foundation models and execution feedback to guide domain model construction. CLIMB can build a model from a natural language description, learn non-obvious predicates while solving tasks, and store that information for future problems. We demonstrate the ability of CLIMB to improve performance in common planning environments compared to baseline methods. We also develop the BlocksWorld++ domain, a simulated environment with an easily usable real counterpart, together with a curriculum of tasks with progressing difficulty for evaluating continual learning. Additional details and demonstrations for this system can be found at https://plan-with-climb.github.io/ .
Authors: Antonio Sclocchi, Alessandro Favero, Noam Itzhak Levi, Matthieu Wyart
Abstract: High-dimensional data must be highly structured to be learnable. Although the compositional and hierarchical nature of data is often put forward to explain learnability, quantitative measurements establishing these properties are scarce. Likewise, accessing the latent variables underlying such a data structure remains a challenge. In this work, we show that forward-backward experiments in diffusion-based models, where data is noised and then denoised to generate new samples, are a promising tool to probe the latent structure of data. We predict in simple hierarchical models that, in this process, changes in data occur by correlated chunks, with a length scale that diverges at a noise level where a phase transition is known to take place. Remarkably, we confirm this prediction in both text and image datasets using state-of-the-art diffusion models. Our results show how latent variable changes manifest in the data and establish how to measure these effects in real data using diffusion models.
Authors: Arvid Frydenlund
Abstract: The recently introduced path-star task is a minimal task designed to exemplify limitations to the abilities of language models (Bachmann and Nagarajan, 2024). It involves a path-star graph where multiple arms radiate from a single starting node and each node is unique. Given the start node and a specified target node that ends an arm, the task is to generate the arm containing that target node. This is straightforward for a human but surprisingly difficult for language models, which did not outperform the random baseline. The authors hypothesized this is due to a deficiency in teacher-forcing and the next-token prediction paradigm. We demonstrate the task is learnable using teacher-forcing in alternative settings and that the issue is partially due to representation. We introduce a regularization method using structured samples of the same graph but with differing target nodes, improving results across a variety of model types. We provide RASP proofs showing the task is theoretically solvable. Finally, we find settings where an encoder-only model can consistently solve the task.
Authors: Or Ordentlich, Yury Polyanskiy
Abstract: Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goal of these new compression algorithms is to be able to approximate not the matrices themselves, but their matrix product. Specifically, given a pair of real matrices $A,B$ an encoder (compressor) is applied to each of them independently producing descriptions with $R$ bits per entry. These representations subsequently are used by the decoder to estimate matrix product $A^\top B$. In this work, we provide a non-asymptotic lower bound on the mean squared error of this approximation (as a function of rate $R$) for the case of matrices $A,B$ with iid Gaussian entries. Algorithmically, we construct a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices $A$, $B$ in terms of only Frobenius norms $\|A\|_F, \|B\|_F$ and $\|A^\top B\|_F$. For iid Gaussian matrices our quantizer achieves the lower bound and is, thus, asymptotically optimal. A practical low-complexity version of our quantizer achieves performance quite close to optimal. In information-theoretic terms we derive rate-distortion function for matrix multiplication of iid Gaussian matrices.
Authors: Limei Wang, Kaveh Hassani, Si Zhang, Dongqi Fu, Baichuan Yuan, Weilin Cong, Zhigang Hua, Hao Wu, Ning Yao, Bo Long
Abstract: Transformers serve as the backbone architectures of Foundational Models, where a domain-specific tokenizer helps them adapt to various domains. Graph Transformers (GTs) have recently emerged as a leading model in geometric deep learning, outperforming Graph Neural Networks (GNNs) in various graph learning tasks. However, the development of tokenizers for graphs has lagged behind other modalities, with existing approaches relying on heuristics or GNNs co-trained with Transformers. To address this, we introduce GQT (\textbf{G}raph \textbf{Q}uantized \textbf{T}okenizer), which decouples tokenizer training from Transformer training by leveraging multi-task graph self-supervised learning, yielding robust and generalizable graph tokens. Furthermore, the GQT utilizes Residual Vector Quantization (RVQ) to learn hierarchical discrete tokens, resulting in significantly reduced memory requirements and improved generalization capabilities. By combining the GQT with token modulation, a Transformer encoder achieves state-of-the-art performance on 16 out of 18 benchmarks, including large-scale homophilic and heterophilic datasets. The code is available at: https://github.com/limei0307/graph-tokenizer
Authors: Ernesto Arganda, Marcela Carena, Mart\'in de los Rios, Andres D. Perez, Duncan Rocha, Rosa M. Sand\'a Seoane, Carlos E. M. Wagner
Abstract: The search for weakly interacting matter particles (WIMPs) is one of the main objectives of the High Luminosity Large Hadron Collider (HL-LHC). In this work we use Machine Learning (ML) techniques to explore WIMP radiative decays into a Dark Matter (DM) candidate in a supersymmetric framework. The minimal supersymmetric WIMP sector includes the lightest neutralino that can provide the observed DM relic density through its co-annihilation with the second lightest neutralino and lightest chargino. Moreover, the direct DM detection cross section rates fulfill current experimental bounds and provide discovery targets for the same region of model parameters in which the radiative decay of the second lightest neutralino into a photon and the lightest neutralino is enhanced. This strongly motivates the search for radiatively decaying neutralinos which, however, suffers from strong backgrounds. We investigate the LHC reach in the search for these radiatively decaying particles by means of cut-based and ML methods and estimate its discovery potential in this well-motivated, new physics scenario.
Authors: Abhijith Jayakumar, Andrey Y. Lokhov, Sidhant Misra, Marc Vuffray
Abstract: Markov chain samplers designed to sample from multi-variable distributions often undesirably get stuck in specific regions of their state space. This causes such samplers to approximately sample from a metastable distribution which is usually quite different from the desired, stationary distribution of the chain. We show that single-variable conditionals of metastable distributions of reversible Markov chain samplers that satisfy a strong metastability condition are on average very close to those of the true distribution. This holds even when the metastable distribution is far away from the true model in terms of global metrics like Kullback-Leibler divergence or total variation distance. This property allows us to learn the true model using a conditional likelihood based estimator, even when the samples come from a metastable distribution concentrated in a small region of the state space. Explicit examples of such metastable states can be constructed from regions that effectively bottleneck the probability flow and cause poor mixing of the Markov chain. For specific cases of binary pairwise undirected graphical models, we extend our results to further rigorously show that data coming from metastable states can be used to learn the parameters of the energy function and recover the structure of the model.
Authors: Mohamed Nomeir, Pasan Dissanayake, Shreya Meel, Sanghamitra Dutta, Sennur Ulukus
Abstract: Transparency and explainability are two extremely important aspects to be considered when employing black-box machine learning models in high-stake applications. Providing counterfactual explanations is one way of catering this requirement. However, this also poses a threat to the privacy of both the institution that is providing the explanation as well as the user who is requesting it. In this work, we propose multiple schemes inspired by private information retrieval (PIR) techniques which ensure the \emph{user's privacy} when retrieving counterfactual explanations. We present a scheme which retrieves the \emph{exact} nearest neighbor counterfactual explanation from a database of accepted points while achieving perfect (information-theoretic) privacy for the user. While the scheme achieves perfect privacy for the user, some leakage on the database is inevitable which we quantify using a mutual information based metric. Furthermore, we propose strategies to reduce this leakage to achieve an advanced degree of database privacy. We extend these schemes to incorporate user's preference on transforming their attributes, so that a more actionable explanation can be received. Since our schemes rely on finite field arithmetic, we empirically validate our schemes on real datasets to understand the trade-off between the accuracy and the finite field sizes.
Authors: Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, Sergey Levine
Abstract: Large, general-purpose robotic policies trained on diverse demonstration datasets have been shown to be remarkably effective both for controlling a variety of robots in a range of different scenes, and for acquiring broad repertoires of manipulation skills. However, the data that such policies are trained on is generally of mixed quality -- not only are human-collected demonstrations unlikely to perform the task perfectly, but the larger the dataset is, the harder it is to curate only the highest quality examples. It also remains unclear how optimal data from one embodiment is for training on another embodiment. In this paper, we present a general and broadly applicable approach that enhances the performance of such generalist robot policies at deployment time by re-ranking their actions according to a value function learned via offline RL. This approach, which we call Value-Guided Policy Steering (V-GPS), is compatible with a wide range of different generalist policies, without needing to fine-tune or even access the weights of the policy. We show that the same value function can improve the performance of five different state-of-the-art policies with different architectures, even though they were trained on distinct datasets, attaining consistent performance improvement on multiple robotic platforms across a total of 12 tasks. Code and videos can be found at: https://nakamotoo.github.io/V-GPS
Authors: Xuan Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin
Abstract: Recent advancements in large language models (LLMs) have extended their capabilities to handle long contexts. However, increasing the number of model layers and the length of input sequences significantly escalates the memory required to store key-value (KV) cache, posing challenges for efficient inference. To mitigate this issue, we present SimLayerKV, a simple yet effective method that reduces inter-layer KV cache redundancies by selectively dropping cache in identified lazy layers. Our approach is based on the observation that certain layers in long-context LLMs exhibit "lazy" behavior, contributing less to modeling long-range dependencies compared to non-lazy layers. By analyzing attention weight patterns, we find that the behavior of these lazy layers is consistent across tokens during generation for a given input. This insight motivates our SimLayerKV, which identifies lazy layers and reduces their KV cache accordingly. SimLayerKV is training-free, generalizable, and can be implemented with only seven lines of code. We conduct extensive experiments on three representative LLMs, e.g., LLaMA2-7B, LLaMA3-8B, and Mistral-7B across 16 tasks from the LongBench benchmark. The results demonstrate that SimLayerKV achieves a KV cache compression ratio of 5$\times$ with only a 1.2% performance drop when combined with 4-bit quantization. Our code is available at https://github.com/sail-sg/SimLayerKV.
Authors: Florian H\"ubler, Ilyas Fatkhullin, Niao He
Abstract: Recent empirical evidence indicates that many machine learning applications involve heavy-tailed gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves good performance in this setting both theoretically and practically. However, our current theoretical understanding of non-convex gradient clipping has three main shortcomings. First, the theory hinges on large, increasing clipping thresholds, which are in stark contrast to the small constant clipping thresholds employed in practice. Second, clipping thresholds require knowledge of problem-dependent parameters to guarantee convergence. Lastly, even with this knowledge, current sampling complexity upper bounds for the method are sub-optimal in nearly all parameters. To address these issues, we study convergence of Normalized SGD (NSGD). First, we establish a parameter-free sample complexity for NSGD of $\mathcal{O}\left(\varepsilon^{-\frac{2p}{p-1}}\right)$ to find an $\varepsilon$-stationary point. Furthermore, we prove tightness of this result, by providing a matching algorithm-specific lower bound. In the setting where all problem parameters are known, we show this complexity is improved to $\mathcal{O}\left(\varepsilon^{-\frac{3p-2}{p-1}}\right)$, matching the previously known lower bound for all first-order methods in all problem dependent parameters. Finally, we establish high-probability convergence of NSGD with a mild logarithmic dependence on the failure probability. Our work complements the studies of gradient clipping under heavy tailed noise improving the sample complexities of existing algorithms and offering an alternative mechanism to achieve high probability convergence.
Authors: Zizhao Chen, Mustafa Omer Gul, Yiwei Chen, Gloria Geng, Anne Wu, Yoav Artzi
Abstract: Multi-turn interactions between large language models (LLMs) and users naturally include implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the user is likely to signal it by rephrasing the request, expressing frustration, or pivoting to an alternative task. Such signals are task-independent and occupy a relatively constrained subspace of language, allowing the LLM to identify them even if it fails on the actual task. This creates an avenue for continually learning from interactions without additional annotations. We introduce ReSpect, a method to learn from such signals in past interactions via retrospection. We deploy ReSpect in a new multimodal interaction scenario, where humans instruct an LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show how ReSpect gradually improves task completion rate from 31% to 82%, all without any external annotation.
Authors: Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian
Abstract: Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.
Authors: Christopher Tran, Keith Burghardt, Kristina Lerman, Elena Zheleva
Abstract: Estimating how a treatment affects different individuals, known as heterogeneous treatment effect estimation, is an important problem in empirical sciences. In the last few years, there has been a considerable interest in adapting machine learning algorithms to the problem of estimating heterogeneous effects from observational and experimental data. However, these algorithms often make strong assumptions about the observed features in the data and ignore the structure of the underlying causal model, which can lead to biased estimation. At the same time, the underlying causal mechanism is rarely known in real-world datasets, making it hard to take it into consideration. In this work, we provide a survey of state-of-the-art data-driven methods for heterogeneous treatment effect estimation using machine learning, broadly categorizing them as methods that focus on counterfactual prediction and methods that directly estimate the causal effect. We also provide an overview of a third category of methods which rely on structural causal models and learn the model structure from data. Our empirical evaluation under various underlying structural model mechanisms shows the advantages and deficiencies of existing estimators and of the metrics for measuring their performance.
Authors: Tianyi Wang, Rodney Beard, John Hawkins, Rohitash Chandra
Abstract: The gross domestic product (GDP) is the most widely used indicator in macroeconomics and the main tool for measuring a country's economic output. Due to the diversity and complexity of the world economy, a wide range of models have been used, but there are challenges in making decadal GDP forecasts given unexpected changes such as emergence of catastrophic world events including pandemics and wars. Deep learning models are well suited for modelling temporal sequences and time series forecasting. In this paper, we develop a deep learning framework to forecast the GDP growth rate of the world economy over a decade. We use the Penn World Table as the data source featuring 13 countries prior to the COVID-19 pandemic, such as Australia, China, India, and the United States. We present a recursive deep learning framework to predict the GDP growth rate in the next ten years. We test prominent deep learning models and compare their results with traditional econometric models for selected developed and developing countries. Our decadal forecasts reveal that that most of the developed countries would experience economic growth slowdown, stagnation and even recession within five years (2020-2024). Furthermore, our model forecasts show that only China, France, and India would experience stable GDP growth.
Authors: Thomas Carta, Cl\'ement Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer
Abstract: Recent works successfully leveraged Large Language Models' (LLM) abilities to capture abstract knowledge about world's physics to solve decision-making problems. Yet, the alignment between LLMs' knowledge and the environment can be wrong and limit functional competence due to lack of grounding. In this paper, we study an approach (named GLAM) to achieve this alignment through functional grounding: we consider an agent using an LLM as a policy that is progressively updated as the agent interacts with the environment, leveraging online Reinforcement Learning to improve its performance to solve goals. Using an interactive textual environment designed to study higher-level forms of functional grounding, and a set of spatial and navigation tasks, we study several scientific questions: 1) Can LLMs boost sample efficiency for online learning of various RL tasks? 2) How can it boost different forms of generalization? 3) What is the impact of online learning? We study these questions by functionally grounding several variants (size, architecture) of FLAN-T5.
Authors: Juyoung Yun, Sol Choi, Francois Rameau, Byungkon Kang, Zhoulai Fu
Abstract: With the increasing complexity of machine learning models, managing computational resources like memory and processing power has become a critical concern. Mixed precision techniques, which leverage different numerical precisions during model training and inference to optimize resource usage, have been widely adopted. However, access to hardware that supports lower precision formats (e.g., FP8 or FP4) remains limited, especially for practitioners with hardware constraints. For many with limited resources, the available options are restricted to using 32-bit, 16-bit, or a combination of the two. While it is commonly believed that 16-bit precision can achieve results comparable to full (32-bit) precision, this study is the first to systematically validate this assumption through both rigorous theoretical analysis and extensive empirical evaluation. Our theoretical formalization of floating-point errors and classification tolerance provides new insights into the conditions under which 16-bit precision can approximate 32-bit results. This study fills a critical gap, proving for the first time that standalone 16-bit precision neural networks match 32-bit and mixed-precision in accuracy while boosting computational speed. Given the widespread availability of 16-bit across GPUs, these findings are especially valuable for machine learning practitioners with limited hardware resources to make informed decisions.
Authors: John Abascal, Stanley Wu, Alina Oprea, Jonathan Ullman
Abstract: Transfer learning has become an increasingly popular technique in machine learning as a way to leverage a pretrained model trained for one task to assist with building a finetuned model for a related task. This paradigm has been especially popular for $\textit{privacy}$ in machine learning, where the pretrained model is considered public, and only the data for finetuning is considered sensitive. However, there are reasons to believe that the data used for pretraining is still sensitive, making it essential to understand how much information the finetuned model leaks about the pretraining data. In this work we propose a new membership-inference threat model where the adversary only has access to the finetuned model and would like to infer the membership of the pretraining data. To realize this threat model, we implement a novel metaclassifier-based attack, $\textbf{TMI}$, that leverages the influence of memorized pretraining samples on predictions in the downstream task. We evaluate $\textbf{TMI}$ on both vision and natural language tasks across multiple transfer learning settings, including finetuning with differential privacy. Through our evaluation, we find that $\textbf{TMI}$ can successfully infer membership of pretraining examples using query access to the finetuned model. An open-source implementation of $\textbf{TMI}$ can be found on GitHub: https://github.com/johnmath/tmi-pets24.
Authors: Gerry Windiarto Mohamad Dunda, Shenghui Song
Abstract: Federated learning (FL) has garnered considerable attention due to its privacy-preserving feature. Nonetheless, the lack of freedom in managing user data can lead to group fairness issues, where models are biased towards sensitive factors such as race or gender. To tackle this issue, this paper proposes a novel algorithm, fair federated averaging with augmented Lagrangian method (FFALM), designed explicitly to address group fairness issues in FL. Specifically, we impose a fairness constraint on the training objective and solve the minimax reformulation of the constrained optimization problem. Then, we derive the theoretical upper bound for the convergence rate of FFALM. The effectiveness of FFALM in improving fairness is shown empirically on CelebA and UTKFace datasets in the presence of severe statistical heterogeneity.
Authors: Ameya Anjarlekar, Rasoul Etesami, R. Srikant
Abstract: We address the challenge of solving machine learning tasks using data from privacy-sensitive sellers. Since the data is private, we design a data market that incentivizes sellers to provide their data in exchange for payments. Therefore our objective is to design a mechanism that optimizes a weighted combination of test loss, seller privacy, and payment, striking a balance between building a good privacy-preserving ML model and minimizing payments to the sellers. To achieve this, we first propose an approach to solve logistic regression with known heterogeneous differential privacy guarantees. Building on these results and leveraging standard mechanism design theory, we develop a two-step optimization framework. We further extend this approach to an online algorithm that handles the sequential arrival of sellers.
Authors: Zezhi Shao, Fei Wang, Yongjun Xu, Wei Wei, Chengqing Yu, Zhao Zhang, Di Yao, Tao Sun, Guangyin Jin, Xin Cao, Gao Cong, Christian S. Jensen, Xueqi Cheng
Abstract: Multivariate Time Series (MTS) analysis is crucial to understanding and managing complex systems, such as traffic and energy systems, and a variety of approaches to MTS forecasting have been proposed recently. However, we often observe inconsistent or seemingly contradictory performance findings across different studies. This hinders our understanding of the merits of different approaches and slows down progress. We address the need for means of assessing MTS forecasting proposals reliably and fairly, in turn enabling better exploitation of MTS as seen in different applications. Specifically, we first propose BasicTS+, a benchmark designed to enable fair, comprehensive, and reproducible comparison of MTS forecasting solutions. BasicTS+ establishes a unified training pipeline and reasonable settings, enabling an unbiased evaluation. Second, we identify the heterogeneity across different MTS as an important consideration and enable classification of MTS based on their temporal and spatial characteristics. Disregarding this heterogeneity is a prime reason for difficulties in selecting the most promising technical directions. Third, we apply BasicTS+ along with rich datasets to assess the capabilities of more than 45 MTS forecasting solutions. This provides readers with an overall picture of the cutting-edge research on MTS forecasting. The code can be accessed at https://github.com/GestaltCogTeam/BasicTS.
Authors: Tong Liu, Hadi Meidani
Abstract: The traffic assignment problem is one of the significant components of traffic flow analysis for which various solution approaches have been proposed. However, deploying these approaches for large-scale networks poses significant challenges. In this paper, we leverage the power of heterogeneous graph neural networks to propose a novel end-to-end surrogate model for traffic assignment, specifically user equilibrium traffic assignment problems. Our model integrates an adaptive graph attention mechanism with auxiliary "virtual" links connecting origin-destination node pairs, This integration enables the model to capture spatial traffic patterns across different links, By incorporating the node-based flow conservation law into the overall loss function, the model ensures the prediction results in compliance with flow conservation principles, resulting in highly accurate predictions for both link flow and flow-capacity ratios. We present numerical experiments on urban transportation networks and show that the proposed heterogeneous graph neural network model outperforms other conventional neural network models in terms of convergence rate and prediction accuracy. Notably, by introducing two different training strategies, the proposed heterogeneous graph neural network model can also be generalized to different network topologies. This approach offers a promising solution for complex traffic flow analysis and prediction, enhancing our understanding and management of a wide range of transportation systems.
Authors: Ikumi Okubo, Keisuke Sugiura, Hiroki Matsutani
Abstract: Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. To address these issues, a hybrid approach has become a recent research trend, which replaces a part of ResNet with an MHSA (Multi-Head Self-Attention). In this paper, we propose a lightweight hybrid model which uses Neural ODE (Ordinary Differential Equation) as a backbone instead of ResNet so that we can increase the number of iterations of building blocks while reusing the same parameters, mitigating the increase in parameter size per iteration. The proposed model is deployed on a modest-sized FPGA device for edge computing. The model is further quantized by QAT (Quantization Aware Training) scheme to reduce FPGA resource utilization while suppressing the accuracy loss. The quantized model achieves 79.68% top-1 accuracy for STL10 dataset that contains 96$\times$96 pixel images. The weights of the feature extraction network are stored on-chip to minimize the memory transfer overhead, allowing faster inference. By eliminating the overhead of memory transfers, inference can be executed seamlessly, leading to accelerated inference. The proposed FPGA implementation accelerates the backbone and MHSA parts by 34.01$\times$, and achieves an overall 9.85$\times$ speedup when taking into account the software pre- and post-processing. The FPGA acceleration leads to 7.10$\times$ better energy efficiency compared to the ARM Cortex-A53 CPU. The proposed lightweight Transformer model is demonstrated on Xilinx ZCU104 board for the image recognition of 96$\times$96 pixel images in this paper and can be applied to different image sizes by modifying the pre-processing layer.
Authors: Jakub Grudzien Kuba, Masatoshi Uehara, Pieter Abbeel, Sergey Levine
Abstract: While machine learning models are typically trained to solve prediction problems, we might often want to use them for optimization problems. For example, given a dataset of proteins and their corresponding fluorescence levels, we might want to optimize for a new protein with the highest possible fluorescence. This kind of data-driven optimization (DDO) presents a range of challenges beyond those in standard prediction problems, since we need models that successfully predict the performance of new designs that are better than the best designs seen in the training set. It is not clear theoretically when existing approaches can even perform better than the naive approach that simply selects the best design in the dataset. In this paper, we study how structure can enable sample-efficient data-driven optimization. To formalize the notion of structure, we introduce functional graphical models (FGMs) and show theoretically how they can provide for principled data-driven optimization by decomposing the original high-dimensional optimization problem into smaller sub-problems. This allows us to derive much more practical regret bounds for DDO, and the result implies that DDO with FGMs can achieve nearly optimal designs in situations where naive approaches fail due to insufficient coverage of the offline data. We further present a data-driven optimization algorithm that inferes the FGM structure itself, either over the original input variables or a latent variable representation of the inputs.
Authors: Florian E. Dorner, Moritz Hardt
Abstract: We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cram\'er's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.
Authors: Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Yulong Chen, Lin Ma, Yue Zhang, Rada Mihalcea
Abstract: In this paper, we investigate the effectiveness of various LLMs in interpreting tabular data through different prompting strategies and data formats. Our analyses extend across six benchmarks for table-related tasks such as question-answering and fact-checking. We introduce for the first time the assessment of LLMs' performance on image-based table representations. Specifically, we compare five text-based and three image-based table representations, demonstrating the role of representation and prompting on LLM performance. Our study provides insights into the effective use of LLMs on table-related tasks.
Authors: Shashwat Goel, Ameya Prabhu, Philip Torr, Ponnurangam Kumaraguru, Amartya Sanyal
Abstract: Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the Internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects including vulnerability to backdoored samples, systemic biases, and reduced accuracy on certain input domains. Realistically, all manipulated training samples cannot be identified, and only a small, representative subset of the affected data can be flagged. We formalize Corrective Machine Unlearning as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, only having identified a subset of the corrupted data. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacy-oriented unlearning. We find most existing unlearning methods, including retraining-from-scratch without the deletion set, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, Selective Synaptic Dampening, achieves limited success, unlearning adverse effects with just a small portion of the manipulated samples in our setting, which shows encouraging signs for future progress. We hope our work spurs research towards developing better methods for corrective unlearning and offers practitioners a new strategy to handle data integrity challenges arising from web-scale training. Code is available at https://github.com/drimpossible/corrective-unlearning-bench.
URLs: https://github.com/drimpossible/corrective-unlearning-bench.
Authors: Shintaro Tamai, Masayuki Numao, Ken-ichi Fukui
Abstract: Recently, growing health awareness, novel methods allow individuals to monitor sleep at home. Utilizing sleep sounds offers advantages over conventional methods like smartwatches, being non-intrusive, and capable of detecting various physiological activities. This study aims to construct a machine learning-based sleep assessment model providing evidence-based assessments, such as poor sleep due to frequent movement during sleep onset. Extracting sleep sound events, deriving latent representations using VAE, clustering with GMM, and training LSTM for subjective sleep assessment achieved a high accuracy of 94.8% in distinguishing sleep satisfaction. Moreover, TimeSHAP revealed differences in impactful sound event types and timings for different individuals.
Authors: Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle
Abstract: Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
Authors: Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, Pietro Li\`o
Abstract: Generative models see increasing use in computer-aided drug design. However, while performing well at capturing distributions of molecular motifs, they often produce synthetically inaccessible molecules. To address this, we introduce SynFlowNet, a GFlowNet model whose action space uses chemical reactions and buyable reactants to sequentially build new molecules. By incorporating forward synthesis as an explicit constraint of the generative mechanism, we aim at bridging the gap between in silico molecular generation and real world synthesis capabilities. We evaluate our approach using synthetic accessibility scores and an independent retrosynthesis tool to assess the synthesizability of our compounds, and motivate the choice of GFlowNets through considerable improvement in sample diversity compared to baselines. Additionally, we identify challenges with reaction encodings that can complicate traversal of the MDP in the backward direction. To address this, we introduce various strategies for learning the GFlowNet backward policy and thus demonstrate how additional constraints can be integrated into the GFlowNet MDP framework. This approach enables our model to successfully identify synthesis pathways for previously unseen molecules.
Authors: Xitong Zhang, Ismail R. Alkhouri, Rongrong Wang
Abstract: Deep Neural Networks (DNNs) have achieved remarkable success in addressing many previously unsolvable tasks. However, the storage and computational requirements associated with DNNs pose a challenge for deploying these trained models on resource-limited devices. Therefore, a plethora of compression and pruning techniques have been proposed in recent years. Low-rank decomposition techniques are among the approaches most utilized to address this problem. Compared to post-training compression, compression-promoted training is still under-explored. In this paper, we present a theoretically-justified technique termed Low-Rank Induced Training (LoRITa), that promotes low-rankness through the composition of linear layers and compresses by using singular value truncation. This is achieved without the need to change the structure at inference time or require constrained and/or additional optimization, other than the standard weight decay regularization. Moreover, LoRITa eliminates the need to (i) initialize with pre-trained models, (ii) specify rank selection prior to training, and (iii) compute SVD in each iteration. Our experimental results (i) demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 and ImageNet on Convolutional Neural Networks, and (ii) illustrate that we achieve either competitive or state-of-the-art results when compared to leading structured pruning and low-rank training methods in terms of FLOPs and parameters drop. Our code is available at \url{https://github.com/XitongSystem/LoRITa/tree/main}.
Authors: Andr\'e Correia, Lu\'is A. Alexandre
Abstract: Recent advancements in imitation learning have been largely fueled by the integration of sequence models, which provide a structured flow of information to effectively mimic task behaviours. Currently, Decision Transformer (DT) and subsequently, the Hierarchical Decision Transformer (HDT), presented Transformer-based approaches to learn task policies. Recently, the Mamba architecture has shown to outperform Transformers across various task domains. In this work, we introduce two novel methods, Decision Mamba (DM) and Hierarchical Decision Mamba (HDM), aimed at enhancing the performance of the Transformer models. Through extensive experimentation across diverse environments such as OpenAI Gym and D4RL, leveraging varying demonstration data sets, we demonstrate the superiority of Mamba models over their Transformer counterparts in a majority of tasks. Results show that DM outperforms other methods in most settings. The code can be found at https://github.com/meowatthemoon/DecisionMamba.
Authors: Xiangyu Dong, Xingyi Zhang, Yanni Sun, Lei Chen, Mingxuan Yuan, Sibo Wang
Abstract: The smoothing issue in graph learning leads to indistinguishable node representations, posing significant challenges for graph-related tasks. However, our experiments reveal that this problem can uncover underlying properties of node anomaly detection (NAD) that previous research has missed. We introduce Individual Smoothing Patterns (ISP) and Neighborhood Smoothing Patterns (NSP), which indicate that the representations of anomalous nodes are harder to smooth than those of normal ones. In addition, we explore the theoretical implications of these patterns, demonstrating the potential benefits of ISP and NSP for NAD tasks. Motivated by these findings, we propose SmoothGNN, a novel unsupervised NAD framework. First, we design a learning component to explicitly capture ISP for detecting node anomalies. Second, we design a spectral graph neural network to implicitly learn ISP to enhance detection. Third, we design an effective coefficient based on our findings that NSP can serve as coefficients for node representations, aiding in the identification of anomalous nodes. Furthermore, we devise a novel anomaly measure to calculate loss functions and anomalous scores for nodes, reflecting the properties of NAD using ISP and NSP. Extensive experiments on 9 real datasets show that SmoothGNN outperforms the best rival by an average of 14.66% in AUC and 7.28% in Average Precision, with 75x running time speedup, validating the effectiveness and efficiency of our framework.
Authors: Yige Hong, Qiaomin Xie, Yudong Chen, Weina Wang
Abstract: We consider the infinite-horizon average-reward restless bandit problem. We propose a novel \emph{two-set policy} that maintains two dynamic subsets of arms: one subset of arms has a nearly optimal state distribution and takes actions according to an Optimal Local Control routine; the other subset of arms is driven towards the optimal state distribution and gradually merged into the first subset. We show that our two-set policy is asymptotically optimal with an $O(\exp(-C N))$ optimality gap for an $N$-armed problem, under the mild assumptions of aperiodic-unichain, non-degeneracy, and local stability. Our policy is the first to achieve \emph{exponential asymptotic optimality} under the above set of easy-to-verify assumptions, whereas prior work either requires a strong \emph{global attractor} assumption or only achieves an $O(1/\sqrt{N})$ optimality gap. We further discuss obstacles in weakening the assumptions by demonstrating examples where exponential asymptotic optimality is not achievable when any of the three assumptions is violated. Notably, we prove a lower bound for a large class of locally unstable restless bandits, showing that local stability is particularly fundamental for exponential asymptotic optimality. Finally, we use simulations to demonstrate that the two-set policy outperforms previous policies on certain RB problems and performs competitively overall.
Authors: Christoph Kern, Michael Kim, Angela Zhou
Abstract: Estimating heterogeneous treatment effects is important to tailor treatments to those individuals who would most likely benefit. However, conditional average treatment effect predictors may often be trained on one population but possibly deployed on different, possibly unknown populations. We use methodology for learning multi-accurate predictors to post-process CATE T-learners (differenced regressions) to become robust to unknown covariate shifts at the time of deployment. The method works in general for pseudo-outcome regression, such as the DR-learner. We show how this approach can combine (large) confounded observational and (smaller) randomized datasets by learning a confounded predictor from the observational dataset, and auditing for multi-accuracy on the randomized controlled trial. We show improvements in bias and mean squared error in simulations with increasingly larger covariate shift, and on a semi-synthetic case study of a parallel large observational study and smaller randomized controlled experiment. Overall, we establish a connection between methods developed for multi-distribution learning and achieve appealing desiderata (e.g. external validity) in causal inference and machine learning.
Authors: Alexander H\"agele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi
Abstract: Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative -- constant learning rate and cooldowns -- and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at \url{https://github.com/epfml/schedules-and-scaling/}.
Authors: Aozhong Zhang, Naigang Wang, Yanxia Deng, Xin Li, Zi Yang, Penghang Yin
Abstract: In this paper, we present a simple optimization-based preprocessing technique called Weight Magnitude Reduction (MagR) to improve the performance of post-training quantization. For each linear layer, we adjust the pre-trained floating-point weights by solving an $\ell_\infty$-regularized optimization problem. This process greatly diminishes the maximum magnitude of the weights and smooths out outliers, while preserving the layer's output. The preprocessed weights are centered more towards zero, which facilitates the subsequent quantization process. To implement MagR, we address the $\ell_\infty$-regularization by employing an efficient proximal gradient descent algorithm. Unlike existing preprocessing methods that involve linear transformations and subsequent post-processing steps, which can introduce significant overhead at inference time, MagR functions as a non-linear transformation, eliminating the need for any additional post-processing. This ensures that MagR introduces no overhead whatsoever during inference. Our experiments demonstrate that MagR achieves state-of-the-art performance on the Llama family of models. For example, we achieve a Wikitext2 perplexity of 5.95 on the LLaMA2-70B model for per-channel INT2 weight quantization without incurring any inference overhead.
Authors: Shivam Gupta, Linda Cai, Sitan Chen
Abstract: Sampling algorithms play an important role in controlling the quality and runtime of diffusion model inference. In recent years, a number of works~\cite{chen2023sampling,chen2023ode,benton2023error,lee2022convergence} have proposed schemes for diffusion sampling with provable guarantees; these works show that for essentially any data distribution, one can approximately sample in polynomial time given a sufficiently accurate estimate of its score functions at different noise levels. In this work, we propose a new scheme inspired by Shen and Lee's randomized midpoint method for log-concave sampling~\cite{ShenL19}. We prove that this approach achieves the best known dimension dependence for sampling from arbitrary smooth distributions in total variation distance ($\widetilde O(d^{5/12})$ compared to $\widetilde O(\sqrt{d})$ from prior work). We also show that our algorithm can be parallelized to run in only $\widetilde O(\log^2 d)$ parallel rounds, constituting the first provable guarantees for parallel sampling with diffusion models. As a byproduct of our methods, for the well-studied problem of log-concave sampling in total variation distance, we give an algorithm and simple analysis achieving dimension dependence $\widetilde O(d^{5/12})$ compared to $\widetilde O(\sqrt{d})$ from prior work.
Authors: Riccardo Salami, Pietro Buzzega, Matteo Mosconi, Mattia Verasani, Simone Calderara
Abstract: Federated Learning (FL) aims at unburdening the training of deep models by distributing computation across multiple devices (clients) while safeguarding data privacy. On top of that, Federated Continual Learning (FCL) also accounts for data distribution evolving over time, mirroring the dynamic nature of real-world environments. In this work, we shed light on the Incremental and Federated biases that naturally emerge in FCL. While the former is a known problem in Continual Learning, stemming from the prioritization of recently introduced classes, the latter (i.e., the bias towards local distributions) remains relatively unexplored. Our proposal constrains both biases in the last layer by efficiently fine-tuning a pre-trained backbone using learnable prompts, resulting in clients that produce less biased representations and more biased classifiers. Therefore, instead of solely relying on parameter aggregation, we also leverage generative prototypes to effectively balance the predictions of the global model. Our method improves on the current State Of The Art, providing an average increase of +7.9% in accuracy.
Authors: Sunwoong Yang, Ricardo Vinuesa, Namwoo Kang
Abstract: This study aims to overcome the limitations of conventional deep-learning approaches based on convolutional neural networks in complex geometries and unstructured meshes by exploring the potential of Graph U-Nets for unsteady flow-field prediction. We present a comprehensive investigation of Graph U-Nets, originally developed for classification tasks, now tailored for mesh-agnostic spatio-temporal forecasting of fluid dynamics. Our focus is on enhancing their performance through systematic hyperparameter tuning and architectural modifications. We propose novel approaches to improve mesh-agnostic spatio-temporal prediction of transient flow fields using Graph U-Nets, enabling accurate prediction on diverse mesh configurations. Key enhancements to the Graph U-Net architecture, including the Gaussian-mixture-model convolutional operator and noise injection approaches, provide increased flexibility in modeling node dynamics: the former reduces prediction error by 95\% compared to conventional convolutional operators, while the latter improves long-term prediction robustness, resulting in an error reduction of 86\%. We demonstrate the effectiveness of these enhancements in both transductive and inductive learning settings, showcasing the adaptability of Graph U-Nets to various flow conditions and mesh structures. This work contributes to the field of reduced-order modeling for computational fluid dynamics by establishing Graph U-Nets as a viable and flexible alternative to convolutional neural networks, capable of accurately and efficiently predicting complex fluid flow phenomena across diverse scenarios.
Authors: Stefan Gerd Fritsch, Cennet Oguz, Vitor Fortes Rey, Lala Ray, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz
Abstract: Human Activity Recognition (HAR) is a longstanding problem in AI with applications in a broad range of areas, including healthcare, sports and fitness, security, and more. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundation models, can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g., in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. To alleviate the need for labeled data, we introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this work, which can be used with the proposed pre-training method MuJo (Multimodal Joint Feature Space Learning) to enhance HAR performance across various modalities. FiMAD was created using YouTube fitness videos and contains parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes this dataset to learn a joint feature space for these modalities. We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on MM-Fit, we achieve an Macro F1-Score of up to 0.855 when fine-tuning on only 2% of the training data and 0.942 when utilizing the full training set for classification tasks. We have compared our approach to other self-supervised ones and showed that, unlike them, ours can consistently improve on the baseline network performance as well as provide a better data-efficiency.
Authors: Son Nguyen, Lizhang Chen, Bo Liu, Qiang Liu
Abstract: Modern deep learning heavily depends on adaptive optimizers such as Adam and its variants, which are renowned for their capacity to handle model scaling and streamline hyperparameter tuning. However, these algorithms typically experience high memory overhead caused by the accumulation of optimization states, leading to a critical challenge in training large-scale network models. In this study, we introduce a novel adaptive optimizer, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge. By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level while maintaining competitive performance across a wide range of architectures. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees. These optimization algorithms are designed to be both straightforward and adaptable, facilitating easy implementation in diverse settings.
Authors: Massimiliano Lupo Pasini, Jong Youl Choi, Kshitij Mehta, Pei Zhang, David Rogers, Jonghyun Bae, Khaled Z. Ibrahim, Ashwin M. Aji, Karl W. Schulz, Jorda Polo, Prasanna Balaprakash
Abstract: We present our work on developing and training scalable, trustworthy, and energy-efficient predictive graph foundation models (GFMs) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) computations in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define nearest-neighbor convolution in GNNs. This work discusses a series of optimizations that have allowed scaling up the GFMs training to tens of thousands of GPUs on datasets that consist of hundreds of millions of graphs. Using over 154 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two state-of-the-art United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge Leadership Computing Facility. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier. Hyperparameter optimization (HPO) was performed on over 64,000 Graphic Compute Dies (GCDs) on Frontier to select GFM architectures with high accuracy. Each HPO trial was ranked based on both accuracy and energy consumption. The training of an ensemble of highest-ranked GFM architectures (selected with judicious balance between accuracy and energy consumption) continued until convergence to establish uncertainty quantification (UQ) capabilities with ensemble learning. Our contributions establish core capabilities for rapidly developing, training, and deploying further GFMs using large-scale computational resources to enable AI-accelerated materials discovery and design.
Authors: Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah
Abstract: The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated efficiency techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy compared to prior methods. In addition, ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on Llama-2 and OPT models with up to 30 billion parameters. Our code is available at \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.
Authors: Katherine M. Collins, Najoung Kim, Yonatan Bitton, Verena Rieser, Shayegan Omidshafiei, Yushi Hu, Sherol Chen, Senjuti Dutta, Minsuk Chang, Kimin Lee, Youwei Liang, Georgina Evans, Sahil Singla, Gang Li, Adrian Weller, Junfeng He, Deepak Ramachandran, Krishnamurthy Dj Dvijotham
Abstract: Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback (for example, thumbs up/down or ranking between a set of options). While fine-grained feedback holds promise, particularly for systems catering to diverse societal preferences, we show that demonstrating its superiority to coarse-grained feedback is not automatic. Through experiments on real and synthetic preference data, we surface the complexities of building effective models due to the interplay of model choice, feedback type, and the alignment between human judgment and computational interpretation. We identify key challenges in eliciting and utilizing fine-grained feedback, prompting a reassessment of its assumed benefits and practicality. Our findings -- e.g., that fine-grained feedback can lead to worse models for a fixed budget, in some settings; however, in controlled settings with known attributes, fine grained rewards can indeed be more helpful -- call for careful consideration of feedback attributes and potentially beckon novel modeling approaches to appropriately unlock the potential value of fine-grained feedback in-the-wild.
Authors: Moritz Vandenhirtz, Sonia Laguna, Ri\v{c}ards Marcinkevi\v{c}s, Julia E. Vogt
Abstract: Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on intermediate, human-understandable concepts rather than the raw input. Through time-consuming manual interventions, a user can correct wrongly predicted concept values to enhance the model's downstream performance. We propose Stochastic Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies. In SCBMs, a single-concept intervention affects all correlated concepts, thereby improving intervention effectiveness. Unlike previous approaches that model the concept relations via an autoregressive structure, we introduce an explicit, distributional parameterization that allows SCBMs to retain the CBMs' efficient training and inference procedure. Additionally, we leverage the parameterization to derive an effective intervention strategy based on the confidence region. We show empirically on synthetic tabular and natural image datasets that our approach improves intervention effectiveness significantly. Notably, we showcase the versatility and usability of SCBMs by examining a setting with CLIP-inferred concepts, alleviating the need for manual concept annotations.
Authors: Yali Fu, Jindong Li, Jiahong Liu, Qianli Xing, Qi Wang, Irwin King
Abstract: Unsupervised graph-level anomaly detection (UGAD) has garnered increasing attention in recent years due to its significance. Most existing methods that rely on traditional GNNs mainly consider pairwise relationships between first-order neighbors, which is insufficient to capture the complex high-order dependencies often associated with anomalies. This limitation underscores the necessity of exploring high-order node interactions in UGAD. In addition, most previous works ignore the underlying properties (e.g., hierarchy and power-law structure) which are common in real-world graph datasets and therefore are indispensable factors in the UGAD task. In this paper, we propose a novel Dual Hyperbolic Contrastive Learning for Unsupervised Graph-Level Anomaly Detection (HC-GLAD in short). To exploit high-order node group information, we construct hypergraphs based on pre-designed gold motifs and subsequently perform hypergraph convolution. Furthermore, to preserve the hierarchy of real-world graphs, we introduce hyperbolic geometry into this field and conduct both graph and hypergraph embedding learning in hyperbolic space with the hyperboloid model. To the best of our knowledge, this is the first work to simultaneously apply hypergraph with node group information and hyperbolic geometry in this field. Extensive experiments on 13 real-world datasets of different fields demonstrate the superiority of HC-GLAD on the UGAD task. The code is available at https://github.com/Yali-F/HC-GLAD.
Authors: Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, Huaxiu Yao
Abstract: The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model's generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RULE on medical VQA and report generation tasks across three datasets, achieving an average improvement of 47.4% in factual accuracy. We publicly release our benchmark and code in https://github.com/richard-peng-xia/RULE.
Authors: Onur Efe, Arkadas Ozakin
Abstract: While modern deep learning methods have shown great promise in the problem of earthquake detection, the most successful methods so far have been based on supervised learning, which requires large datasets with ground-truth labels. The curation of such datasets is both time consuming and prone to systematic biases, which result in difficulties with cross-dataset generalization, hindering general applicability. In this paper, we develop an unsupervised method for earthquake detection that learns to detect earthquakes from raw waveforms, without access to ground truth labels. The performance is comparable to, and in some cases better than, some state-of-the-art supervised methods. Moreover, the method has strong \emph{cross-dataset generalization} performance. The algorithm utilizes deep autoencoders that learn to reproduce the waveforms after a data-compressive bottleneck and uses a simple, cross-covariance-based triggering algorithm at the bottleneck for labeling. The approach has the potential to be useful for time series datasets from other domains.
Authors: Cheems Wang, Yiqin Lv, Yixiu Mao, Yun Qu, Yi Xu, Xiangyang Ji
Abstract: Meta-learning is a practical learning paradigm to transfer skills across tasks from a few examples. Nevertheless, the existence of task distribution shifts tends to weaken meta-learners' generalization capability, particularly when the task distribution is naively hand-crafted or based on simple priors that fail to cover typical scenarios sufficiently. Here, we consider explicitly generative modeling task distributions placed over task identifiers and propose robustifying fast adaptation from adversarial training. Our approach, which can be interpreted as a model of a Stackelberg game, not only uncovers the task structure during problem-solving from an explicit generative model but also theoretically increases the adaptation robustness in worst cases. This work has practical implications, particularly in dealing with task distribution shifts in meta-learning, and contributes to theoretical insights in the field. Our method demonstrates its robustness in the presence of task subpopulation shifts and improved performance over SOTA baselines in extensive experiments. The project is available at https://sites.google.com/view/ar-metalearn.
Authors: Jinluan Yang, Zhengyu Chen, Teng Xiao, Wenqiao Zhang, Yong Lin, Kun Kuang
Abstract: Heterophilic Graph Neural Networks (HGNNs) have shown promising results for semi-supervised learning tasks on graphs. Notably, most real-world heterophilic graphs are composed of a mixture of nodes with different neighbor patterns, exhibiting local node-level homophilic and heterophilic structures. However, existing works are only devoted to designing better HGNN backbones or architectures for node classification tasks on heterophilic and homophilic graph benchmarks simultaneously, and their analyses of HGNN performance with respect to nodes are only based on the determined data distribution without exploring the effect caused by this structural difference between training and testing nodes. How to learn invariant node representations on heterophilic graphs to handle this structure difference or distribution shifts remains unexplored. In this paper, we first discuss the limitations of previous graph-based invariant learning methods from the perspective of data augmentation. Then, we propose \textbf{HEI}, a framework capable of generating invariant node representations through incorporating heterophily information to infer latent environments without augmentation, which are then used for invariant prediction, under heterophilic graph structure distribution shifts. We theoretically show that our proposed method can achieve guaranteed performance under heterophilic graph structure distribution shifts. Extensive experiments on various benchmarks and backbones can also demonstrate the effectiveness of our method compared with existing state-of-the-art baselines.
Authors: Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, Ricky T. Q. Chen
Abstract: Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there have not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.
Authors: Gokul Bhusal, Kevin Miller, Ekaterina Merkurjev
Abstract: Active learning enhances the performance of machine learning methods, particularly in semi-supervised cases, by judiciously selecting a limited number of unlabeled data points for labeling, with the goal of improving the performance of an underlying classifier. In this work, we introduce the Multiclass Active Learning with Auction Dynamics on Graphs (MALADY) framework which leverages the auction dynamics algorithm on similarity graphs for efficient active learning. In particular, we generalize the auction dynamics algorithm on similarity graphs for semi-supervised learning in [24] to incorporate a more general optimization functional. Moreover, we introduce a novel active learning acquisition function that uses the dual variable of the auction algorithm to measure the uncertainty in the classifier to prioritize queries near the decision boundaries between different classes. Lastly, using experiments on classification tasks, we evaluate the performance of our proposed method and show that it exceeds that of comparison algorithms.
Authors: Yubo Li, Saba Al-Sayouri, Rema Padman
Abstract: This study explores the potential of utilizing administrative claims data, combined with advanced machine learning and deep learning techniques, to predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major health insurance organization to develop prediction models for multiple observation windows using traditional machine learning methods such as Random Forest and XGBoost as well as deep learning approaches such as Long Short-Term Memory (LSTM) networks. Our findings demonstrate that the LSTM model, particularly with a 24-month observation window, exhibits superior performance in predicting ESRD progression, outperforming existing models in the literature. We further apply SHapley Additive exPlanations (SHAP) analysis to enhance interpretability, providing insights into the impact of individual features on predictions at the individual patient level. This study underscores the value of leveraging administrative claims data for CKD management and predicting ESRD progression.
Authors: Micha{\l} Wili\'nski, Mononito Goswami, Nina \.Zukowska, Willa Potosnak, Artur Dubrawski
Abstract: Time series foundation models (TSFMs) promise to be powerful tools for a wide range of applications. However, their internal representations and learned concepts are still not well understood. In this study, we investigate the structure and redundancy of representations across various TSFMs, examining the self-similarity of model layers within and across different model sizes. This analysis reveals block-like redundancy in the representations, which can be utilized for informed pruning to improve inference speed and efficiency. Additionally, we explore the concepts learned by these models - such as periodicity and trends - and how these can be manipulated through latent space steering to influence model behavior. Our experiments show that steering interventions can introduce new features, e.g., adding periodicity or trends to signals that initially lacked them. These findings underscore the value of representational analysis for optimizing models and demonstrate how conceptual steering offers new possibilities for more controlled and efficient time series analysis with TSFMs.
Authors: East Sun, Yan Wang, Lan Tian
Abstract: We introduce Block-Attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context. Instead, Block-Attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-Attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-Attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-Attention mechanism. Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block-Attention model achieves performance comparable to self-attention models (68.4\% vs 67.9\% on Llama3) or even superior performance (62.8\% vs 59.6\% on Mistral). Notably, Block-Attention significantly reduces the time to first token (TTFT) and floating point operations (FLOPs) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared to the self-attention models, the time consumption and corresponding FLOPs are reduced by 98.7\% and 99.8\%, respectively.
Authors: Mushir Akhtar, Ritik Mishra, M. Tanveer, Mohd. Arshad
Abstract: Random vector functional link (RVFL), a variant of single-layer feedforward neural network (SLFN), has garnered significant attention due to its lower computational cost and robustness to overfitting. Despite its advantages, the RVFL network's reliance on the square error loss function makes it highly sensitive to outliers and noise, leading to degraded model performance in real-world applications. To remedy it, we propose the incorporation of the HawkEye loss (H-loss) function into the RVFL framework. The H-loss function features nice mathematical properties, including smoothness and boundedness, while simultaneously incorporating an insensitive zone. Each characteristic brings its own advantages: 1) Boundedness limits the impact of extreme errors, enhancing robustness against outliers; 2) Smoothness facilitates the use of gradient-based optimization algorithms, ensuring stable and efficient convergence; and 3) The insensitive zone mitigates the effect of minor discrepancies and noise. Leveraging the H-loss function, we embed it into the RVFL framework and develop a novel robust RVFL model termed H-RVFL. Notably, this work addresses a significant gap, as no bounded loss function has been incorporated into RVFL to date. The non-convex optimization of the proposed H-RVFL is effectively addressed by the Nesterov accelerated gradient (NAG) algorithm, whose computational complexity is also discussed. The proposed H-RVFL model's effectiveness is validated through extensive experiments on $40$ benchmark datasets from UCI and KEEL repositories, with and without label noise. The results highlight significant improvements in robustness and efficiency, establishing the H-RVFL model as a powerful tool for applications in noisy and outlier-prone environments.
Authors: Timo Braun, Anders Kvellestad, Riccardo De Bin
Abstract: We introduce GPTreeO, a flexible R package for scalable Gaussian process (GP) regression, particularly tailored to continual learning problems. GPTreeO builds upon the Dividing Local Gaussian Processes (DLGP) algorithm, in which a binary tree of local GP regressors is dynamically constructed using a continual stream of input data. In GPTreeO we extend the original DLGP algorithm by allowing continual optimisation of the GP hyperparameters, incorporating uncertainty calibration, and introducing new strategies for how the local partitions are created. Moreover, the modular code structure allows users to interface their favourite GP library to perform the local GP regression in GPTreeO. The flexibility of GPTreeO gives the user fine-grained control of the balance between computational speed, accuracy, stability and smoothness. We conduct a sensitivity analysis to show how GPTreeO's configurable features impact the regression performance in a continual learning setting.
Authors: Jie Shen, Xiaoyu Li
Abstract: Understanding noise tolerance of learning algorithms under certain conditions is a central quest in learning theory. In this work, we study the problem of computationally efficient PAC learning of halfspaces in the presence of malicious noise, where an adversary can corrupt both instances and labels of training samples. The best-known noise tolerance either depends on a target error rate under distributional assumptions or on a margin parameter under large-margin conditions. In this work, we show that when both types of conditions are satisfied, it is possible to achieve {\em constant} noise tolerance by minimizing a reweighted hinge loss. Our key ingredients include: 1) an efficient algorithm that finds weights to control the gradient deterioration from corrupted samples, and 2) a new analysis on the robustness of the hinge loss equipped with such weights.
Authors: Tao Ren, Zishi Zhang, Jinyang Jiang, Guanghao Li, Zeliang Zhang, Mingqian Feng, Yijie Peng
Abstract: Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformers on various datasets and further deploy the allocator to two black-box applications: prompt tuning and multimodal alignment for foundation models. All findings demonstrate that our proposed allocator significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.
Authors: Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, Anima Anandkumar
Abstract: Large Language Models (LLMs) have been successful in mathematical reasoning tasks such as formal theorem proving when integrated with interactive proof assistants like Lean. Existing approaches involve training or fine-tuning an LLM on a specific dataset to perform well on particular domains, such as undergraduate-level mathematics. These methods struggle with generalizability to advanced mathematics. A fundamental limitation is that these approaches operate on static domains, failing to capture how mathematicians often work across multiple domains and projects simultaneously or cyclically. We present LeanAgent, a novel lifelong learning framework for theorem proving that continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge. LeanAgent introduces several key innovations, including a curriculum learning strategy that optimizes the learning trajectory in terms of mathematical difficulty, a dynamic database for efficient management of evolving mathematical knowledge, and progressive training to balance stability and plasticity. LeanAgent successfully proves 162 theorems previously unproved by humans across 23 diverse Lean repositories, many from advanced mathematics. It performs significantly better than the static LLM baseline, proving challenging theorems in domains like abstract algebra and algebraic topology while showcasing a clear progression of learning from basic concepts to advanced topics. In addition, we analyze LeanAgent's superior performance on key lifelong learning metrics. LeanAgent achieves exceptional scores in stability and backward transfer, where learning new tasks improves performance on previously learned tasks. This emphasizes LeanAgent's continuous generalizability and improvement, explaining its superior theorem-proving performance.
Authors: Mohamamd Zavid Parvez, Rafiqul Islam, Md Zahidul Islam
Abstract: In a hyperconnected environment, medical institutions are particularly concerned with data privacy when sharing and transmitting sensitive patient information due to the risk of data breaches, where malicious actors could intercept sensitive information. A collaborative learning framework, including transfer, federated, and incremental learning, can generate efficient, secure, and scalable models while requiring less computation, maintaining patient data privacy, and ensuring an up-to-date model. This study aims to address the detection of COVID-19 using chest X-ray images through a proposed collaborative learning framework called CL3. Initially, transfer learning is employed, leveraging knowledge from a pre-trained model as the starting global model. Local models from different medical institutes are then integrated, and a new global model is constructed to adapt to any data drift observed in the local models. Additionally, incremental learning is considered, allowing continuous adaptation to new medical data without forgetting previously learned information. Experimental results demonstrate that the CL3 framework achieved a global accuracy of 89.99% when using Xception with a batch size of 16 after being trained for six federated communication rounds. A demo of the CL3 framework is available at https://github.com/zavidparvez/CL3-Collaborative-Approach to ensure reproducibility.
URLs: https://github.com/zavidparvez/CL3-Collaborative-Approach
Authors: Ran Liu, Zhongzhou Liu, Xiaoli Li, Yuan Fang
Abstract: Knowledge graphs (KGs) are instrumental in various real-world applications, yet they often suffer from incompleteness due to missing relations. To predict instances for novel relations with limited training examples, few-shot relation learning approaches have emerged, utilizing techniques such as meta-learning. However, the assumption is that novel relations in meta-testing and base relations in meta-training are independently and identically distributed, which may not hold in practice. To address the limitation, we propose RelAdapter, a context-aware adapter for few-shot relation learning in KGs designed to enhance the adaptation process in meta-learning. First, RelAdapter is equipped with a lightweight adapter module that facilitates relation-specific, tunable adaptation of meta-knowledge in a parameter-efficient manner. Second, RelAdapter is enriched with contextual information about the target relation, enabling enhanced adaptation to each distinct relation. Extensive experiments on three benchmark KGs validate the superiority of RelAdapter over state-of-the-art methods.
Authors: Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Li Shen
Abstract: Harmful fine-tuning attack poses a serious threat to the online fine-tuning service. Vaccine, a recent alignment-stage defense, applies uniform perturbation to all layers of embedding to make the model robust to the simulated embedding drift. However, applying layer-wise uniform perturbation may lead to excess perturbations for some particular safety-irrelevant layers, resulting in defense performance degradation and unnecessary memory consumption. To address this limitation, we propose Targeted Vaccine (T-Vaccine), a memory-efficient safety alignment method that applies perturbation to only selected layers of the model. T-Vaccine follows two core steps: First, it uses gradient norm as a statistical metric to identify the safety-critical layers. Second, instead of applying uniform perturbation across all layers, T-Vaccine only applies perturbation to the safety-critical layers while keeping other layers frozen during training. Results show that T-Vaccine outperforms Vaccine in terms of both defense effectiveness and resource efficiency. Comparison with other defense baselines, e.g., RepNoise and TAR also demonstrate the superiority of T-Vaccine. Notably, T-Vaccine is the first defense that can address harmful fine-tuning issues for a 7B pre-trained models trained on consumer GPUs with limited memory (e.g., RTX 4090). Our code is available at https://github.com/Lslland/T-Vaccine.
Authors: Andrew Jesson, Yiding Jiang
Abstract: We demonstrate that recent advances in reinforcement learning (RL) combined with simple architectural changes significantly improves generalization on the ProcGen benchmark. These changes are frame stacking, replacing 2D convolutional layers with 3D convolutional layers, and scaling up the number of convolutional kernels per layer. Experimental results using a single set of hyperparameters across all environments show a 37.9\% reduction in the optimality gap compared to the baseline (from 0.58 to 0.36). This performance matches or exceeds current state-of-the-art methods. The proposed changes are largely orthogonal and therefore complementary to the existing approaches for improving generalization in RL, and our results suggest that further exploration in this direction could yield substantial improvements in addressing generalization challenges in deep reinforcement learning.
Authors: Christofel Rio Goenawan
Abstract: In the modern world, the development of Artificial Intelligence (AI) has contributed to improvements in various areas, including automation, computer vision, fraud detection, and more. AI can be leveraged to enhance the efficiency of Autonomous Smart Traffic Management (ASTM) systems and reduce traffic congestion rates. This paper presents an Autonomous Smart Traffic Management (STM) system that uses AI to improve traffic flow rates. The system employs the YOLO V5 Convolutional Neural Network to detect vehicles in traffic management images. Additionally, it predicts the number of vehicles for the next 12 hours using a Recurrent Neural Network with Long Short-Term Memory (RNN-LSTM). The Smart Traffic Management Cycle Length Analysis manages the traffic cycle length based on these vehicle predictions, aided by AI. From the results of the RNN-LSTM model for predicting vehicle numbers over the next 12 hours, we observe that the model predicts traffic with a Mean Squared Error (MSE) of 4.521 vehicles and a Root Mean Squared Error (RMSE) of 2.232 vehicles. After simulating the STM system in the CARLA simulation environment, we found that the Traffic Management Congestion Flow Rate with ASTM (21 vehicles per minute) is 50\% higher than the rate without STM (around 15 vehicles per minute). Additionally, the Traffic Management Vehicle Pass Delay with STM (5 seconds per vehicle) is 70\% lower than without STM (around 12 seconds per vehicle). These results demonstrate that the STM system using AI can increase traffic flow by 50\% and reduce vehicle pass delays by 70\%.
Authors: Xinran Liu, Roc\'io D\'iaz Mart\'in, Yikun Bai, Ashkan Shahbazi, Matthew Thorpe, Akram Aldroubi, Soheil Kolouri
Abstract: The optimal transport (OT) problem has gained significant traction in modern machine learning for its ability to: (1) provide versatile metrics, such as Wasserstein distances and their variants, and (2) determine optimal couplings between probability measures. To reduce the computational complexity of OT solvers, methods like entropic regularization and sliced optimal transport have been proposed. The sliced OT framework improves efficiency by comparing one-dimensional projections (slices) of high-dimensional distributions. However, despite their computational efficiency, sliced-Wasserstein approaches lack a transportation plan between the input measures, limiting their use in scenarios requiring explicit coupling. In this paper, we address two key questions: Can a transportation plan be constructed between two probability measures using the sliced transport framework? If so, can this plan be used to define a metric between the measures? We propose a "lifting" operation to extend one-dimensional optimal transport plans back to the original space of the measures. By computing the expectation of these lifted plans, we derive a new transportation plan, termed expected sliced transport (EST) plans. We prove that using the EST plan to weight the sum of the individual Euclidean costs for moving from one point to another results in a valid metric between the input discrete probability measures. We demonstrate the connection between our approach and the recently proposed min-SWGG, along with illustrative numerical examples that support our theoretical findings.
Authors: Yingsong Luo, Ling Chen
Abstract: Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters (i.e., scale and zero-point) based on the impact of weights on the model output. Experiments on LLaMA and LLaMA-2 show that DAQ consistently outperforms the best baseline method, reducing perplexity loss by an average of 22.8% on LLaMA and 19.6% on LLaMA-2. Our code is available at https://github.com/LuoYingSong/DAQ.
Authors: Rustem Islamov, Niccol\`o Ajroldi, Antonio Orvieto, Aurelien Lucchi
Abstract: Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.
Authors: Sameep Chattopadhyay, Pulkit Paliwal, Sai Shankar Narasimhan, Shubhankar Agarwal, Sandeep P. Chinchali
Abstract: Time series forecasts are often influenced by exogenous contextual features in addition to their corresponding history. For example, in financial settings, it is hard to accurately predict a stock price without considering public sentiments and policy decisions in the form of news articles, tweets, etc. Though this is common knowledge, the current state-of-the-art (SOTA) forecasting models fail to incorporate such contextual information, owing to its heterogeneity and multimodal nature. To address this, we introduce ContextFormer, a novel plug-and-play method to surgically integrate multimodal contextual information into existing pre-trained forecasting models. ContextFormer effectively distills forecast-specific information from rich multimodal contexts, including categorical, continuous, time-varying, and even textual information, to significantly enhance the performance of existing base forecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on a range of real-world datasets spanning energy, traffic, environmental, and financial domains.
Authors: Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, Huaxiu Yao
Abstract: Recent self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to iteratively improve the alignment performance without the need of human annotations for preference data. These methods commonly utilize the same LLM to act as both the policy model (which generates responses) and the reward model (which scores and ranks those responses). The ranked responses are then used as preference pairs to train the LLM via direct alignment technologies (e.g. DPO). However, it is noteworthy that throughout this process, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data. Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations, which we hypothesize is due to accumulated bias in the reward system. This bias can lead to unreliable preference data for training the LLM. To address this issue, we first formulate and analyze the generalized iterative preference fine-tuning framework for self-rewarding language model. We then introduce the regularization to this generalized framework to mitigate the overconfident preference labeling in the self-rewarding process. Based on this theoretical insight, we propose a Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that leverages the rewarding consistency across different iterations to regularize the self-rewarding training, helping the model to learn from more reliable preference data. With this explicit regularization, our empirical results demonstrate the superiority of CREAM in improving both reward consistency and alignment performance. The code is publicly available at https://github.com/Raibows/CREAM.
Authors: Kangle Deng, Andrew Liu, Jun-Yan Zhu, Deva Ramanan
Abstract: A commonly observed failure mode of Neural Radiance Field (NeRF) is fitting incorrect geometries when given an insufficient number of input views. One potential reason is that standard volumetric rendering does not enforce the constraint that most of a scene's geometry consist of empty space and opaque surfaces. We formalize the above assumption through DS-NeRF (Depth-supervised Neural Radiance Fields), a loss for learning radiance fields that takes advantage of readily-available depth supervision. We leverage the fact that current NeRF pipelines require images with known camera poses that are typically estimated by running structure-from-motion (SFM). Crucially, SFM also produces sparse 3D points that can be used as "free" depth supervision during training: we add a loss to encourage the distribution of a ray's terminating depth matches a given 3D keypoint, incorporating depth uncertainty. DS-NeRF can render better images given fewer training views while training 2-3x faster. Further, we show that our loss is compatible with other recently proposed NeRF methods, demonstrating that depth is a cheap and easily digestible supervisory signal. And finally, we find that DS-NeRF can support other types of depth supervision such as scanned depth sensors and RGB-D reconstruction outputs.
Authors: Youwei Huang, Sen Fang, Jianwen Li, Jiachun Tao, Bin Hu, Tao Zhang
Abstract: In recent years, researchers in the software security field have focused on detecting vulnerabilities in smart contracts to avoid significant losses of crypto assets on the blockchain. Despite early successes in this domain, detecting developers' intents in smart contracts is a more pressing issue, as malicious intents have resulted in substantial financial losses. Unfortunately, existing research lacks effective methods for detecting development intents in smart contracts. To address this gap, we propose \textsc{SmartIntentNN} (Smart Contract Intent Neural Network), a deep learning model designed to automatically detect development intent in smart contracts. \textsc{SmartIntentNN} utilizes a pre-trained sentence encoder to generate contextual representations of smart contract code, a K-means clustering model to identify and highlight prominent intent features, and a bidirectional LSTM-based deep neural network for multi-label classification. We trained and evaluated \textsc{SmartIntentNN} on a dataset comprising over 40,000 real-world smart contracts, employing self-comparison baselines in our experimental setup. The results demonstrate that \textsc{SmartIntentNN} achieves an F1-score of 0.8633 in identifying intents across 10 distinct categories, outperforming all baselines and filling the gap in smart contract detection by incorporating intent analysis.
Authors: G\'abor Lugosi, Gergely Neu
Abstract: We present a new framework for deriving bounds on the generalization bound of statistical learning algorithms from the perspective of online learning. Specifically, we construct an online learning game called the "generalization game", where an online learner is trying to compete with a fixed statistical learning algorithm in predicting the sequence of generalization gaps on a training set of i.i.d. data points. We establish a connection between the online and statistical learning setting by showing that the existence of an online learning algorithm with bounded regret in this game implies a bound on the generalization error of the statistical learning algorithm, up to a martingale concentration term that is independent of the complexity of the statistical learning method. This technique allows us to recover several standard generalization bounds including a range of PAC-Bayesian and information-theoretic guarantees, as well as generalizations thereof.
Authors: Leonard Tang, Dan Ley
Abstract: It is well-known that modern computer vision systems often exhibit behaviors misaligned with those of humans: from adversarial attacks to image corruptions, deep learning vision models suffer in a variety of settings that humans capably handle. In light of these phenomena, here we introduce another, orthogonal perspective studying the human-machine vision gap. We revisit the task of recovering images under degradation, first introduced over 30 years ago in the Recognition-by-Components theory of human vision. Specifically, we study the performance and behavior of neural networks on the seemingly simple task of classifying regular polygons at varying orders of degradation along their perimeters. To this end, we implement the Automated Shape Recoverability Test for rapidly generating large-scale datasets of perimeter-degraded regular polygons, modernizing the historically manual creation of image recoverability experiments. We then investigate the capacity of neural networks to recognize and recover such degraded shapes when initialized with different priors. Ultimately, we find that neural networks' behavior on this simple task conflicts with human behavior, raising a fundamental question of the robustness and learning capabilities of modern computer vision models.
Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, Kfir Aberman
Abstract: Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges, we propose HyperDreamBooth - a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth can generate a person's face in various contexts and styles, with high subject details while also preserving the model's crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual Inversion, using as few as one reference image, with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10,000x smaller than a normal DreamBooth model. Project page: https://hyperdreambooth.github.io
Authors: Riccardo Barbano, Alexander Denker, Hyungjin Chung, Tae Hoon Roh, Simon Arridge, Peter Maass, Bangti Jin, Jong Chul Ye
Abstract: Denoising diffusion models have emerged as the go-to generative framework for solving inverse problems in imaging. A critical concern regarding these models is their performance on out-of-distribution tasks, which remains an under-explored challenge. Using a diffusion model on an out-of-distribution dataset, realistic reconstructions can be generated, but with hallucinating image features that are uniquely present in the training dataset. To address this discrepancy during train-test time and improve reconstruction accuracy, we introduce a novel sampling framework called Steerable Conditional Diffusion. Specifically, this framework adapts the diffusion model, concurrently with image reconstruction, based solely on the information provided by the available measurement. Utilising our proposed method, we achieve substantial enhancements in out-of-distribution performance across diverse imaging modalities, advancing the robust deployment of denoising diffusion models in real-world applications.
Authors: Yixuan Wang, Mingtong Zhang, Zhuoran Li, Tarik Kelestemur, Katherine Driggs-Campbell, Jiajun Wu, Li Fei-Fei, Yunzhu Li
Abstract: Scene representation is a crucial design choice in robotic manipulation systems. An ideal representation is expected to be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields -- dynamic 3D descriptor fields. These fields are implicit 3D representations that take in 3D points and output semantic features and instance masks. They can also capture the dynamics of the underlying 3D environments. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from visual foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to rearrangement tasks in a zero-shot manner. Through extensive evaluation in real worlds and simulations, we demonstrate that D$^3$Fields are effective for zero-shot generalizable rearrangement tasks. We also compare D$^3$Fields with state-of-the-art implicit 3D representations and show significant improvements in effectiveness and efficiency.
Authors: Digvijay Wadekar, Tejaswi Venumadhav, Ajit Kumar Mehta, Javier Roulet, Seth Olsen, Jonathan Mushkin, Barak Zackay, Matias Zaldarriaga
Abstract: Searches for gravitational wave events use models, or templates, for the signals of interest. The templates used in current searches in the LIGO-Virgo-Kagra (LVK) data model the dominant quadrupole mode $(\ell,|m|)=(2,2)$ of the signals, and omit sub-dominant higher-order modes (HM) such as $(\ell,|m|)=(3,3)$, $(4,4)$, which are predicted by general relativity. This omission reduces search sensitivity to black hole mergers in interesting parts of parameter space, such as systems with high masses and asymmetric mass-ratios. We develop a new strategy to include HM in template banks: instead of making templates containing a combination of different modes, we separately store normalized templates corresponding to $(2,2)$, $(3,3)$ and $(4,4)$ modes. To model aligned-spin $(3,3)$, $(4,4)$ waveforms corresponding to a given $(2,2)$ waveform, we use a combination of post-Newtonian formulae and machine learning tools. In the matched filtering stage, one can filter each mode separately with the data and collect the timeseries of signal-to-noise ratios (SNR). This leads to a HM template bank whose matched-filtering cost is just $\approx 3\times$ that of a quadrupole-only search (as opposed to $\approx\! 100 \times$ in previously proposed HM search methods). Our method is effectual and generally applicable for template banks constructed with either stochastic or geometric placement techniques. New GW candidate events that we detect using our HM banks and details for combining the different SNR mode timeseries are presented in accompanying papers: Wadekar et al. [1] and [2] respectively. Additionally, we discuss non-linear compression of $(2,2)$-only geometric-placement template banks using machine learning algorithms.
Authors: Hanwen Ye, Wenzhuo Zhou, Ruoqing Zhu, Annie Qu
Abstract: Recent advances in dynamic treatment regimes (DTRs) facilitate the search for optimal treatments, which are tailored to individuals' specific needs and able to maximize their expected clinical benefits. However, existing algorithms relying on consistent trajectories, such as inverse probability weighting estimators (IPWEs), could suffer from insufficient sample size under optimal treatments and a growing number of decision-making stages, particularly in the context of chronic diseases. To address these challenges, we propose a novel individualized learning method which estimates the DTR with a focus on prioritizing alignment between the observed treatment trajectory and the one obtained by the optimal regime across decision stages. By relaxing the restriction that the observed trajectory must be fully aligned with the optimal treatments, our approach substantially improves the sample efficiency and stability of IPWE-based methods. In particular, the proposed learning scheme builds a more general framework which includes the popular outcome weighted learning framework as a special case of ours. Moreover, we introduce the notion of stage importance scores along with an attention mechanism to explicitly account for heterogeneity among decision stages. We establish the theoretical properties of the proposed approach, including the Fisher consistency and finite-sample performance bound. Empirically, we evaluate the proposed method in extensive simulated environments and a real case study for the COVID-19 pandemic.
Authors: Haimeng Zhao, Laura Lewis, Ishaan Kannan, Yihui Quek, Hsin-Yuan Huang, Matthias C. Caro
Abstract: While quantum state tomography is notoriously hard, most states hold little interest to practically-minded tomographers. Given that states and unitaries appearing in Nature are of bounded gate complexity, it is natural to ask if efficient learning becomes possible. In this work, we prove that to learn a state generated by a quantum circuit with $G$ two-qubit gates to a small trace distance, a sample complexity scaling linearly in $G$ is necessary and sufficient. We also prove that the optimal query complexity to learn a unitary generated by $G$ gates to a small average-case error scales linearly in $G$. While sample-efficient learning can be achieved, we show that under reasonable cryptographic conjectures, the computational complexity for learning states and unitaries of gate complexity $G$ must scale exponentially in $G$. We illustrate how these results establish fundamental limitations on the expressivity of quantum machine learning models and provide new perspectives on no-free-lunch theorems in unitary learning. Together, our results answer how the complexity of learning quantum states and unitaries relate to the complexity of creating these states and unitaries.
Authors: Che Liu, Cheng Ouyang, Sibo Cheng, Anand Shah, Wenjia Bai, Rossella Arcucci
Abstract: Recently, medical vision-language pre-training (VLP) has reached substantial progress to learn global visual representation from medical images and their paired radiology reports. However, medical imaging tasks in real world usually require finer granularity in visual features. These tasks include visual localization tasks (e.g., semantic segmentation, object detection) and visual grounding task. Yet, current medical VLP methods face challenges in learning these fine-grained features, as they primarily focus on brute-force alignment between image patches and individual text tokens for local visual feature learning, which is suboptimal for downstream dense prediction tasks. In this work, we propose a new VLP framework, named \textbf{G}lobal to \textbf{D}ense level representation learning (G2D) that achieves significantly improved granularity and more accurate grounding for the learned features, compared to existing medical VLP approaches. In particular, G2D learns dense and semantically-grounded image representations via a pseudo segmentation task parallel with the global vision-language alignment. Notably, generating pseudo segmentation targets does not incur extra trainable parameters: they are obtained on the fly during VLP with a parameter-free processor. G2D achieves superior performance across 6 medical imaging tasks and 25 diseases, particularly in semantic segmentation, which necessitates fine-grained, semantically-grounded image features. In this task, G2D surpasses peer models even when fine-tuned with just 1\% of the training data, compared to the 100\% used by these models. The code will be released upon acceptance.
Authors: Hanwen Ye, Tatiana Moreno, Adrianne Alpern, Louis Ehwerhemuepha, Annie Qu
Abstract: Mental health diseases affect children's lives and well-beings which have received increased attention since the COVID-19 pandemic. Analyzing psychiatric clinical notes with topic models is critical to evaluating children's mental status over time. However, few topic models are built for longitudinal settings, and most existing approaches fail to capture temporal trajectories for each document. To address these challenges, we develop a dynamic topic model with consistent topics and individualized temporal dependencies on the evolving document metadata. Our model preserves the semantic meaning of discovered topics over time and incorporates heterogeneity among documents. In particular, when documents can be categorized, we propose a classifier-free approach to maximize topic heterogeneity across different document groups. We also present an efficient variational optimization procedure adapted for the multistage longitudinal setting. In this case study, we apply our method to the psychiatric clinical notes from a large tertiary pediatric hospital in Southern California and achieve a 38% increase in the overall coherence of extracted topics. Our real data analysis reveals that children tend to express more negative emotions during state shutdowns and more positive when schools reopen. Furthermore, it suggests that sexual and gender minority (SGM) children display more pronounced reactions to major COVID-19 events and a greater sensitivity to vaccine-related news than non-SGM children. This study examines children's mental health progression during the pandemic and offers clinicians valuable insights to recognize disparities in children's mental health related to their sexual and gender identities.
Authors: Akihide Hayashi, So Takamoto, Ju Li, Yuta Tsuboi, Daisuke Okanohara
Abstract: Mapping the chemical reaction pathways and their corresponding activation barriers is a significant challenge in molecular simulation. Given the inherent complexities of 3D atomic geometries, even generating an initial guess of these paths can be difficult for humans. This paper presents an innovative approach that utilizes neural networks to generate initial guesses for reaction pathways based on the initial state and learning from a database of low-energy transition paths. The proposed method is initiated by inputting the coordinates of the initial state, followed by progressive alterations to its structure. This iterative process culminates in the generation of the guess reaction path and the coordinates of the final state. The method does not require one-the-fly computation of the actual potential energy surface, and is therefore fast-acting. The application of this geometry-based method extends to complex reaction pathways illustrated by organic reactions. Training was executed on the Transition1x dataset of organic reaction pathways. The results revealed the generation of reactions that bore substantial similarities with the test set of chemical reaction paths. The method's flexibility allows for reactions to be generated either to conform to predetermined conditions or in a randomized manner.
Authors: Hao Wang, Shangwei Guo, Jialing He, Hangcheng Liu, Tianwei Zhang, Tao Xiang
Abstract: Pre-trained models (PTMs) are widely adopted across various downstream tasks in the machine learning supply chain. Adopting untrustworthy PTMs introduces significant security risks, where adversaries can poison the model supply chain by embedding hidden malicious behaviors (backdoors) into PTMs. However, existing backdoor attacks to PTMs can only achieve partially task-agnostic and the embedded backdoors are easily erased during the fine-tuning process. This makes it challenging for the backdoors to persist and propagate through the supply chain. In this paper, we propose a novel and severer backdoor attack, TransTroj, which enables the backdoors embedded in PTMs to efficiently transfer in the model supply chain. In particular, we first formalize this attack as an indistinguishability problem between poisoned and clean samples in the embedding space. We decompose embedding indistinguishability into pre- and post-indistinguishability, representing the similarity of the poisoned and reference embeddings before and after the attack. Then, we propose a two-stage optimization that separately optimizes triggers and victim PTMs to achieve embedding indistinguishability. We evaluate TransTroj on four PTMs and six downstream tasks. Experimental results show that our method significantly outperforms SOTA task-agnostic backdoor attacks -- achieving nearly 100\% attack success rate on most downstream tasks -- and demonstrates robustness under various system settings. Our findings underscore the urgent need to secure the model supply chain against such transferable backdoor attacks. The code is available at https://github.com/haowang-cqu/TransTroj .
Authors: A. Emin Orhan, Wentao Wang, Alex N. Wang, Mengye Ren, Brenden M. Lake
Abstract: Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases? Recent advances in collecting large-scale, longitudinal, developmentally realistic video datasets and generic self-supervised learning (SSL) algorithms are allowing us to begin to tackle this nature vs. nurture question. However, existing work typically focuses on image-based SSL algorithms and visual capabilities that can be learned from static images (e.g. object recognition), thus ignoring temporal aspects of the world. To close this gap, here we train self-supervised video models on longitudinal, egocentric headcam recordings collected from a child over a two year period in their early development (6-31 months). The resulting models are highly effective at facilitating the learning of action concepts from a small number of labeled examples; they have favorable data size scaling properties; and they display emergent video interpolation capabilities. Video models also learn more accurate and more robust object representations than image-based models trained with the exact same data. These results suggest that important temporal aspects of a child's internal model of the world may be learnable from their visual experience using highly generic learning algorithms and without strong inductive biases.
Authors: Sungee Hong, Zhengling Qi, Raymond K. W. Wong
Abstract: We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.
Authors: Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li
Abstract: Facial action unit (AU) detection has long encountered the challenge of detecting subtle feature differences when AUs activate. Existing methods often rely on encoding pixel-level information of AUs, which not only encodes additional redundant information but also leads to increased model complexity and limited generalizability. Additionally, the accuracy of AU detection is negatively impacted by the class imbalance issue of each AU type, and the presence of noisy and false AU labels. In this paper, we introduce a novel contrastive learning framework aimed for AU detection that incorporates both self-supervised and supervised signals, thereby enhancing the learning of discriminative features for accurate AU detection. To tackle the class imbalance issue, we employ a negative sample re-weighting strategy that adjusts the step size of updating parameters for minority and majority class samples. Moreover, to address the challenges posed by noisy and false AU labels, we employ a sampling technique that encompasses three distinct types of positive sample pairs. This enables us to inject self-supervised signals into the supervised signal, effectively mitigating the adverse effects of noisy labels. Our experimental assessments, conducted on four widely-utilized benchmark datasets (BP4D, DISFA, GFT and Aff-Wild2), underscore the superior performance of our approach compared to state-of-the-art methods of AU detection. Our code is available at \url{https://github.com/Ziqiao-Shang/AUNCE}.
Authors: James Oldfield, Markos Georgopoulos, Grigorios G. Chrysos, Christos Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Jiankang Deng, Ioannis Patras
Abstract: The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts ($\mu$MoE) layer to address this, focusing on vision models. $\mu$MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, $\mu$MoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling $\mu$MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched $\mu$MoE blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/muMoE.
Authors: Kangle Deng, Timothy Omernick, Alexander Weiss, Deva Ramanan, Jun-Yan Zhu, Tinghui Zhou, Maneesh Agrawala
Abstract: Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our algorithm is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.
Authors: Shirin Panahi, Ling-Wei Kong, Mohammadamin Moradi, Zheng-Meng Zhai, Bryan Glaz, Mulugeta Haile, Ying-Cheng Lai
Abstract: Anticipating a tipping point, a transition from one stable steady state to another, is a problem of broad relevance due to the ubiquity of the phenomenon in diverse fields. The steady-state nature of the dynamics about a tipping point makes its prediction significantly more challenging than predicting other types of critical transitions from oscillatory or chaotic dynamics. Exploiting the benefits of noise, we develop a general data-driven and machine-learning approach to predicting potential future tipping in nonautonomous dynamical systems and validate the framework using examples from different fields. As an application, we address the problem of predicting the potential collapse of the Atlantic Meridional Overturning Circulation (AMOC), possibly driven by climate-induced changes in the freshwater input to the North Atlantic. Our predictions based on synthetic and currently available empirical data place a potential collapse window spanning from 2040 to 2065, in consistency with the results in the current literature.
Authors: Christina Giannoula, Peiming Yang, Ivan Fernandez Vega, Jiacheng Yang, Sankeerth Durvasula, Yu Xin Li, Mohammad Sadrosadati, Juan Gomez Luna, Onur Mutlu, Gennady Pekhimenko
Abstract: Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML library that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim is publicly available at https://github.com/CMU-SAFARI/PyGim.
Authors: Rikab Gambhir, Athis Osathapan, Jesse Thaler
Abstract: Many machine learning applications involve learning a latent representation of data, which is often high-dimensional and difficult to directly interpret. In this work, we propose "Moment Pooling", a natural extension of Deep Sets networks which drastically decrease latent space dimensionality of these networks while maintaining or even improving performance. Moment Pooling generalizes the summation in Deep Sets to arbitrary multivariate moments, which enables the model to achieve a much higher effective latent dimensionality for a fixed latent dimension. We demonstrate Moment Pooling on the collider physics task of quark/gluon jet classification by extending Energy Flow Networks (EFNs) to Moment EFNs. We find that Moment EFNs with latent dimensions as small as 1 perform similarly to ordinary EFNs with higher latent dimension. This small latent dimension allows for the internal representation to be directly visualized and interpreted, which in turn enables the learned internal jet representation to be extracted in closed form.
Authors: Muhammad Asif Ali, Zhengping Li, Shu Yang, Keyuan Cheng, Yang Cao, Tianhao Huang, Guimin Hu, Weimin Lyu, Lijie Hu, Lu Yu, Di Wang
Abstract: Large Language Models (LLMs) have shown exceptional abilities for multiple different natural language processing tasks. While prompting is a crucial tool for LLM inference, we observe that there is a significant cost associated with exceedingly lengthy prompts. Existing attempts to compress lengthy prompts lead to substandard results in terms of readability/interpretability of the compressed prompt, with a detrimental impact on prompt utility. To address this, we propose PromptSAW: Prompt compresSion via Relation AWare graphs, an effective strategy for prompt compression over task-agnostic and task-aware prompts. Prompt-SAW uses the prompt's textual information to build a graph and later extracts key information elements in the graph to come up with the compressed prompt. We also propose GSM8K-aug, i.e., an extended version of the existing GSM8K benchmark for task-agnostic prompts in order to provide a comprehensive evaluation platform. Experimental evaluation using benchmark datasets shows that prompts compressed by Prompt-SAW are not only better in terms of readability, but they also outperform the best-performing baseline models by up to 10.1 and 77.1, respectively, for task-agnostic and task-aware settings while compressing the original prompt text by 34.9 and 56.7.
Authors: Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, Raluca Ada Popa
Abstract: Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using $30\times$ fewer tokens during inference. LLoCO achieves up to $7.62\times$ speed-up during inference and $11.52\times$ higher throughput during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long context processing. Our code is publicly available on https://github.com/jeffreysijuntan/lloco.
Authors: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
Abstract: We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and checkpoints at https://github.com/facebookresearch/LayerSkip.
Authors: Trevor Campbell
Abstract: Bayesian coresets speed up posterior inference in the large-scale data regime by approximating the full-data log-likelihood function with a surrogate log-likelihood based on a small, weighted subset of the data. But while Bayesian coresets and methods for construction are applicable in a wide range of models, existing theoretical analysis of the posterior inferential error incurred by coreset approximations only apply in restrictive settings -- i.e., exponential family models, or models with strong log-concavity and smoothness assumptions. This work presents general upper and lower bounds on the Kullback-Leibler (KL) divergence of coreset approximations that reflect the full range of applicability of Bayesian coresets. The lower bounds require only mild model assumptions typical of Bayesian asymptotic analyses, while the upper bounds require the log-likelihood functions to satisfy a generalized subexponentiality criterion that is weaker than conditions used in earlier work. The lower bounds are applied to obtain fundamental limitations on the quality of coreset approximations, and to provide a theoretical explanation for the previously-observed poor empirical performance of importance sampling-based construction methods. The upper bounds are used to analyze the performance of recent subsample-optimize methods. The flexibility of the theory is demonstrated in validation experiments involving multimodal, unidentifiable, heavy-tailed Bayesian posterior distributions.
Authors: Amit Kadan, Kevin Ryczko, Erika Lloyd, Adrian Roitberg, Takeshi Yamazaki
Abstract: Generative AI has the potential to revolutionize drug discovery. Yet, despite recent advances in deep learning, existing models cannot generate molecules that satisfy all desired physicochemical properties. Herein, we describe IDOLpro, a generative chemistry AI combining diffusion with multi-objective optimization for structure-based drug design. Differentiable scoring functions guide the latent variables of the diffusion model to explore uncharted chemical space and generate novel ligands in silico, optimizing a plurality of target physicochemical properties. We demonstrate our platform's effectiveness by generating ligands with optimized binding affinity and synthetic accessibility on two benchmark sets. IDOLpro produces ligands with binding affinities over 10%-20% better than the next best state-of-the-art method on each test set, producing more drug-like molecules with generally better synthetic accessibility scores than other methods. We do a head-to-head comparison of IDOLpro against a classic virtual screen of a large database of drug-like molecules. We show that IDOLpro can generate molecules for a range of important disease-related targets with better binding affinity and synthetic accessibility than any molecule found in the virtual screen while being over 100x faster and less expensive to run. On a test set of experimental complexes, IDOLpro is the first to produce molecules with better binding affinities than experimentally observed ligands. IDOLpro can accommodate other scoring functions (e.g. ADME-Tox) to accelerate hit-finding, hit-to-lead, and lead optimization for drug discovery.
Authors: James Requeima, John Bronskill, Dami Choi, Richard E. Turner, David Duvenaud
Abstract: Machine learning practitioners often face significant challenges in formally integrating their prior knowledge and beliefs into predictive models, limiting the potential for nuanced and context-aware analyses. Moreover, the expertise needed to integrate this prior knowledge into probabilistic modeling typically limits the application of these models to specialists. Our goal is to build a regression model that can process numerical data and make probabilistic predictions at arbitrary locations, guided by natural language text which describes a user's prior knowledge. Large Language Models (LLMs) provide a useful starting point for designing such a tool since they 1) provide an interface where users can incorporate expert insights in natural language and 2) provide an opportunity for leveraging latent problem-relevant knowledge encoded in LLMs that users may not have themselves. We start by exploring strategies for eliciting explicit, coherent numerical predictive distributions from LLMs. We examine these joint predictive distributions, which we call LLM Processes, over arbitrarily-many quantities in settings such as forecasting, multi-dimensional regression, black-box optimization, and image modeling. We investigate the practical details of prompting to elicit coherent predictive distributions, and demonstrate their effectiveness at regression. Finally, we demonstrate the ability to usefully incorporate text into numerical predictions, improving predictive performance and giving quantitative structure that reflects qualitative descriptions. This lets us begin to explore the rich, grounded hypothesis space that LLMs implicitly encode.
Authors: Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro
Abstract: This paper presents an auditing procedure for the Differentially Private Stochastic Gradient Descent (DP-SGD) algorithm in the black-box threat model that is substantially tighter than prior work. The main intuition is to craft worst-case initial model parameters, as DP-SGD's privacy analysis is agnostic to the choice of the initial model parameters. For models trained on MNIST and CIFAR-10 at theoretical $\varepsilon=10.0$, our auditing procedure yields empirical estimates of $\varepsilon_{emp} = 7.21$ and $6.95$, respectively, on a 1,000-record sample and $\varepsilon_{emp}= 6.48$ and $4.96$ on the full datasets. By contrast, previous audits were only (relatively) tight in stronger white-box models, where the adversary can access the model's inner parameters and insert arbitrary gradients. Overall, our auditing procedure can offer valuable insight into how the privacy analysis of DP-SGD could be improved and detect bugs and DP violations in real-world implementations. The source code needed to reproduce our experiments is available at https://github.com/spalabucr/bb-audit-dpsgd.
Authors: Federico Bassetti, Marco Gherardi, Alessandro Ingrosso, Mauro Pastore, Pietro Rotondo
Abstract: Deep linear networks have been extensively studied, as they provide simplified models of deep learning. However, little is known in the case of finite-width architectures with multiple outputs and convolutional layers. In this manuscript, we provide rigorous results for the statistics of functions implemented by the aforementioned class of networks, thus moving closer to a complete characterization of feature learning in the Bayesian setting. Our results include: (i) an exact and elementary non-asymptotic integral representation for the joint prior distribution over the outputs, given in terms of a mixture of Gaussians; (ii) an analytical formula for the posterior distribution in the case of squared error loss function (Gaussian likelihood); (iii) a quantitative description of the feature learning infinite-width regime, using large deviation theory. From a physical perspective, deep architectures with multiple outputs or convolutional layers represent different manifestations of kernel shape renormalization, and our work provides a dictionary that translates this physics intuition and terminology into rigorous Bayesian statistics.
Authors: Ryan Zhou, Jaume Bacardit, Alexander Brownlee, Stefano Cagnoni, Martin Fyvie, Giovanni Iacca, John McCall, Niki van Stein, David Walker, Ting Hu
Abstract: Artificial intelligence methods are being increasingly applied across various domains, but their often opaque nature has raised concerns about accountability and trust. In response, the field of explainable AI (XAI) has emerged to address the need for human-understandable AI systems. Evolutionary computation (EC), a family of powerful optimization and learning algorithms, offers significant potential to contribute to XAI, and vice versa. This paper provides an introduction to XAI and reviews current techniques for explaining machine learning models. We then explore how EC can be leveraged in XAI and examine existing XAI approaches that incorporate EC techniques. Furthermore, we discuss the application of XAI principles within EC itself, investigating how these principles can illuminate the behavior and outcomes of EC algorithms, their (automatic) configuration, and the underlying problem landscapes they optimize. Finally, we discuss open challenges in XAI and highlight opportunities for future research at the intersection of XAI and EC. Our goal is to demonstrate EC's suitability for addressing current explainability challenges and to encourage further exploration of these methods, ultimately contributing to the development of more understandable and trustworthy ML models and EC algorithms.
Authors: Defu Cao, Angela Zhou
Abstract: Offline reinforcement learning is important in many settings with available observational data but the inability to deploy new policies online due to safety, cost, and other concerns. Many recent advances in causal inference and machine learning target estimation of causal contrast functions such as CATE, which is sufficient for optimizing decisions and can adapt to potentially smoother structure. We develop a dynamic generalization of the R-learner (Nie and Wager 2021, Lewis and Syrgkanis 2021) for estimating and optimizing the difference of $Q^\pi$-functions, $Q^\pi(s,1)-Q^\pi(s,0)$ (which can be used to optimize multiple-valued actions). We leverage orthogonal estimation to improve convergence rates in the presence of slower nuisance estimation rates and prove consistency of policy optimization under a margin condition. The method can leverage black-box nuisance estimators of the $Q$-function and behavior policy to target estimation of a more structured $Q$-function contrast.
Authors: Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Tak\'a\v{c}, Pascal Fua, Karthik Nandakumar, Ivan Laptev
Abstract: Vision-Language Models (VLMs) are becoming increasingly vulnerable to adversarial attacks as various novel attack strategies are being proposed against these models. While existing defenses excel in unimodal contexts, they currently fall short in safeguarding VLMs against adversarial threats. To mitigate this vulnerability, we propose a novel, yet elegantly simple approach for detecting adversarial samples in VLMs. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Subsequently, we calculate the similarities of the embeddings of both input and generated images in the feature space to identify adversarial samples. Empirical evaluations conducted on different datasets validate the efficacy of our approach, outperforming baseline methods adapted from image classification domains. Furthermore, we extend our methodology to classification tasks, showcasing its adaptability and model-agnostic nature. Theoretical analyses and empirical findings also show the resilience of our approach against adaptive attacks, positioning it as an excellent defense mechanism for real-world deployment against adversarial threats.
Authors: Sophie Ostmeier, Brian Axelrod, Michael E. Moseley, Akshay Chaudhari, Curtis Langlotz
Abstract: While Rotary Position Embeddings (RoPE) for large language models have become widely adopted, their application for other modalities has been slower. Here, we introduce Lie group Relative position Encodings (LieRE) that goes beyond RoPE in supporting n-dimensional inputs. We evaluate the performance of LieRE on 2D and 3D image classification tasks and observe that LieRE leads to marked relative improvements in performance (up to 9.7% for 2D and up to 25.5% for 3D), training efficiency (3.5x reduction), data efficiency (30%) compared to the baselines of DeiT III, RoPE-Mixed and Vision-Llama. https://github.com/Stanford-AIMI/LieRE
Authors: Ling Li, Yu Ye, Bingchuan Jiang, Wei Zeng
Abstract: This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.
Authors: Ismail Alkhouri, Cedric Le Denmat, Yingjie Li, Cunxi Yu, Jia Liu, Rongrong Wang, Alvaro Velasquez
Abstract: Combinatorial Optimization (CO) addresses many important problems, including the challenging Maximum Independent Set (MIS) problem. Alongside exact and heuristic solvers, differentiable approaches have emerged, often using continuous relaxations of ReLU-based or quadratic objectives. Noting that an MIS in a graph is a Maximum Clique (MC) in its complement, we propose a new quadratic formulation for MIS by incorporating an MC term, improving convergence and exploration. We show that every maximal independent set corresponds to a local minimizer, derive conditions for the MIS size, and characterize stationary points. To solve our non-convex objective, we propose solving parallel multiple initializations using momentum-based gradient descent, complemented by an efficient MIS checking criterion derived from our theory. Therefore, we dub our method as parallelized Clique-Informed Quadratic Optimization for MIS (pCQO-MIS). Our experimental results demonstrate the effectiveness of the proposed method compared to exact, heuristic, sampling, and data-centric approaches. Notably, our method avoids the out-of-distribution tuning and reliance on (un)labeled data required by data-centric methods, while achieving superior MIS sizes and competitive runtime relative to their inference time. Additionally, a key advantage of pCQO-MIS is that, unlike exact and heuristic solvers, the runtime scales only with the number of nodes in the graph, not the number of edges.
Authors: Lijun Bo, Yijie Huang, Xiang Yu, Tingting Zhang
Abstract: This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization. Contrary to the Shannon entropy, the general form of Tsallis entropy renders the optimal policy not necessary a Gibbs measure, where the Lagrange and KKT multipliers naturally arise from some constraints to ensure the learnt policy to be a probability density function. As a consequence, the characterization of the optimal policy using the q-function also involves a Lagrange multiplier. In response, we establish the martingale characterization of the q-function under Tsallis entropy and devise two q-learning algorithms depending on whether the Lagrange multiplier can be derived explicitly or not. In the latter case, we need to consider different parameterizations of the optimal q-function and the optimal policy and update them alternatively in an Actor-Critic manner. We also study two financial applications, namely, an optimal portfolio liquidation problem and a non-LQ control problem. It is interesting to see therein that the optimal policies under the Tsallis entropy regularization can be characterized explicitly, which are distributions concentrated on some compact support. The satisfactory performance of our q-learning algorithms is illustrated in each example.
Authors: Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky
Abstract: Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and LC across various public datasets using three latest LLMs. Results reveal that when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage. Based on this observation, we propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. Self-Route significantly reduces the computation cost while maintaining a comparable performance to LC. Our findings provide a guideline for long-context applications of LLMs using RAG and LC.
Authors: Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, Tianyu Gao
Abstract: Literature search questions, such as "Where can I find research on the evaluation of consistency in generated summaries?" pose significant challenges for modern search engines and retrieval systems. These questions often require a deep understanding of research concepts and the ability to reason across entire articles. In this work, we introduce LitSearch, a retrieval benchmark comprising 597 realistic literature search queries about recent ML and NLP papers. LitSearch is constructed using a combination of (1) questions generated by GPT-4 based on paragraphs containing inline citations from research papers and (2) questions manually written by authors about their recently published papers. All LitSearch questions were manually examined or edited by experts to ensure high quality. We extensively benchmark state-of-the-art retrieval models and also evaluate two LLM-based reranking pipelines. We find a significant performance gap between BM25 and state-of-the-art dense retrievers, with a 24.8% absolute difference in recall@5. The LLM-based reranking strategies further improve the best-performing dense retriever by 4.4%. Additionally, commercial search engines and research tools like Google Search perform poorly on LitSearch, lagging behind the best dense retriever by up to 32 recall points. Taken together, these results show that LitSearch is an informative new testbed for retrieval systems while catering to a real-world use case.
Authors: Praveen Ravirathinam, Ankush Khandelwal, Rahul Ghosh, Vipin Kumar
Abstract: In recent years, there has been an increased interest in foundation models for geoscience due to the vast amount of Earth observing satellite imagery. Existing remote sensing foundation models make use of the various sources of spectral imagery to create large models pretrained on the task of masked reconstruction. In this paper, we present a foundation model framework, where the pretraining task captures the causal relationship between multiple modalities. Our framework leverages the knowledge guided principles that the spectral imagery captures the impact of the physical drivers on the environmental system, and that the relationship between them is governed by the characteristics of the system. Specifically, our method, called MultiModal Variable Step Forecasting (MM-VSF), uses forecasting of satellite imagery as a pretraining task and is able to capture the causal relationship between spectral imagery and weather. In our evaluation we show that the forecasting of satellite imagery using weather can be used as an effective pretraining task for foundation models. We further show the effectiveness of the embeddings produced by MM-VSF on the downstream tasks of pixel wise crop mapping and missing image prediction of spectral imagery, when compared with embeddings created by models trained in alternative pretraining settings including the traditional single modality input masked reconstruction.
Authors: Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, Kellin Pelrine
Abstract: Recent work shows that LLMs are vulnerable to data poisoning, in which they are trained on partially corrupted or harmful data. Poisoned data is hard to detect, breaks guardrails, and leads to undesirable and harmful behavior. Given the intense efforts by leading labs to train and deploy increasingly larger and more capable LLMs, it is critical to ask if the risk of data poisoning will be naturally mitigated by scale, or if it is an increasing threat. We consider three threat models by which data poisoning can occur: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments evaluate the effects of data poisoning on 23 frontier LLMs ranging from 1.5-72 billion parameters, on three datasets which speak to each of our threat models. We find that larger LLMs are increasingly vulnerable, learning harmful behavior significantly quicker than smaller LLMs with even minimal data poisoning. Additionally, we demonstrate that even frontier GPT models, despite additional moderation systems, remain susceptible to data poisoning. These results underscore the need for robust safeguards against data poisoning in larger LLMs.
Authors: Eli Sennesh, Hao Wu, Tommaso Salvatori
Abstract: Unexpected stimuli induce "error" or "surprise" signals in the brain. The theory of predictive coding promises to explain these observations in terms of Bayesian inference by suggesting that the cortex implements variational inference in a probabilistic graphical model. However, when applied to machine learning tasks, this family of algorithms has yet to perform on par with other variational approaches in high-dimensional, structured inference problems. To address this, we introduce a novel predictive coding algorithm for structured generative models, that we call divide-and-conquer predictive coding (DCPC). DCPC differs from other formulations of predictive coding, as it respects the correlation structure of the generative model and provably performs maximum-likelihood updates of model parameters, all without sacrificing biological plausibility. Empirically, DCPC achieves better numerical performance than competing algorithms and provides accurate inference in a number of problems not previously addressed with predictive coding. We provide an open implementation of DCPC in Pyro on Github.
Authors: Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang
Abstract: Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.
Authors: Ashutosh Bajpai, Aaryan Goyal, Atif Anwer, Tanmoy Chakraborty
Abstract: The prolific use of Large Language Models (LLMs) as an alternate knowledge base requires them to be factually consistent, necessitating both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Factuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factuality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several baselines.
Authors: Gholamali Aminian, Amir R. Asadi, Tian Li, Ahmad Beirami, Gesine Reinert, Samuel N. Cohen
Abstract: The generalization error (risk) of a supervised statistical learning algorithm quantifies its prediction ability on previously unseen data. Inspired by exponential tilting, Li et al. (2021) proposed the tilted empirical risk as a non-linear risk metric for machine learning applications such as classification and regression problems. In this work, we examine the generalization error of the tilted empirical risk. In particular, we provide uniform and information-theoretic bounds on the tilted generalization error, defined as the difference between the population risk and the tilted empirical risk, with a convergence rate of $O(1/\sqrt{n})$ where $n$ is the number of training samples. Furthermore, we study the solution to the KL-regularized expected tilted empirical risk minimization problem and derive an upper bound on the expected tilted generalization error with a convergence rate of $O(1/n)$.
Authors: Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, Kannan Ramchandran
Abstract: With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model's performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.
Authors: Hanzaleh Akbari Nodehi, Viveck R. Cadambe, Mohammad Ali Maddah-Ali
Abstract: Coding theory plays a crucial role in ensuring data integrity and reliability across various domains, from communication to computation and storage systems. However, its reliance on trust assumptions for data recovery poses significant challenges, particularly in emerging decentralized systems where trust is scarce. To address this, the game of coding framework was introduced, offering insights into strategies for data recovery within incentive-oriented environments. The focus of the earliest version of the game of coding was limited to scenarios involving only two nodes. This paper investigates the implications of increasing the number of nodes in the game of coding framework, particularly focusing on scenarios with one honest node and multiple adversarial nodes. We demonstrate that despite the increased flexibility for the adversary with an increasing number of adversarial nodes, having more power is not beneficial for the adversary and is not detrimental to the data collector, making this scheme sybil-resistant. Furthermore, we outline optimal strategies for the data collector in terms of accepting or rejecting the inputs, and characterize the optimal noise distribution for the adversary.
Authors: Yulong Feng, Jing Xu, Liujun Hu, Guanghui Yu, Xiangyang Duan
Abstract: Semantic communication technology emerges as a pivotal bridge connecting AI with classical communication. The current semantic communication systems are generally modeled as an Auto-Encoder (AE). AE lacks a deep integration of AI principles with communication strategies due to its inability to effectively capture channel dynamics. This gap makes it difficult to justify the need for joint source-channel coding (JSCC) and to explain why performance improves. This paper begins by exploring lossless and lossy communication, highlighting that the inclusion of data distortion distinguishes semantic communication from classical communication. It breaks the conditions for the separation theorem to hold and explains why the amount of data transferred by semantic communication is less. Therefore, employing JSCC becomes imperative for achieving optimal semantic communication. Moreover, a Variational Source-Channel Coding (VSCC) method is proposed for constructing semantic communication systems based on data distortion theory, integrating variational inference and channel characteristics. Using a deep learning network, we develop a semantic communication system employing the VSCC method and demonstrate its capability for semantic transmission. We also establish semantic communication systems of equivalent complexity employing the AE method and the VAE method. Experimental results reveal that the VSCC model offers superior interpretability compared to AE model, as it clearly captures the semantic features of the transmitted data, represented as the variance of latent variables in our experiments. In addition, VSCC model exhibits superior semantic transmission capabilities compared to VAE model. At the same level of data distortion evaluated by PSNR, VSCC model exhibits stronger human interpretability, which can be partially assessed by SSIM.
Authors: Akihiro Fujii, Yoshitaka Ushiku, Koji Shimizu, Anh Khoa Augustin Lu, Satoshi Watanabe
Abstract: In materials science, finding crystal structures that have targeted properties is crucial. While recent methodologies such as Bayesian optimization and deep generative models have made some advances on this issue, these methods often face difficulties in adaptively incorporating various constraints, such as electrical neutrality and targeted properties optimization, while keeping the desired specific crystal structure. To address these challenges, we have developed the Simultaneous Multi-property Optimization using Adaptive Crystal Synthesizer (SMOACS), which utilizes state-of-the-art property prediction models and their gradients to directly optimize input crystal structures for targeted properties simultaneously. SMOACS enables the integration of adaptive constraints into the optimization process without necessitating model retraining. Thanks to this feature, SMOACS has succeeded in simultaneously optimizing targeted properties while maintaining perovskite structures, even with models trained on diverse crystal types. We have demonstrated the band gap optimization while meeting a challenging constraint, that is, maintaining electrical neutrality in large atomic configurations up to 135 atom sites, where the verification of the electrical neutrality is challenging. The properties of the most promising materials have been confirmed by density functional theory calculations.
Authors: Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim K\"ohler, Ren\'e J\"akel, Mehdi Ali
Abstract: The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks. We introduce a multilingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.
Authors: Pengfei Hu, Yuhang Qian, Tianyue Zheng, Ang Li, Zhe Chen, Yue Gao, Xiuzhen Cheng, Jun Luo
Abstract: Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.
Authors: Th\'eodore Fougereux, C\'edric Josz, Xiaopeng Li
Abstract: We propose a tensor-based criterion for benign landscape in phase retrieval and establish boundedness of gradient trajectories. This implies that gradient descent will converge to a global minimum for almost every initial point.
Authors: Vivin Vinod, Peter Zaspel
Abstract: The development of machine learning (ML) methods has made quantum chemistry (QC) calculations more accessible by reducing the compute cost incurred in conventional QC methods. This has since been translated into the overhead cost of generating training data. Increased work in reducing the cost of generating training data resulted in the development of $\Delta$-ML and multifidelity machine learning methods which use data at more than one QC level of accuracy, or fidelity. This work compares the data costs associated with $\Delta$-ML, multifidelity machine learning (MFML), and optimized MFML (o-MFML) in contrast with a newly introduced Multifidelity$\Delta$-Machine Learning (MF$\Delta$ML) method for the prediction of ground state energies over the multifidelity benchmark dataset QeMFi. This assessment is made on the basis of training data generation cost associated with each model and is compared with the single fidelity kernel ridge regression (KRR) case. The results indicate that the use of multifidelity methods surpasses the standard $\Delta$-ML approaches in cases of a large number of predictions. For cases, where $\Delta$-ML method might be favored, such as small test set regimes, the MF$\Delta$-ML method is shown to be more efficient than conventional $\Delta$-ML.
Authors: Renhang Liu, Abhinaba Roy, Dorien Herremans
Abstract: In this work, we present a novel method for music emotion recognition that leverages Large Language Model (LLM) embeddings for label alignment across multiple datasets and zero-shot prediction on novel categories. First, we compute LLM embeddings for emotion labels and apply non-parametric clustering to group similar labels, across multiple datasets containing disjoint labels. We use these cluster centers to map music features (MERT) to the LLM embedding space. To further enhance the model, we introduce an alignment regularization that enables dissociation of MERT embeddings from different clusters. This further enhances the model's ability to better adaptation to unseen datasets. We demonstrate the effectiveness of our approach by performing zero-shot inference on a new dataset, showcasing its ability to generalize to unseen labels without additional training.
Authors: Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoshchuk, Artur Janicki
Abstract: With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.
Authors: Shashank Sonkar, Xinghe Chen, Naiming Liu, Richard G. Baraniuk, Mrinmaya Sachan
Abstract: Accurately modeling student cognition is crucial for developing effective AI-driven educational technologies. A key challenge is creating realistic student models that satisfy two essential properties: (1) accurately replicating specific misconceptions, and (2) correctly solving problems where these misconceptions are not applicable. This dual requirement reflects the complex nature of student understanding, where misconceptions coexist with correct knowledge. This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement and effectively simulate student thinking in algebra. We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns through a graph-based representation of algebraic problem-solving. Utilizing MalAlgoPy, we define and examine Cognitive Student Models (CSMs) - LLMs instruction tuned to faithfully emulate realistic student behavior. Our findings reveal that LLMs trained on misconception examples can efficiently learn to replicate errors. However, the training diminishes the model's ability to solve problems correctly, particularly for problem types where the misconceptions are not applicable, thus failing to satisfy second property of CSMs. We demonstrate that by carefully calibrating the ratio of correct to misconception examples in the training data - sometimes as low as 0.25 - it is possible to develop CSMs that satisfy both properties. Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.
Authors: Mohamad Abdi, Gerardo Hermosillo Valadez, Halid Ziya Yerebakan
Abstract: Anatomical landmarks are vital in medical imaging for navigation and anomaly detection. Modern large language models (LLMs), like Llama-2, offer promise for automating the mapping of these landmarks in free-text radiology reports to corresponding positions in image data. Recent studies propose LLMs may develop coherent representations of generative processes. Motivated by these insights, we investigated whether LLMs accurately represent the spatial positions of anatomical landmarks. Through experiments with Llama-2 models, we found that they can linearly represent anatomical landmarks in space with considerable robustness to different prompts. These results underscore the potential of LLMs to enhance the efficiency and accuracy of medical imaging workflows.
Authors: Xinming Wang, Simon Mak, John Miller, Jianguo Wu
Abstract: A critical bottleneck for scientific progress is the costly nature of computer simulations for complex systems. Surrogate models provide an appealing solution: such models are trained on simulator evaluations, then used to emulate and quantify uncertainty on the expensive simulator at unexplored inputs. In many applications, one often has available data on related systems. For example, in designing a new jet turbine, there may be existing studies on turbines with similar configurations. A key question is how information from such "source" systems can be transferred for effective surrogate training on the "target" system of interest. We thus propose a new LOcal transfer Learning Gaussian Process (LOL-GP) model, which leverages a carefully-designed Gaussian process to transfer such information for surrogate modeling. The key novelty of the LOL-GP is a latent regularization model, which identifies regions where transfer should be performed and regions where it should be avoided. This "local transfer" property is desirable in scientific systems: at certain parameters, such systems may behave similarly and thus transfer is beneficial; at other parameters, they may behave differently and thus transfer is detrimental. By accounting for local transfer, the LOL-GP can rectify a critical limitation of "negative transfer" in existing transfer learning models, where the transfer of information worsens predictive performance. We derive a Gibbs sampling algorithm for efficient posterior predictive sampling on the LOL-GP, for both the multi-source and multi-fidelity transfer settings. We then show, via a suite of numerical experiments and an application for jet turbine design, the improved surrogate performance of the LOL-GP over existing methods.