Authors: Alfred K. Adzika, Prudence Djagba
Abstract: This thesis aims to invent new approaches for making inferences with the k-means algorithm. k-means is an iterative clustering algorithm that randomly assigns k centroids, then assigns data points to the nearest centroid, and updates centroids based on the mean of assigned points. This process continues until convergence, forming k clusters where each point belongs to the closest centroid. This research investigates the prediction of the last component of data points obtained from a distribution of clustered data using the online balanced k-means approach. Through extensive experimentation and analysis, key findings have emerged. It is observed that a larger number of clusters or partitions tends to yield lower errors while increasing the number of assigned data points does not significantly improve inference errors. Reducing losses in the learning process does not significantly impact overall inference errors. Indicating that as learning is going on inference errors remain unchanged. Recommendations include the need for specialized inference techniques to estimate better data points derived from multi-clustered data and exploring methods that yield improved results with larger assigned datasets. By addressing these recommendations, this research advances the accuracy and reliability of inferences made with the k-means algorithm, bridging the gap between clustering and non-parametric density estimation and inference.
Authors: Arjun Subramonian, Sam Bell, Levent Sagun, Elvis Dohmatob
Abstract: Machine learning models may capture and amplify biases present in data, leading to disparate test performance across social groups. To better understand, evaluate, and mitigate these possible biases, a deeper theoretical understanding of how model design choices and data distribution properties could contribute to bias is needed. In this work, we contribute a precise analytical theory in the context of ridge regression, both with and without random projections, where the former models neural networks in a simplified regime. Our theory offers a unified and rigorous explanation of machine learning bias, providing insights into phenomena such as bias amplification and minority-group bias in various feature and parameter regimes. For example, we demonstrate that there may be an optimal regularization penalty or training time to avoid bias amplification, and there can be fundamental differences in test error between groups that do not vanish with increased parameterization. Importantly, our theoretical predictions align with several empirical observations reported in the literature. We extensively empirically validate our theory on diverse synthetic and semi-synthetic datasets.
Authors: Norbert Dzadz, Maciej Romaniuk
Abstract: The precise and large dataset concerning catastrophic events is very important for insurers. To improve the quality of such data three methods based on the bootstrap, bootknife, and GAN algorithms are proposed. Using numerical experiments and real-life data, simulated outputs for these approaches are compared based on the mean squared (MSE) and mean absolute errors (MAE). Then, a direct algorithm to construct a fuzzy expert's opinion concerning such outputs is also considered.
Authors: Khashayar Gatmiry, Jon Schneider, Stefanie Jegelka
Abstract: Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret, but the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a regularizer such that running FTRL with this regularizer guarantees regret within a universal constant factor of the best possible regret bound. In particular, for any choice of (convex, symmetric) action set and loss set we prove that there exists an instantiation of FTRL which achieves regret within a constant factor of the best possible learning algorithm, strengthening the universality result of Srebro et al., 2011. Our algorithm requires preprocessing time and space exponential in the dimension $d$ of the OLO instance, but can be run efficiently online assuming a membership and linear optimization oracle for the action and loss sets, respectively (and is fully polynomial time for the case of constant dimension $d$). We complement this with a lower bound showing that even deciding whether a given regularizer is $\alpha$-strongly-convex with respect to a given norm is NP-hard.
Authors: M. Tanveer, R. K. Sharma, A. Quadir, M. Sajid
Abstract: In the domain of machine learning, least square twin support vector machine (LSTSVM) stands out as one of the state-of-the-art models. However, LSTSVM suffers from sensitivity to noise and outliers, overlooking the SRM principle and instability in resampling. Moreover, its computational complexity and reliance on matrix inversions hinder the efficient processing of large datasets. As a remedy to the aforementioned challenges, we propose the robust granular ball LSTSVM (GBLSTSVM). GBLSTSVM is trained using granular balls instead of original data points. The core of a granular ball is found at its center, where it encapsulates all the pertinent information of the data points within the ball of specified radius. To improve scalability and efficiency, we further introduce the large-scale GBLSTSVM (LS-GBLSTSVM), which incorporates the SRM principle through regularization terms. Experiments are performed on UCI, KEEL, and NDC benchmark datasets; both the proposed GBLSTSVM and LS-GBLSTSVM models consistently outperform the baseline models.
Authors: Aditya Vikram Singh, Ethan Rathbun, Emma Graham, Lisa Oakley, Simona Boboila, Alina Oprea, Peter Chin
Abstract: Recent advances in multi-agent reinforcement learning (MARL) have created opportunities to solve complex real-world tasks. Cybersecurity is a notable application area, where defending networks against sophisticated adversaries remains a challenging task typically performed by teams of security operators. In this work, we explore novel MARL strategies for building autonomous cyber network defenses that address challenges such as large policy spaces, partial observability, and stealthy, deceptive adversarial strategies. To facilitate efficient and generalized learning, we propose a hierarchical Proximal Policy Optimization (PPO) architecture that decomposes the cyber defense task into specific sub-tasks like network investigation and host recovery. Our approach involves training sub-policies for each sub-task using PPO enhanced with domain expertise. These sub-policies are then leveraged by a master defense policy that coordinates their selection to solve complex network defense tasks. Furthermore, the sub-policies can be fine-tuned and transferred with minimal cost to defend against shifts in adversarial behavior or changes in network settings. We conduct extensive experiments using CybORG Cage 4, the state-of-the-art MARL environment for cyber defense. Comparisons with multiple baselines across different adversaries show that our hierarchical learning approach achieves top performance in terms of convergence speed, episodic return, and several interpretable metrics relevant to cybersecurity, including the fraction of clean machines on the network, precision, and false positives on recoveries.
Authors: Dongsu Lee, Minhae Kwon
Abstract: Understanding cognitive processes in multi-agent interactions is a primary goal in cognitive science. It can guide the direction of artificial intelligence (AI) research toward social decision-making in multi-agent systems, which includes uncertainty from character heterogeneity. In this paper, we introduce an episodic future thinking (EFT) mechanism for a reinforcement learning (RL) agent, inspired by cognitive processes observed in animals. To enable future thinking functionality, we first develop a multi-character policy that captures diverse characters with an ensemble of heterogeneous policies. Here, the character of an agent is defined as a different weight combination on reward components, representing distinct behavioral preferences. The future thinking agent collects observation-action trajectories of the target agents and uses the pre-trained multi-character policy to infer their characters. Once the character is inferred, the agent predicts the upcoming actions of target agents and simulates the potential future scenario. This capability allows the agent to adaptively select the optimal action, considering the predicted future scenario in multi-agent interactions. To evaluate the proposed mechanism, we consider the multi-agent autonomous driving scenario with diverse driving traits and multiple particle environments. Simulation results demonstrate that the EFT mechanism with accurate character inference leads to a higher reward than existing multi-agent solutions. We also confirm that the effect of reward improvement remains valid across societies with different levels of character diversity.
Authors: Amirhossein Afsharrad, Parisa Oftadeh, Ahmadreza Moradipari, Sanjay Lall
Abstract: In this study, we explore a collaborative multi-agent stochastic linear bandit setting involving a network of $N$ agents that communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold $\tau$. Each agent encounters a distinct linear bandit problem characterized by its own reward and cost parameters, i.e., local parameters. The goal of the agents is to determine the best overall action corresponding to the average of these parameters, or so-called global parameters. In each round, an agent is randomly chosen to select an action based on its current knowledge of the system. This chosen action is then executed by all agents, then they observe their individual rewards and costs. We propose a safe distributed upper confidence bound algorithm, so called \textit{MA-OPLB}, and establish a high probability bound on its $T$-round regret. MA-OPLB utilizes an accelerated consensus method, where agents can compute an estimate of the average rewards and costs across the network by communicating the proper information with their neighbors. We show that our regret bound is of order $ \mathcal{O}\left(\frac{d}{\tau-c_0}\frac{\log(NT)^2}{\sqrt{N}}\sqrt{\frac{T}{\log(1/|\lambda_2|)}}\right)$, where $\lambda_2$ is the second largest (in absolute value) eigenvalue of the communication matrix, and $\tau-c_0$ is the known cost gap of a feasible action. We also experimentally show the performance of our proposed algorithm in different network structures.
Authors: Rohit Agarwal, Karaka Prasanth Naidu, Alexander Horsch, Krishna Agarwal, Dilip K. Prasad
Abstract: We study the online learning problem characterized by the varying input feature space of streaming data. Although LSTMs have been employed to effectively capture the temporal nature of streaming data, they cannot handle the dimension-varying streams in an online learning setting. Therefore, we propose a dynamic LSTM-based novel method, called packetLSTM, to model the dimension-varying streams. The packetLSTM's dynamic framework consists of an evolving packet of LSTMs, each dedicated to processing one input feature. Each LSTM retains the local information of its corresponding feature, while a shared common memory consolidates global information. This configuration facilitates continuous learning and mitigates the issue of forgetting, even when certain features are absent for extended time periods. The idea of utilizing one LSTM per feature coupled with a dimension-invariant operator for information aggregation enhances the dynamic nature of packetLSTM. This dynamic nature is evidenced by the model's ability to activate, deactivate, and add new LSTMs as required, thus seamlessly accommodating varying input dimensions. The packetLSTM achieves state-of-the-art results on five datasets, and its underlying principle is extended to other RNN types, like GRU and vanilla RNN.
Authors: Sara Honarvar, Yancy Diaz-Mercado
Abstract: Modeling human trajectories in crowded environments is challenging due to the complex nature of pedestrian behavior and interactions. This paper proposes a geometric graph neural network (GNN) architecture that integrates domain knowledge from psychological studies to model pedestrian interactions and predict future trajectories. Unlike prior studies using complete graphs, we define interaction neighborhoods using pedestrians' field of view, motion direction, and distance-based kernel functions to construct graph representations of crowds. Evaluations across multiple datasets demonstrate improved prediction accuracy through reduced average and final displacement error metrics. Our findings underscore the importance of integrating domain knowledge with data-driven approaches for effective modeling of human interactions in crowds.
Authors: My H Dinh, James Kotary, Lauryn P. Gouldin, William Yeoh, Ferdinando Fioretto
Abstract: Criminal courts across the United States handle millions of cases every year, and the scheduling of those cases must accommodate a diverse set of constraints, including the preferences and availability of courts, prosecutors, and defense teams. When criminal court schedules are formed, defendants' scheduling preferences often take the least priority, although defendants may face significant consequences (including arrest or detention) for missed court dates. Additionally, studies indicate that defendants' nonappearances impose costs on the courts and other system stakeholders. To address these issues, courts and commentators have begun to recognize that pretrial outcomes for defendants and for the system would be improved with greater attention to court processes, including \emph{court scheduling practices}. There is thus a need for fair criminal court pretrial scheduling systems that account for defendants' preferences and availability, but the collection of such data poses logistical challenges. Furthermore, optimizing schedules fairly across various parties' preferences is a complex optimization problem, even when such data is available. In an effort to construct such a fair scheduling system under data uncertainty, this paper proposes a joint optimization and learning framework that combines machine learning models trained end-to-end with efficient matching algorithms. This framework aims to produce court scheduling schedules that optimize a principled measure of fairness, balancing the availability and preferences of all parties.
Authors: \"Omer Veysel \c{C}a\u{g}atan, Bar{\i}\c{s} Akg\"un
Abstract: In this study, we investigate the effect of SSL objective modifications within the SPR framework, focusing on specific adjustments such as terminal state masking and prioritized replay weighting, which were not explicitly addressed in the original design. While these modifications are specific to RL, they are not universally applicable across all RL algorithms. Therefore, we aim to assess their impact on performance and explore other SSL objectives that do not accommodate these adjustments like Barlow Twins and VICReg. We evaluate six SPR variants on the Atari 100k benchmark, including versions both with and without these modifications. Additionally, we test the performance of these objectives on the DeepMind Control Suite, where such modifications are absent. Our findings reveal that incorporating specific SSL modifications within SPR significantly enhances performance, and this influence extends to subsequent frameworks like SR-SPR and BBF, highlighting the critical importance of SSL objective selection and related adaptations in achieving data efficiency in self-predictive reinforcement learning.
Authors: Tao Li, Henger Li, Yunian Pan, Tianyi Xu, Zizhan Zheng, Quanyan Zhu
Abstract: Federated learning (FL) is susceptible to a range of security threats. Although various defense mechanisms have been proposed, they are typically non-adaptive and tailored to specific types of attacks, leaving them insufficient in the face of multiple uncertain, unknown, and adaptive attacks employing diverse strategies. This work formulates adversarial federated learning under a mixture of various attacks as a Bayesian Stackelberg Markov game, based on which we propose the meta-Stackelberg defense composed of pre-training and online adaptation. {The gist is to simulate strong attack behavior using reinforcement learning (RL-based attacks) in pre-training and then design meta-RL-based defense to combat diverse and adaptive attacks.} We develop an efficient meta-learning approach to solve the game, leading to a robust and adaptive FL defense. Theoretically, our meta-learning algorithm, meta-Stackelberg learning, provably converges to the first-order $\varepsilon$-meta-equilibrium point in $O(\varepsilon^{-2})$ gradient iterations with $O(\varepsilon^{-4})$ samples per iteration. Experiments show that our meta-Stackelberg framework performs superbly against strong model poisoning and backdoor attacks of uncertain and unknown types.
Authors: Samarth Bhargav, Alexander Gu
Abstract: Understanding the internal mechanisms of GPT-style transformers, particularly their capacity to perform in-context learning (ICL), is critical for advancing AI alignment and interpretability. In-context learning allows transformers to generalize during inference without modifying their weights, yet the precise operations driving this capability remain largely opaque. This paper presents an investigation into the mechanistic interpretability of these transformers, focusing specifically on their ability to learn and predict affine recurrences as an ICL task. To address this, we trained a custom three-layer transformer to predict affine recurrences and analyzed the model's internal operations using both empirical and theoretical approaches. Our findings reveal that the model forms an initial estimate of the target sequence using a copying mechanism in the zeroth layer, which is subsequently refined through negative similarity heads in the second layer. These insights contribute to a deeper understanding of transformer behaviors in recursive tasks and offer potential avenues for improving AI alignment through mechanistic interpretability. Finally, we discuss the implications of our results for future work, including extensions to higher-dimensional recurrences and the exploration of polynomial sequences.
Authors: Furkan Mumcu, Yasin Yilmaz
Abstract: Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial examples. While numerous successful adversarial attacks have been proposed, defenses against these attacks remain relatively understudied. Existing defense approaches either focus on negating the effects of perturbations caused by the attacks to restore the DNNs' original predictions or use a secondary model to detect adversarial examples. However, these methods often become ineffective due to the continuous advancements in attack techniques. We propose a novel universal and lightweight method to detect adversarial examples by analyzing the layer outputs of DNNs. Through theoretical justification and extensive experiments, we demonstrate that our detection method is highly effective, compatible with any DNN architecture, and applicable across different domains, such as image, video, and audio.
Authors: Anthony Baez, Wang Zhang, Ziwen Ma, Subhro Das, Lam M. Nguyen, Luca Daniel
Abstract: Physics-informed neural networks (PINNs) incorporate physical laws into their training to efficiently solve partial differential equations (PDEs) with minimal data. However, PINNs fail to guarantee adherence to conservation laws, which are also important to consider in modeling physical systems. To address this, we proposed PINN-Proj, a PINN-based model that uses a novel projection method to enforce conservation laws. We found that PINN-Proj substantially outperformed PINN in conserving momentum and lowered prediction error by three to four orders of magnitude from the best benchmark tested. PINN-Proj also performed marginally better in the separate task of state prediction on three PDE datasets.
Authors: Mahesh Vaijainthymala Krishnamoorthy
Abstract: As AI systems increasingly integrate into critical societal sectors, the demand for robust privacy-preserving methods has escalated. This paper introduces Data Obfuscation through Latent Space Projection (LSP), a novel technique aimed at enhancing AI governance and ensuring Responsible AI compliance. LSP uses machine learning to project sensitive data into a latent space, effectively obfuscating it while preserving essential features for model training and inference. Unlike traditional privacy methods like differential privacy or homomorphic encryption, LSP transforms data into an abstract, lower-dimensional form, achieving a delicate balance between data utility and privacy. Leveraging autoencoders and adversarial training, LSP separates sensitive from non-sensitive information, allowing for precise control over privacy-utility trade-offs. We validate LSP's effectiveness through experiments on benchmark datasets and two real-world case studies: healthcare cancer diagnosis and financial fraud analysis. Our results show LSP achieves high performance (98.7% accuracy in image classification) while providing strong privacy (97.3% protection against sensitive attribute inference), outperforming traditional anonymization and privacy-preserving methods. The paper also examines LSP's alignment with global AI governance frameworks, such as GDPR, CCPA, and HIPAA, highlighting its contribution to fairness, transparency, and accountability. By embedding privacy within the machine learning pipeline, LSP offers a promising approach to developing AI systems that respect privacy while delivering valuable insights. We conclude by discussing future research directions, including theoretical privacy guarantees, integration with federated learning, and enhancing latent space interpretability, positioning LSP as a critical tool for ethical AI advancement.
Authors: Yann Bouteiller, Karthik Soma, Giovanni Beltrame
Abstract: The universe involves many independent co-learning agents as an ever-evolving part of our observed environment. Yet, in practice, Multi-Agent Reinforcement Learning (MARL) applications are usually constrained to small, homogeneous populations and remain computationally intensive. In this paper, we study how large heterogeneous populations of learning agents evolve in normal-form games. We show how, under assumptions commonly made in the multi-armed bandit literature, Multi-Agent Policy Gradient closely resembles the Replicator Dynamic, and we further derive a fast, parallelizable implementation of Opponent-Learning Awareness tailored for evolutionary simulations. This enables us to simulate the evolution of very large populations made of heterogeneous co-learning agents, under both naive and advanced learning strategies. We demonstrate our approach in simulations of 200,000 agents, evolving in the classic games of Hawk-Dove, Stag-Hunt, and Rock-Paper-Scissors. Each game highlights distinct ways in which Opponent-Learning Awareness affects evolution.
Authors: Taisuke Kobayashi
Abstract: In reinforcement learning (RL), temporal difference (TD) error is known to be related to the firing rate of dopamine neurons. It has been observed that each dopamine neuron does not behave uniformly, but each responds to the TD error in an optimistic or pessimistic manner, interpreted as a kind of distributional RL. To explain such a biological data, a heuristic model has also been designed with learning rates asymmetric for the positive and negative TD errors. However, this heuristic model is not theoretically-grounded and unknown whether it can work as a RL algorithm. This paper therefore introduces a novel theoretically-grounded model with optimism and pessimism, which is derived from control as inference. In combination with ensemble learning, a distributional value function as a critic is estimated from regularly introduced optimism and pessimism. Based on its central value, a policy in an actor is improved. This proposed algorithm, so-called DROP (distributional and regular optimism and pessimism), is compared on dynamic tasks. Although the heuristic model showed poor learning performance, DROP showed excellent one in all tasks with high generality. In other words, it was suggested that DROP is a new model that can elicit the potential contributions of optimism and pessimism.
Authors: Bohan Wang, Yurui Chang, Lu Lin
Abstract: Distribution shifts between training and testing datasets significantly impair the model performance on graph learning. A commonly-taken causal view in graph invariant learning suggests that stable predictive features of graphs are causally associated with labels, whereas varying environmental features lead to distribution shifts. In particular, covariate shifts caused by unseen environments in test graphs underscore the critical need for out-of-distribution (OOD) generalization. Existing graph augmentation methods designed to address the covariate shift often disentangle the stable and environmental features in the input space, and selectively perturb or mixup the environmental features. However, such perturbation-based methods heavily rely on an accurate separation of stable and environmental features, and their exploration ability is confined to existing environmental features in the training distribution. To overcome these limitations, we introduce a novel approach using score-based graph generation strategies that synthesize unseen environmental features while preserving the validity and stable features of overall graph patterns. Our comprehensive empirical evaluations demonstrate the enhanced effectiveness of our method in improving graph OOD generalization.
Authors: Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, Sijia Liu
Abstract: The need for effective unlearning mechanisms in large language models (LLMs) is increasingly urgent, driven by the necessity to adhere to data regulations and foster ethical generative AI practices. Despite growing interest of LLM unlearning, much of the existing research has focused on varied unlearning method designs to boost effectiveness and efficiency. However, the inherent relationship between model weights and LLM unlearning has not been extensively examined. In this paper, we systematically explore how model weights interact with unlearning processes in LLMs and we design the weight attribution-guided LLM unlearning method, WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. By strategically guiding the LLM unlearning across different types of unlearning methods and tasks, WAGLE can erase the undesired content, while maintaining the performance of the original tasks. We refer to the weight attribution-guided LLM unlearning method as WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. Our extensive experiments show that WAGLE boosts unlearning performance across a range of LLM unlearning methods such as gradient difference and (negative) preference optimization, applications such as fictitious unlearning, malicious use prevention, and copyrighted information removal, and models including Zephyr-7b-beta and Llama2-7b. To the best of our knowledge, our work offers the first principled method for attributing and pinpointing the influential weights in enhancing LLM unlearning. It stands in contrast to previous methods that lack weight attribution and simpler weight attribution techniques.
Authors: Soto Anno, Kota Tsubouchi, Masamichi Shimosaka
Abstract: Forecasting rail congestion is crucial for efficient mobility in transport systems. We present rail congestion forecasting using reports from passengers collected through a transit application. Although reports from passengers have received attention from researchers, ensuring a sufficient volume of reports is challenging due to passenger's reluctance. The limited number of reports results in the sparsity of the congestion label, which can be an issue in building a stable prediction model. To address this issue, we propose a semi-supervised method for congestion forecasting for trains, or SURCONFORT. Our key idea is twofold: firstly, we adopt semi-supervised learning to leverage sparsely labeled data and many unlabeled data. Secondly, in order to complement the unlabeled data from nearby stations, we design a railway network-oriented graph and apply the graph to semi-supervised graph regularization. Empirical experiments with actual reporting data show that SURCONFORT improved the forecasting performance by 14.9% over state-of-the-art methods under the label sparsity.
Authors: Muhammad Tanzil Furqon, Mahardhika Pratama, Ary Mazharuddin Shiddiqi, Lin Liu, Habibullah Habibullah, Kutluyil Dogancay
Abstract: The issue of source-free time-series domain adaptations still gains scarce research attentions. On the other hand, existing approaches rely solely on time-domain features ignoring frequency components providing complementary information. This paper proposes Time Frequency Domain Adaptation (TFDA), a method to cope with the source-free time-series domain adaptation problems. TFDA is developed with a dual branch network structure fully utilizing both time and frequency features in delivering final predictions. It induces pseudo-labels based on a neighborhood concept where predictions of a sample group are aggregated to generate reliable pseudo labels. The concept of contrastive learning is carried out in both time and frequency domains with pseudo label information and a negative pair exclusion strategy to make valid neighborhood assumptions. In addition, the time-frequency consistency technique is proposed using the self-distillation strategy while the uncertainty reduction strategy is implemented to alleviate uncertainties due to the domain shift problem. Last but not least, the curriculum learning strategy is integrated to combat noisy pseudo labels. Our experiments demonstrate the advantage of our approach over prior arts with noticeable margins in benchmark problems.
Authors: Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
Abstract: Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications. To clearly evaluate safety apart from general capabilities, we design separate tasks measuring safety and tasks evaluating helpfulness. The safety tasks challenge agents with managing potential risks prevalent in daily life and include tests to evaluate robustness against indirect prompt injections. Our experiments demonstrate that while baseline agents, based on state-of-the-art LLMs, perform well in executing helpful tasks, they show poor performance in safety tasks. To mitigate these safety concerns, we propose a prompting method that encourages agents to prioritize safety considerations. While this method shows promise in promoting safer behaviors, there is still considerable room for improvement to fully earn user trust. This highlights the urgent need for continued research to develop more robust safety mechanisms in mobile environments. We open-source our benchmark at: https://mobilesafetybench.github.io/.
Authors: Zhixia He, Chen Zhao, Minglai Shao, Yujie Lin, Dong Li, Qin Tian
Abstract: Out-of-distribution (OOD) detection poses a significant challenge for Graph Neural Networks (GNNs), particularly in open-world scenarios with varying distribution shifts. Most existing OOD detection methods on graphs primarily focus on identifying instances in test data domains caused by either semantic shifts (changes in data classes) or covariate shifts (changes in data features), while leaving the simultaneous occurrence of both distribution shifts under-explored. In this work, we address both types of shifts simultaneously and introduce a novel challenge for OOD detection on graphs: graph-level semantic OOD detection under covariate shift. In this scenario, variations between the training and test domains result from the concurrent presence of both covariate and semantic shifts, where only graphs associated with unknown classes are identified as OOD samples (OODs). To tackle this challenge, we propose a novel two-phase framework called Graph Disentangled Diffusion Augmentation (GDDA). The first phase focuses on disentangling graph representations into domain-invariant semantic factors and domain-specific style factors. In the second phase, we introduce a novel distribution-shift-controlled score-based generative diffusion model that generates latent factors outside the training semantic and style spaces. Additionally, auxiliary pseudo-in-distribution (InD) and pseudo-OOD graph representations are employed to enhance the effectiveness of the energy-based semantic OOD detector. Extensive empirical studies on three benchmark datasets demonstrate that our approach outperforms state-of-the-art baselines.
Authors: Yang Hu, Tianyi Chen, Na Li, Kai Wang, Bo Dai
Abstract: Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with only experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the curse of horizon. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a linear representation of value function and stationary distribution correction ratio, i.e., primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, SpectralDICE, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.
Authors: Xintao Li, Sibei Liu, Dezhi Yu, Yang Zhang, Xiaoyu Liu
Abstract: Readmissions among Medicare beneficiaries are a major problem for the US healthcare system from a perspective of both healthcare operations and patient caregiving outcomes. Our study analyzes Medicare hospital readmissions using LSTM networks with feature engineering to assess feature contributions. We selected variables from admission-level data, inpatient medical history and patient demography. The LSTM model is designed to capture temporal dynamics from admission-level and patient-level data. On a case study on the MIMIC dataset, the LSTM model outperformed the logistic regression baseline, accurately leveraging temporal features to predict readmission. The major features were the Charlson Comorbidity Index, hospital length of stay, the hospital admissions over the past 6 months, while demographic variables were less impactful. This work suggests that LSTM networks offers a more promising approach to improve Medicare patient readmission prediction. It captures temporal interactions in patient databases, enhancing current prediction models for healthcare providers. Adoption of predictive models into clinical practice may be more effective in identifying Medicare patients to provide early and targeted interventions to improve patient outcomes.
Authors: Bang You, Huaping Liu
Abstract: Reinforcement learning has achieved promising results on robotic control tasks but struggles to leverage information effectively from multiple sensory modalities that differ in many characteristics. Recent works construct auxiliary losses based on reconstruction or mutual information to extract joint representations from multiple sensory inputs to improve the sample efficiency and performance of reinforcement learning algorithms. However, the representations learned by these methods could capture information irrelevant to learning a policy and may degrade the performance. We argue that compressing information in the learned joint representations about raw multimodal observations is helpful, and propose a multimodal information bottleneck model to learn task-relevant joint representations from egocentric images and proprioception. Our model compresses and retains the predictive information in multimodal observations for learning a compressed joint representation, which fuses complementary information from visual and proprioceptive feedback and meanwhile filters out task-irrelevant information in raw multimodal observations. We propose to minimize the upper bound of our multimodal information bottleneck objective for computationally tractable optimization. Experimental evaluations on several challenging locomotion tasks with egocentric images and proprioception show that our method achieves better sample efficiency and zero-shot robustness to unseen white noise than leading baselines. We also empirically demonstrate that leveraging information from egocentric images and proprioception is more helpful for learning policies on locomotion tasks than solely using one single modality.
Authors: Shangshang Yang, Mingyang Chen, Ziwen Wang, Xiaoshan Yu, Panpan Zhang, Haiping Ma, Xingyi Zhang
Abstract: Existing graph learning-based cognitive diagnosis (CD) methods have made relatively good results, but their student, exercise, and concept representations are learned and exchanged in an implicit unified graph, which makes the interaction-agnostic exercise and concept representations be learned poorly, failing to provide high robustness against noise in students' interactions. Besides, lower-order exercise latent representations obtained in shallow layers are not well explored when learning the student representation. To tackle the issues, this paper suggests a meta multigraph-assisted disentangled graph learning framework for CD (DisenGCD), which learns three types of representations on three disentangled graphs: student-exercise-concept interaction, exercise-concept relation, and concept dependency graphs, respectively. Specifically, the latter two graphs are first disentangled from the interaction graph. Then, the student representation is learned from the interaction graph by a devised meta multigraph learning module; multiple learnable propagation paths in this module enable current student latent representation to access lower-order exercise latent representations, which can lead to more effective nad robust student representations learned; the exercise and concept representations are learned on the relation and dependency graphs by graph attention modules. Finally, a novel diagnostic function is devised to handle three disentangled representations for prediction. Experiments show better performance and robustness of DisenGCD than state-of-the-art CD methods and demonstrate the effectiveness of the disentangled learning framework and meta multigraph module. The source code is available at \textcolor{red}{\url{https://github.com/BIMK/Intelligent-Education/tree/main/DisenGCD}}.
URLs: https://github.com/BIMK/Intelligent-Education/tree/main/DisenGCD
Authors: Ivoline C. Ngong, Joseph P. Near, Niloofar Mireshghallah
Abstract: Differentially private SGD (DPSGD) enables privacy-preserving training of language models, but often reduces utility, diversity, and linguistic quality. We introduce DPRefine, a three-phase method that initializes a model using data synthesis from a small pre-trained LM with rigorous filtering, applies DP finetuning on private data, and performs self-distillation to refine outputs. This approach significantly outperforms vanilla DPSGD, with AlpacaEval preferring DPRefine's generations in 78.4% of cases across all datasets. Our analysis reveals that DPRefine reduces linguistic errors in generated text by 84.0%, mitigating grammar and spelling errors, commonly associated with DPSGD. It also reduces inconsistencies of non-private models, such as hallucinated details and misattributed quotes. We find that small models like GPT-2 can be effective for initialization and distillation, highlighting their potential in enabling scalable and efficient deployment of privacy-preserving language.
Authors: Xiaohuan Bi, Xi Li
Abstract: Federated learning (FL) enables decentralized model training while preserving privacy. Recently, integrating Foundation Models (FMs) into FL has boosted performance but also introduced a novel backdoor attack mechanism. Attackers can exploit the FM's capabilities to embed backdoors into synthetic data generated by FMs used for model fusion, subsequently infecting all client models through knowledge sharing without involvement in the long-lasting FL process. These novel attacks render existing FL backdoor defenses ineffective, as they primarily detect anomalies among client updates, which may appear uniformly malicious under this attack. Our work proposes a novel data-free defense strategy by constraining abnormal activations in the hidden feature space during model aggregation on the server. The activation constraints, optimized using synthetic data alongside FL training, mitigate the attack while barely affecting model performance, as the parameters remain untouched. Extensive experiments demonstrate its effectiveness against both novel and classic backdoor attacks, outperforming existing defenses while maintaining model performance.
Authors: Mir Imtiaz Mostafiz (Department of Computer Science, Purdue University), Eunseob Kim (School of Mechanical Engineering, Purdue University), Adrian Shuai Li (Department of Computer Science, Purdue University), Elisa Bertino (Department of Computer Science, Purdue University), Martin Byung-Guk Jun (School of Mechanical Engineering, Purdue University), Ali Shakouri (School of Electrical and Computer Engineering, Purdue University)
Abstract: Cutting state monitoring in the milling process is crucial for improving manufacturing efficiency and tool life. Cutting sound detection using machine learning (ML) models, inspired by experienced machinists, can be employed as a cost-effective and non-intrusive monitoring method in a complex manufacturing environment. However, labeling industry data for training is costly and time-consuming. Moreover, industry data is often scarce. In this study, we propose a novel adversarial domain adaptation (DA) approach to leverage abundant lab data to learn from scarce industry data, both labeled, for training a cutting-sound detection model. Rather than adapting the features from separate domains directly, we project them first into two separate latent spaces that jointly work as the feature space for learning domain-independent representations. We also analyze two different mechanisms for adversarial learning where the discriminator works as an adversary and a critic in separate settings, enabling our model to learn expressive domain-invariant and domain-ingrained features, respectively. We collected cutting sound data from multiple sensors in different locations, prepared datasets from lab and industry domain, and evaluated our learning models on them. Experiments showed that our models outperformed the multi-layer perceptron based vanilla domain adaptation models in labeling tasks on the curated datasets, achieving near 92%, 82% and 85% accuracy respectively for three different sensors installed in industry settings.
Authors: Mridul Gupta, Samyak Jain, Vansh Ramani, Hariprasad Kodamana, Sayan Ranu
Abstract: Graph distillation has emerged as a promising avenue to enable scalable training of GNNs by compressing the training dataset while preserving essential graph characteristics. Our study uncovers significant shortcomings in current graph distillation techniques. First, the majority of the algorithms paradoxically require training on the full dataset to perform distillation. Second, due to their gradient-emulating approach, these methods require fresh distillation for any change in hyperparameters or GNN architecture, limiting their flexibility and reusability. Finally, they fail to achieve substantial size reduction due to synthesizing fully-connected, edge-weighted graphs. To address these challenges, we present Bonsai, a novel graph distillation method empowered by the observation that \textit{computation trees} form the fundamental processing units of message-passing GNNs. Bonsai distills datasets by encoding a careful selection of \textit{exemplar} trees that maximize the representation of all computation trees in the training set. This unique approach imparts Bonsai as the first linear-time, model-agnostic graph distillation algorithm for node classification that outperforms existing baselines across $6$ real-world datasets on accuracy, while being $22$ times faster on average. Bonsai is grounded in rigorous mathematical guarantees on the adopted approximation strategies making it robust to GNN architectures, datasets, and parameters.
Authors: Sejun Park, Kihun Hong, Ganguk Hwang
Abstract: Over the past decade, there is a growing interest in collaborative learning that can enhance AI models of multiple parties. However, it is still challenging to enhance performance them without sharing private data and models from individual parties. One recent promising approach is to develop distillation-based algorithms that exploit unlabeled public data but the results are still unsatisfactory in both theory and practice. To tackle this problem, we rigorously analyze a representative distillation-based algorithm in the view of kernel regression. This work provides the first theoretical results to prove the (nearly) minimax optimality of the nonparametric collaborative learning algorithm that does not directly share local data or models in massively distributed statistically heterogeneous environments. Inspired by our theoretical results, we also propose a practical distillation-based collaborative learning algorithm based on neural network architecture. Our algorithm successfully bridges the gap between our theoretical assumptions and practical settings with neural networks through feature kernel matching. We simulate various regression tasks to verify our theory and demonstrate the practical feasibility of our proposed algorithm.
Authors: Jianjun Wei, Yue Liu, Xin Huang, Xin Zhang, Wenyi Liu, Xu Yan
Abstract: This paper explores the applications and challenges of graph neural networks (GNNs) in processing complex graph data brought about by the rapid development of the Internet. Given the heterogeneity and redundancy problems that graph data often have, traditional GNN methods may be overly dependent on the initial structure and attribute information of the graph, which limits their ability to accurately simulate more complex relationships and patterns in the graph. Therefore, this study proposes a graph neural network model under a self-supervised learning framework, which can flexibly combine different types of additional information of the attribute graph and its nodes, so as to better mine the deep features in the graph data. By introducing a self-supervisory mechanism, it is expected to improve the adaptability of existing models to the diversity and complexity of graph data and improve the overall performance of the model.
Authors: George Potter, Gertjan Burghouts, Joris Sijs
Abstract: Affordances enable robots to have a semantic understanding of their surroundings. This allows them to have more acting flexibility when completing a given task. Capturing object affordances in a machine learning model is a difficult task, because of their dependence on contextual information. Markov Logic Networks (MLN) combine probabilistic reasoning with logic that is able to capture such context. Mobile robots operate in partially known environments wherein unseen object affordances can be observed. This new information must be incorporated into the existing knowledge, without having to retrain the MLN from scratch. We introduce the MLN Cumulative Learning Algorithm (MLN-CLA). MLN-CLA learns new relations in various knowledge domains by retaining knowledge and only updating the changed knowledge, for which the MLN is retrained. We show that MLN-CLA is effective for accumulative learning and zero-shot affordance inference, outperforming strong baselines.
Authors: Baiyuan Chen
Abstract: Attention and convolution are fundamental techniques in machine learning. While they use different approaches to learn features - attention mechanisms capture both global and local data relathionships, while convolutional layers focus on local patterns - both methods are effective for various tasks. Although the feature learning of both models is well-studied individually, there has not been a direct comparison of their feature learning dynamics. In this paper, we compare their Lipschitz continuity with respect to the Wasserstein distance and covering numbers under similar settings. We demonstrate that attention processes data in a more compact and stable manner. Compactness refers to the lower variance and intrinsic dimensionality of the activation outputs, while stability refers to the changes between inputs and outputs. We validate our findings through experiments using topological data analysis, measuring the 1-, 2-, and infinity-Wasserstein distances between the outputs of each layer from both models. Furthermore, we extend our comparison to Vision Transformers (ViTs) and ResNets, showing that while ViTs have higher output variance, their feature learning is more stable than that of ResNets.
Authors: Isaac Symes Thompson, Alberto Caron, Chris Hicks, Vasilios Mavroudis
Abstract: A significant challenge for autonomous cyber defence is ensuring a defensive agent's ability to generalise across diverse network topologies and configurations. This capability is necessary for agents to remain effective when deployed in dynamically changing environments, such as an enterprise network where devices may frequently join and leave. Standard approaches to deep reinforcement learning, where policies are parameterised using a fixed-input multi-layer perceptron (MLP) expect fixed-size observation and action spaces. In autonomous cyber defence, this makes it hard to develop agents that generalise to environments with network topologies different from those trained on, as the number of nodes affects the natural size of the observation and action spaces. To overcome this limitation, we reframe the problem of autonomous network defence using entity-based reinforcement learning, where the observation and action space of an agent are decomposed into a collection of discrete entities. This framework enables the use of policy parameterisations specialised in compositional generalisation. Namely, we train a Transformer-based policy on the Yawning Titan cyber-security simulation environment and test its generalisation capabilities across various network topologies. We demonstrate that this approach significantly outperforms an MLP-based policy on fixed networks, and has the ability for zero-shot generalisation to networks of a different size to those seen in training. These findings highlight the potential for entity-based reinforcement learning to advance the field of autonomous cyber defence by providing more generalisable policies capable of handling variations in real-world network environments.
Authors: Jon Irureta, Jon Imaz, Aizea Lojo, Marco Gonz\'alez, I\~nigo Perona
Abstract: Vertical Federated Learning (VFL) enables collaborative model training across different participants with distinct features and common samples, while preserving data privacy. Existing VFL methodologies often struggle with realistic data partitions, typically incurring high communication costs and significant operational complexity. In this work, we introduce a novel simplified approach to VFL, Active Participant-Centric VFL (APC-VFL), that, to the best of our knowledge, is the first to require only a single communication round between participants, and allows the active participant to do inference in a non collaborative fashion. This method integrates unsupervised representation learning with knowledge distillation to achieve comparable accuracy to traditional VFL methods based on vertical split learning in classical settings, reducing required communication rounds by up to $4200\times$, while being more flexible. Our approach also shows improvements compared to non-federated local models, as well as a comparable VFL proposal, VFedTrans, offering an efficient and flexible solution for collaborative learning.
Authors: Dongwen Luo
Abstract: Power grid load scheduling is a critical task that ensures the balance between electricity generation and consumption while minimizing operational costs and maintaining grid stability. Traditional optimization methods often struggle with the dynamic and stochastic nature of power systems, especially when faced with renewable energy sources and fluctuating demand. This paper proposes a reinforcement learning (RL) approach using a Markov Decision Process (MDP) framework to address the challenges of dynamic load scheduling. The MDP is defined by a state space representing grid conditions, an action space covering control operations like generator adjustments and storage management, and a reward function balancing economic efficiency and system reliability. We investigate the application of various RL algorithms, from basic Q-Learning to more advanced Deep Q-Networks (DQN) and Actor-Critic methods, to determine optimal scheduling policies. The proposed approach is evaluated through a simulated power grid environment, demonstrating its potential to improve scheduling efficiency and adapt to variable demand patterns. Our results show that the RL-based method provides a robust and scalable solution for real-time load scheduling, contributing to the efficient management of modern power grids.
Authors: Ying Li, Zhidi Lin, Yuhao Liu, Michael Minyi Zhang, Pablo M. Olmos, Petar M. Djuri\'c
Abstract: Random feature latent variable models (RFLVMs) represent the state-of-the-art in latent variable models, capable of handling non-Gaussian likelihoods and effectively uncovering patterns in high-dimensional data. However, their heavy reliance on Monte Carlo sampling results in scalability issues which makes it difficult to use these models for datasets with a massive number of observations. To scale up RFLVMs, we turn to the optimization-based variational Bayesian inference (VBI) algorithm which is known for its scalability compared to sampling-based methods. However, implementing VBI for RFLVMs poses challenges, such as the lack of explicit probability distribution functions (PDFs) for the Dirichlet process (DP) in the kernel learning component, and the incompatibility of existing VBI algorithms with RFLVMs. To address these issues, we introduce a stick-breaking construction for DP to obtain an explicit PDF and a novel VBI algorithm called ``block coordinate descent variational inference" (BCD-VI). This enables the development of a scalable version of RFLVMs, or in short, SRFLVM. Our proposed method shows scalability, computational efficiency, superior performance in generating informative latent representations and the ability of imputing missing data across various real-world datasets, outperforming state-of-the-art competitors.
Authors: Elif Ceren Gok Yildirim, Murat Onur Yildirim, Joaquin Vanschoren
Abstract: Continual Learning (CL) methods usually learn from all available data. However, this is not the case in human cognition which efficiently focuses on key experiences while disregarding the redundant information. Similarly, not all data points in a dataset have equal potential; some can be more informative than others. This disparity may significantly impact the performance, as both the quality and quantity of samples directly influence the model's generalizability and efficiency. Drawing inspiration from this, we explore the potential of learning from important samples and present an empirical study for evaluating coreset selection techniques in the context of CL to stimulate research in this unexplored area. We train different continual learners on increasing amounts of selected samples and investigate the learning-forgetting dynamics by shedding light on the underlying mechanisms driving their improved stability-plasticity balance. We present several significant observations: learning from selectively chosen samples (i) enhances incremental accuracy, (ii) improves knowledge retention of previous tasks, and (iii) refines learned representations. This analysis contributes to a deeper understanding of selective learning strategies in CL scenarios.
Authors: Yao Tang, Zhihui Xie, Zichuan Lin, Deheng Ye, Shuai Li
Abstract: Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme.
Authors: Salvatore Raieli, Abdulrahman Altahhan, Nathalie Jeanray, St\'ephane Gerart, Sebastien Vachenc
Abstract: Tabular datasets are widely used in scientific disciplines such as biology. While these disciplines have already adopted AI methods to enhance their findings and analysis, they mainly use tree-based methods due to their interpretability. At the same time, artificial neural networks have been shown to offer superior flexibility and depth for rich and complex non-tabular problems, but they are falling behind tree-based models for tabular data in terms of performance and interpretability. Although sparsity has been shown to improve the interpretability and performance of ANN models for complex non-tabular datasets, enforcing sparsity structurally and formatively for tabular data before training the model, remains an open question. To address this question, we establish a method that infuses sparsity in neural networks by utilising attention mechanisms to capture the features' importance in tabular datasets. We show that our models, Sparse TABular NET or sTAB-Net with attention mechanisms, are more effective than tree-based models, reaching the state-of-the-art on biological datasets. They further permit the extraction of insights from these datasets and achieve better performance than post-hoc methods like SHAP.
Authors: Bastian Rieck
Abstract: This overview article makes the case for how topological concepts can enrich research in machine learning. Using the Euler Characteristic Transform (ECT), a geometrical-topological invariant, as a running example, I present different use cases that result in more efficient models for analyzing point clouds, graphs, and meshes. Moreover, I outline a vision for how topological concepts could be used in the future, comprising (1) the learning of functions on topological spaces, (2) the building of hybrid models that imbue neural networks with knowledge about the topological information in data, and (3) the analysis of qualitative properties of neural networks. With current research already addressing some of these aspects, this article thus serves as an introduction and invitation to this nascent area of research.
Authors: Suraj Kumar, Soumi Chattopadhyay, Chandranath Adak
Abstract: Quality-of-Service (QoS) prediction is a critical task in the service lifecycle, enabling precise and adaptive service recommendations by anticipating performance variations over time in response to evolving network uncertainties and user preferences. However, contemporary QoS prediction methods frequently encounter data sparsity and cold-start issues, which hinder accurate QoS predictions and limit the ability to capture diverse user preferences. Additionally, these methods often assume QoS data reliability, neglecting potential credibility issues such as outliers and the presence of greysheep users and services with atypical invocation patterns. Furthermore, traditional approaches fail to leverage diverse features, including domain-specific knowledge and complex higher-order patterns, essential for accurate QoS predictions. In this paper, we introduce a real-time, trust-aware framework for temporal QoS prediction to address the aforementioned challenges, featuring an end-to-end deep architecture called the Hypergraph Convoluted Transformer Network (HCTN). HCTN combines a hypergraph structure with graph convolution over hyper-edges to effectively address high-sparsity issues by capturing complex, high-order correlations. Complementing this, the transformer network utilizes multi-head attention along with parallel 1D convolutional layers and fully connected dense blocks to capture both fine-grained and coarse-grained dynamic patterns. Additionally, our approach includes a sparsity-resilient solution for detecting greysheep users and services, incorporating their unique characteristics to improve prediction accuracy. Trained with a robust loss function resistant to outliers, HCTN demonstrated state-of-the-art performance on the large-scale WSDREAM-2 datasets for response time and throughput.
Authors: Katharina Fl\"ugel, Daniel Coquelin, Marie Weiel, Achim Streit, Markus G\"otz
Abstract: The gradients used to train neural networks are typically computed using backpropagation. While an efficient way to obtain exact gradients, backpropagation is computationally expensive, hinders parallelization, and is biologically implausible. Forward gradients are an approach to approximate the gradients from directional derivatives along random tangents computed by forward-mode automatic differentiation. So far, research has focused on using a single tangent per step. This paper provides an in-depth analysis of multi-tangent forward gradients and introduces an improved approach to combining the forward gradients from multiple tangents based on orthogonal projections. We demonstrate that increasing the number of tangents improves both approximation quality and optimization performance across various tasks.
Authors: Artem Basharin, Andrei Chertkov, Ivan Oseledets
Abstract: We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy. Motivated by recent work that predicts the probabilities of subsequent tokens using multiple heads, we connect this approach to rank-$1$ canonical tensor decomposition. By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously. This model can also be interpreted as a mixture of experts, allowing us to leverage successful techniques from that domain for efficient and robust training. Importantly, the overall overhead for training and sampling remains low. Our method demonstrates significant improvements in inference speed for both text and code generation tasks, proving particularly beneficial within the self-speculative decoding paradigm. It maintains its effectiveness across various model sizes and training epochs, highlighting its robustness and scalability.
Authors: Max Staats, Matthias Thamm, Bernd Rosenow
Abstract: As large language models (LLMs) become central to AI applications, gaining a deeper understanding of their inner workings is increasingly important. In this work, we analyze the weight matrices of pretrained transformer models -- specifically BERT and Llama -- using random matrix theory (RMT) as a zero-information hypothesis. While randomly initialized weights perfectly agree with RMT predictions, deviations emerge after training, allowing us to locate learned structures within the models. We identify layer-type specific behaviors that are consistent across all blocks and architectures considered. By pinpointing regions that deviate from RMT predictions, we highlight areas of feature learning and confirm this through comparisons with the activation covariance matrices of the corresponding layers. Our method provides a diagnostic tool for identifying relevant regions in transformer weights using only the trained matrices. Additionally, we address the ongoing debate regarding the significance of small singular values in the context of fine-tuning and alignment in LLMs. Our findings reveal that, after fine-tuning, small singular values play a crucial role in the models' capabilities, suggesting that removing them in an already aligned transformer can be detrimental, as it may compromise model alignment.
Authors: Jaris K\"uken, Lennart Purucker, Frank Hutter
Abstract: Tabular machine learning problems often require time-consuming and labor-intensive feature engineering. Recent efforts have focused on using large language models (LLMs) to capitalize on their potential domain knowledge. At the same time, researchers have observed ethically concerning negative biases in other LLM-related use cases, such as text generation. These developments motivated us to investigate whether LLMs exhibit a bias that negatively impacts the performance of feature engineering. While not ethically concerning, such a bias could hinder practitioners from fully utilizing LLMs for automated data science. Therefore, we propose a method to detect potential biases by detecting anomalies in the frequency of operators (e.g., adding two features) suggested by LLMs when engineering new features. Our experiments evaluate the bias of four LLMs, two big frontier and two small open-source models, across 27 tabular datasets. Our results indicate that LLMs are biased toward simple operators, such as addition, and can fail to utilize more complex operators, such as grouping followed by aggregations. Furthermore, the bias can negatively impact the predictive performance when using LLM-generated features. Our results call for mitigating bias when using LLMs for feature engineering.
Authors: Charuka Herath, Xiaolan Liu, Sangarapillai Lambotharan, Yogachandran Rahulamathavan
Abstract: Federated Learning (FL) is a decentralized approach for collaborative model training on edge devices. This distributed method of model training offers advantages in privacy, security, regulatory compliance, and cost-efficiency. Our emphasis in this research lies in addressing statistical complexity in FL, especially when the data stored locally across devices is not identically and independently distributed (non-IID). We have observed an accuracy reduction of up to approximately 10\% to 30\%, particularly in skewed scenarios where each edge device trains with only 1 class of data. This reduction is attributed to weight divergence, quantified using the Euclidean distance between device-level class distributions and the population distribution, resulting in a bias term (\(\delta_k\)). As a solution, we present a method to improve convergence in FL by creating a global subset of data on the server and dynamically distributing it across devices using a Dynamic Data queue-driven Federated Learning (DDFL). Next, we leverage Data Entropy metrics to observe the process during each training round and enable reasonable device selection for aggregation. Furthermore, we provide a convergence analysis of our proposed DDFL to justify their viability in practical FL scenarios, aiming for better device selection, a non-sub-optimal global model, and faster convergence. We observe that our approach results in a substantial accuracy boost of approximately 5\% for the MNIST dataset, around 18\% for CIFAR-10, and 20\% for CIFAR-100 with a 10\% global subset of data, outperforming the state-of-the-art (SOTA) aggregation algorithms.
Authors: Tin Sum Cheng, Aurelien Lucchi, Anastasis Kratsios, David Belius
Abstract: This paper conducts a comprehensive study of the learning curves of kernel ridge regression (KRR) under minimal assumptions. Our contributions are three-fold: 1) we analyze the role of key properties of the kernel, such as its spectral eigen-decay, the characteristics of the eigenfunctions, and the smoothness of the kernel; 2) we demonstrate the validity of the Gaussian Equivalent Property (GEP), which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption; 3) we derive novel bounds that improve over existing bounds across a broad range of setting such as (in)dependent feature vectors and various combinations of eigen-decay rates in the over/underparameterized regimes.
Authors: Kai Liu, Kang You, Pan Gao, Manoranjan Paul
Abstract: With the great progress of 3D sensing and acquisition technology, the volume of point cloud data has grown dramatically, which urges the development of efficient point cloud compression methods. In this paper, we focus on the task of learned lossy point cloud attribute compression (PCAC). We propose an efficient attention-based method for lossy compression of point cloud attributes leveraging on an autoencoder architecture. Specifically, at the encoding side, we conduct multiple downsampling to best exploit the local attribute patterns, in which effective External Cross Attention (ECA) is devised to hierarchically aggregate features by intergrating attributes and geometry contexts. At the decoding side, the attributes of the point cloud are progressively reconstructed based on the multi-scale representation and the zero-padding upsampling tactic. To the best of our knowledge, this is the first approach to introduce attention mechanism to point-based lossy PCAC task. We verify the compression efficiency of our model on various sequences, including human body frames, sparse objects, and large-scale point cloud scenes. Experiments show that our method achieves an average improvement of 1.15 dB and 2.13 dB in BD-PSNR of Y channel and YUV channel, respectively, when comparing with the state-of-the-art point-based method Deep-PCAC. Codes of this paper are available at https://github.com/I2-Multimedia-Lab/Att2CPC.
Authors: Tianyuan Jin, Keke Huang, Jing Tang, Xiaokui Xiao
Abstract: This paper studies two variants of the best arm identification (BAI) problem under the streaming model, where we have a stream of $n$ arms with reward distributions supported on $[0,1]$ with unknown means. The arms in the stream are arriving one by one, and the algorithm cannot access an arm unless it is stored in a limited size memory. We first study the streaming \eps-$top$-$k$ arms identification problem, which asks for $k$ arms whose reward means are lower than that of the $k$-th best arm by at most $\eps$ with probability at least $1-\delta$. For general $\eps \in (0,1)$, the existing solution for this problem assumes $k = 1$ and achieves the optimal sample complexity $O(\frac{n}{\eps^2} \log \frac{1}{\delta})$ using $O(\log^*(n))$ ($\log^*(n)$ equals the number of times that we need to apply the logarithm function on $n$ before the results is no more than 1.) memory and a single pass of the stream. We propose an algorithm that works for any $k$ and achieves the optimal sample complexity $O(\frac{n}{\eps^2} \log\frac{k}{\delta})$ using a single-arm memory and a single pass of the stream. Second, we study the streaming BAI problem, where the objective is to identify the arm with the maximum reward mean with at least $1-\delta$ probability, using a single-arm memory and as few passes of the input stream as possible. We present a single-arm-memory algorithm that achieves a near instance-dependent optimal sample complexity within $O(\log \Delta_2^{-1})$ passes, where $\Delta_2$ is the gap between the mean of the best arm and that of the second best arm.
Authors: Ferdi Kossmann, Bruce Fontaine, Daya Khudia, Michael Cafarella, Samuel Madden
Abstract: Serving systems for Large Language Models (LLMs) improve throughput by processing several requests concurrently. However, multiplexing hardware resources between concurrent requests involves non-trivial scheduling decisions. Practical serving systems typically implement these decisions at two levels: First, a load balancer routes requests to different servers which each hold a replica of the LLM. Then, on each server, an engine-level scheduler decides when to run a request, or when to queue or preempt it. Improved scheduling policies may benefit a wide range of LLM deployments and can often be implemented as "drop-in replacements" to a system's current policy. In this work, we survey scheduling techniques from the literature and from practical serving systems. We find that schedulers from the literature often achieve good performance but introduce significant complexity. In contrast, schedulers in practical deployments often leave easy performance gains on the table but are easy to implement, deploy and configure. This finding motivates us to introduce two new scheduling techniques, which are both easy to implement, and outperform current techniques on production workload traces.
Authors: K. Darshana Abeyrathna, Sara El Mekkaoui, Andreas Hafver, Christian Agrell
Abstract: Tsetlin Machines (TMs) have emerged as a compelling alternative to conventional deep learning methods, offering notable advantages such as smaller memory footprint, faster inference, fault-tolerant properties, and interpretability. Although various adaptations of TMs have expanded their applicability across diverse domains, a fundamental gap remains in understanding how TMs quantify uncertainty in their predictions. In response, this paper introduces the Probabilistic Tsetlin Machine (PTM) framework, aimed at providing a robust, reliable, and interpretable approach for uncertainty quantification. Unlike the original TM, the PTM learns the probability of staying on each state of each Tsetlin Automaton (TA) across all clauses. These probabilities are updated using the feedback tables that are part of the TM framework: Type I and Type II feedback. During inference, TAs decide their actions by sampling states based on learned probability distributions, akin to Bayesian neural networks when generating weight values. In our experimental analysis, we first illustrate the spread of the probabilities across TA states for the noisy-XOR dataset. Then we evaluate the PTM alongside benchmark models using both simulated and real-world datasets. The experiments on the simulated dataset reveal the PTM's effectiveness in uncertainty quantification, particularly in delineating decision boundaries and identifying regions of high uncertainty. Moreover, when applied to multiclass classification tasks using the Iris dataset, the PTM demonstrates competitive performance in terms of predictive entropy and expected calibration error, showcasing its potential as a reliable tool for uncertainty estimation. Our findings underscore the importance of selecting appropriate models for accurate uncertainty quantification in predictive tasks, with the PTM offering a particularly interpretable and effective solution.
Authors: Flavio S. Correa da Silva, Simon Sawhney
Abstract: Acute kidney injury (AKI) is a serious clinical condition that affects up to 20% of hospitalised patients. AKI is associated with short term unplanned hospital readmission and post-discharge mortality risk. Patient risk and healthcare expenditures can be minimised by followup planning grounded on predictive models and machine learning. Since AKI is multi-factorial, predictive models specialised in different categories of patients can increase accuracy of predictions. In the present article we present some results following this approach.
Authors: Ahmed A. Elhag, T. Konstantin Rusch, Francesco Di Giovanni, Michael Bronstein
Abstract: Incorporating equivariance as an inductive bias into deep learning architectures to take advantage of the data symmetry has been successful in multiple applications, such as chemistry and dynamical systems. In particular, roto-translations are crucial for effectively modeling geometric graphs and molecules, where understanding the 3D structures enhances generalization. However, equivariant models often pose challenges due to their high computational complexity. In this paper, we introduce REMUL, a training procedure for approximating equivariance with multitask learning. We show that unconstrained models (which do not build equivariance into the architecture) can learn approximate symmetries by minimizing an additional simple equivariance loss. By formulating equivariance as a new learning objective, we can control the level of approximate equivariance in the model. Our method achieves competitive performance compared to equivariant baselines while being $10 \times$ faster at inference and $2.5 \times$ at training.
Authors: Yehonathan Refael, Jonathan Svirsky, Boris Shustin, Wasim Huleihel, Ofir Lindenbaum
Abstract: Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging this, our approach involves adaptively reducing the rank of the gradients during Adam optimization steps, using an efficient online-updating low-rank projections rule. We further present a randomized SVD scheme for efficiently finding the projection matrix. Our technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, significantly reducing overall memory requirements during training compared to state-of-the-art methods while improving model performance in both pretraining and fine-tuning. Finally, we provide a convergence analysis of our method and demonstrate its merits for training and fine-tuning language and biological foundation models.
Authors: Congxi Zhang, Yongchun Xie
Abstract: Learning identifiable representations and models from low-level observations is useful for an intelligent spacecraft to reliability finish downstream tasks. For temporal observations, to ensure that the data generating process is provably inverted, most existing works either assume the noise variables in the dynamic mechanisms are (conditionally) independent, or require interventions which can directly affect each latent variable. However, in practice, the relationship between the exogenous inputs/interventions and the latent variables may follow some complex deterministic mechanisms. In this work, we study the problem of identifiable representation and model learning for latent dynamic systems. The key idea is that we use an inductive bias inspired by controllable canonical forms, which is invariant, sparse, and input dependent by definition. We prove that, for linear or affine nonlinear latent dynamic systems, it is possible to identify the representations up to scaling and determine the models up to some simple transformations. The results have potential to provide some theoretical guarantees for developing more trustworthy decision-making and control methods for intelligent spacecrafts.
Authors: Axel Brunnbauer, Julian Lemmel, Zahra Babaiee, Sophie Neubauer, Radu Grosu
Abstract: Reinforcement learning algorithms for mean-field games offer a scalable framework for optimizing policies in large populations of interacting agents. Existing methods often depend on online interactions or access to system dynamics, limiting their practicality in real-world scenarios where such interactions are infeasible or difficult to model. In this paper, we present Offline Munchausen Mirror Descent (Off-MMD), a novel mean-field RL algorithm that approximates equilibrium policies in mean-field games using purely offline data. By leveraging iterative mirror descent and importance sampling techniques, Off-MMD estimates the mean-field distribution from static datasets without relying on simulation or environment dynamics. Additionally, we incorporate techniques from offline reinforcement learning to address common issues like Q-value overestimation, ensuring robust policy learning even with limited data coverage. Our algorithm scales to complex environments and demonstrates strong performance on benchmark tasks like crowd exploration or navigation, highlighting its applicability to real-world multi-agent systems where online experimentation is infeasible. We empirically demonstrate the robustness of Off-MMD to low-quality datasets and conduct experiments to investigate its sensitivity to hyperparameter choices.
Authors: Philip Amortila, Dylan J. Foster, Nan Jiang, Akshay Krishnamurthy, Zakaria Mhammedi
Abstract: Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations, but the underlying (''latent'') dynamics are comparatively simple. However, outside of restrictive settings such as small latent spaces, the fundamental statistical requirements and algorithmic principles for reinforcement learning under latent dynamics are poorly understood. This paper addresses the question of reinforcement learning under $\textit{general}$ latent dynamics from a statistical and algorithmic perspective. On the statistical side, our main negative result shows that most well-studied settings for reinforcement learning with function approximation become intractable when composed with rich observations; we complement this with a positive result, identifying latent pushforward coverability as a general condition that enables statistical tractability. Algorithmically, we develop provably efficient observable-to-latent reductions -- that is, reductions that transform an arbitrary algorithm for the latent MDP into an algorithm that can operate on rich observations -- in two settings: one where the agent has access to hindsight observations of the latent dynamics [LADZ23], and one where the agent can estimate self-predictive latent models [SAGHCB20]. Together, our results serve as a first step toward a unified statistical and algorithmic theory for reinforcement learning under latent dynamics.
Authors: Caroline Tatsuoka, Dongbin Xiu
Abstract: We present a deep learning framework for correcting existing dynamical system models utilizing only a scarce high-fidelity data set. In many practical situations, one has a low-fidelity model that can capture the dynamics reasonably well but lacks high resolution, due to the inherent limitation of the model and the complexity of the underlying physics. When high resolution data become available, it is natural to seek model correction to improve the resolution of the model predictions. We focus on the case when the amount of high-fidelity data is so small that most of the existing data driven modeling methods cannot be applied. In this paper, we address these challenges with a model-correction method which only requires a scarce high-fidelity data set. Our method first seeks a deep neural network (DNN) model to approximate the existing low-fidelity model. By using the scarce high-fidelity data, the method then corrects the DNN model via transfer learning (TL). After TL, an improved DNN model with high prediction accuracy to the underlying dynamics is obtained. One distinct feature of the propose method is that it does not assume a specific form of the model correction terms. Instead, it offers an inherent correction to the low-fidelity model via TL. A set of numerical examples are presented to demonstrate the effectiveness of the proposed method.
Authors: Elizaveta Surzhikova, Jonny Proppe
Abstract: Increasingly more research areas rely on machine learning methods to accelerate discovery while saving resources. Machine learning models, however, usually require large datasets of experimental or computational results, which in certain fields, such as (bio)chemistry, materials science, or medicine, are rarely given and often prohibitively expensive to obtain. To bypass that obstacle, active learning methods are employed to develop machine learning models with a desired performance while requiring the least possible number of computational or experimental results from the domain of application. For this purpose, the model's knowledge about certain regions of the application domain is estimated to guide the choice of the model's training set. Although active learning is widely studied for classification problems (discrete outcomes), comparatively few works handle this method for regression problems (continuous outcomes). In this work, we present our Python package regAL, which allows users to evaluate different active learning strategies for regression problems. With a minimal input of just the dataset in question, but many additional customization and insight options, this package is intended for anyone who aims to perform and understand active learning in their problem-specific scope.
Authors: Rui Sun, Zhipeng Wang, Hengrui Zhang, Ming Jiang, Yizhe Wen, Jiqun Zhang, Jiahao Sun, Shuoying Zhang, Erwu Liu, Kezhi Li
Abstract: One of the biggest challenges of building artificial intelligence (AI) model in healthcare area is the data sharing. Since healthcare data is private, sensitive, and heterogeneous, collecting sufficient data for modelling is exhausted, costly, and sometimes impossible. In this paper, we propose a framework for global healthcare modelling using datasets from multi-continents (Europe, North America and Asia) while without sharing the local datasets, and choose glucose management as a study model to verify its effectiveness. Technically, blockchain-enabled federated learning is implemented with adaption to make it meet with the privacy and safety requirements of healthcare data, meanwhile rewards honest participation and penalize malicious activities using its on-chain incentive mechanism. Experimental results show that the proposed framework is effective, efficient, and privacy preserved. Its prediction accuracy is much better than the models trained from limited personal data and is similar to, and even slightly better than, the results from a centralized dataset. This work paves the way for international collaborations on healthcare projects, where additional data is crucial for reducing bias and providing benefits to humanity.
Authors: Li Sun, Zhenhao Huang, Qiqi Wan, Hao Peng, Philip S. Yu
Abstract: Graph neural networks (GNNs) have become the dominant solution for learning on graphs, the typical non-Euclidean structures. Conventional GNNs, constructed with the Artificial Neuron Network (ANN), have achieved impressive performance at the cost of high computation and energy consumption. In parallel, spiking GNNs with brain-like spiking neurons are drawing increasing research attention owing to the energy efficiency. So far, existing spiking GNNs consider graphs in Euclidean space, ignoring the structural geometry, and suffer from the high latency issue due to Back-Propagation-Through-Time (BPTT) with the surrogate gradient. In light of the aforementioned issues, we are devoted to exploring spiking GNN on Riemannian manifolds, and present a Manifold-valued Spiking GNN (MSG). In particular, we design a new spiking neuron on geodesically complete manifolds with the diffeomorphism, so that BPTT regarding the spikes is replaced by the proposed differentiation via manifold. Theoretically, we show that MSG approximates a solver of the manifold ordinary differential equation. Extensive experiments on common graphs show the proposed MSG achieves superior performance to previous spiking GNNs and energy efficiency to conventional GNNs.
Authors: Diego Marcondes, Ulisses Braga-Neto
Abstract: We propose generalized resubstitution error estimators for regression, a broad family of estimators, each corresponding to a choice of empirical probability measures and loss function. The usual sum of squares criterion is a special case corresponding to the standard empirical probability measure and the quadratic loss. Other choices of empirical probability measure lead to more general estimators with superior bias and variance properties. We prove that these error estimators are consistent under broad assumptions. In addition, procedures for choosing the empirical measure based on the method of moments and maximum pseudo-likelihood are proposed and investigated. Detailed experimental results using polynomial regression demonstrate empirically the superior finite-sample bias and variance properties of the proposed estimators. The R code for the experiments is provided.
Authors: Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li
Abstract: In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7$\times$ and 3.0$\times$ and the execution memory by 3.5$\times$ and 4.3$\times$, respectively. MCUBERT also achieves 1.5$\times$ latency reduction. For the first time, MCUBERT enables lightweight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.
Authors: Riccardo Salami, Pietro Buzzega, Matteo Mosconi, Jacopo Bonato, Luigi Sabetta, Simone Calderara
Abstract: Model merging has emerged as a crucial technique in Deep Learning, enabling the integration of multiple models into a unified system while preserving performance and scalability. In this respect, the compositional properties of low-rank adaptation techniques (e.g., LoRA) have proven beneficial, as simple averaging LoRA modules yields a single model that mostly integrates the capabilities of all individual modules. Building on LoRA, we take a step further by imposing that the merged model matches the responses of all learned modules. Solving this objective in closed form yields an indeterminate system with A and B as unknown variables, indicating the existence of infinitely many closed-form solutions. To address this challenge, we introduce LoRM, an alternating optimization strategy that trains one LoRA matrix at a time. This allows solving for each unknown variable individually, thus finding a unique solution. We apply our proposed methodology to Federated Class-Incremental Learning (FCIL), ensuring alignment of model responses both between clients and across tasks. Our method demonstrates state-of-the-art performance across a range of FCIL scenarios.
Authors: Imad Bouhou, Stefano Fortunati, Leila Gharsalli, Alexandre Renaux
Abstract: The joint detection and tracking of a moving target embedded in an unknown disturbance represents a key feature that motivates the development of the cognitive radar paradigm. Building upon recent advancements in robust target detection with multiple-input multiple-output (MIMO) radars, this work explores the application of a Partially Observable Markov Decision Process (POMDP) framework to enhance the tracking and detection tasks in a statistically unknown environment. In the POMDP setup, the radar system is considered as an intelligent agent that continuously senses the surrounding environment, optimizing its actions to maximize the probability of detection $(P_D)$ and improve the target position and velocity estimation, all this while keeping a constant probability of false alarm $(P_{FA})$. The proposed approach employs an online algorithm that does not require any apriori knowledge of the noise statistics, and it relies on a much more general observation model than the traditional range-azimuth-elevation model employed by conventional tracking algorithms. Simulation results clearly show substantial performance improvement of the POMDP-based algorithm compared to the State-Action-Reward-State-Action (SARSA)-based one that has been recently investigated in the context of massive MIMO (MMIMO) radar systems.
Authors: Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda
Abstract: The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point $\beta_{i,j}$, which represents the proportion of the remaining stick to allocate to the current token. We repeat the process until the stick is fully allocated, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et. al., 2017). We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with $2^{11}$ context window to perform well at $2^{14}$ with perplexity improvements.
Authors: Zhaomin Wu, Junyi Hou, Yiqun Diao, Bingsheng He
Abstract: Federated Learning (FL) is an evolving paradigm that enables multiple parties to collaboratively train models without sharing raw data. Among its variants, Vertical Federated Learning (VFL) is particularly relevant in real-world, cross-organizational collaborations, where distinct features of a shared instance group are contributed by different parties. In these scenarios, parties are often linked using fuzzy identifiers, leading to a common practice termed as multi-party fuzzy VFL. Existing models generally address either multi-party VFL or fuzzy VFL between two parties. Extending these models to practical multi-party fuzzy VFL typically results in significant performance degradation and increased costs for maintaining privacy. To overcome these limitations, we introduce the Federated Transformer (FeT), a novel framework that supports multi-party VFL with fuzzy identifiers. FeT innovatively encodes these identifiers into data representations and employs a transformer architecture distributed across different parties, incorporating three new techniques to enhance performance. Furthermore, we have developed a multi-party privacy framework for VFL that integrates differential privacy with secure multi-party computation, effectively protecting local representations while minimizing associated utility costs. Our experiments demonstrate that the FeT surpasses the baseline models by up to 46\% in terms of accuracy when scaled to 50 parties. Additionally, in two-party fuzzy VFL settings, FeT also shows improved performance and privacy over cutting-edge VFL models.
Authors: Chanwoo Chun, SueYeon Chung, Daniel D. Lee
Abstract: Analyzing the structure of sampled features from an input data distribution is challenging when constrained by limited measurements in both the number of inputs and features. Traditional approaches often rely on the eigenvalue spectrum of the sample covariance matrix derived from finite measurement matrices; however, these spectra are sensitive to the size of the measurement matrix, leading to biased insights. In this paper, we introduce a novel algorithm that provides unbiased estimates of the spectral moments of the kernel integral operator in the limit of infinite inputs and features from finitely sampled measurement matrices. Our method, based upon dynamic programming, is efficient and capable of estimating the moments of the operator spectrum. We demonstrate the accuracy of our estimator on radial basis function (RBF) kernels, highlighting its consistency with the theoretical spectra. Furthermore, we showcase the practical utility and robustness of our method in understanding the geometry of learned representations in neural networks.
Authors: Elise \"Ozalp, Luca Magri
Abstract: The data-driven learning of solutions of partial differential equations can be based on a divide-and-conquer strategy. First, the high dimensional data is compressed to a latent space with an autoencoder; and, second, the temporal dynamics are inferred on the latent space with a form of recurrent neural network. In chaotic systems and turbulence, convolutional autoencoders and echo state networks (CAE-ESN) successfully forecast the dynamics, but little is known about whether the stability properties can also be inferred. We show that the CAE-ESN model infers the invariant stability properties and the geometry of the tangent space in the low-dimensional manifold (i.e. the latent space) through Lyapunov exponents and covariant Lyapunov vectors. This work opens up new opportunities for inferring the stability of high-dimensional chaotic systems in latent spaces.
Authors: Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar
Abstract: Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. Hybrid batching works well for linear operations as it amortizes the cost of loading model weights from HBM. However, attention computation in hybrid batches remains inefficient because existing attention kernels are optimized for either prefill or decode. In this paper, we present POD-Attention -- the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. We integrate POD-Attention in a state-of-the-art LLM inference scheduler Sarathi-Serve. POD-Attention speeds up attention computation by up to 75% (mean 28%) and increases LLM serving throughput by up to 22% in offline inference. In online inference, POD-Attention enables lower time-to-first-token (TTFT), time-between-tokens (TBT), and request execution latency versus Sarathi-Serve.
Authors: Raman Ebrahimi, Kristen Vaccaro, Parinaz Naghizadeh
Abstract: When humans are subject to an algorithmic decision system, they can strategically adjust their behavior accordingly (``game'' the system). While a growing line of literature on strategic classification has used game-theoretic modeling to understand and mitigate such gaming, these existing works consider standard models of fully rational agents. In this paper, we propose a strategic classification model that considers behavioral biases in human responses to algorithms. We show how misperceptions of a classifier (specifically, of its feature weights) can lead to different types of discrepancies between biased and rational agents' responses, and identify when behavioral agents over- or under-invest in different features. We also show that strategic agents with behavioral biases can benefit or (perhaps, unexpectedly) harm the firm compared to fully rational strategic agents. We complement our analytical results with user studies, which support our hypothesis of behavioral biases in human responses to the algorithm. Together, our findings highlight the need to account for human (cognitive) biases when designing AI systems, and providing explanations of them, to strategic human in the loop.
Authors: Valeria Ruscio, Fabrizio Silvestri
Abstract: Rotary Positional Embeddings (RoPE) enhance positional encoding in Transformer models, yet their full impact on model dynamics remains underexplored. This paper studies how RoPE introduces position-dependent rotations, causing phase shifts in token embeddings that influence higher-frequency components within the model's internal representations. Through spectral analysis, we demonstrate that RoPE's rotation matrices induce oscillatory behaviors in embeddings, affecting information retention across layers and shaping temporal modeling capabilities. We show that activation functions in feed-forward networks interact with RoPE-modulated embeddings to generate harmonics, leading to constructive or destructive interference based on phase alignment. Our findings reveal that phase alignment amplifies activations and sharpens attention, while misalignment weakens activations and disrupts focus on positional patterns. This study underscores the importance of frequency components as intrinsic elements of model behavior, offering new insights beyond traditional analyses.
Authors: Luran Wang, Chaoran Cheng, Yizhen Liao, Yanru Qu, Ge Liu
Abstract: Controlled generation with pre-trained Diffusion and Flow Matching models has vast applications. One strategy for guiding ODE-based generative models is through optimizing a target loss $R(x_1)$ while staying close to the prior distribution. Along this line, some recent work showed the effectiveness of guiding flow model by differentiating through its ODE sampling process. Despite the superior performance, the theoretical understanding of this line of methods is still preliminary, leaving space for algorithm improvement. Moreover, existing methods predominately focus on Euclidean data manifold, and there is a compelling need for guided flow methods on complex geometries such as SO(3), which prevails in high-stake scientific applications like protein design. We present OC-Flow, a general and theoretically grounded training-free framework for guided flow matching using optimal control. Building upon advances in optimal control theory, we develop effective and practical algorithms for solving optimal control in guided ODE-based generation and provide a systematic theoretical analysis of the convergence guarantee in both Euclidean and SO(3). We show that existing backprop-through-ODE methods can be interpreted as special cases of Euclidean OC-Flow. OC-Flow achieved superior performance in extensive experiments on text-guided image manipulation, conditional molecule generation, and all-atom peptide design.
Authors: Xue Zheng, Tian Xie, Xuwei Tan, Aylin Yener, Xueru Zhang, Ali Payani, Myungjin Lee
Abstract: Performative prediction (PP) is a framework that captures distribution shifts that occur during the training of machine learning models due to their deployment. As the trained model is used, its generated data could cause the model to evolve, leading to deviations from the original data distribution. The impact of such model-induced distribution shifts in the federated learning (FL) setup remains unexplored despite being increasingly likely to transpire in real-life use cases. Although Jin et al. (2024) recently extended PP to FL in a straightforward manner, the resulting model only converges to a performative stable point, which may be far from optimal. The methods in Izzo et al. (2021); Miller et al. (2021) can find a performative optimal point in centralized settings, but they require the performative risk to be convex and the training data to be noiseless, assumptions often violated in realistic FL systems. This paper overcomes all of these shortcomings and proposes Performative robust optimal Federated Learning (ProFL), an algorithm that finds performative optimal points in FL from noisy and contaminated data. We present the convergence analysis under the Polyak-Lojasiewicz condition, which applies to non-convex objectives. Extensive experiments on multiple datasets validate our proposed algorithms' efficiency.
Authors: Max Wilcoxson, Qiyang Li, Kevin Frans, Sergey Levine
Abstract: Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an optimistic reward model, transforming prior data into high-level, task-relevant examples. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.
Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova
Abstract: We propose a new programming language called ALTA and a compiler that can map ALTA programs to Transformer weights. ALTA is inspired by RASP, a language proposed by Weiss et al. (2021), and Tracr (Lindner et al., 2023), a compiler from RASP programs to Transformer weights. ALTA complements and extends this prior work, offering the ability to express loops and to compile programs to Universal Transformers, among other advantages. ALTA allows us to constructively show how Transformers can represent length-invariant algorithms for computing parity and addition, as well as a solution to the SCAN benchmark of compositional generalization tasks, without requiring intermediate scratchpad decoding steps. We also propose tools to analyze cases where the expressibility of an algorithm is established, but end-to-end training on a given training set fails to induce behavior consistent with the desired algorithm. To this end, we explore training from ALTA execution traces as a more fine-grained supervision signal. This enables additional experiments and theoretical analyses relating the learnability of various algorithms to data availability and modeling decisions, such as positional encodings. We make the ALTA framework -- language specification, symbolic interpreter, and weight compiler -- available to the community to enable further applications and insights.
Authors: Renhao Wang, Kevin Frans, Pieter Abbeel, Sergey Levine, Alexei A. Efros
Abstract: Sample-efficient online reinforcement learning often uses replay buffers to store experience for reuse when updating the value function. However, uniform replay is inefficient, since certain classes of transitions can be more relevant to learning. While prioritization of more useful samples is helpful, this strategy can also lead to overfitting, as useful samples are likely to be more rare. In this work, we instead propose a prioritized, parametric version of an agent's memory, using generative models to capture online experience. This paradigm enables (1) densification of past experience, with new generations that benefit from the generative model's generalization capacity and (2) guidance via a family of "relevance functions" that push these generations towards more useful parts of an agent's acquired history. We show this recipe can be instantiated using conditional diffusion models and simple relevance functions such as curiosity- or value-based metrics. Our approach consistently improves performance and sample efficiency in both state- and pixel-based domains. We expose the mechanisms underlying these gains, showing how guidance promotes diversity in our generated transitions and reduces overfitting. We also showcase how our approach can train policies with even higher update-to-data ratios than before, opening up avenues to better scale online RL agents.
Authors: Osamu Take, Taketo Akama
Abstract: Recent MIDI-to-audio synthesis methods have employed deep neural networks to successfully generate high-quality and expressive instrumental tracks. However, these methods require MIDI annotations for supervised training, limiting the diversity of the output audio in terms of instrument timbres, and expression styles. We propose CoSaRef, a MIDI-to-audio synthesis method that can be developed without MIDI-audio paired datasets. CoSaRef first performs concatenative synthesis based on MIDI inputs and then refines the resulting audio into realistic tracks using a diffusion-based deep generative model trained on audio-only datasets. This approach enhances the diversity of audio timbres and expression styles. It also allows for control over the output timbre based on audio sample selection, similar to traditional functions in digital audio workstations. Experiments show that while inherently capable of generating general tracks with high control over timbre, CoSaRef can also perform comparably to conventional methods in generating realistic audio.
Authors: Zilong Liu, Krzysztof Janowicz, Kitty Currier, Meilin Shi
Abstract: Regional defaults describe the emerging phenomenon that text-to-image (T2I) foundation models used in generative AI are prone to over-proportionally depicting certain geographic regions to the exclusion of others. In this work, we introduce a scalable evaluation for uncovering such regional defaults. The evaluation consists of region hierarchy--based image generation and cross-level similarity comparisons. We carry out an experiment by prompting DALL-E 2, a state-of-the-art T2I generation model capable of generating photorealistic images, to depict a forest. We select forest as an object class that displays regional variation and can be characterized using spatial statistics. For a region in the hierarchy, our experiment reveals the regional defaults implicit in DALL-E 2, along with their scale-dependent nature and spatial relationships. In addition, we discover that the implicit defaults do not necessarily correspond to the most widely forested regions in reality. Our findings underscore a need for further investigation into the geography of T2I generation and other forms of generative AI.
Authors: Hoon Lee, Mintae Kim, Seunghwan Baek, Namyoon Lee, Merouane Debbah, Inkyu Lee
Abstract: Traditional network management algorithms have relied on prior knowledge of system models and networking scenarios. In practice, a universal optimization framework is desirable where a sole optimization module can be readily applied to arbitrary network management tasks without any knowledge of the system. To this end, knowledge-free optimization techniques are necessary whose operations are independent of scenario-specific information including objective functions, system parameters, and network setups. The major challenge of this paradigm-shifting approach is the requirement of a hyper-intelligent black-box optimizer that can establish efficient decision-making policies using its internal reasoning capabilities. This article presents a novel knowledge-free network management paradigm with the power of foundation models called large language models (LLMs). Trained on vast amounts of datasets, LLMs can understand important contexts from input prompts containing minimal system information, thereby offering remarkable inference performance even for entirely new tasks. Pretrained LLMs can be potentially leveraged as foundation models for versatile network optimization. By eliminating the dependency on prior knowledge, LLMs can be seamlessly applied for various network management tasks. The viability of this approach is demonstrated for resource management problems using GPT-3.5-Turbo. Numerical results validate that knowledge-free LLM optimizers are able to achieve comparable performance to existing knowledge-based optimization algorithms.
Authors: Kasra Laamerad, Mehran Shabanpour, Md. Rabiul Islam, Arash Mohammadi
Abstract: Multi-channel surface Electromyography (sEMG), also referred to as high-density sEMG (HD-sEMG), plays a crucial role in improving gesture recognition performance for myoelectric control. Pattern recognition models developed based on HD-sEMG, however, are vulnerable to changing recording conditions (e.g., signal variability due to electrode shift). This has resulted in significant degradation in performance across subjects, and sessions. In this context, the paper proposes the Masked Autoencoder with Swin Transformer (MAST) framework, where training is performed on a masked subset of HDsEMG channels. A combination of four masking strategies, i.e., random block masking; temporal masking; sensor-wise random masking, and; multi-scale masking, is used to learn latent representations and increase robustness against electrode shift. The masked data is then passed through MAST's three-path encoder-decoder structure, leveraging a multi-path Swin-Unet architecture that simultaneously captures time-domain, frequency-domain, and magnitude-based features of the underlying HD-sEMG signal. These augmented inputs are then used in a self-supervised pre-training fashion to improve the model's generalization capabilities. Experimental results demonstrate the superior performance of the proposed MAST framework in comparison to its counterparts.
Authors: Wenqing Wang, Yun Fu
Abstract: Audio-driven video portrait synthesis is a crucial and useful technology in virtual human interaction and film-making applications. Recent advancements have focused on improving the image fidelity and lip-synchronization. However, generating accurate emotional expressions is an important aspect of realistic talking-head generation, which has remained underexplored in previous works. We present a novel system in this paper for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Specifically, we utilize a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks. These landmarks are concatenated with emotional embeddings to produce emotional landmarks through our motion-to-emotion module. These emotional landmarks are then used to render realistic emotional talking-head video using a Neural Radiance Fields (NeRF)-based emotion-to-video module. Additionally, we propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs. Extensive experiments demonstrate that our method obtains more accurate emotion generation with higher fidelity.
Authors: Fabian Jaensch, Giuseppe Caire, Beg\"um Demir
Abstract: In recent years, several studies have explored deep learning algorithms to predict large-scale signal fading, or path loss, in urban communication networks. The goal is to replace costly measurement campaigns, inaccurate statistical models, or computationally expensive ray-tracing simulations with machine learning models that deliver quick and accurate predictions. We focus on predicting path loss radio maps using convolutional neural networks, leveraging aerial images alone or in combination with supplementary height information. Notably, our approach does not rely on explicit classification of environmental objects, which is often unavailable for most locations worldwide. While the prediction of radio maps using complete 3D environmental data is well-studied, the use of only aerial images remains under-explored. We address this gap by showing that state-of-the-art models developed for existing radio map datasets can be effectively adapted to this task, achieving strong performance. Additionally, we introduce a new model that slightly exceeds the performance of the present state-of-the-art with reduced complexity. The trained models are differentiable, and therefore they can be incorporated in various network optimization algorithms. While an extensive discussion is beyond this paper's scope, we demonstrate this through an example optimizing the directivity of base stations in cellular networks via backpropagation to enhance coverage.
Authors: Matthis Manthe (LIRIS, CREATIS), Stefan Duffner (LIRIS), Carole Lartizien (MYRIAD)
Abstract: Recently, federated learning has raised increasing interest in the medical image analysis field due to its ability to aggregate multi-center data with privacy-preserving properties. A large amount of federated training schemes have been published, which we categorize into global (one final model), personalized (one model per institution) or hybrid (one model per cluster of institutions) methods. However, their applicability on the recently published Federated Brain Tumor Segmentation 2022 dataset has not been explored yet. We propose an extensive benchmark of federated learning algorithms from all three classes on this task. While standard FedAvg already performs very well, we show that some methods from each category can bring a slight performance improvement and potentially limit the final model(s) bias toward the predominant data distribution of the federation. Moreover, we provide a deeper understanding of the behaviour of federated learning on this task through alternative ways of distributing the pooled dataset among institutions, namely an Independent and Identical Distributed (IID) setup, and a limited data setup.
Authors: Kelvin J. L. Koa, Yunshan Ma, Ritchie Ng, Huanhuan Zheng, Tat-Seng Chua
Abstract: Stock portfolios are often exposed to rare consequential events (e.g., 2007 global financial crisis, 2020 COVID-19 stock market crash), as they do not have enough historical information to learn from. Large Language Models (LLMs) now present a possible tool to tackle this problem, as they can generalize across their large corpus of training data and perform zero-shot reasoning on new events, allowing them to detect possible portfolio crash events without requiring specific training data. However, detecting portfolio crashes is a complex problem that requires more than basic reasoning abilities. Investors need to dynamically process the impact of each new information found in the news articles, analyze the the relational network of impacts across news events and portfolio stocks, as well as understand the temporal context between impacts across time-steps, in order to obtain the overall aggregated effect on the target portfolio. In this work, we propose an algorithmic framework named Temporal Relational Reasoning (TRR). It seeks to emulate the spectrum of human cognitive capabilities used for complex problem-solving, which include brainstorming, memory, attention and reasoning. Through extensive experiments, we show that TRR is able to outperform state-of-the-art solutions on detecting stock portfolio crashes, and demonstrate how each of the proposed components help to contribute to its performance through an ablation study. Additionally, we further explore the possible applications of TRR by extending it to other related complex problems, such as the detection of possible global crisis events in Macroeconomics.
Authors: Siqi Li, Qiming Wu, Xin Li, Di Miao, Chuan Hong, Wenjun Gu, Yuqing Shang, Yohei Okada, Michael Hao Chen, Mengying Yan, Yilin Ning, Marcus Eng Hock Ong, Nan Liu
Abstract: Objective: Mitigating algorithmic disparities is a critical challenge in healthcare research, where ensuring equity and fairness is paramount. While large-scale healthcare data exist across multiple institutions, cross-institutional collaborations often face privacy constraints, highlighting the need for privacy-preserving solutions that also promote fairness. Materials and Methods: In this study, we present Fair Federated Machine Learning (FairFML), a model-agnostic solution designed to reduce algorithmic bias in cross-institutional healthcare collaborations while preserving patient privacy. As a proof of concept, we validated FairFML using a real-world clinical case study focused on reducing gender disparities in cardiac arrest outcome prediction. Results: We demonstrate that the proposed FairFML framework enhances fairness in federated learning (FL) models without compromising predictive performance. Our findings show that FairFML improves model fairness by up to 65% compared to the centralized model, while maintaining performance comparable to both local and centralized models, as measured by receiver operating characteristic analysis. Discussion and Conclusion: FairFML offers a promising and flexible solution for FL collaborations, with its adaptability allowing seamless integration with various FL frameworks and models, from traditional statistical methods to deep learning techniques. This makes FairFML a robust approach for developing fairer FL models across diverse clinical and biomedical applications.
Authors: Nayoung Kim, Seongsu Kim, Minsu Kim, Jinkyoo Park, Sungsoo Ahn
Abstract: Metal-organic frameworks (MOFs) are a class of crystalline materials with promising applications in many areas such as carbon capture and drug delivery. In this work, we introduce MOFFlow, the first deep generative model tailored for MOF structure prediction. Existing approaches, including ab initio calculations and even deep generative models, struggle with the complexity of MOF structures due to the large number of atoms in the unit cells. To address this limitation, we propose a novel Riemannian flow matching framework that reduces the dimensionality of the problem by treating the metal nodes and organic linkers as rigid bodies, capitalizing on the inherent modularity of MOFs. By operating in the $SE(3)$ space, MOFFlow effectively captures the roto-translational dynamics of these rigid components in a scalable way. Our experiment demonstrates that MOFFlow accurately predicts MOF structures containing several hundred atoms, significantly outperforming conventional methods and state-of-the-art machine learning baselines while being much faster.
Authors: Jos\'e Javier Gal\'an, Ram\'on Alberto Carrasco, Antonio LaTorre
Abstract: The military environment generates a large amount of data of great importance, which makes necessary the use of machine learning for its processing. Its ability to learn and predict possible scenarios by analyzing the huge volume of information generated provides automatic learning and decision support. This paper aims to present a model of a machine learning architecture applied to a military organization, carried out and supported by a bibliometric study applied to an architecture model of a nonmilitary organization. For this purpose, a bibliometric analysis up to the year 2021 was carried out, making a strategic diagram and interpreting the results. The information used has been extracted from one of the main databases widely accepted by the scientific community, ISI WoS. No direct military sources were used. This work is divided into five parts: the study of previous research related to machine learning in the military world; the explanation of our research methodology using the SciMat, Excel and VosViewer tools; the use of this methodology based on data mining, preprocessing, cluster normalization, a strategic diagram and the analysis of its results to investigate machine learning in the military context; based on these results, a conceptual architecture of the practical use of ML in the military context is drawn up; and, finally, we present the conclusions, where we will see the most important areas and the latest advances in machine learning applied, in this case, to a military environment, to analyze a large set of data, providing utility, machine learning and decision support.
Authors: Ziwei Dong, Ameya Patil, Yuichi Shoda, Leilani Battle, Emily Wall
Abstract: Data science pipelines inform and influence many daily decisions, from what we buy to who we work for and even where we live. When designed incorrectly, these pipelines can easily propagate social inequity and harm. Traditional solutions are technical in nature; e.g., mitigating biased algorithms. In this vision paper, we introduce a novel lens for promoting responsible data science using theories of behavior change that emphasize not only technical solutions but also the behavioral responsibility of practitioners. By integrating behavior change theories from cognitive psychology with data science workflow knowledge and ethics guidelines, we present a new perspective on responsible data science. We present example data science interventions in machine learning and visual data analysis, contextualized in behavior change theories that could be implemented to interrupt and redirect potentially suboptimal or negligent practices while reinforcing ethically conscious behaviors. We conclude with a call to action to our community to explore this new research area of behavior change interventions for responsible data science.
Authors: Sendey Vera, Luis Chuquimarca, Wilson Galdea, Bremnen V\'eliz, Carlos Salda\~na
Abstract: This scientific article presents the implementation of an automated control system for detecting and classifying faults in tuna metal cans using artificial vision. The system utilizes a conveyor belt and a camera for visual recognition triggered by a photoelectric sensor. A robotic arm classifies the metal cans according to their condition. Industry 4.0 integration is achieved through an IoT system using Mosquitto, Node-RED, InfluxDB, and Grafana. The YOLOv5 model is employed to detect faults in the metal can lids and the positioning of the easy-open ring. Training with GPU on Google Colab enables OCR text detection on the labels. The results indicate efficient real-time problem identification, optimization of resources, and delivery of quality products. At the same time, the vision system contributes to autonomy in quality control tasks, freeing operators to perform other functions within the company.
Authors: Arushi Prakash, Dimitrios Bermperidis, Srivas Chennu
Abstract: Large-scale industrial recommendation models predict the most relevant items from catalogs containing millions or billions of options. To train these models efficiently, a small set of irrelevant items (negative samples) is selected from the vast catalog for each relevant item (positive example), helping the model distinguish between relevant and irrelevant items. Choosing the right negative sampling method is a common challenge. We address this by implementing and comparing various negative sampling methods - random, popularity-based, in-batch, mixed, adaptive, and adaptive with mixed variants - on modern sequential recommendation models. Our experiments, including hyperparameter optimization and 20x repeats on three benchmark datasets with varying popularity biases, show how the choice of method and dataset characteristics impact key model performance metrics. We also reveal that average performance metrics often hide imbalances across popularity bands (head, mid, tail). We find that commonly used random negative sampling reinforces popularity bias and performs best for head items. Popularity-based methods (in-batch and global popularity negative sampling) can offer balanced performance at the cost of lower overall model performance results. Our study serves as a practical guide to the trade-offs in selecting a negative sampling method for large-scale sequential recommendation models.
Authors: Louise McCormack, Malika Bendechache
Abstract: This paper presents a systematic review of the literature on evaluation criteria for Trustworthy Artificial Intelligence (TAI), with a focus on the seven EU principles of TAI. This systematic literature review identifies and analyses current evaluation criteria, maps them to the EU TAI principles and proposes a new classification system for each principle. The findings reveal both a need for and significant barriers to standardising criteria for TAI evaluation. The proposed classification contributes to the development, selection and standardization of evaluation criteria for TAI governance.
Authors: Glenda Hui En Tan (Carnegie Mellon University), Goh Xin Ru Karin (London School of Economics and Political Science), Shen Bingquan (DSO National Laboratories Singapore)
Abstract: Colorectal cancer is the most common cancer in Singapore and the third most common cancer worldwide. Blood in a person's stool is a symptom of this disease, and it is usually detected by the faecal occult blood test (FOBT). However, the FOBT presents several limitations - the collection process for the stool samples is tedious and unpleasant, the waiting period for results is about 2 weeks and costs are involved. In this research, we propose a simple-to-use, fast and cost-free alternative - a stool recognition neural network that determines if there is blood in one's stool (which indicates a possible risk of colorectal cancer) from an image of it. As this is a new classification task, there was limited data available, hindering classifier performance. Hence, various Generative Adversarial Networks (GANs) (DiffAugment StyleGAN2, DCGAN, Conditional GAN) were trained to generate images of high fidelity to supplement the dataset. Subsequently, images generated by the GAN with the most realistic images (DiffAugment StyleGAN2) were concatenated to the classifier's training batch on-the-fly, improving accuracy to 94%. This model was then deployed to a mobile app - Poolice, where users can take a photo of their stool and obtain instantaneous results if there is blood in their stool, prompting those who do to seek medical advice. As "early detection saves lives", we hope our app built on our stool recognition neural network can help people detect colorectal cancer earlier, so they can seek treatment and have higher chances of survival.
Authors: Ghazaleh Babanejaddehaki, Aijun An, Manos Papagelis
Abstract: Infectious diseases occur when pathogens from other individuals or animals infect a person, resulting in harm to both individuals and society as a whole. The outbreak of such diseases can pose a significant threat to human health. However, early detection and tracking of these outbreaks have the potential to reduce the mortality impact. To address these threats, public health authorities have endeavored to establish comprehensive mechanisms for collecting disease data. Many countries have implemented infectious disease surveillance systems, with the detection of epidemics being a primary objective. The clinical healthcare system, local/state health agencies, federal agencies, academic/professional groups, and collaborating governmental entities all play pivotal roles within this system. Moreover, nowadays, search engines and social media platforms can serve as valuable tools for monitoring disease trends. The Internet and social media have become significant platforms where users share information about their preferences and relationships. This real-time information can be harnessed to gauge the influence of ideas and societal opinions, making it highly useful across various domains and research areas, such as marketing campaigns, financial predictions, and public health, among others. This article provides a review of the existing standard methods developed by researchers for detecting outbreaks using time series data. These methods leverage various data sources, including conventional data sources and social media data or Internet data sources. The review particularly concentrates on works published within the timeframe of 2015 to 2022.
Authors: Bahar Ali, Anwar Shah, Malik Niaz, Musadaq Mansoord, Sami Ullah, Muhammad Adnan
Abstract: Advanced automated AI techniques allow us to classify protein sequences and discern their biological families and functions. Conventional approaches for classifying these protein families often focus on extracting N-Gram features from the sequences while overlooking crucial motif information and the interplay between motifs and neighboring amino acids. Recently, convolutional neural networks have been applied to amino acid and motif data, even with a limited dataset of well-characterized proteins, resulting in improved performance. This study presents a model for classifying protein families using the fusion of 1D-CNN, BiLSTM, and an attention mechanism, which combines spatial feature extraction, long-term dependencies, and context-aware representations. The proposed model (ProFamNet) achieved superior model efficiency with 450,953 parameters and a compact size of 1.72 MB, outperforming the state-of-the-art model with 4,578,911 parameters and a size of 17.47 MB. Further, we achieved a higher F1 score (98.30% vs. 97.67%) with more instances (271,160 vs. 55,077) in fewer training epochs (25 vs. 30).
Authors: Arnaud Guillin (LMBP), Yu Wang, Lihu Xu, Haoran Yang
Abstract: Stochastic gradient descent with momentum is a popular variant of stochastic gradient descent, which has recently been reported to have a close relationship with the underdamped Langevin diffusion. In this paper, we establish a quantitative error estimate between them in the 1-Wasserstein and total variation distances.
Authors: Haokun Liu, Yangqiaoyu Zhou, Mingxuan Li, Chenfei Yuan, Chenhao Tan
Abstract: AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.
Authors: Zekun Jiang, Wei Dai, Qu Wei, Ziyuan Qin, Kang Li, Le Zhang
Abstract: Multi-channel EEG signals are commonly used for the diagnosis and assessment of diseases such as epilepsy. Currently, various EEG diagnostic algorithms based on deep learning have been developed. However, most research efforts focus solely on diagnosing and classifying current signal data but do not consider the prediction of future trends for early warning. Additionally, since multi-channel EEG can be essentially regarded as the spatio-temporal signal data received by detectors at different locations in the brain, how to construct spatio-temporal information representations of EEG signals to facilitate future trend prediction for multi-channel EEG becomes an important problem. This study proposes a multi-signal prediction algorithm based on generative diffusion models (EEG-DIF), which transforms the multi-signal forecasting task into an image completion task, allowing for comprehensive representation and learning of the spatio-temporal correlations and future developmental patterns of multi-channel EEG signals. Here, we employ a publicly available epilepsy EEG dataset to construct and validate the EEG-DIF. The results demonstrate that our method can accurately predict future trends for multi-channel EEG signals simultaneously. Furthermore, the early warning accuracy for epilepsy seizures based on the generated EEG data reaches 0.89. In general, EEG-DIF provides a novel approach for characterizing multi-channel EEG signals and an innovative early warning algorithm for epilepsy seizures, aiding in optimizing and enhancing the clinical diagnosis process. The code is available at https://github.com/JZK00/EEG-DIF.
Authors: Sathvik Prasad, Aleksandr Nahapetyan, Bradley Reaves
Abstract: Telephone spam has been among the highest network security concerns for users for many years. In response, industry and government have deployed new technologies and regulations to curb the problem, and academic and industry researchers have provided methods and measurements to characterize robocalls. Have these efforts borne fruit? Are the research characterizations reliable, and have the prevention and deterrence mechanisms succeeded? In this paper, we address these questions through analysis of data from several independently-operated vantage points, ranging from industry and academic voice honeypots to public enforcement and consumer complaints, some with over 5 years of historic data. We first describe how we address the non-trivial methodological challenges of comparing disparate data sources, including comparing audio and transcripts from about 3 million voice calls. We also detail the substantial coherency of these diverse perspectives, which dramatically strengthens the evidence for the conclusions we draw about robocall characterization and mitigation while highlighting advantages of each approach. Among our many findings, we find that unsolicited calls are in slow decline, though complaints and call volumes remain high. We also find that robocallers have managed to adapt to STIR/SHAKEN, a mandatory call authentication scheme. In total, our findings highlight the most promising directions for future efforts to characterize and stop telephone spam.
Authors: Bradley McDanel
Abstract: Large language models typically generate tokens autoregressively, using each token as input for the next. Recent work on Speculative Decoding has sought to accelerate this process by employing a smaller, faster draft model to more quickly generate candidate tokens. These candidates are then verified in parallel by the larger (original) verify model, resulting in overall speedup compared to using the larger model by itself in an autoregressive fashion. In this work, we introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that further accelerates generation by decoupling the draft and verify phases into a continuous, asynchronous approach. Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices (e.g., GPUs). We evaluate our approach over multiple datasets and show that AMUSD achieves an average 29% improvement over speculative decoding and up to 1.96$\times$ speedup over conventional autoregressive decoding, while achieving identical output quality. Our system is open-source and available at https://github.com/BradMcDanel/AMUSD/.
Authors: Borja Aizpurua, Saeed S. Jahromi, Sukhbinder Singh, Roman Orus
Abstract: We propose a method to enhance the performance of Large Language Models (LLMs) by integrating quantum computing and quantum-inspired techniques. Specifically, our approach involves replacing the weight matrices in the Self-Attention and Multi-layer Perceptron layers with a combination of two variational quantum circuits and a quantum-inspired tensor network, such as a Matrix Product Operator (MPO). This substitution enables the reproduction of classical LLM functionality by decomposing weight matrices through the application of tensor network disentanglers and MPOs, leveraging well-established tensor network techniques. By incorporating more complex and deeper quantum circuits, along with increasing the bond dimensions of the MPOs, our method captures additional correlations within the quantum-enhanced LLM, leading to improved accuracy beyond classical models while maintaining low memory overhead.
Authors: Meiby Ortiz-Bouza, Duc Vu, Abdullah Karaaslanli, Selin Aviyente
Abstract: Over the past two decades, tools from network science have been leveraged to characterize the organization of both structural and functional networks of the brain. One such measure of network organization is hub node identification. Hubs are specialized nodes within a network that link distinct brain units corresponding to specialized functional processes. Conventional methods for identifying hub nodes utilize different types of centrality measures and participation coefficient to profile various aspects of nodal importance. These methods solely rely on the functional connectivity networks constructed from functional magnetic resonance imaging (fMRI), ignoring the structure-function coupling in the brain. In this paper, we introduce a graph signal processing (GSP) based hub detection framework that utilizes both the structural connectivity and the functional activation to identify hub nodes. The proposed framework models functional activity as graph signals on the structural connectivity. Hub nodes are then detected based on the premise that hub nodes are sparse, have higher level of activity compared to their neighbors, and the non-hub nodes' activity can be modeled as the output of a graph-based filter. Based on these assumptions, an optimization framework, GraFHub, is formulated to learn the coefficients of the optimal polynomial graph filter and detect the hub nodes. The proposed framework is evaluated on both simulated data and resting state fMRI (rs-fMRI) data from Human Connectome Project (HCP).
Authors: \"Omer Veysel \c{C}a\u{g}atan
Abstract: We propose SigCLR: Sigmoid Contrastive Learning of Visual Representations. SigCLR utilizes the logistic loss that only operates on pairs and does not require a global view as in the cross-entropy loss used in SimCLR. We show that logistic loss shows competitive performance on CIFAR-10, CIFAR-100, and Tiny-IN compared to other established SSL objectives. Our findings verify the importance of learnable bias as in the case of SigLUP, however, it requires a fixed temperature as in the SimCLR to excel. Overall, SigCLR is a promising replacement for the SimCLR which is ubiquitous and has shown tremendous success in various domains.
Authors: Haotong Liang, Chuangye Wang, Heshan Yu, Dylan Kirsch, Rohit Pant, Austin McDannald, A. Gilad Kusne, Ji-Cheng Zhao, Ichiro Takeuchi
Abstract: Iterative cycles of theoretical prediction and experimental validation are the cornerstone of the modern scientific method. However, the proverbial "closing of the loop" in experiment-theory cycles in practice are usually ad hoc, often inherently difficult, or impractical to repeat on a systematic basis, beset by the scale or the time constraint of computation or the phenomena under study. Here, we demonstrate Autonomous MAterials Search Engine (AMASE), where we enlist robot science to perform self-driving continuous cyclical interaction of experiments and computational predictions for materials exploration. In particular, we have applied the AMASE formalism to the rapid mapping of a temperature-composition phase diagram, a fundamental task for the search and discovery of new materials. Thermal processing and experimental determination of compositional phase boundaries in thin films are autonomously interspersed with real-time updating of the phase diagram prediction through the minimization of Gibbs free energies. AMASE was able to accurately determine the eutectic phase diagram of the Sn-Bi binary thin-film system on the fly from a self-guided campaign covering just a small fraction of the entire composition - temperature phase space, translating to a 6-fold reduction in the number of necessary experiments. This study demonstrates for the first time the possibility of real-time, autonomous, and iterative interactions of experiments and theory carried out without any human intervention.
Authors: Thuan Pham, Xingpeng Li
Abstract: Optimal power flow (OPF) has been used for real-time grid operations. Prior efforts demonstrated that utilizing flexibility from dynamic topologies will improve grid efficiency. However, this will convert the linear OPF into a mixed-integer linear programming network-reconfigured OPF (NR-OPF) problem, substantially increasing the computing time. Thus, a machine learning (ML)-based approach, particularly utilizing graph neural network (GNN), is proposed to accelerate the solution process. The GNN model is trained offline to predict the best topology before entering the optimization stage. In addition, this paper proposes an offline pre-ML filter layer to reduce GNN model size and training time while improving its accuracy. A fast online post-ML selection layer is also proposed to analyze GNN predictions and then select a subset of predicted NR solutions with high confidence. Case studies have demonstrated superior performance of the proposed GNN-accelerated NR-OPF method augmented with the proposed pre-ML and post-ML layers.
Authors: Ali Azizpour, Nicolas Zilberstein, Santiago Segarra
Abstract: Graphons are continuous models that represent the structure of graphs and allow the generation of graphs of varying sizes. We propose Scalable Implicit Graphon Learning (SIGL), a scalable method that combines implicit neural representations (INRs) and graph neural networks (GNNs) to estimate a graphon from observed graphs. Unlike existing methods, which face important limitations like fixed resolution and scalability issues, SIGL learns a continuous graphon at arbitrary resolutions. GNNs are used to determine the correct node ordering, improving graph alignment. Furthermore, we characterize the asymptotic consistency of our estimator, showing that more expressive INRs and GNNs lead to consistent estimators. We evaluate SIGL in synthetic and real-world graphs, showing that it outperforms existing methods and scales effectively to larger graphs, making it ideal for tasks like graph data augmentation.
Authors: Jacopo Tagliabue, Tyler Caraza-Harter, Ciro Greco
Abstract: Chaining functions for longer workloads is a key use case for FaaS platforms in data applications. However, modern data pipelines differ significantly from typical serverless use cases (e.g., webhooks and microservices); this makes it difficult to retrofit existing pipeline frameworks due to structural constraints. In this paper, we describe these limitations in detail and introduce bauplan, a novel FaaS programming model and serverless runtime designed for data practitioners. bauplan enables users to declaratively define functional Directed Acyclic Graphs (DAGs) along with their runtime environments, which are then efficiently executed on cloud-based workers. We show that bauplan achieves both better performance and a superior developer experience for data workloads by making the trade-off of reducing generality in favor of data-awareness
Authors: Tanaporn Na Narong, Zoe N. Zachko, Steven B. Torrisi, Simon J. L. Billinge
Abstract: We used off-the-shelf interpretable ML techniques to combine information from multiple heterogeneous spectra: X-ray absorption near-edge spectra (XANES) and atomic pair distribution functions (PDFs), to extract information about local structure and chemistry of transition metal oxides. This approach enabled us to analyze the relative contributions of the different spectra to different prediction tasks. Specifically, we trained random forest models on XANES, PDF, and both of them combined, to extract charge (oxidation) state, coordination number, and mean nearest-neighbor bond length of transition metal cations in oxides. We find that XANES-only models tend to outperform the PDF-only models for all the tasks, and information from XANES often dominated when the two inputs were combined. This was even true for structural tasks where we might expect PDF to dominate. However, the performance gap closes when we used species-specific differential PDFs (dPDFs) as the inputs instead of total PDFs. Our results highlight that XANES contains rich structural information and may be further developed as a structural probe. Our interpretable, multimodal approach is quick and easy to implement when suitable structural and spectroscopic databases are available. This approach provides valuable insights into the relative strengths of different modalities for a practical scientific goal, guiding researchers in their experiment design tasks such as deciding when it is useful to combine complementary techniques in a scientific investigation.
Authors: Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Boxing Chen, Sarath Chandar
Abstract: The growth in prominence of large language models (LLMs) in everyday life can be largely attributed to their generative abilities, yet some of this is also owed to the risks and costs associated with their use. On one front is their tendency to \textit{hallucinate} false or misleading information, limiting their reliability. On another is the increasing focus on the computational limitations associated with traditional self-attention based LLMs, which has brought about new alternatives, in particular recurrent models, meant to overcome them. Yet it remains uncommon to consider these two concerns simultaneously. Do changes in architecture exacerbate/alleviate existing concerns about hallucinations? Do they affect how and where they occur? Through an extensive evaluation, we study how these architecture-based inductive biases affect the propensity to hallucinate. While hallucination remains a general phenomenon not limited to specific architectures, the situations in which they occur and the ease with which specific types of hallucinations can be induced can significantly differ based on the model architecture. These findings highlight the need for better understanding both these problems in conjunction with each other, as well as consider how to design more universal techniques for handling hallucinations.
Authors: He Zhu, Ren Togo, Takahiro Ogawa, Miki Haseyama
Abstract: Conventional medical artificial intelligence (AI) models face barriers in clinical application and ethical issues owing to their inability to handle the privacy-sensitive characteristics of medical data. We present a novel personalized federated learning (pFL) method for medical visual question answering (VQA) models, addressing privacy reliability challenges in the medical domain. Our method introduces learnable prompts into a Transformer architecture to efficiently train it on diverse medical datasets without massive computational costs. Then we introduce a reliable client VQA model that incorporates Dempster-Shafer evidence theory to quantify uncertainty in predictions, enhancing the model's reliability. Furthermore, we propose a novel inter-client communication mechanism that uses maximum likelihood estimation to balance accuracy and uncertainty, fostering efficient integration of insights across clients.
Authors: Yixuan Wang, Guang Yin, Binghao Huang, Tarik Kelestemur, Jiuguang Wang, Yunzhu Li
Abstract: Diffusion-based policies have shown remarkable capability in executing complex robotic manipulation tasks but lack explicit characterization of geometry and semantics, which often limits their ability to generalize to unseen objects and layouts. To enhance the generalization capabilities of Diffusion Policy, we introduce a novel framework that incorporates explicit spatial and semantic information via 3D semantic fields. We generate 3D descriptor fields from multi-view RGBD observations with large foundational vision models, then compare these descriptor fields against reference descriptors to obtain semantic fields. The proposed method explicitly considers geometry and semantics, enabling strong generalization capabilities in tasks requiring category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details. We evaluate our method across eight tasks involving articulated objects and instances with varying shapes and textures from multiple object categories. Our method demonstrates its effectiveness by increasing Diffusion Policy's average success rate on unseen instances from 20% to 93%. Additionally, we provide a detailed analysis and visualization to interpret the sources of performance gain and explain how our method can generalize to novel instances.
Authors: Jiaqi Xue, Qian Lou, Mengxin Zheng
Abstract: Attacking fairness is crucial because compromised models can introduce biased outcomes, undermining trust and amplifying inequalities in sensitive applications like hiring, healthcare, and law enforcement. This highlights the urgent need to understand how fairness mechanisms can be exploited and to develop defenses that ensure both fairness and robustness. We introduce BadFair, a novel backdoored fairness attack methodology. BadFair stealthily crafts a model that operates with accuracy and fairness under regular conditions but, when activated by certain triggers, discriminates and produces incorrect results for specific groups. This type of attack is particularly stealthy and dangerous, as it circumvents existing fairness detection methods, maintaining an appearance of fairness in normal use. Our findings reveal that BadFair achieves a more than 85% attack success rate in attacks aimed at target groups on average while only incurring a minimal accuracy loss. Moreover, it consistently exhibits a significant discrimination score, distinguishing between pre-defined target and non-target attacked groups across various datasets and models.
Authors: Qibang Liu, Pengfei Cai, Diab Abueidda, Seid Koric, Rafael Gomez-Bombarellig, Philippe Geubelle
Abstract: Rapid reaction-thermal diffusion during frontal polymerization (FP) with variations in initial and boundary conditions destabilizes the planar mode of front propagation, leading to spatially varying complex hierarchical patterns in polymeric materials. Although modern reaction-diffusion models can predict the patterns resulting from unstable FP, the inverse design of patterns, which aims to retrieve process conditions that produce a desired pattern, remains an open challenge due to the nonunique and nonintuitive mapping between process conditions and patterns. In this work, we propose a novel probabilistic generative model named univariate conditional variational autoencoder (UcVAE) for the inverse design of hierarchical patterns in FP-based manufacturing. Unlike the cVAE, which encodes both the design space and the design target, the UcVAE encodes only the design space. In the encoder of the UcVAE, the number of training parameters is significantly reduced compared to the cVAE, resulting in a shorter training time while maintaining comparable performance. Given desired pattern images, the trained UcVAE can generate multiple process condition solutions that produce high-fidelity hierarchical patterns.
Authors: Tianhao Fu, Querobin Mascarenhas, Andrew Forti
Abstract: With the rapid development of electric vehicles, formula races that face high school and university students have become more popular than ever as the threshold for design and manufacturing has been lowered. In many cases, we see teams inspired by or directly using toolkits and technologies inherited from standardized commercial vehicles. These architectures are usually overly complicated for amateur applications like the races. In order to improve the efficiency and simplify the development of instrumentation, control, and analysis systems, we propose LEADS (Lightweight Embedded Assisted Driving System), a dedicated solution for such scenarios.
Authors: Michael John Fanous, Christopher Michael Seybold, Hanlong Chen, Nir Pillar, Aydogan Ozcan
Abstract: We developed a rapid scanning optical microscope, termed "BlurryScope", that leverages continuous image acquisition and deep learning to provide a cost-effective and compact solution for automated inspection and analysis of tissue sections. BlurryScope integrates specialized hardware with a neural network-based model to quickly process motion-blurred histological images and perform automated pathology classification. This device offers comparable speed to commercial digital pathology scanners, but at a significantly lower price point and smaller size/weight, making it ideal for fast triaging in small clinics, as well as for resource-limited settings. To demonstrate the proof-of-concept of BlurryScope, we implemented automated classification of human epidermal growth factor receptor 2 (HER2) scores on immunohistochemically (IHC) stained breast tissue sections, achieving concordant results with those obtained from a high-end digital scanning microscope. We evaluated this approach by scanning HER2-stained tissue microarrays (TMAs) at a continuous speed of 5 mm/s, which introduces bidirectional motion blur artifacts. These compromised images were then used to train our network models. Using a test set of 284 unique patient cores, we achieved blind testing accuracies of 79.3% and 89.7% for 4-class (0, 1+, 2+, 3+) and 2-class (0/1+ , 2+/3+) HER2 score classification, respectively. BlurryScope automates the entire workflow, from image scanning to stitching and cropping of regions of interest, as well as HER2 score classification. We believe BlurryScope has the potential to enhance the current pathology infrastructure in resource-scarce environments, save diagnostician time and bolster cancer identification and classification across various clinical environments.
Authors: Ruyi Tao, Kaiwei Liu, Xu Jing, Jiang Zhang
Abstract: Predicting company growth is crucial for strategic adjustment, operational decision-making, risk assessment, and loan eligibility reviews. Traditional models for company growth often focus too much on theory, overlooking practical forecasting, or they rely solely on time series forecasting techniques, ignoring interpretability and the inherent mechanisms of company growth. In this paper, we propose a machine learning-based prediction framework that incorporates an econophysics model for company growth. Our model captures both the intrinsic growth mechanisms of companies led by scaling laws and the fluctuations influenced by random factors and individual decisions, demonstrating superior predictive performance compared with methods that use time series techniques alone. Its advantages are more pronounced in long-range prediction tasks. By explicitly modeling the baseline growth and volatility components, our model is more interpretable.
Authors: Junwon Lee, Modan Tailleur, Laurie M. Heller, Keunwoo Choi, Mathieu Lagrange, Brian McFee, Keisuke Imoto, Yuki Okamoto
Abstract: Despite significant advancements in neural text-to-audio generation, challenges persist in controllability and evaluation. This paper addresses these issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fr\'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation. Our analysis reveals varying performance across sound categories and model architectures, with larger models generally excelling but innovative lightweight approaches also showing promise. The strong correlation between objective metrics and human ratings validates our evaluation approach. We discuss outcomes in terms of audio quality, controllability, and architectural considerations for text-to-audio synthesizers, providing direction for future research.
Authors: Zheng Luo, Ming Feng, Zijian Gao, Jinyang Yu, Liang Hu, Tao Wang, Shenao Xue, Shen Zhou, Fangping Ouyang, Dawei Feng, Kele Xu, Shanshan Wang
Abstract: The emergence of deep learning (DL) has provided great opportunities for the high-throughput analysis of atomic-resolution micrographs. However, the DL models trained by image patches in fixed size generally lack efficiency and flexibility when processing micrographs containing diversified atomic configurations. Herein, inspired by the similarity between the atomic structures and graphs, we describe a few-shot learning framework based on an equivariant graph neural network (EGNN) to analyze a library of atomic structures (e.g., vacancies, phases, grain boundaries, doping, etc.), showing significantly promoted robustness and three orders of magnitude reduced computing parameters compared to the image-driven DL models, which is especially evident for those aggregated vacancy lines with flexible lattice distortion. Besides, the intuitiveness of graphs enables quantitative and straightforward extraction of the atomic-scale structural features in batches, thus statistically unveiling the self-assembly dynamics of vacancy lines under electron beam irradiation. A versatile model toolkit is established by integrating EGNN sub-models for single structure recognition to process images involving varied configurations in the form of a task chain, leading to the discovery of novel doping configurations with superior electrocatalytic properties for hydrogen evolution reactions. This work provides a powerful tool to explore structure diversity in a fast, accurate, and intelligent manner.
Authors: Dairazalia S\'anchez-Cort\'es, Sergio Burdisso, Esa\'u Villatoro-Tello, Petr Motlicek
Abstract: Bias assessment of news sources is paramount for professionals, organizations, and researchers who rely on truthful evidence for information gathering and reporting. While certain bias indicators are discernible from content analysis, descriptors like political bias and fake news pose greater challenges. In this paper, we propose an extension to a recently presented news media reliability estimation method that focuses on modeling outlets and their longitudinal web interactions. Concretely, we assess the classification performance of four reinforcement learning strategies on a large news media hyperlink graph. Our experiments, targeting two challenging bias descriptors, factual reporting and political bias, showed a significant performance improvement at the source media level. Additionally, we validate our methods on the CLEF 2023 CheckThat! Lab challenge, outperforming the reported results in both, F1-score and the official MAE metric. Furthermore, we contribute by releasing the largest annotated dataset of news source media, categorized with factual reporting and political bias labels. Our findings suggest that profiling news media sources based on their hyperlink interactions over time is feasible, offering a bird's-eye view of evolving media landscapes.
Authors: Maximilian Augustin, Syed Shakib Sarwar, Mostafa Elhoushi, Sai Qian Zhang, Yuecheng Li, Barbara De Salvo
Abstract: Following their success in natural language processing (NLP), there has been a shift towards transformer models in computer vision. While transformers perform well and offer promising multi-tasking performance, due to their high compute requirements, many resource-constrained applications still rely on convolutional or hybrid models that combine the benefits of convolution and attention layers and achieve the best results in the sub 100M parameter range. Simultaneously, task adaptation techniques that allow for the use of one shared transformer backbone for multiple downstream tasks, resulting in great storage savings at negligible cost in performance, have not yet been adopted for hybrid transformers. In this work, we investigate how to achieve the best task-adaptation performance and introduce PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers. We further combine PETAH adaptation with pruning to achieve highly performant and storage friendly models for multi-tasking. In our extensive evaluation on classification and other vision tasks, we demonstrate that our PETAH-adapted hybrid models outperform established task-adaptation techniques for ViTs while requiring fewer parameters and being more efficient on mobile hardware.
Authors: Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Abstract: As large language models (LLMs) are widely applied across various fields, model compression has become increasingly crucial for reducing costs and improving inference efficiency. Post-training pruning is a promising method that does not require resource-intensive iterative training and only needs a small amount of calibration data to assess the importance of parameters. Previous research has primarily focused on designing advanced pruning methods, while different calibration data's impact on pruning performance still lacks systematical exploration. We fill this blank and surprisingly observe that the effects of calibration data even value more than designing advanced pruning strategies, especially for high sparsity. Our preliminary exploration also discloses that using calibration data similar to the training data can yield better performance. As pre-training data is usually inaccessible for advanced LLMs, we further provide a self-generating calibration data synthesis strategy to construct feasible calibration data. We conduct experiments on the recent strong open-source LLMs (e.g., DCLM, and LLaMA-3), and the results show that the proposed method outperforms commonly used calibration data and can effectively enhance strong pruning methods (e.g., Wanda, OWL).
Authors: Tao Yu, Zhaonian Zou, Hao Xiong
Abstract: Index tuning is crucial for optimizing database performance by selecting optimal indexes based on workload. The key to this process lies in an accurate and efficient benefit estimator. Traditional methods relying on what-if tools often suffer from inefficiency and inaccuracy. In contrast, learning-based models provide a promising alternative but face challenges such as instability, lack of interpretability, and complex management. To overcome these limitations, we adopt a novel approach: quantifying the uncertainty in learning-based models' results, thereby combining the strengths of both traditional and learning-based methods for reliable index tuning. We propose Beauty, the first uncertainty-aware framework that enhances learning-based models with uncertainty quantification and uses what-if tools as a complementary mechanism to improve reliability and reduce management complexity. Specifically, we introduce a novel method that combines AutoEncoder and Monte Carlo Dropout to jointly quantify uncertainty, tailored to the characteristics of benefit estimation tasks. In experiments involving sixteen models, our approach outperformed existing uncertainty quantification methods in the majority of cases. We also conducted index tuning tests on six datasets. By applying the Beauty framework, we eliminated worst-case scenarios and more than tripled the occurrence of best-case scenarios.
Authors: Yousef Yeganeh, Rachmadio Lazuardi, Amir Shamseddin, Emine Dari, Yash Thirani, Nassir Navab Azade Farshad
Abstract: Surgical data science (SDS) is a field that analyzes patient data before, during, and after surgery to improve surgical outcomes and skills. However, surgical data is scarce, heterogeneous, and complex, which limits the applicability of existing machine learning methods. In this work, we introduce the novel task of future video generation in laparoscopic surgery. This task can augment and enrich the existing surgical data and enable various applications, such as simulation, analysis, and robot-aided surgery. Ultimately, it involves not only understanding the current state of the operation but also accurately predicting the dynamic and often unpredictable nature of surgical procedures. Our proposed method, VISAGE (VIdeo Synthesis using Action Graphs for Surgery), leverages the power of action scene graphs to capture the sequential nature of laparoscopic procedures and utilizes diffusion models to synthesize temporally coherent video sequences. VISAGE predicts the future frames given only a single initial frame, and the action graph triplets. By incorporating domain-specific knowledge through the action graph, VISAGE ensures the generated videos adhere to the expected visual and motion patterns observed in real laparoscopic procedures. The results of our experiments demonstrate high-fidelity video generation for laparoscopy procedures, which enables various applications in SDS.
Authors: Nils Blank, Moritz Reuss, Marcel R\"uhle, \"Omer Erdin\c{c} Ya\u{g}murlu, Fabian Wenzel, Oier Mees, Rudolf Lioutikov
Abstract: A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pretrained vision-language foundation models in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a kitchen play dataset show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity. We use NILS to label over 115k trajectories obtained from over 430 hours of robot data. We open-source our auto-labeling code and generated annotations on our website: http://robottasklabeling.github.io.
Authors: Kai Wang, Yuanchao Bai, Daxin Li, Deming Zhai, Junjun Jiang, Xianming Liu
Abstract: Recent advances in learning-based methods have markedly enhanced the capabilities of image compression. However, these methods struggle with high bit-depth volumetric medical images, facing issues such as degraded performance, increased memory demand, and reduced processing speed. To address these challenges, this paper presents the Bit-Division based Lossless Volumetric Image Compression (BD-LVIC) framework, which is tailored for high bit-depth medical volume compression. The BD-LVIC framework skillfully divides the high bit-depth volume into two lower bit-depth segments: the Most Significant Bit-Volume (MSBV) and the Least Significant Bit-Volume (LSBV). The MSBV concentrates on the most significant bits of the volumetric medical image, capturing vital structural details in a compact manner. This reduction in complexity greatly improves compression efficiency using traditional codecs. Conversely, the LSBV deals with the least significant bits, which encapsulate intricate texture details. To compress this detailed information effectively, we introduce an effective learning-based compression model equipped with a Transformer-Based Feature Alignment Module, which exploits both intra-slice and inter-slice redundancies to accurately align features. Subsequently, a Parallel Autoregressive Coding Module merges these features to precisely estimate the probability distribution of the least significant bit-planes. Our extensive testing demonstrates that the BD-LVIC framework not only sets new performance benchmarks across various datasets but also maintains a competitive coding speed, highlighting its significant potential and practical utility in the realm of volumetric medical image compression.
Authors: Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann
Abstract: Diffusion models have found great success in generating high quality, natural samples of speech, but their potential for density estimation for speech has so far remained largely unexplored. In this work, we leverage an unconditional diffusion model trained only on clean speech for the assessment of speech quality. We show that the quality of a speech utterance can be assessed by estimating the likelihood of a corresponding sample in the terminating Gaussian distribution, obtained via a deterministic noising process. The resulting method is purely unsupervised, trained only on clean speech, and therefore does not rely on annotations. Our diffusion-based approach leverages clean speech priors to assess quality based on how the input relates to the learned distribution of clean data. Our proposed log-likelihoods show promising results, correlating well with intrusive speech quality metrics such as POLQA and SI-SDR.
Authors: K V Srinanda, M Manvith Prabhu, Shyam Lal
Abstract: This manuscript summarizes work on the Capsule Vision Challenge 2024 by MISAHUB. To address the multi-class disease classification task, which is challenging due to the complexity and imbalance in the Capsule Vision challenge dataset, this paper proposes CASCRNet (Capsule endoscopy-Aspp-SCR-Network), a parameter-efficient and novel model that uses Shared Channel Residual (SCR) blocks and Atrous Spatial Pyramid Pooling (ASPP) blocks. Further, the performance of the proposed model is compared with other well-known approaches. The experimental results yield that proposed model provides better disease classification results. The proposed model was successful in classifying diseases with an F1 Score of 78.5% and a Mean AUC of 98.3%, which is promising given its compact architecture.
Authors: Wenfang Yao, Chen Liu, Kejing Yin, William K. Cheung, Jing Qin
Abstract: Integrating multi-modal clinical data, such as electronic health records (EHR) and chest X-ray images (CXR), is particularly beneficial for clinical prediction tasks. However, in a temporal setting, multi-modal data are often inherently asynchronous. EHR can be continuously collected but CXR is generally taken with a much longer interval due to its high cost and radiation dose. When clinical prediction is needed, the last available CXR image might have been outdated, leading to suboptimal predictions. To address this challenge, we propose DDL-CXR, a method that dynamically generates an up-to-date latent representation of the individualized CXR images. Our approach leverages latent diffusion models for patient-specific generation strategically conditioned on a previous CXR image and EHR time series, providing information regarding anatomical structures and disease progressions, respectively. In this way, the interaction across modalities could be better captured by the latent CXR generation process, ultimately improving the prediction performance. Experiments using MIMIC datasets show that the proposed model could effectively address asynchronicity in multimodal fusion and consistently outperform existing methods.
Authors: Zhihao Liu, Simon Filhol, D\'esir\'ee Treichler
Abstract: Estimating the variability of seasonal snow cover, in particular snow depth in remote areas, poses significant challenges due to limited spatial and temporal data availability. This study uses snow depth measurements from the ICESat-2 satellite laser altimeter, which are sparse in both space and time, and incorporates them with climate reanalysis data into a downscaling-calibration scheme to produce monthly gridded snow depth maps at microscale (10 m). Snow surface elevation measurements from ICESat-2 along profiles are compared to a digital elevation model to determine snow depth at each point. To efficiently turn sparse measurements into snow depth maps, a regression model is fitted to establish a relationship between the retrieved snow depth and the corresponding ERA5 Land snow depth. This relationship, referred to as subgrid variability, is then applied to downscale the monthly ERA5 Land snow depth data. The method can provide timeseries of monthly snow depth maps for the entire ERA5 time range (since 1950). The validation of downscaled snow depth data was performed at an intermediate scale (100 m x 500 m) using datasets from airborne laser scanning (ALS) in the Hardangervidda region of southern Norway. Results show that snow depth prediction achieved R2 values ranging from 0.74 to 0.88 (post-calibration). The method relies on globally available data and is applicable to other snow regions above the treeline. Though requiring area-specific calibration, our approach has the potential to provide snow depth maps in areas where no such data exist and can be used to extrapolate existing snow surveys in time and over larger areas. With this, it can offer valuable input data for hydrological, ecological or permafrost modeling tasks.
Authors: Shiyue Zhang, Ziheng Cheng, Cheng Zhang
Abstract: Particle-based variational inference methods (ParVIs) use non-parametric variational families represented by particles to approximate the target distribution according to the kernelized Wasserstein gradient flow for the Kullback-Leibler (KL) divergence. Recent works introduce functional gradient flows to substitute the kernel for better flexibility. However, the deterministic updating mechanism may suffer from limited exploration and require expensive repetitive runs for new samples. In this paper, we propose Semi-Implicit Functional Gradient flow (SIFG), a functional gradient ParVI method that uses perturbed particles as the approximation family. The corresponding functional gradient flow, which can be estimated via denoising score matching, exhibits strong theoretical convergence guarantee. We also present an adaptive version of our method to automatically choose the suitable noise magnitude. Extensive experiments demonstrate the effectiveness and efficiency of the proposed framework on both simulated and real data problems.
Authors: Biman Barua, M. Shamim Kaiser
Abstract: The objective of this research is how an implementation of AI algorithms in the microservices architecture enhances travel itineraries by cost, time, user preferences, and environmental sustainability. It uses machine learning models for both cost forecasting and personalization, genetic algorithm for optimization of the itinerary, and heuristics for sustainability checking. Primary evaluated parameters consist of latency, ability to satisfy user preferences, cost and environmental concern. The experimental results demonstrate an average of 4.5 seconds of response time on 1000 concurrent users and 92% of user preferences accuracy. The cost efficiency is proved, with 95% of provided trips being within the limits of the budget declared by the user. The system also implements some measures to alleviate negative externalities related to travel and 60% of offered travel plans had green options incorporated, resulting in the average 15% lower carbon emissions than the traditional travel plans offered. The genetic algorithm with time complexity O(g.p.f) provides the optimal solution in 100 generations. Every iteration improves the quality of the solution by 5%, thus enabling its effective use in optimization problems where time is measured in seconds. Finally, the system is designed to be fault-tolerant with functional 99.9% availability which allows the provision of services even when requirements are exceeded. Travel optimization platform is turned dynamic and efficient by this microservices based architecture which provides enhanced scaling, allows asynchronous communication and real time changes. Because of the incorporation of Ai, cost control and eco-friendliness approaches, the system addresses the different user needs in the present days travel business.
Authors: Ankur Nath, Alan Kuhnle
Abstract: Modern instances of combinatorial optimization problems often exhibit billion-scale ground sets, which have many uninformative or redundant elements. In this work, we develop light-weight pruning algorithms to quickly discard elements that are unlikely to be part of an optimal solution. Under mild assumptions on the instance, we prove theoretical guarantees on the fraction of the optimal value retained and the size of the resulting pruned ground set. Through extensive experiments on real-world datasets for various applications, we demonstrate that our algorithm, QuickPrune, efficiently prunes over 90% of the ground set and outperforms state-of-the-art classical and machine learning heuristics for pruning.
Authors: Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He
Abstract: Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.
Authors: William Cagas, Chan Ko, Blake Hsiao, Shryuk Grandhi, Rishi Bhattacharya, Kevin Zhu, Michael Lam
Abstract: The proliferation of machine learning models in diverse clinical applications has led to a growing need for high-fidelity, medical image training data. Such data is often scarce due to cost constraints and privacy concerns. Alleviating this burden, medical image synthesis via generative adversarial networks (GANs) emerged as a powerful method for synthetically generating photo-realistic images based on existing sets of real medical images. However, the exact image set size required to efficiently train such a GAN is unclear. In this work, we experimentally establish benchmarks that measure the relationship between a sample dataset size and the fidelity of the generated images, given the dataset's distribution of image complexities. We analyze statistical metrics based on delentropy, an image complexity measure rooted in Shannon's entropy in information theory. For our pipeline, we conduct experiments with two state-of-the-art GANs, StyleGAN 3 and SPADE-GAN, trained on multiple medical imaging datasets with variable sample sizes. Across both GANs, general performance improved with increasing training set size but suffered with increasing complexity.
Authors: Horacio Thompson, Marcelo Errecalde
Abstract: The eRisk laboratory aims to address issues related to early risk detection on the Web. In this year's edition, three tasks were proposed, where Task 2 was about early detection of signs of anorexia. Early risk detection is a problem where precision and speed are two crucial objectives. Our research group solved Task 2 by defining a CPI+DMC approach, addressing both objectives independently, and a time-aware approach, where precision and speed are considered a combined single-objective. We implemented the last approach by explicitly integrating time during the learning process, considering the ERDE{\theta} metric as the training objective. It also allowed us to incorporate temporal metrics to validate and select the optimal models. We achieved outstanding results for the ERDE50 metric and ranking-based metrics, demonstrating consistency in solving ERD problems.
Authors: Shiqi Chen, Yuhang Li, Hanlong Chen, Aydogan Ozcan
Abstract: Generative models cover various application areas, including image, video and music synthesis, natural language processing, and molecular design, among many others. As digital generative models become larger, scalable inference in a fast and energy-efficient manner becomes a challenge. Here, we present optical generative models inspired by diffusion models, where a shallow and fast digital encoder first maps random noise into phase patterns that serve as optical generative seeds for a desired data distribution; a jointly-trained free-space-based reconfigurable decoder all-optically processes these generative seeds to create novel images (never seen before) following the target data distribution. Except for the illumination power and the random seed generation through a shallow encoder, these optical generative models do not consume computing power during the synthesis of novel images. We report the optical generation of monochrome and multi-color novel images of handwritten digits, fashion products, butterflies, and human faces, following the data distributions of MNIST, Fashion MNIST, Butterflies-100, and Celeb-A datasets, respectively, achieving an overall performance comparable to digital neural network-based generative models. To experimentally demonstrate optical generative models, we used visible light to generate, in a snapshot, novel images of handwritten digits and fashion products. These optical generative models might pave the way for energy-efficient, scalable and rapid inference tasks, further exploiting the potentials of optics and photonics for artificial intelligence-generated content.
Authors: Prashanth S Velayudhan, Xiaoqiao Xu, Prajkta Kallurkar, Ana Patricia Balbon, Maria T Secara, Adam Taback, Denise Sabac, Nicholas Chan, Shihao Ma, Bo Wang, Daniel Felsky, Stephanie H Ameis, Brian Cox, Colin Hawco, Lauren Erdman, Anne L Wheeler
Abstract: metasnf is an R package that enables users to apply meta clustering, a method for efficiently searching a broad space of cluster solutions by clustering the solutions themselves, to clustering workflows based on similarity network fusion (SNF). SNF is a multi-modal data integration algorithm commonly used for biomedical subtype discovery. The package also contains functions to assist with cluster visualization, characterization, and validation. This package can help researchers identify SNF-derived cluster solutions that are guided by context-specific utility over context-agnostic measures of quality.
Authors: Zihan Zhou, Animesh Garg, Dieter Fox, Caelan Garrett, Ajay Mandlekar
Abstract: Robot learning has proven to be a general and effective technique for programming manipulators. Imitation learning is able to teach robots solely from human demonstrations but is bottlenecked by the capabilities of the demonstrations. Reinforcement learning uses exploration to discover better behaviors; however, the space of possible improvements can be too large to start from scratch. And for both techniques, the learning difficulty increases proportional to the length of the manipulation task. Accounting for this, we propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths. We develop novel strategies to train learning agents when deployed in the context of a planning system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot manipulation problems. We find that SPIRE outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning by 35% to 50% in average task performance, is 6 times more data efficient in the number of human demonstrations needed to train proficient agents, and learns to complete tasks nearly twice as efficiently. View https://sites.google.com/view/spire-corl-2024 for more details.
Authors: Suchisrit Gangopadhyay, Xien Chen, Michael Chu, Patrick Rim, Hyoungseob Park, Alex Wong
Abstract: We propose UnCLe, a standardized benchmark for Unsupervised Continual Learning of a multimodal depth estimation task: Depth completion aims to infer a dense depth map from a pair of synchronized RGB image and sparse depth map. We benchmark depth completion models under the practical scenario of unsupervised learning over continuous streams of data. Existing methods are typically trained on a static, or stationary, dataset. However, when adapting to novel non-stationary distributions, they "catastrophically forget" previously learned information. UnCLe simulates these non-stationary distributions by adapting depth completion models to sequences of datasets containing diverse scenes captured from distinct domains using different visual and range sensors. We adopt representative methods from continual learning paradigms and translate them to enable unsupervised continual learning of depth completion. We benchmark these models for indoor and outdoor and investigate the degree of catastrophic forgetting through standard quantitative metrics. Furthermore, we introduce model inversion quality as an additional measure of forgetting. We find that unsupervised continual learning of depth completion is an open problem, and we invite researchers to leverage UnCLe as a development platform.
Authors: Shauli Ravfogel, Michael Twiton, Yoav Goldberg, Ryan Cotterell
Abstract: Modern neural models trained on textual data rely on pre-trained representations that emerge without direct supervision. As these representations are increasingly being used in real-world applications, the inability to \emph{control} their content becomes an increasingly important problem. We formulate the problem of identifying and erasing a linear subspace that corresponds to a given concept, in order to prevent linear predictors from recovering the concept. We model this problem as a constrained, linear maximin game, and show that existing solutions are generally not optimal for this task. We derive a closed-form solution for certain objectives, and propose a convex relaxation, \method, that works well for others. When evaluated in the context of binary gender removal, the method recovers a low-dimensional subspace whose removal mitigates bias by intrinsic and extrinsic evaluation. We show that the method is highly expressive, effectively mitigating bias in deep nonlinear classifiers while maintaining tractability and interpretability.
Authors: Seanie Lee, Bruno Andreis, Kenji Kawaguchi, Juho Lee, Sung Ju Hwang
Abstract: Meta-learning approaches enable machine learning systems to adapt to new tasks given few examples by leveraging knowledge from related tasks. However, a large number of meta-training tasks are still required for generalization to unseen tasks during meta-testing, which introduces a critical bottleneck for real-world problems that come with only few tasks, due to various reasons including the difficulty and cost of constructing tasks. Recently, several task augmentation methods have been proposed to tackle this issue using domain-specific knowledge to design augmentation techniques to densify the meta-training task distribution. However, such reliance on domain-specific knowledge renders these methods inapplicable to other domains. While Manifold Mixup based task augmentation methods are domain-agnostic, we empirically find them ineffective on non-image domains. To tackle these limitations, we propose a novel domain-agnostic task augmentation method, Meta-Interpolation, which utilizes expressive neural set functions to densify the meta-training task distribution using bilevel optimization. We empirically validate the efficacy of Meta-Interpolation on eight datasets spanning across various domains such as image classification, molecule property prediction, text classification and speech recognition. Experimentally, we show that Meta-Interpolation consistently outperforms all the relevant baselines. Theoretically, we prove that task interpolation with the set function regularizes the meta-learner to improve generalization.
Authors: Guangyu Sun, Umar Khalid, Matias Mendieta, Taojiannan Yang, Pu Wang, Minwoo Lee, Chen Chen
Abstract: Federated learning (FL) has emerged as a promising paradigm for enabling the collaborative training of models without centralized access to the raw data on local devices. In the typical FL paradigm (e.g., FedAvg), model weights are sent to and from the server each round to participating clients. Recently, the use of small pre-trained models has been shown effective in federated learning optimization and improving convergence. However, recent state-of-the-art pre-trained models are getting more capable but also have more parameters. In conventional FL, sharing the enormous model weights can quickly put a massive communication burden on the system, especially if more capable models are employed. Can we find a solution to enable those strong and readily-available pre-trained models in FL to achieve excellent performance while simultaneously reducing the communication burden? To this end, we investigate the use of parameter-efficient fine-tuning in federated learning and thus introduce a new framework: FedPEFT. Specifically, we systemically evaluate the performance of FedPEFT across a variety of client stability, data distribution, and differential privacy settings. By only locally tuning and globally sharing a small portion of the model weights, significant reductions in the total communication overhead can be achieved while maintaining competitive or even better performance in a wide range of federated learning scenarios, providing insight into a new paradigm for practical and effective federated systems.
Authors: Li Zeng, Xiaoliang Wan, Tao Zhou
Abstract: In this paper, we develop an invertible mapping, called B-KRnet, on a bounded domain and apply it to density estimation/approximation for data or the solutions of PDEs such as the Fokker-Planck equation and the Keller-Segel equation. Similar to KRnet, the structure of B-KRnet adapts the pseudo-triangular structure into a normalizing flow model. The main difference between B-KRnet and KRnet is that B-KRnet is defined on a hypercube while KRnet is defined on the whole space, in other words, a new mechanism is introduced in B-KRnet to maintain the exact invertibility. Using B-KRnet as a transport map, we obtain an explicit probability density function (PDF) model that corresponds to the pushforward of a prior (uniform) distribution on the hypercube. It can be directly applied to density estimation when only data are available. By coupling KRnet and B-KRnet, we define a deep generative model on a high-dimensional domain where some dimensions are bounded and other dimensions are unbounded. A typical case is the solution of the stationary kinetic Fokker-Planck equation, which is a PDF of position and momentum. Based on B-KRnet, we develop an adaptive learning approach to approximate partial differential equations whose solutions are PDFs or can be treated as PDFs. A variety of numerical experiments is presented to demonstrate the effectiveness of B-KRnet.
Authors: Jiaxing Zhang, Zhuomin Chen, Hao Mei, Dongsheng Luo, Hua Wei
Abstract: Graph regression is a fundamental task and has received increasing attention in a wide range of graph learning tasks. However, the inference process is often not interpretable. Most existing explanation techniques are limited to understanding GNN behaviors in classification tasks. In this work, we seek an explanation to interpret the graph regression models (XAIG-R). We show that existing methods overlook the distribution shifting and continuously ordered decision boundary, which hinders them away from being applied in the regression tasks. To address these challenges, we propose a novel objective based on the information bottleneck theory and introduce a new mix-up framework, which could support various GNNs in a model-agnostic manner. We further present a contrastive learning strategy to tackle the continuously ordered labels in regression task. To empirically verify the effectiveness of the proposed method, we introduce three benchmark datasets and a real-life dataset for evaluation. Extensive experiments show the effectiveness of the proposed method in interpreting GNN models in regression tasks.
Authors: Richard Nock, Mathieu Guillame-Bert
Abstract: We focus on generative AI for a type of data that still represent one of the most prevalent form of data: tabular data. Our paper introduces two key contributions: a new powerful class of forest-based models fit for such tasks and a simple training algorithm with strong convergence guarantees in a boosting model that parallels that of the original weak / strong supervised learning setting. This algorithm can be implemented by a few tweaks to the most popular induction scheme for decision tree induction (i.e. supervised learning) with two classes. Experiments on the quality of generated data display substantial improvements compared to the state of the art. The losses our algorithm minimize and the structure of our models make them practical for related tasks that require fast estimation of a density given a generative model and an observation (even partially specified): such tasks include missing data imputation and density estimation. Additional experiments on these tasks reveal that our models can be notably good contenders to diverse state of the art methods, relying on models as diverse as (or mixing elements of) trees, neural nets, kernels or graphical models.
Authors: Dengwang Tang, Dongze Ye, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo
Abstract: Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
Authors: Naoki Sato, Hideaki Iiduka
Abstract: The graduated optimization approach is a heuristic method for finding global optimal solutions for nonconvex functions by using a function smoothing operation with stochastic noise. We show that stochastic noise in stochastic gradient descent (SGD) has the effect of smoothing the objective function, the degree of which is determined by the learning rate, batch size, and variance of the stochastic gradient. Using this finding, we propose and analyze a new graduated optimization algorithm that varies the degree of smoothing by varying the learning rate and batch size, and provide experimental results on image classification tasks with ResNets that support our theoretical findings. We further show that there is an interesting correlation between the degree of smoothing by SGD's stochastic noise, the well-studied ``sharpness'' indicator, and the generalization performance of the model.
Authors: Jack Sandberg, Niklas {\AA}kerblom, Morteza Haghir Chehreghani
Abstract: We consider the combinatorial volatile Gaussian process (GP) semi-bandit problem. Each round, an agent is provided a set of available base arms and must select a subset of them to maximize the long-term cumulative reward. We study the Bayesian setting and provide novel Bayesian cumulative regret bounds for three GP-based algorithms: GP-UCB, GP-BayesUCB and GP-TS. Our bounds extend previous results for GP-UCB and GP-TS to the infinite, volatile and combinatorial setting, and to the best of our knowledge, we provide the first regret bound for GP-BayesUCB. Volatile arms encompass other widely considered bandit problems such as contextual bandits. Furthermore, we employ our framework to address the challenging real-world problem of online energy-efficient navigation, where we demonstrate its effectiveness compared to the alternatives.
Authors: Hao Chen, Bhiksha Raj, Xing Xie, Jindong Wang
Abstract: Large foundation models (LFMs) are claiming incredible performances. Yet great concerns have been raised about their mythic and uninterpreted potentials not only in machine learning, but also in various other disciplines. In this position paper, we propose to identify a neglected issue deeply rooted in LFMs: Catastrophic Inheritance, describing the weaknesses and limitations inherited from biased large-scale pre-training data to behaviors of LFMs on the downstream tasks, including samples that are corrupted, long-tailed, noisy, out-of-distributed, to name a few. Such inheritance can potentially cause catastrophes to downstream applications, such as bias, lack of generalization, deteriorated performance, security vulnerability, privacy leakage, and value misalignment. We discuss the challenges behind this issue and propose UIM, a framework to Understand the catastrophic inheritance of LFMs from both pre-training and downstream adaptation, Interpret the implications of catastrophic inheritance on downstream tasks, and how to Mitigate it. UIM aims to unite both the machine learning and social sciences communities for more responsible and promising AI development and deployment.
Authors: Huanran Chen, Yinpeng Dong, Shitong Shao, Zhongkai Hao, Xiao Yang, Hang Su, Jun Zhu
Abstract: Generative learning, recognized for its effective modeling of data distributions, offers inherent advantages in handling out-of-distribution instances, especially for enhancing robustness to adversarial attacks. Among these, diffusion classifiers, utilizing powerful diffusion models, have demonstrated superior empirical robustness. However, a comprehensive theoretical understanding of their robustness is still lacking, raising concerns about their vulnerability to stronger future attacks. In this study, we prove that diffusion classifiers possess $O(1)$ Lipschitzness, and establish their certified robustness, demonstrating their inherent resilience. To achieve non-constant Lipschitzness, thereby obtaining much tighter certified robustness, we generalize diffusion classifiers to classify Gaussian-corrupted data. This involves deriving the evidence lower bounds (ELBOs) for these distributions, approximating the likelihood using the ELBO, and calculating classification probabilities via Bayes' theorem. Experimental results show the superior certified robustness of these Noised Diffusion Classifiers (NDCs). Notably, we achieve over 80% and 70% certified robustness on CIFAR-10 under adversarial perturbations with \(\ell_2\) norms less than 0.25 and 0.5, respectively, using a single off-the-shelf diffusion model without any additional data.
Authors: Ruofan Wu, Guanhua Fang, Qiying Pan, Mingyang Zhang, Tengfei Liu, Weiqiang Wang
Abstract: Graph representation learning (GRL) is critical for extracting insights from complex network structures, but it also raises security concerns due to potential privacy vulnerabilities in these representations. This paper investigates the structural vulnerabilities in graph neural models where sensitive topological information can be inferred through edge reconstruction attacks. Our research primarily addresses the theoretical underpinnings of similarity-based edge reconstruction attacks (SERA), furnishing a non-asymptotic analysis of their reconstruction capacities. Moreover, we present empirical corroboration indicating that such attacks can perfectly reconstruct sparse graphs as graph size increases. Conversely, we establish that sparsity is a critical factor for SERA's effectiveness, as demonstrated through analysis and experiments on (dense) stochastic block models. Finally, we explore the resilience of private graph representations produced via noisy aggregation (NAG) mechanism against SERA. Through theoretical analysis and empirical assessments, we affirm the mitigation of SERA using NAG . In parallel, we also empirically delineate instances wherein SERA demonstrates both efficacy and deficiency in its capacity to function as an instrument for elucidating the trade-off between privacy and utility.
Authors: Vivian Y. Nastl, Moritz Hardt
Abstract: We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each dataset comes with multiple domains, allowing us to test how well a model trained in one domain performs in another. For each prediction task, we select features that have a causal influence on the target of prediction. Our goal is to test the hypothesis that models trained on causal features generalize better across domains. Without exception, we find that predictors using all available features, regardless of causality, have better in-domain and out-of-domain accuracy than predictors using causal features. Moreover, even the absolute drop in accuracy from one domain to the other is no better for causal predictors than for models that use all features. In addition, we show that recent causal machine learning methods for domain generalization do not perform better in our evaluation than standard predictors trained on the set of causal features. Likewise, causal discovery algorithms either fail to run or select causal variables that perform no better than our selection. Extensive robustness checks confirm that our findings are stable under variable misclassification.
Authors: Philip A. LeMaitre, Marius Krumm, Hans J. Briegel
Abstract: With the impressive progress of deep learning, applications relying on machine learning are increasingly being integrated into daily life. However, most deep learning models have an opaque, oracle-like nature making it difficult to interpret and understand their decisions. This problem led to the development of the field known as eXplainable Artificial Intelligence (XAI). One method in this field known as Projective Simulation (PS) models a chain-of-thought as a random walk of a particle on a graph with vertices that have concepts attached to them. While this description has various benefits, including the possibility of quantization, it cannot be naturally used to model thoughts that combine several concepts simultaneously. To overcome this limitation, we introduce Multi-Excitation Projective Simulation (mePS), a generalization that considers a chain-of-thought to be a random walk of several particles on a hypergraph. A definition for a dynamic hypergraph is put forward to describe the agent's training history along with applications to AI and hypergraph visualization. An inductive bias inspired by the remarkably successful few-body interaction models used in quantum many-body physics is formalized for our classical mePS framework and employed to tackle the exponential complexity associated with naive implementations of hypergraphs. We prove that our inductive bias reduces the complexity from exponential to polynomial, with the exponent representing the cutoff on how many particles can interact. We numerically apply our method to two toy environments and a more complex scenario modelling the diagnosis of a broken computer. These environments demonstrate the resource savings provided by an appropriate choice of inductive bias, as well as showcasing aspects of interpretability. A quantum model for mePS is also briefly outlined and some future directions for it are discussed.
Authors: Yao Qiang, Xiangyu Zhou, Saleh Zare Zade, Mohammad Amin Roshani, Prashant Khanduri, Douglas Zytko, Dongxiao Zhu
Abstract: The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during the instruction tuning of LLMs and emphasizes the necessity of safeguarding LLMs against data poisoning attacks.
Authors: Emile Anand, Guannan Qu
Abstract: We study reinforcement learning for global decision-making in the presence of local agents, where the global decision-maker makes decisions affecting all local agents, and the objective is to learn a policy that maximizes the joint rewards of all the agents. Such problems find many applications, e.g. demand response, EV charging, queueing, etc. In this setting, scalability has been a long-standing challenge due to the size of the state space which can be exponential in the number of agents. This work proposes the \texttt{SUBSAMPLE-Q} algorithm where the global agent subsamples $k\leq n$ local agents to compute a policy in time that is polynomial in $k$. We show that this learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k}+{\epsilon}_{k,m})$ as the number of sub-sampled agents $k$ increases, where ${\epsilon}_{k,m}$ is the Bellman noise. Finally, we validate the theory through numerical simulations in a demand-response setting and a queueing setting.
Authors: Jianyu Zhang, L\'eon Bottou
Abstract: It is impossible today to pretend that the practice of machine learning is compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how scenarios involving multiple data distributions are best served by representations that are both richer than those obtained by regularizing for the best in-distribution performance, and richer than those obtained under the influence of the implicit sparsity bias of common stochastic gradient procedures. This contribution investigates the use of very high dropout rates instead of ensembles to obtain such rich representations. Although training a deep network from scratch using such dropout rates is virtually impossible, fine-tuning a large pre-trained model under such conditions is not only possible but also achieves out-of-distribution performances that exceed those of both ensembles and weight averaging methods such as model soups. This result has practical significance because the importance of the fine-tuning scenario has considerably grown in recent years. This result also provides interesting insights on the nature of rich representations and on the intrinsically linear nature of fine-tuning a large network using a comparatively small dataset.
Authors: Cassidy Laidlaw, Shivam Singhal, Anca Dragan
Abstract: Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using flawed proxy rewards that seem to capture the true objective. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy, and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "base policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). We then show theoretically that regularization to the base policy can effectively prevent reward hacking. While current RLHF approaches apply a KL penalty between the action distributions of policies, our theory suggests that it is more effective to regularize using the $\chi^2$ divergence between the policies' occupancy measures. We intuitively show why this type of regularization is superior and demonstrate that it better mitigates reward hacking in practice across four realistic domains, including RLHF for LLMs. Our code is available at https://github.com/cassidylaidlaw/orpo.
Authors: Baoyu Jing, Dawei Zhou, Kan Ren, Carl Yang
Abstract: Spatiotemporal time series are usually collected via monitoring sensors placed at different locations, which usually contain missing values due to various failures, such as mechanical damages and Internet outages. Imputing the missing values is crucial for analyzing time series. When recovering a specific data point, most existing methods consider all the information relevant to that point regardless of the cause-and-effect relationship. During data collection, it is inevitable that some unknown confounders are included, e.g., background noise in time series and non-causal shortcut edges in the constructed sensor network. These confounders could open backdoor paths and establish non-causal correlations between the input and output. Over-exploiting these non-causal correlations could cause overfitting. In this paper, we first revisit spatiotemporal time series imputation from a causal perspective and show how to block the confounders via the frontdoor adjustment. Based on the results of frontdoor adjustment, we introduce a novel Causality-Aware Spatiotemporal Graph Neural Network (Casper), which contains a novel Prompt Based Decoder (PBD) and a Spatiotemporal Causal Attention (SCA). PBD could reduce the impact of confounders and SCA could discover the sparse causal relationships among embeddings. Theoretical analysis reveals that SCA discovers causal relationships based on the values of gradients. We evaluate Casper on three real-world datasets, and the experimental results show that Casper could outperform the baselines and could effectively discover causal relationships.
Authors: Baoyu Jing, Yansen Wang, Guoxin Sui, Jing Hong, Jingrui He, Yuqing Yang, Dongsheng Li, Kan Ren
Abstract: In recent years, Contrastive Learning (CL) has become a predominant representation learning paradigm for time series. Most existing methods manually build specific CL Strategies (CLS) by human heuristics for certain datasets and tasks. However, manually developing CLS usually requires excessive prior knowledge about the data, and massive experiments to determine the detailed CL configurations. In this paper, we present an Automated Machine Learning (AutoML) practice at Microsoft, which automatically learns CLS for time series datasets and tasks, namely Automated Contrastive Learning (AutoCL). We first construct a principled search space of size over $3\times10^{12}$, covering data augmentation, embedding transformation, contrastive pair construction, and contrastive losses. Further, we introduce an efficient reinforcement learning algorithm, which optimizes CLS from the performance on the validation tasks, to obtain effective CLS within the space. Experimental results on various real-world datasets demonstrate that AutoCL could automatically find the suitable CLS for the given dataset and task. From the candidate CLS found by AutoCL on several public datasets/tasks, we compose a transferable Generally Good Strategy (GGS), which has a strong performance for other datasets. We also provide empirical analysis as a guide for the future design of CLS.
Authors: Tuija Leinonen, David Wong, Antti Vasankari, Ali Wahab, Ramesh Nadarajah, Matti Kaisti, Antti Airola
Abstract: Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data.
Authors: Chengyuan Li, Tianyu Zhang, Xusheng Du, Ye Zhang, Haoran Xie
Abstract: Recent advances in generative artificial intelligence (AI) technologies have been significantly driven by models such as generative adversarial networks (GANs), variational autoencoders (VAEs), and denoising diffusion probabilistic models (DDPMs). Although architects recognize the potential of generative AI in design, personal barriers often restrict their access to the latest technological developments, thereby causing the application of generative AI in architectural design to lag behind. Therefore, it is essential to comprehend the principles and advancements of generative AI models and analyze their relevance in architecture applications. This paper first provides an overview of generative AI technologies, with a focus on probabilistic diffusion models (DDPMs), 3D generative models, and foundation models, highlighting their recent developments and main application scenarios. Then, the paper explains how the abovementioned models could be utilized in architecture. We subdivide the architectural design process into six steps and review related research projects in each step from 2020 to the present. Lastly, this paper discusses potential future directions for applying generative AI in the architectural design steps. This research can help architects quickly understand the development and latest progress of generative AI and contribute to the further development of intelligent architecture.
Authors: Linyu Liu, Yu Pan, Xiaocheng Li, Guanting Chen
Abstract: In this paper, we study the problem of uncertainty estimation and calibration for LLMs. We begin by formulating the uncertainty estimation problem, a relevant yet underexplored area in existing literature. We then propose a supervised approach that leverages labeled datasets to estimate the uncertainty in LLMs' responses. Based on the formulation, we illustrate the difference between the uncertainty estimation for LLMs and that for standard ML models and explain why the hidden neurons of the LLMs may contain uncertainty information. Our designed approach demonstrates the benefits of utilizing hidden activations to enhance uncertainty estimation across various tasks and shows robust transferability in out-of-distribution settings. We distinguish the uncertainty estimation task from the uncertainty calibration task and show that better uncertainty estimation leads to better calibration performance. Furthermore, our method is easy to implement and adaptable to different levels of model accessibility including black box, grey box, and white box.
Authors: Shuhao Mei, Xin Li, Yuxi Zhou, Jiahao Xu, Yong Zhang, Yuxuan Wan, Shan Cao, Qinghao Zhao, Shijia Geng, Junqing Xie, Shengyong Chen, Shenda Hong
Abstract: Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that causes airflow obstruction. Current methods can only detect COPD from prominent features in spirogram (Volume-Flow time series) but cannot predict future COPD risk from subtle data patterns. We propose a deep learning-based method, DeepSpiro, for early prediction of future COPD risk. DeepSpiro consists of four key components: SpiroSmoother for stabilizing the Volume-Flow curve, SpiroEncoder for capturing volume evolution through key patches of varying lengths, SpiroExplainer for integrating heterogeneous data and explaining predictions through volume attention, and SpiroPredictor for predicting the disease risk of undiagnosed high-risk patients based on key patch concavity, with prediction horizons of 1, 2, 3, 4, 5 years, or even longer. Evaluated on the UK Biobank dataset, DeepSpiro achieved an AUC of 0.8328 for COPD detection and demonstrated strong predictive performance for future COPD risk (p-value < 0.001). DeepSpiro effectively predicts the long-term progression of the disease.
Authors: Yi Yan, Ercan E. Kuruoglu
Abstract: Graph Neural Networks have a limitation of solely processing features on graph nodes, neglecting data on high-dimensional structures such as edges and triangles. Simplicial Convolutional Neural Networks (SCNN) represent higher-order structures using simplicial complexes to break this limitation albeit still lacking time efficiency. In this paper, we propose a novel neural network architecture on simplicial complexes named Binarized Simplicial Convolutional Neural Networks (Bi-SCNN) based on the combination of simplicial convolution with a binary-sign forward propagation strategy. The usage of the Hodge Laplacian on a binary-sign forward propagation enables Bi-SCNN to efficiently and effectively represent simplicial features that have higher-order structures than traditional graph node representations. Compared to the previous Simplicial Convolutional Neural Networks, the reduced model complexity of Bi-SCNN shortens the execution time without sacrificing the prediction performance and is less prone to the over-smoothing effect. Experimenting with real-world citation and ocean-drifter data confirmed that our proposed Bi-SCNN is efficient and accurate.
Authors: Chakshu Moar, Faraz Tahmasebi, Michael Pellauer, Hyoukjun Kwon
Abstract: Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored to achieve memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O($2^{39}$) for Llama2-7B). To navigate such a vast design space, we formulate it and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p (\%p refers to "percentage point," which refers to the absolute difference between two percentage numbers; 74\% -> 78\% = 4\%p increase) to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale (e.g., AI agent and real-time coding assistant), where the latency is as important as the model accuracy.
Authors: Haixu Wu, Huakun Luo, Yuezhou Ma, Jianmin Wang, Mingsheng Long
Abstract: Physics-informed neural networks (PINNs) have been widely applied to solve partial differential equations (PDEs) by enforcing outputs and gradients of deep models to satisfy target equations. Due to the limitation of numerical computation, PINNs are conventionally optimized on finite selected points. However, since PDEs are usually defined on continuous domains, solely optimizing models on scattered points may be insufficient to obtain an accurate solution for the whole domain. To mitigate this inherent deficiency of the default scatter-point optimization, this paper proposes and theoretically studies a new training paradigm as region optimization. Concretely, we propose to extend the optimization process of PINNs from isolated points to their continuous neighborhood regions, which can theoretically decrease the generalization error, especially for hidden high-order constraints of PDEs. A practical training algorithm, Region Optimized PINN (RoPINN), is seamlessly derived from this new paradigm, which is implemented by a straightforward but effective Monte Carlo sampling method. By calibrating the sampling process into trust regions, RoPINN finely balances optimization and generalization error. Experimentally, RoPINN consistently boosts the performance of diverse PINNs on a wide range of PDEs without extra backpropagation or gradient calculation. Code is available at this repository: https://github.com/thuml/RoPINN.
Authors: Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur
Abstract: Recent advances in large language models have shown capabilities that are extraordinary and near-superhuman. These models operate with such complexity that reliably evaluating and aligning them proves challenging for humans. This leads to the natural question: can guidance from weak models (like humans) adequately direct the capabilities of strong models? In a recent and somewhat surprising work, Burns et al. (2023) empirically demonstrated that when strong models (like GPT-4) are finetuned using labels generated by weak supervisors (like GPT-2), the strong models outperform their weaker counterparts -- a phenomenon they term weak-to-strong generalization. In this work, we present a theoretical framework for understanding weak-to-strong generalization. Specifically, we show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model. Our theory reveals several curious algorithmic insights. For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error. We validate our theoretical findings through various empirical assessments.
Authors: Duke Nguyen, Aditya Joshi, Flora Salim
Abstract: Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods use a subset of combinations of component functions and weight matrices within the random features paradigm. We identify the need for a systematic comparison of different combinations of weight matrices and component functions for attention learning in Transformer. In this work, we introduce Spectraformer, a unified framework for approximating and learning the kernel function in linearized attention of the Transformer. We experiment with broad classes of component functions and weight matrices for three textual tasks in the LRA benchmark. Our empirical findings indicate that different kernels are good at different tasks and that kernel choice is fundamental to performant models. Our code is available at: https://github.com/dukenguyenxyz/spectraformer .
Authors: Roel Bouman, Linda Schmeitz, Luco Buise, Jacco Heres, Yuliya Shapovalova, Tom Heskes
Abstract: In this paper we present novel methodology for automatic anomaly and switch event filtering to improve load estimation in power grid systems. By leveraging unsupervised methods with supervised optimization, our approach prioritizes interpretability while ensuring robust and generalizable performance on unseen data. Through experimentation, a combination of binary segmentation for change point detection and statistical process control for anomaly detection emerges as the most effective strategy, specifically when ensembled in a novel sequential manner. Results indicate the clear wasted potential when filtering is not applied. The automatic load estimation is also fairly accurate, with approximately 90% of estimates falling within a 10% error margin, with only a single significant failure in both the minimum and maximum load estimates across 60 measurements in the test set. Our methodology's interpretability makes it particularly suitable for critical infrastructure planning, thereby enhancing decision-making processes.
Authors: Chun-Mao Lai, Hsiang-Chun Wang, Ping-Chun Hsieh, Yu-Chiang Frank Wang, Min-Hung Chen, Shao-Hua Sun
Abstract: Imitation learning aims to learn a policy from observing expert demonstrations without access to reward signals from environments. Generative adversarial imitation learning (GAIL) formulates imitation learning as adversarial learning, employing a generator policy learning to imitate expert behaviors and discriminator learning to distinguish the expert demonstrations from agent trajectories. Despite its encouraging results, GAIL training is often brittle and unstable. Inspired by the recent dominance of diffusion models in generative modeling, we propose Diffusion-Reward Adversarial Imitation Learning (DRAIL), which integrates a diffusion model into GAIL, aiming to yield more robust and smoother rewards for policy learning. Specifically, we propose a diffusion discriminative classifier to construct an enhanced discriminator, and design diffusion rewards based on the classifier's output for policy learning. Extensive experiments are conducted in navigation, manipulation, and locomotion, verifying DRAIL's effectiveness compared to prior imitation learning methods. Moreover, additional experimental results demonstrate the generalizability and data efficiency of DRAIL. Visualized learned reward functions of GAIL and DRAIL suggest that DRAIL can produce more robust and smoother rewards. Project page: https://nturobotlearninglab.github.io/DRAIL/
Authors: Wei Jiang, Sifan Yang, Wenhao Yang, Lijun Zhang
Abstract: Sign stochastic gradient descent (signSGD) is a communication-efficient method that transmits only the sign of stochastic gradients for parameter updating. Existing literature has demonstrated that signSGD can achieve a convergence rate of $\mathcal{O}(d^{1/2}T^{-1/4})$, where $d$ represents the dimension and $T$ is the iteration number. In this paper, we improve this convergence rate to $\mathcal{O}(d^{1/2}T^{-1/3})$ by introducing the Sign-based Stochastic Variance Reduction (SSVR) method, which employs variance reduction estimators to track gradients and leverages their signs to update. For finite-sum problems, our method can be further enhanced to achieve a convergence rate of $\mathcal{O}(m^{1/4}d^{1/2}T^{-1/2})$, where $m$ denotes the number of component functions. Furthermore, we investigate the heterogeneous majority vote in distributed settings and introduce two novel algorithms that attain improved convergence rates of $\mathcal{O}(d^{1/2}T^{-1/2} + dn^{-1/2})$ and $\mathcal{O}(d^{1/4}T^{-1/4})$ respectively, outperforming the previous results of $\mathcal{O}(dT^{-1/4} + dn^{-1/2})$ and $\mathcal{O}(d^{3/8}T^{-1/8})$, where $n$ represents the number of nodes. Numerical experiments across different tasks validate the effectiveness of our proposed methods.
Authors: Jianrong Ding, Zhanyu Liu, Guanjie Zheng, Haiming Jin, Linghe Kong
Abstract: Dataset condensation is a newborn technique that generates a small dataset that can be used in training deep neural networks to lower training costs. The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets. However, existing methods predominantly concentrate on classification tasks, posing challenges in their adaptation to time series forecasting (TS-forecasting). This challenge arises from disparities in the evaluation of synthetic data. In classification, the synthetic data is considered well-distilled if the model trained with the full dataset and the model trained with the synthetic dataset yield identical labels for the same input, regardless of variations in output logits distribution. Conversely, in TS-forecasting, the effectiveness of synthetic data distillation is determined by the distance between predictions of the two models. The synthetic data is deemed well-distilled only when all data points within the predictions are similar. Consequently, TS-forecasting has a more rigorous evaluation methodology compared to classification. To mitigate this gap, we theoretically analyze the optimization objective of dataset condensation for TS-forecasting and propose a new one-line plugin of dataset condensation designated as Dataset Condensation for Time Series Forecasting (CondTSF) based on our analysis. Plugging CondTSF into previous dataset condensation methods facilitates a reduction in the distance between the predictions of the model trained with the full dataset and the model trained with the synthetic dataset, thereby enhancing performance. We conduct extensive experiments on eight commonly used time series datasets. CondTSF consistently improves the performance of all previous dataset condensation methods across all datasets, particularly at low condensing ratios.
Authors: Riccardo Salami, Pietro Buzzega, Matteo Mosconi, Mattia Verasani, Simone Calderara
Abstract: Federated Learning (FL) aims at unburdening the training of deep models by distributing computation across multiple devices (clients) while safeguarding data privacy. On top of that, Federated Continual Learning (FCL) also accounts for data distribution evolving over time, mirroring the dynamic nature of real-world environments. While previous studies have identified Catastrophic Forgetting and Client Drift as primary causes of performance degradation in FCL, we shed light on the importance of Incremental Bias and Federated Bias, which cause models to prioritize classes that are recently introduced or locally predominant, respectively. Our proposal constrains both biases in the last layer by efficiently finetuning a pre-trained backbone using learnable prompts, resulting in clients that produce less biased representations and more biased classifiers. Therefore, instead of solely relying on parameter aggregation, we leverage generative prototypes to effectively balance the predictions of the global model. Our method significantly improves the current State Of The Art, providing an average increase of +7.8% in accuracy. Code to reproduce the results is provided in the suppl. material.
Authors: John Arevalo, Ellen Su, Anne E Carpenter, Shantanu Singh
Abstract: Drug-target interaction (DTI) prediction is crucial for identifying new therapeutics and detecting mechanisms of action. While structure-based methods accurately model physical interactions between a drug and its protein target, cell-based assays such as Cell Painting can better capture complex DTI interactions. This paper introduces MOTIVE, a Morphological cOmpound Target Interaction Graph dataset comprising Cell Painting features for 11,000 genes and 3,600 compounds, along with their relationships extracted from seven publicly available databases. We provide random, cold-source (new drugs), and cold-target (new genes) data splits to enable rigorous evaluation under realistic use cases. Our benchmark results show that graph neural networks that use Cell Painting features consistently outperform those that learn from graph structure alone, feature-based models, and topological heuristics. MOTIVE accelerates both graph ML research and drug discovery by promoting the development of more reliable DTI prediction models. MOTIVE resources are available at https://github.com/carpenter-singh-lab/motive.
Authors: Jacob E. Kooi, Mark Hoogendoorn, Vincent Fran\c{c}ois-Lavet
Abstract: Activation functions are one of the key components of a deep neural network. The most commonly used activation functions can be classed into the category of continuously differentiable (e.g. tanh) and linear-unit functions (e.g. ReLU), both having their own strengths and drawbacks with respect to downstream performance and representation capacity through learning (e.g. measured by the number of dead neurons and the effective rank). In reinforcement learning, the performance of continuously differentiable activations often falls short as compared to linear-unit functions. We provide insights into the vanishing gradients associated with the former, and show that the dying neuron problem is not exclusive to ReLU's. To alleviate vanishing gradients and the resulting dying neuron problem occurring with continuously differentiable activations, we propose a Hadamard representation. Using deep Q-networks and proximal policy optimization in the Atari domain, we show faster learning, a reduction in dead neurons and increased effective rank.
Authors: Pranav Ajit Nair, Arun Sai Suggala
Abstract: Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses greedy coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. We perform extensive evaluation on Gemma, and PaLM2 model families, and demonstrate that CDQuant consistently outperforms GPTQ in 2-4 bit weight quantization. Moreover, CDQuant improves the performance of state-of-the-art PTQ techniques such as QuIP and FrameQuant when used as a replacement for their GPTQ component, resulting in further gains in quality.
Authors: Maresa Schr\"oder, Dennis Frauen, Jonas Schweisthal, Konstantin He{\ss}, Valentyn Melnychuk, Stefan Feuerriegel
Abstract: Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data.
Authors: Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, Mario Martin
Abstract: Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms the sample efficiency and, similarly, the replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network, even with off-policy data. Empirically, we find that online, parallelised sampling enabled by vectorised environments stabilises training without the need of a replay buffer. Motivated by these findings, we propose PQN, our simplified deep online Q-Learning algorithm. Surprisingly, this simple algorithm is competitive with more complex methods like: Rainbow in Atari, R2D2 in Hanabi, QMix in Smax, PPO-RNN in Craftax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. In an era where PPO has become the go-to RL algorithm, PQN reestablishes Q-learning as a viable alternative.
Authors: Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, Alexandre Ram\'e, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, L\'eonard Hussenot, Olivier Bachem, Edouard Leurent
Abstract: Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditional Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through extensive experiments and ablations on two summarization datasets, we show that CLP learns steerable language models that outperform and Pareto-dominate the existing approaches for multi-objective finetuning.
Authors: Yannik Schnitzer, Alessandro Abate, David Parker
Abstract: We present a data-driven approach for producing policies that are provably robust across unknown stochastic environments. Existing approaches can learn models of a single environment as an interval Markov decision processes (IMDP) and produce a robust policy with a probably approximately correct (PAC) guarantee on its performance. However these are unable to reason about the impact of environmental parameters underlying the uncertainty. We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters. We learn and analyse IMDPs for a set of unknown sample environments induced by parameters. The key challenge is then to produce meaningful performance guarantees that combine the two layers of uncertainty: (1) multiple environments induced by parameters with an unknown distribution; (2) unknown induced environments which are approximated by IMDPs. We present a novel approach based on scenario optimisation that yields a single PAC guarantee quantifying the risk level for which a specified performance level can be assured in unseen environments, plus a means to trade-off risk and performance. We implement and evaluate our framework using multiple robust policy generation methods on a range of benchmarks. We show that our approach produces tight bounds on a policy's performance with high confidence.
Authors: Jiang You, Arben Cela, Ren\'e Natowicz, Jacob Ouanounou, Patrick Siarry
Abstract: Anomaly detection in time series data is a critical challenge across various domains. Traditional methods typically focus on identifying anomalies in immediate subsequent steps, often underestimating the significance of temporal dynamics such as delay time and horizons of anomalies, which generally require extensive post-analysis. This paper introduces a novel approach for time series anomaly prediction, incorporating temporal information directly into the prediction results. We propose a new dataset specifically designed to evaluate this approach and conduct comprehensive experiments using several state-of-the-art methods. Our results demonstrate the efficacy of our approach in providing timely and accurate anomaly predictions, setting a new benchmark for future research in this field.
Authors: Rong J. B. Zhu, Yanqi Qiu
Abstract: We study best-arm identification (BAI) in the fixed-budget setting. Adaptive allocations based on upper confidence bounds (UCBs), such as UCBE, are known to work well in BAI. However, it is well-known that its optimal regret is theoretically dependent on instances, which we show to be an artifact in many fixed-budget BAI problems. In this paper we propose an UCB exploration algorithm that is both theoretically and empirically efficient for the fixed budget BAI problem under a Bayesian setting. The key idea is to learn prior information, which can enhance the performance of UCB-based BAI algorithm as it has done in the cumulative regret minimization problem. We establish bounds on the failure probability and the simple regret for the Bayesian BAI problem, providing upper bounds of order $\tilde{O}(\sqrt{K/n})$, up to logarithmic factors, where $n$ represents the budget and $K$ denotes the number of arms. Furthermore, we demonstrate through empirical results that our approach consistently outperforms state-of-the-art baselines.
Authors: Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath Chandar
Abstract: Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One noted issue is the high latency associated with auto-regressive generation, rendering large LLMs use dependent on advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger target model's generation, has helped alleviate this, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains provided the candidates are effective. Further results show this to hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.
Authors: Mingfei Cai, Yanbo Pang, Yoshihide Sekimoto
Abstract: Commuting flow prediction is an essential task for municipal operations in the real world. Previous studies have revealed that it is feasible to estimate the commuting origin-destination (OD) demand within a city using multiple auxiliary data. However, most existing methods are not suitable to deal with a similar task at a large scale, namely within a prefecture or the whole nation, owing to the increased number of geographical units that need to be maintained. In addition, region representation learning is a universal approach for gaining urban knowledge for diverse metropolitan downstream tasks. Although many researchers have developed comprehensive frameworks to describe urban units from multi-source data, they have not clarified the relationship between the selected geographical elements. Furthermore, metropolitan areas naturally preserve ranked structures, like cities and their inclusive districts, which makes elucidating relations between cross-level urban units necessary. Therefore, we develop a heterogeneous graph-based model to generate meaningful region embeddings at multiple spatial resolutions for predicting different types of inter-level OD flows. To demonstrate the effectiveness of the proposed method, extensive experiments were conducted using real-world aggregated mobile phone datasets collected from Shizuoka Prefecture, Japan. The results indicate that our proposed model outperforms existing models in terms of a uniform urban structure. We extend the understanding of predicted results using reasonable explanations to enhance the credibility of the model.
Authors: Sukai Huang, Nir Lipovetzky, Trevor Cohn
Abstract: While Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents to follow instructions, our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic (exploration-driven) rewards, contradicting expectations set by recent work. We hypothesize that false positive rewards -- instances where unintended trajectories are incorrectly rewarded -- are more detrimental than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric is prone to false positive reward estimates. To address this, we introduce BiMI ({Bi}nary {M}utual {I}nformation), a novel reward function designed to mitigate noise. BiMI significantly enhances learning efficiency across diverse and challenging embodied navigation environments. Our findings offer a nuanced understanding of how different types of reward noise impact agent learning and highlight the importance of addressing multimodal reward signal noise when training embodied agents
Authors: Kunyu Peng, Di Wen, Kailun Yang, Ao Luo, Yufan Chen, Jia Fu, M. Saquib Sarfraz, Alina Roitberg, Rainer Stiefelhagen
Abstract: In Open-Set Domain Generalization (OSDG), the model is exposed to both new variations of data appearance (domains) and open-set conditions, where both known and novel categories are present at test time. The challenges of this task arise from the dual need to generalize across diverse domains and accurately quantify category novelty, which is critical for applications in dynamic environments. Recently, meta-learning techniques have demonstrated superior results in OSDG, effectively orchestrating the meta-train and -test tasks by employing varied random categories and predefined domain partition strategies. These approaches prioritize a well-designed training schedule over traditional methods that focus primarily on data augmentation and the enhancement of discriminative feature learning. The prevailing meta-learning models in OSDG typically utilize a predefined sequential domain scheduler to structure data partitions. However, a crucial aspect that remains inadequately explored is the influence brought by strategies of domain schedulers during training. In this paper, we observe that an adaptive domain scheduler benefits more in OSDG compared with prefixed sequential and random domain schedulers. We propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve an adaptive domain scheduler. This method strategically sequences domains by assessing their reliabilities in utilizing a follower network, trained with confidence scores learned in an evidential manner, regularized by max rebiasing discrepancy, and optimized in a bi-level manner. The results show that our method substantially improves OSDG performance and achieves more discriminative embeddings for both the seen and unseen categories. The source code is publicly available at https://github.com/KPeng9510/EBiL-HaDS.
Authors: Chia-Hsiang Kao, Bharath Hariharan
Abstract: Despite its widespread use in neural networks, error backpropagation has faced criticism for its lack of biological plausibility, suffering from issues such as the backward locking problem and the weight transport problem. These limitations have motivated researchers to explore more biologically plausible learning algorithms that could potentially shed light on how biological neural systems adapt and learn. Inspired by the counter-current exchange mechanisms observed in biological systems, we propose counter-current learning (CCL), a biologically plausible framework for credit assignment in neural networks. This framework employs a feedforward network to process input data and a feedback network to process targets, with each network enhancing the other through anti-parallel signal propagation. By leveraging the more informative signals from the bottom layer of the feedback network to guide the updates of the top layer of the feedforward network and vice versa, CCL enables the simultaneous transformation of source inputs to target outputs and the dynamic mutual influence of these transformations. Experimental results on MNIST, FashionMNIST, CIFAR10, and CIFAR100 datasets using multi-layer perceptrons and convolutional neural networks demonstrate that CCL achieves comparable performance to other biologically plausible algorithms while offering a more biologically realistic learning mechanism. Furthermore, we showcase the applicability of our approach to an autoencoder task, underscoring its potential for unsupervised representation learning. Our work presents a direction for biologically inspired and plausible learning algorithms, offering an alternative mechanism of learning and adaptation in neural networks.
Authors: Alfredo Reichlin, Gustaf Tegn\'er, Miguel Vasco, Hang Yin, M{\aa}rten Bj\"orkman, Danica Kragic
Abstract: Given a finite set of sample points, meta-learning algorithms aim to learn an optimal adaptation strategy for new, unseen tasks. Often, this data can be ambiguous as it might belong to different tasks concurrently. This is particularly the case in meta-regression tasks. In such cases, the estimated adaptation strategy is subject to high variance due to the limited amount of support data for each task, which often leads to sub-optimal generalization performance. In this work, we address the problem of variance reduction in gradient-based meta-learning and formalize the class of problems prone to this, a condition we refer to as \emph{task overlap}. Specifically, we propose a novel approach that reduces the variance of the gradient estimate by weighing each support point individually by the variance of its posterior over the parameters. To estimate the posterior, we utilize the Laplace approximation, which allows us to express the variance in terms of the curvature of the loss landscape of our meta-learner. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of variance reduction in meta-learning.
Authors: Xavier Warin
Abstract: A new Kolmogorov-Arnold network (KAN) is proposed to approximate potentially irregular functions in high dimension. We show that it outperforms multilayer perceptrons in terms of accuracy and converges faster. We also compare it with several proposed KAN networks: the original spline-based KAN network appears to be more effective for smooth functions, while the P1-KAN network is more effective for irregular functions.
Authors: Yifan Wang, Cheng Zhang, Yuanndon Zhuang, Mingzeng Dai, Haiming Wang, Yongming Huang
Abstract: Wireless networks supporting artificial intelligence have gained significant attention, with Over-the-Air Federated Learning emerging as a key application due to its unique transmission and distributed computing characteristics. This paper derives error bounds for Over-the-Air Federated Learning in a Cell-free MIMO system and formulates an optimization problem to minimize optimality gap via joint optimization of power control and beamforming. We introduce the MOP-LOFPC algorithm, which employs Lyapunov optimization to decouple long-term constraints across rounds while requiring only causal channel state information. Experimental results demonstrate that MOP-LOFPC achieves a better and more flexible trade-off between the model's training loss and adherence to long-term power constraints compared to existing baselines.
Authors: Hung-Hsuan Chen
Abstract: The Gradient Boosting Classifier (GBC) is a widely used machine learning algorithm for binary classification, which builds decision trees iteratively to minimize prediction errors. This document explains the GBC's training and prediction processes, focusing on the computation of terminal node values $\gamma_j$, which are crucial to optimizing the logistic loss function. We derive $\gamma_j$ through a Taylor series approximation and provide a step-by-step pseudocode for the algorithm's implementation. The guide explains the theory of GBC and its practical application, demonstrating its effectiveness in binary classification tasks. We provide a step-by-step example in the appendix to help readers understand.
Authors: Alex Stein, Samuel Sharpe, Doron Bergman, Senthil Kumar, C. Bayan Bruss, John Dickerson, Tom Goldstein, Micah Goldblum
Abstract: Many real-world applications of tabular data involve using historic events to predict properties of new ones, for example whether a credit card transaction is fraudulent or what rating a customer will assign a product on a retail platform. Existing approaches to event prediction include costly, brittle, and application-dependent techniques such as time-aware positional embeddings, learned row and field encodings, and oversampling methods for addressing class imbalance. Moreover, these approaches often assume specific use-cases, for example that we know the labels of all historic events or that we only predict a pre-specified label and not the data's features themselves. In this work, we propose a simple but flexible baseline using standard autoregressive LLM-style transformers with elementary positional embeddings and a causal language modeling objective. Our baseline outperforms existing approaches across popular datasets and can be employed for various use-cases. We demonstrate that the same model can predict labels, impute missing values, or model event sequences.
Authors: Shiran Yuan, Hao Zhao
Abstract: We address an important problem in ecology called Species Distribution Modeling (SDM), whose goal is to predict whether a species exists at a certain position on Earth. In particular, we tackle a challenging version of this task, where we learn from presence-only data in a community-sourced dataset, model a large number of species simultaneously, and do not use any additional environmental information. Previous work has used neural implicit representations to construct models that achieve promising results. However, implicit representations often generate predictions of limited spatial precision. We attribute this limitation to their inherently global formulation and inability to effectively capture local feature variations. This issue is especially pronounced with presence-only data and a large number of species. To address this, we propose a hybrid embedding scheme that combines both implicit and explicit embeddings. Specifically, the explicit embedding is implemented with a multiresolution hashgrid, enabling our models to better capture local information. Experiments demonstrate that our results exceed other works by a large margin on various standard benchmarks, and that the hybrid representation is better than both purely implicit and explicit ones. Qualitative visualizations and comprehensive ablation studies reveal that our hybrid representation successfully addresses the two main challenges. Our code is open-sourced at https://github.com/Shiran-Yuan/HSR-SDM.
Authors: Zifan Liu, Amin Karbasi, Theodoros Rekatsinas
Abstract: Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.
Authors: Xinting Liao, Weiming Liu, Pengyang Zhou, Fengyuan Yu, Jiahe Xu, Jun Wang, Wenjie Wang, Chaochao Chen, Xiaolin Zheng
Abstract: Federated learning (FL) is a promising machine learning paradigm that collaborates with client models to capture global knowledge. However, deploying FL models in real-world scenarios remains unreliable due to the coexistence of in-distribution data and unexpected out-of-distribution (OOD) data, such as covariate-shift and semantic-shift data. Current FL researches typically address either covariate-shift data through OOD generalization or semantic-shift data via OOD detection, overlooking the simultaneous occurrence of various OOD shifts. In this work, we propose FOOGD, a method that estimates the probability density of each client and obtains reliable global distribution as guidance for the subsequent FL process. Firstly, SM3D in FOOGD estimates score model for arbitrary distributions without prior constraints, and detects semantic-shift data powerfully. Then SAG in FOOGD provides invariant yet diverse knowledge for both local covariate-shift generalization and client performance generalization. In empirical validations, FOOGD significantly enjoys three main advantages: (1) reliably estimating non-normalized decentralized distributions, (2) detecting semantic shift data via score values, and (3) generalizing to covariate-shift data by regularizing feature extractor. The prejoct is open in https://github.com/XeniaLLL/FOOGD-main.git.
Authors: Jiamian Li
Abstract: Reinforcement learning has achieved remarkable success in perfect information games such as Go and Atari, enabling agents to compete at the highest levels against human players. However, research in reinforcement learning for imperfect information games has been relatively limited due to the more complex game structures and randomness. Traditional methods face challenges in training and improving performance in imperfect information games due to issues like inaccurate Q value estimation and reward sparsity. In this paper, we focus on Uno, an imperfect information game, and aim to address these problems by reducing Q value overestimation and reshaping reward function. We propose a novel algorithm that utilizes Monte Carlo Tree Search to average the value estimations in Q function. Even though we choose Double Deep Q Learning as the foundational framework in this paper, our method can be generalized and used in any algorithm which needs Q value estimation, such as the Actor-Critic. Additionally, we employ Monte Carlo Tree Search to reshape the reward structure in the game environment. We compare our algorithm with several traditional methods applied to games such as Double Deep Q Learning, Deep Monte Carlo and Neural Fictitious Self Play, and the experiments demonstrate that our algorithm consistently outperforms these approaches, especially as the number of players in Uno increases, indicating a higher level of difficulty.
Authors: Nina Wiedemann, Th\'eo Uscidda, Martin Raubal
Abstract: Prediction problems in geographic information science and transportation are often motivated by the possibility to enhance operational efficiency and thereby reduce emissions. Examples range from predicting car sharing demand for relocation planning to forecasting traffic congestion for navigation purposes. However, conventional accuracy metrics ignore the spatial distribution of the errors, despite its relevance for operations. Here, we put forward a spatially aware evaluation metric and loss function based on Optimal Transport (OT). Our framework leverages partial OT and can minimize relocation costs in any spatial prediction problem. We showcase the advantages of OT-based evaluation over conventional metrics and further demonstrate the application of an OT loss function for improving forecasts of bike sharing demand and charging station occupancy. Thus, our framework not only aligns with operational considerations, but also signifies a step forward in refining predictions within geospatial applications. All code is available at https://github.com/mie-lab/geospatialOT.
Authors: Jesus Garcia Fernandez, Nasir Ahmad, Marcel van Gerven
Abstract: Learning is a fundamental property of intelligent systems, observed across biological organisms and engineered systems. While modern intelligent systems typically rely on gradient descent for learning, the need for exact gradients and complex information flow makes its implementation in biological and neuromorphic systems challenging. This has motivated the exploration of alternative learning mechanisms that can operate locally and do not rely on exact gradients. In this work, we introduce a novel approach that leverages noise in the parameters of the system and global reinforcement signals. Using an Ornstein-Uhlenbeck process with adaptive dynamics, our method balances exploration and exploitation during learning, driven by deviations from error predictions, akin to reward prediction error. Operating in continuous time, Orstein-Uhlenbeck adaptation (OUA) is proposed as a general mechanism for learning dynamic, time-evolving environments. We validate our approach across diverse tasks, including supervised learning and reinforcement learning in feedforward and recurrent systems. Additionally, we demonstrate that it can perform meta-learning, adjusting hyper-parameters autonomously. Our results indicate that OUA provides a viable alternative to traditional gradient-based methods, with potential applications in neuromorphic computing. It also hints at a possible mechanism for noise-driven learning in the brain, where stochastic neurotransmitter release may guide synaptic adjustments.
Authors: Jie Peng, Zhang Cao, Huaizhi Qu, Zhengyu Zhang, Chang Guo, Yanyong Zhang, Zhichao Cao, Tianlong Chen
Abstract: Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H$100$, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M$40$ (as shown in Figure 1, M$40$ only has one third carbon emission of H$100$'s) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$B parameters typically requires $128$GB for inference, which substantially surpasses $24$GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$GB DRAM. To address this challenge, this paper proposes a mixed-precision with a model modularization algorithm to enable LLM inference on outdated hardware with resource constraints. (The precision denotes the numerical precision like FP16, INT8, INT4) and multi-level caching (M2Cache).) Specifically, our M2Cache first modulizes neurons in LLM and creates their importance ranking. Then, it adopts a dynamic sparse mixed-precision quantization mechanism in weight space to reduce computational demands and communication overhead at each decoding step. It collectively lowers the operational carbon emissions associated with LLM inference. Moreover, M2Cache introduces a three-level cache management system with HBM, DRAM, and SSDs that complements the dynamic sparse mixed-precision inference. To enhance communication efficiency, M2Cache maintains a neuron-level mixed-precision LRU cache in HBM, a larger layer-aware cache in DRAM, and a full model in SSD.
Authors: Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios N. Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica
Abstract: We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we open-source for public use and further development. Our code and evaluations can be found at https://github.com/lmarena/PPE .
Authors: Giulia DeSalvo, Jean-Fracois Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar
Abstract: We present a novel soft prompt based framework, SoftSRV, that leverages a frozen pre-trained large language model (LLM) to generate targeted synthetic text sequences. Given a sample from the target distribution, our proposed framework uses data-driven loss minimization to train a parameterized "contextual" soft prompt. This soft prompt is then used to steer the frozen LLM to generate synthetic sequences that are similar to the target distribution. We argue that SoftSRV provides a practical improvement over common hard-prompting approaches that rely on human-curated prompt-templates, which can be idiosyncratic, labor-intensive to craft, and may need to be specialized per domain. We empirically evaluate SoftSRV and hard-prompting baselines by generating synthetic data to fine-tune a small Gemma model on three different domains (coding, math, reasoning). To stress the generality of SoftSRV, we perform these evaluations without any particular specialization of the framework to each domain. We find that SoftSRV significantly improves upon hard-prompting baselines, generating data with superior fine-tuning performance and that better matches the target distribution according to the MAUVE similarity metric.
Authors: Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Abstract: Access to real clinical data is often restricted due to privacy obligations, creating significant barriers for healthcare research. Synthetic datasets provide a promising solution, enabling secure data sharing and model development. However, most existing approaches focus on data realism rather than utility -- ensuring that models trained on synthetic data yield clinically meaningful insights comparable to those trained on real data. In this paper, we present Masked Clinical Modelling (MCM), a framework inspired by masked language modelling, designed for both data synthesis and conditional data augmentation. We evaluate this prototype on the WHAS500 dataset using Cox Proportional Hazards models, focusing on the preservation of hazard ratios as key clinical metrics. Our results show that data generated using the MCM framework improves both discrimination and calibration in survival analysis, outperforming existing methods. MCM demonstrates strong potential to support survival data analysis and broader healthcare applications.
Authors: Maurice Kraus, Felix Divo, Devendra Singh Dhami, Kristian Kersting
Abstract: Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models. Capturing patterns both within and between temporal and multivariate components is crucial for reliable predictions. We introduce xLSTM-Mixer, a model designed to effectively integrate temporal sequences, joint time-variate information, and multiple perspectives for robust forecasting. Our approach begins with a linear forecast shared across variates, which is then refined by xLSTM blocks. These blocks serve as key elements for modeling the complex dynamics of challenging time series data. xLSTM-Mixer ultimately reconciles two distinct views to produce the final forecast. Our extensive evaluations demonstrate xLSTM-Mixer's superior long-term forecasting performance compared to recent state-of-the-art methods. A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness. This work contributes to the resurgence of recurrent models in time series forecasting.
Authors: Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan, Pierre M\'enard, Eric Moulines, Michal Valko
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. However, the costly process of collecting human preferences has received little attention and could benefit from theoretical insights. This paper addresses this issue and aims to formalize the reward training model in RLHF. We frame the selection of an effective dataset as a simple regret minimization task, using a linear contextual dueling bandit method. Given the potentially large number of arms, this approach is more coherent than the best-arm identification setting. We then propose an offline framework for solving this problem. Under appropriate assumptions - linearity of the reward model in the embedding space, and boundedness of the reward parameter - we derive bounds on the simple regret. Finally, we provide a lower bound that matches our upper bound up to constant and logarithmic terms. To our knowledge, this is the first theoretical contribution in this area to provide an offline approach as well as worst-case guarantees.
Authors: Yulun Wu, Layne C. Price, Zichen Wang, Vassilis N. Ioannidis, Robert A. Barton, George Karypis
Abstract: Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest.
Authors: Yafei Shen, Tao Zhang, Zhiwei Liu, Kalliopi Kostelidou, Ying Xu, Ling Yang
Abstract: Identifying complex phenotypes from high-dimensional biological data is challenging due to the intricate interdependencies among different physiological indicators. Traditional approaches often focus on detecting outliers in single variables, overlooking the broader network of interactions that contribute to phenotype emergence. Here, we introduce ODBAE (Outlier Detection using Balanced Autoencoders), a machine learning method designed to uncover both subtle and extreme outliers by capturing latent relationships among multiple physiological parameters. ODBAE's revised loss function enhances its ability to detect two key types of outliers: influential points (IP), which disrupt latent correlations between dimensions, and high leverage points (HLP), which deviate from the norm but go undetected by traditional autoencoder-based methods. Using data from the International Mouse Phenotyping Consortium (IMPC), we show that ODBAE can identify knockout mice with complex, multi-indicator phenotypes - normal in individual traits, but abnormal when considered together. In addition, this method reveals novel metabolism-related genes and uncovers coordinated abnormalities across metabolic indicators. Our results highlight the utility of ODBAE in detecting joint abnormalities and advancing our understanding of homeostatic perturbations in biological systems.
Authors: Meryem Banu Cavlak, Gagandeep Singh, Mohammed Alser, Can Firtina, Jo\"el Lindegger, Mohammad Sadrosadati, Nika Mansouri Ghiasi, Can Alkan, Onur Mutlu
Abstract: Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality compared to prior works. TargetCall is available at https://github.com/CMU-SAFARI/TargetCall.
Authors: Ammar Daskin
Abstract: Ridge functions are used to describe and study the lower bound of the approximation done by the neural networks which can be written as a linear combination of activation functions. If the activation functions are also ridge functions, these networks are called explainable neural networks. In this brief paper, we first show that quantum neural networks which are based on variational quantum circuits can be written as a linear combination of ridge functions by following matrix notations. Consequently, we show that the interpretability and explainability of such quantum neural networks can be directly considered and studied as an approximation with the linear combination of ridge functions.
Authors: Daniel Hothem, Kevin Young, Tommie Catanach, Timothy Proctor
Abstract: Accurately predicting a quantum computer's capability -- which circuits it can run and how well it can run them -- is a foundational goal of quantum characterization and benchmarking. As modern quantum computers become increasingly hard to simulate, we must develop accurate and scalable predictive capability models to help researchers and stakeholders decide which quantum computers to build and use. In this work, we propose a hardware-agnostic method to efficiently construct scalable predictive models of a quantum computer's capability for almost any class of circuits, and demonstrate our method using convolutional neural networks (CNNs). Our CNN-based approach works by efficiently representing a circuit as a three-dimensional tensor and then using a CNN to predict its success rate. Our CNN capability models obtain approximately a $1\%$ average absolute prediction error when modeling processors experiencing both Markovian and non-Markovian stochastic Pauli errors. We also apply our CNNs to model the capabilities of cloud-access quantum computing systems, obtaining moderate prediction accuracy (average absolute error around $2-5\%$), and we highlight the challenges to building better neural network capability models.
Authors: Shirong Xu, Will Wei Sun, Guang Cheng
Abstract: Synthetic data algorithms are widely employed in industries to generate artificial data for downstream learning tasks. While existing research primarily focuses on empirically evaluating utility of synthetic data, its theoretical understanding is largely lacking. This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework. It considers two utility metrics: generalization and ranking of models trained on synthetic data. The former is defined as the generalization difference between models trained on synthetic and on real data. By deriving analytical bounds for this utility metric, we demonstrate that the synthetic feature distribution does not need to be similar as that of real data for ensuring comparable generalization of synthetic models, provided proper model specifications in downstream learning tasks. The latter utility metric studies the relative performance of models trained on synthetic data. In particular, we discover that the distribution of synthetic data is not necessarily similar as the real one to ensure consistent model comparison. Interestingly, consistent model comparison is still achievable even when synthetic responses are not well generated, as long as downstream models are separable by a generalization gap. Finally, extensive experiments on non-parametric models and deep neural networks have been conducted to validate these theoretical findings.
Authors: Yi Yan, Ercan Engin Kuruoglu
Abstract: Spatio-temporal estimation of signals on graph edges is challenging because most conventional Graph Signal Processing techniques are defined on the graph nodes. Leveraging the Line Graph transform, the Line Graph Least Mean Square (LGLMS) algorithm is proposed to conduct adaptive estimation of time-varying edge signals by projecting the edge signals from edge space to node space. LGLMS is an adaptive algorithm analogous to the classical LMS algorithm but applied to graph edges. Unlike edge-specific methods, LGLMS retains all GSP concepts and techniques originally designed for graph nodes, without the need for redefinition on the edges. Experimenting with transportation graphs and meteorological graphs, with the signal observations having noisy and missing values, we confirmed that LGLMS is suitable for the online prediction of time-varying edge signals.
Authors: Georg A. Gottwald, Fengyi Li, Youssef Marzouk, Sebastian Reich
Abstract: We consider the problem of sampling from an unknown distribution for which only a sufficiently large number of training samples are available. Such settings have recently drawn considerable interest in the context of generative modelling and Bayesian inference. In this paper, we propose a generative model combining Schr\"odinger bridges and Langevin dynamics. Schr\"odinger bridges over an appropriate reversible reference process are used to approximate the conditional transition probability from the available training samples, which is then implemented in a discrete-time reversible Langevin sampler to generate new samples. By setting the kernel bandwidth in the reference process to match the time step size used in the unadjusted Langevin algorithm, our method effectively circumvents any stability issues typically associated with the time-stepping of stiff stochastic differential equations. Moreover, we introduce a novel split-step scheme, ensuring that the generated samples remain within the convex hull of the training samples. Our framework can be naturally extended to generate conditional samples and to Bayesian inference problems. We demonstrate the performance of our proposed scheme through experiments on synthetic datasets with increasing dimensions and on a stochastic subgrid-scale parametrization conditional sampling problem as well as generating sample trajectories of a dynamical system using conditional sampling.
Authors: Yize Sun, Zixin Wu, Yunpu Ma, Volker Tresp
Abstract: Unsupervised representation learning presents new opportunities for advancing Quantum Architecture Search (QAS) on Noisy Intermediate-Scale Quantum (NISQ) devices. QAS is designed to optimize quantum circuits for Variational Quantum Algorithms (VQAs). Most QAS algorithms tightly couple the search space and search algorithm, typically requiring the evaluation of numerous quantum circuits, resulting in high computational costs and limiting scalability to larger quantum circuits. Predictor-based QAS algorithms mitigate this issue by estimating circuit performance based on structure or embedding. However, these methods often demand time-intensive labeling to optimize gate parameters across many circuits, which is crucial for training accurate predictors. Inspired by the classical neural architecture search algorithm Arch2vec, we investigate the potential of unsupervised representation learning for QAS without relying on predictors. Our framework decouples unsupervised architecture representation learning from the search process, enabling the learned representations to be applied across various downstream tasks. Additionally, it integrates an improved quantum circuit graph encoding scheme, addressing the limitations of existing representations and enhancing search efficiency. This predictor-free approach removes the need for large labeled datasets. During the search, we employ REINFORCE and Bayesian Optimization to explore the latent representation space and compare their performance against baseline methods. Our results demonstrate that the framework efficiently identifies high-performing quantum circuits with fewer search iterations.
Authors: Shao-Bo Lin
Abstract: This paper focuses on scattered data fitting problems on spheres. We study the approximation performance of a class of weighted spectral filter algorithms, including Tikhonov regularization, Landaweber iteration, spectral cut-off, and iterated Tikhonov, in fitting noisy data with possibly unbounded random noise. For the analysis, we develop an integral operator approach that can be regarded as an extension of the widely used sampling inequality approach and norming set method in the community of scattered data fitting. After providing an equivalence between the operator differences and quadrature rules, we succeed in deriving optimal Sobolev-type error estimates of weighted spectral filter algorithms. Our derived error estimates do not suffer from the saturation phenomenon for Tikhonov regularization in the literature, native-space-barrier for existing error analysis and adapts to different embedding spaces. We also propose a divide-and-conquer scheme to equip weighted spectral filter algorithms to reduce their computational burden and present the optimal approximation error bounds.
Authors: Paolo Morettin, Andrea Passerini, Roberto Sebastiani
Abstract: In machine learning (ML) verification, the majority of procedures are non-quantitative and therefore cannot be used for verifying probabilistic models, or be applied in domains where hard guarantees are practically unachievable. The probabilistic formal verification (PFV) of ML models is in its infancy, with the existing approaches limited to specific ML models, properties, or both. This contrasts with standard formal methods techniques, whose successful adoption in real-world scenarios is also due to their support for a wide range of properties and diverse systems. We propose a unifying framework for the PFV of ML systems based on Weighted Model Integration (WMI), a relatively recent formalism for probabilistic inference with algebraic and logical constraints. Crucially, reducing the PFV of ML models to WMI enables the verification of many properties of interest over a wide range of systems, addressing multiple limitations of deterministic verification and ad-hoc algorithms. We substantiate the generality of the approach on prototypical tasks involving the verification of group fairness, monotonicity, robustness to noise, probabilistic local robustness and equivalence among predictors. We characterize the challenges related to the scalability of the approach and, through our WMI-based perspective, we show how successful scaling techniques in the ML verification literature can be generalized beyond their original scope.
Authors: Kento Kawaharazuka, Tatsuya Matsushima, Andrew Gambardella, Jiaxian Guo, Chris Paxton, Andy Zeng
Abstract: Recent developments in foundation models, like Large Language Models (LLMs) and Vision-Language Models (VLMs), trained on extensive data, facilitate flexible application across different tasks and modalities. Their impact spans various fields, including healthcare, education, and robotics. This paper provides an overview of the practical application of foundation models in real-world robotics, with a primary emphasis on the replacement of specific components within existing robot systems. The summary encompasses the perspective of input-output relationships in foundation models, as well as their role in perception, motion planning, and control within the field of robotics. This paper concludes with a discussion of future challenges and implications for practical robot applications.
Authors: Yiyun He, Roman Vershynin, Yizhe Zhu
Abstract: We present a polynomial-time algorithm for online differentially private synthetic data generation. For a data stream within the hypercube $[0,1]^d$ and an infinite time horizon, we develop an online algorithm that generates a differentially private synthetic dataset at each time $t$. This algorithm achieves a near-optimal accuracy bound of $O(\log(t)t^{-1/d})$ for $d\geq 2$ and $O(\log^{4.5}(t)t^{-1})$ for $d=1$ in the 1-Wasserstein distance. This result extends the previous work on the continual release model for counting queries to Lipschitz queries. Compared to the offline case, where the entire dataset is available at once, our approach requires only an extra polylog factor in the accuracy bound.
Authors: Banghua Zhu, Norman Mu, Jiantao Jiao, David Wagner
Abstract: Generative AI's expanding footprint across numerous industries has led to both excitement and increased scrutiny. This paper delves into the unique security challenges posed by Generative AI, and outlines potential research directions for managing these risks.
Authors: Clement Neo, Shay B. Cohen, Fazl Barez
Abstract: Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied independently, their interactions remain largely unexplored. This study investigates how attention heads and next-token neurons interact in LLMs to predict new words. We propose a methodology to identify next-token neurons, find prompts that highly activate them, and determine the upstream attention heads responsible. We then generate and evaluate explanations for the activity of these attention heads in an automated manner. Our findings reveal that some attention heads recognize specific contexts relevant to predicting a token and activate a downstream token-predicting neuron accordingly. This mechanism provides a deeper understanding of how attention heads work with MLP neurons to perform next-token prediction. Our approach offers a foundation for further research into the intricate workings of LLMs and their impact on text generation and understanding.
Authors: Yang Peng, Liangyu Zhang, Zhihua Zhang
Abstract: Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $\eta^\pi$ for a given policy $\pi$. The distributional temporal difference learning has been accordingly proposed, which is an extension of the temporal difference learning (TD) in the classic RL area. In the tabular case, \citet{rowland2018analysis} and \citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD learning (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $\tilde{O}\left(\frac{1}{\varepsilon^{2p}(1-\gamma)^{2p+1}}\right)$ iterations to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal up to logarithmic factors in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman's inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance for $p\geq 1$.
Authors: Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, Tongliang Liu
Abstract: The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Inspired by the success of vision-language foundation models, previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision. However, in practice, they are still unsatisfactory due to several issues, including heavy adaptation cost, suboptimal text supervision, and uncontrolled natural generalization capacity. In this paper, to address these issues, we propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement. Specifically, we achieve this by providing adversarially correlated text supervision that is end-to-end learned from adversarial examples. We also propose a novel training objective that enhances the consistency of multi-modal features while encourages differentiated uni-modal features between natural and adversarial examples. The proposed framework gives access to learn adversarial text supervision, which provides superior cross-modal adversarial alignment and matches state-of-the-art zero-shot adversarial robustness with only 1% training data. Code is available at: https://github.com/lionel-w2/FAP.
Authors: Zihao Wang, Zhe Wu
Abstract: In this work, we present a novel application of foundation models for chemical reactor modeling. Accurate modeling of real-world chemical reactors through first-principles is often challenging, and the process of rebuilding and retraining models for each new chemical process is inefficient. This raises a critical question: can we develop a single, universal neural network (i.e., a foundation model) that can rapidly adapt to any new chemical process in a reactor? To address this, we propose a foundation model for chemical reactor modeling that employs a meta-learning approach, followed by physics-informed fine-tuning on new tasks with only a few data samples. Our model is designed to generalize across three classic reactor types: continuous stirred tank reactors, batch reactors, and plug flow reactors. Compared to conventional methods such as data-driven learning, physics-informed learning, transfer learning, and meta-learning, our approach demonstrates superior performance in few-shot scenarios. Specifically, it shows rapid adaptation to unseen reactions with varying integer orders across different reactor set-ups, requiring minimal data for fine-tuning. Source code is available at https://github.com/killingbear999/chemical-reactor-foundation-model.
URLs: https://github.com/killingbear999/chemical-reactor-foundation-model.
Authors: James Requeima, John Bronskill, Dami Choi, Richard E. Turner, David Duvenaud
Abstract: Machine learning practitioners often face significant challenges in formally integrating their prior knowledge and beliefs into predictive models, limiting the potential for nuanced and context-aware analyses. Moreover, the expertise needed to integrate this prior knowledge into probabilistic modeling typically limits the application of these models to specialists. Our goal is to build a regression model that can process numerical data and make probabilistic predictions at arbitrary locations, guided by natural language text which describes a user's prior knowledge. Large Language Models (LLMs) provide a useful starting point for designing such a tool since they 1) provide an interface where users can incorporate expert insights in natural language and 2) provide an opportunity for leveraging latent problem-relevant knowledge encoded in LLMs that users may not have themselves. We start by exploring strategies for eliciting explicit, coherent numerical predictive distributions from LLMs. We examine these joint predictive distributions, which we call LLM Processes, over arbitrarily-many quantities in settings such as forecasting, multi-dimensional regression, black-box optimization, and image modeling. We investigate the practical details of prompting to elicit coherent predictive distributions, and demonstrate their effectiveness at regression. Finally, we demonstrate the ability to usefully incorporate text into numerical predictions, improving predictive performance and giving quantitative structure that reflects qualitative descriptions. This lets us begin to explore the rich, grounded hypothesis space that LLMs implicitly encode.
Authors: Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva
Abstract: Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device's system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the benchmark. Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. AndroidWorld and the experiments in this paper are available at github.com/google-research/android_world.
Authors: Kihyuk Hong, Woojin Chae, Yufan Zhang, Dabeen Lee, Ambuj Tewari
Abstract: We study the infinite-horizon average-reward reinforcement learning with linear MDPs. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $\widetilde{O}(\sqrt{T})$. In this paper, we propose an algorithm that achieves the regret bound of $\widetilde{O}(\sqrt{T})$ and is computationally efficient in the sense that the time complexity is polynomial in problem parameters. Our algorithm runs an optimistic value iteration on a discounted-reward MDP that approximates the average-reward setting. With an appropriately tuned discounting factor $\gamma$, the algorithm attains the desired $\widetilde{O}(\sqrt{T})$ regret. The challenge in our approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - \gamma)$. We address this challenge by clipping the value function obtained at each value iteration step to limit the span of the value function.
Authors: Cencheng Shen
Abstract: Graph encoder embedding, a recent technique for graph data, offers speed and scalability in producing vertex-level representations from binary graphs. In this paper, we extend the applicability of this method to a general graph model, which includes weighted graphs, distance matrices, and kernel matrices. We prove that the encoder embedding satisfies the law of large numbers and the central limit theorem on a per-observation basis. Under certain condition, it achieves asymptotic normality on a per-class basis, enabling optimal classification through discriminant analysis. These theoretical findings are validated through a series of experiments involving weighted graphs, as well as text and image data transformed into general graph representations using appropriate distance metrics.
Authors: Wei Jiang, Sifan Yang, Yibo Wang, Lijun Zhang
Abstract: This paper explores adaptive variance reduction methods for stochastic optimization based on the STORM technique. Existing adaptive extensions of STORM rely on strong assumptions like bounded gradients and bounded function values, or suffer an additional $\mathcal{O}(\log T)$ term in the convergence rate. To address these limitations, we introduce a novel adaptive STORM method that achieves an optimal convergence rate of $\mathcal{O}(T^{-1/3})$ for non-convex functions with our newly designed learning rate strategy. Compared with existing approaches, our method requires weaker assumptions and attains the optimal convergence rate without the additional $\mathcal{O}(\log T)$ term. We also extend the proposed technique to stochastic compositional optimization, obtaining the same optimal rate of $\mathcal{O}(T^{-1/3})$. Furthermore, we investigate the non-convex finite-sum problem and develop another innovative adaptive variance reduction method that achieves an optimal convergence rate of $\mathcal{O}(n^{1/4} T^{-1/2} )$, where $n$ represents the number of component functions. Numerical experiments across various tasks validate the effectiveness of our method.
Authors: Md Saiful Islam, Tariq Adnan, Jan Freyberg, Sangwu Lee, Abdelrahman Abdelkader, Meghan Pawlik, Cathe Schwartz, Karen Jaffe, Ruth B. Schneider, E Ray Dorsey, Ehsan Hoque
Abstract: Limited accessibility to neurological care leads to underdiagnosed Parkinson's Disease (PD), preventing early intervention. Existing AI-based PD detection methods primarily focus on unimodal analysis of motor or speech tasks, overlooking the multifaceted nature of the disease. To address this, we introduce a large-scale, multi-task video dataset consisting of 1102 sessions (each containing videos of finger tapping, facial expression, and speech tasks captured via webcam) from 845 participants (272 with PD). We propose a novel Uncertainty-calibrated Fusion Network (UFNet) that leverages this multimodal data to enhance diagnostic accuracy. UFNet employs independent task-specific networks, trained with Monte Carlo Dropout for uncertainty quantification, followed by self-attended fusion of features, with attention weights dynamically adjusted based on task-specific uncertainties. To ensure patient-centered evaluation, the participants were randomly split into three sets: 60% for training, 20% for model selection, and 20% for final performance evaluation. UFNet significantly outperformed single-task models in terms of accuracy, area under the ROC curve (AUROC), and sensitivity while maintaining non-inferior specificity. Withholding uncertain predictions further boosted the performance, achieving 88.0+-0.3%$ accuracy, 93.0+-0.2% AUROC, 79.3+-0.9% sensitivity, and 92.6+-0.3% specificity, at the expense of not being able to predict for 2.3+-0.3% data (+- denotes 95% confidence interval). Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. Requiring only a webcam and microphone, our approach facilitates accessible home-based PD screening, especially in regions with limited healthcare resources.
Authors: Alicja R\k{a}czkowska, Aleksandra Osowska-Kurczab, Jacek Szczerbi\'nski, Kalina Jasinska-Kobus, Klaudia Nazarko
Abstract: Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set the bar high for the development of label noise methods that can handle real-world label noise in text classification tasks. The code and dataset are available for download at https://github.com/allegro/AlleNoise.
Authors: Alexandru M. Gherghescu, Vlad-Andrei B\u{a}doiu, Alexandru Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, Costin Raiciu
Abstract: Hyperscalers dominate the landscape of large network deployments, yet they rarely share data or insights about the challenges they face. In light of this supremacy, what problems can we find to solve in this space? We take an unconventional approach to find relevant research directions, starting from public plans to build a $100 billion datacenter for machine learning applications. Leveraging the language models scaling laws, we discover what workloads such a datacenter might carry and explore the challenges one may encounter in doing so, with a focus on networking research. We conclude that building the datacenter and training such models is technically possible, but this requires novel wide-area transports for inter-DC communication, a multipath transport and novel datacenter topologies for intra-datacenter communication, high speed scale-up networks and transports, outlining a rich research agenda for the networking community.
Authors: Jan Ole von Hartz, Tim Welschehold, Abhinav Valada, Joschka Boedecker
Abstract: Task Parametrized Gaussian Mixture Models (TP-GMM) are a sample-efficient method for learning object-centric robot manipulation tasks. However, there are several open challenges to applying TP-GMMs in the wild. In this work, we tackle three crucial challenges synergistically. First, end-effector velocities are non-Euclidean and thus hard to model using standard GMMs. We thus propose to factorize the robot's end-effector velocity into its direction and magnitude, and model them using Riemannian GMMs. Second, we leverage the factorized velocities to segment and sequence skills from complex demonstration trajectories. Through the segmentation, we further align skill trajectories and hence leverage time as a powerful inductive bias. Third, we present a method to automatically detect relevant task parameters per skill from visual observations. Our approach enables learning complex manipulation tasks from just five demonstrations while using only RGB-D observations. Extensive experimental evaluations on RLBench demonstrate that our approach achieves state-of-the-art performance with 20-fold improved sample efficiency. Our policies generalize across different environments, object instances, and object positions, while the learned skills are reusable.
Authors: Cheng Shi, Liming Pan, Ivan Dokmani\'c
Abstract: Feature-learning deep nets progressively collapse data to a regular low-dimensional geometry. How this phenomenon emerges from collective action of nonlinearity, noise, learning rate, and other choices that shape the dynamics, has eluded first-principles theories built from microscopic neuronal dynamics. We exhibit a noise-nonlinearity phase diagram that identifies regimes where shallow or deep layers learn more effectively. We then propose a macroscopic mechanical theory that reproduces the diagram, explaining why some DNNs are lazy and some active, and linking feature learning across layers to generalization.
Authors: Weilin Cai, Le Qin, Jiayi Huang
Abstract: As large language models continue to scale up, distributed training systems have expanded beyond 10k nodes, intensifying the importance of fault tolerance. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges due to the substantial increase in model size, despite comparable computational demands to dense models. In this work, we propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts, effectively reducing the MoE checkpoint size to levels comparable with dense models. Incorporating hybrid parallel strategies, MoC-System involves fully sharded checkpointing strategies to evenly distribute the workload across distributed ranks. Furthermore, MoC-System introduces a two-level checkpointing management method that asynchronously handles in-memory snapshots and persistence processes. We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process compared to the original method, during MoE model training with ZeRO-2 data parallelism and expert parallelism. Additionally, extensive empirical analyses substantiate that our methods enhance efficiency while maintaining comparable model accuracy, even achieving an average accuracy increase of 1.08% on downstream tasks.
Authors: Jiarui Xie, Mutahar Safdar, Lequn Chen, Seung Ki Moon, Yaoyao Fiona Zhao
Abstract: Various machine learning (ML)-based in-situ monitoring systems have been developed to detect anomalies and defects in laser additive manufacturing (LAM) processes. While multimodal fusion, which integrates data from visual, audio, and other modalities, can improve monitoring performance, it also increases hardware, computational, and operational costs due to the use of multiple sensor types. This paper introduces a cross-modality knowledge transfer (CMKT) methodology for LAM in-situ monitoring, which transfers knowledge from a source modality to a target modality. CMKT enhances the representativeness of the features extracted from the target modality, allowing the removal of source modality sensors during prediction. This paper proposes three CMKT methods: semantic alignment, fully supervised mapping, and semi-supervised mapping. The semantic alignment method establishes a shared encoded space between modalities to facilitate knowledge transfer. It employs a semantic alignment loss to align the distributions of identical groups (e.g., visual and audio defective groups) and a separation loss to distinguish different groups (e.g., visual defective and audio defect-free groups). The two mapping methods transfer knowledge by deriving features from one modality to another using fully supervised and semi-supervised learning approaches. In a case study for LAM in-situ defect detection, the proposed CMKT methods were compared with multimodal audio-visual fusion. The semantic alignment method achieved an accuracy of 98.7% while removing the audio modality during the prediction phase, which is comparable to the 98.2% accuracy obtained through multimodal fusion. Using explainable artificial intelligence, we discovered that semantic alignment CMKT can extract more representative features while reducing noise by leveraging the inherent correlations between modalities.
Authors: Xin Zhang, Jiawei Du, Ping Liu, Joey Tianyi Zhou
Abstract: Dataset distillation has emerged as a technique aiming to condense informative features from large, natural datasets into a compact and synthetic form. While recent advancements have refined this technique, its performance is bottlenecked by the prevailing class-specific synthesis paradigm. Under this paradigm, synthetic data is optimized exclusively for a pre-assigned one-hot label, creating an implicit class barrier in feature condensation. This leads to inefficient utilization of the distillation budget and oversight of inter-class feature distributions, which ultimately limits the effectiveness and efficiency, as demonstrated in our analysis. To overcome these constraints, this paper presents the Inter-class Feature Compensator (INFER), an innovative distillation approach that transcends the class-specific data-label framework widely utilized in current dataset distillation methods. Specifically, INFER leverages a Universal Feature Compensator (UFC) to enhance feature integration across classes, enabling the generation of multiple additional synthetic instances from a single UFC input. This significantly improves the efficiency of the distillation budget. Moreover, INFER enriches inter-class interactions during the distillation, thereby enhancing the effectiveness and generalizability of the distilled data. By allowing for the linear interpolation of labels similar to those in the original dataset, INFER meticulously optimizes the synthetic data and dramatically reduces the size of soft labels in the synthetic dataset to almost zero, establishing a new benchmark for efficiency and effectiveness in dataset distillation.
Authors: Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Abstract: To solve ever more complex problems, Deep Neural Networks are scaled to billions of parameters, leading to huge computational costs. An effective approach to reduce computational requirements and increase efficiency is to prune unnecessary components of these often over-parameterized networks. Previous work has shown that attribution methods from the field of eXplainable AI serve as effective means to extract and prune the least relevant network components in a few-shot fashion. We extend the current state by proposing to explicitly optimize hyperparameters of attribution methods for the task of pruning, and further include transformer-based networks in our analysis. Our approach yields higher model compression rates of large transformer- and convolutional architectures (VGG, ResNet, ViT) compared to previous works, while still attaining high performance on ImageNet classification tasks. Here, our experiments indicate that transformers have a higher degree of over-parameterization compared to convolutional neural networks. Code is available at https://github.com/erfanhatefi/Pruning-by-eXplaining-in-PyTorch.
URLs: https://github.com/erfanhatefi/Pruning-by-eXplaining-in-PyTorch.
Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
Abstract: VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
Authors: Christopher C. Price, Yansong Li, Guanyu Zhou, Rehan Younas, Spencer S. Zeng, Tim H. Scanlon, Jason M. Munro, Christopher L. Hinkle
Abstract: Materials synthesis optimization is constrained by serial feedback processes that rely on manual tools and intuition across multiple siloed modes of characterization. We automate and generalize feature extraction of reflection high-energy electron diffraction (RHEED) data with machine learning to establish quantitatively predictive relationships in small sets (\~10) of expert-labeled data, saving significant time on subsequently grown samples. These predictive relationships are evaluated in a representative material system (\ce{W_{1-x}V_xSe2} on c-plane sapphire (0001)) with two aims: 1) predicting grain alignment of the deposited film using pre-growth substrate data, and 2) estimating vanadium dopant concentration using in-situ RHEED as a proxy for ex-situ methods (e.g. x-ray photoelectron spectroscopy). Both tasks are accomplished using the same materials-agnostic features, avoiding specific system retraining and leading to a potential 80\% time saving over a 100-sample synthesis campaign. These predictions provide guidance to avoid doomed trials, reduce follow-on characterization, and improve control resolution for materials synthesis.
Authors: Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Abstract: We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at https://huggingface.co/nvidia/NVLM-D-72B and will open-source the training code for the community soon.
Authors: Debargha Ganguly, Srinivasan Iyengar, Vipin Chaudhary, Shivkumar Kalyanaraman
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning, particularly in novel domains and complex logical sequences. This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs. Our approach bridges LLM-generated ideas with formal logic verification, employing a custom interpreter to convert LLM outputs into First Order Logic constructs for theorem prover scrutiny. Central to our method is an intermediary JSON-based Domain-Specific Language, which by design balances precise logical structures with intuitive human concepts. This hybrid representation enables both rigorous validation and accessible human comprehension of LLM reasoning processes. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge, and a flexible architecture that allows for easy extension to various domain-specific applications. We demonstrate Proof of Thought's effectiveness through benchmarking on StrategyQA and a novel multimodal reasoning task, showing improved performance in open-ended scenarios. By providing verifiable and interpretable results, our technique addresses critical needs for AI system accountability and sets a foundation for human-in-the-loop oversight in high-stakes domains.
Authors: Maor Ashkenazi, Eran Treister
Abstract: Implicit Neural Representations (INRs) have peaked interest in recent years due to their ability to encode natural signals using neural networks. While INRs allow for useful applications such as interpolating new coordinates and signal compression, their black-box nature makes it difficult to modify them post-training. In this paper we explore the idea of editable INRs, and specifically focus on the widely used cropping operation. To this end, we present Local-Global SIRENs -- a novel INR architecture that supports cropping by design. Local-Global SIRENs are based on combining local and global feature extraction for signal encoding. What makes their design unique is the ability to effortlessly remove specific portions of an encoded signal, with a proportional weight decrease. This is achieved by eliminating the corresponding weights from the network, without the need for retraining. We further show how this architecture can be used to support the straightforward extension of previously encoded signals. Beyond signal editing, we examine how the Local-Global approach can accelerate training, enhance encoding of various signals, improve downstream performance, and be applied to modern INRs such as INCODE, highlighting its potential and flexibility. Code is available at https://github.com/maorash/Local-Global-INRs.
Authors: Ganchao Wei, Li Ma
Abstract: Flow matching (FM) is a family of training algorithms for fitting continuous normalizing flows (CNFs). A standard approach to FM, called conditional flow matching (CFM), exploits the fact that the marginal vector field of a CNF can be learned by fitting least-square regression to the so-called conditional vector field specified given one or both ends of the flow path. We show that viewing CFM training from a Bayesian decision theoretic perspective on parameter estimation opens the door to generalizations of CFM algorithms. We propose one such extension by introducing a CFM algorithm based on defining conditional probability paths given what we refer to as ``streams'', instances of latent stochastic paths that connect pairs of noise and observed data. Further, we advocate the modeling of these latent streams using Gaussian processes (GPs). The unique distributional properties of GPs, and in particular the fact that the velocity of a GP is still a GP, allows drawing samples from the resulting stream-augmented conditional probability path without simulating the actual streams, and hence the ``simulation-free" nature of CFM training is preserved. We show that this generalization of the CFM can substantially reduce the variance in the estimated marginal vector field at a moderate computational cost, thereby improving the quality of the generated samples under common metrics. Additionally, we show that adopting the GP on the streams allows for flexibly linking multiple related training data points (e.g., time series) and incorporating additional prior information. We empirically validate our claim through both simulations and applications to two hand-written image datasets.
Authors: Bolun "Namir" Xia, Aparna Gupta, Mohammed J. Zaki
Abstract: The advent of large language models (LLMs) has initiated much research into their various financial applications. However, in applying LLMs on long documents, semantic relations are not explicitly incorporated, and a full or arbitrarily sparse attention operation is employed. In recent years, progress has been made in Abstract Meaning Representation (AMR), which is a graph-based representation of text to preserve its semantic relations. Since AMR can represent semantic relationships at a deeper level, it can be beneficially utilized by graph neural networks (GNNs) for constructing effective document-level graph representations built upon LLM embeddings to predict target metrics in the financial domain. We propose FLAG: Financial Long document classification via AMR-based GNN, an AMR graph based framework to generate document-level embeddings for long financial document classification. We construct document-level graphs from sentence-level AMR graphs, endow them with specialized LLM word embeddings in the financial domain, apply a deep learning mechanism that utilizes a GNN, and examine the efficacy of our AMR-based approach in predicting labeled target data from long financial documents. Extensive experiments are conducted on a dataset of quarterly earnings calls transcripts of companies in various sectors of the economy, as well as on a corpus of more recent earnings calls of companies in the S&P 1500 Composite Index. We find that our AMR-based approach outperforms fine-tuning LLMs directly on text in predicting stock price movement trends at different time horizons in both datasets. Our work also outperforms previous work utilizing document graphs and GNNs for text classification.
Authors: Nidhi Munikote
Abstract: As quantum computers continue to become more capable, the possibilities of their applications increase. For example, quantum techniques are being integrated with classical neural networks to perform machine learning. In order to be used in this way, or for any other widespread use like quantum chemistry simulations or cryptographic applications, classical data must be converted into quantum states through quantum encoding. There are three fundamental encoding methods: basis, amplitude, and rotation, as well as several proposed combinations. This study explores the encoding methods, specifically in the context of hybrid quantum-classical machine learning. Using the QuClassi quantum neural network architecture to perform binary classification of the `3' and `6' digits from the MNIST datasets, this study obtains several metrics such as accuracy, entropy, loss, and resistance to noise, while considering resource usage and computational complexity to compare the three main encoding methods.
Authors: Isack Lee, Haebin Seong
Abstract: Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of PCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense method PCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. PCDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures.
Authors: Zhengzheng Tang, Eva Zhu
Abstract: This study introduces BrainTransformers, an innovative Large Language Model (LLM) implemented using Spiking Neural Networks (SNN). Our key contributions include: (1) designing SNN-compatible Transformer components such as SNNMatmul, SNNSoftmax, and SNNSiLU; (2) implementing an SNN approximation of the SiLU activation function; and (3) developing a Synapsis module to simulate synaptic plasticity. Our 3-billion parameter model, BrainTransformers-3B-Chat, demonstrates competitive performance across various benchmarks, including MMLU (63.2), BBH (54.1), ARC-C (54.3), and GSM8K (76.3), while potentially offering improved energy efficiency and biological plausibility. The model employs a three-stage training approach, including SNN-specific neuronal synaptic plasticity training. This research opens new avenues for brain-like AI systems in natural language processing and neuromorphic computing. Future work will focus on hardware optimization, developing specialized SNN fine-tuning tools, and exploring practical applications in energy-efficient computing environments.
Authors: Ariel Neufeld, Philipp Schmocker
Abstract: In this paper, we generalize the universal approximation property of single-hidden-layer feed-forward neural networks beyond the classical formulation over compact domains. More precisely, by assuming that the activation function is non-polynomial, we derive universal approximation results for neural networks within function spaces over non-compact subsets of a Euclidean space, e.g., weighted spaces, $L^p$-spaces, and (weighted) Sobolev spaces over unbounded domains, where the latter includes the approximation of the (weak) derivatives. Furthermore, we provide some dimension-independent rates for approximating a function with sufficiently regular and integrable Fourier transform by neural networks with non-polynomial activation function.
Authors: Wei Xie, Shuoyoucheng Ma, Zhenhua Wang, Enze Wang, Baosheng Wang, Jinshu Su
Abstract: Despite their proficiency in math tasks, the mechanisms underlying LLMs' mathematical reasoning abilities remain a subject of debate. Recent studies suggest that chain-of-thought (CoT) prompts can bolster mathematical reasoning by encouraging LLMs to employ human-like logical reasoning (System 2), enabling them to excel on the Cognitive Reflection Test (CRT). To assess whether LLMs genuinely possess System 2-like logical reasoning, we introduced targeted modifications to CRT problems. Our findings reveal that, despite the use of CoT prompts, mainstream LLMs, including the latest o1-preview model, continue to exhibit a significant error rate. Further analysis indicates that they predominantly rely on System 1-like intuitive reasoning and pattern matching derived from training data, rather than demonstrating mastery of mathematical thinking. This discovery challenges the prevailing notion that LLMs possess genuine logical reasoning abilities and that CoT can enhance them. Consequently, this work may temper overly optimistic projections regarding LLMs' advancement toward artificial general intelligence.
Authors: Wenrui Gou, Wenhui Ge, Yang Tan, Mingchen Li, Guisheng Fan, Huiqun Yu
Abstract: Protein structures are important for understanding their functions and interactions. Currently, many protein structure prediction methods are enriching the structure database. Discriminating the origin of structures is crucial for distinguishing between experimentally resolved and computationally predicted structures, evaluating the reliability of prediction methods, and guiding downstream biological studies. Building on works in structure prediction, We developed a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), to represent and discriminate the origin of protein structures. CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes, and is expected to be extended to more. Simultaneously, we utilized Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM. Preliminary experiments demonstrated that, compared to large-scale protein language models pre-trained on vast amounts of amino acid sequences, the "structure-sequence" enables the language model to learn more informative protein features, enhancing and optimizing structural representations. We have provided the code, model weights, and all related materials on https://github.com/GouWenrui/CPE-Pro-main.git.
Authors: Yuncheng Yuan, P\'eter Scheepers, Lydia Tasiou, Yunus Can G\"ultekin, Federico Corradi, Alex Alvarado
Abstract: This paper analyzes the design and competitiveness of four neural network (NN) architectures recently proposed as decoders for forward error correction (FEC) codes. We first consider the so-called single-label neural network (SLNN) and the multi-label neural network (MLNN) decoders which have been reported to achieve near maximum likelihood (ML) performance. Here, we show analytically that SLNN and MLNN decoders can always achieve ML performance, regardless of the code dimensions -- although at the cost of computational complexity -- and no training is in fact required. We then turn our attention to two transformer-based decoders: the error correction code transformer (ECCT) and the cross-attention message passing transformer (CrossMPT). We compare their performance against traditional decoders, and show that ordered statistics decoding outperforms these transformer-based decoders. The results in this paper cast serious doubts on the application of NN-based FEC decoders in the short and medium block length regime.
Authors: Samrajya Thapa, Koushik Howlader, Subhankar Bhattacharjee, Wei le
Abstract: In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.
Authors: Mengdi Zhang, Kai Kiat Goh, Peixin Zhang, Jun Sun
Abstract: Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.
Authors: Gholamali Aminian, {\L}ukasz Szpruch, Samuel N. Cohen
Abstract: We propose a novel framework for exploring generalization errors of transfer learning through the lens of differential calculus on the space of probability measures. In particular, we consider two main transfer learning scenarios, $\alpha$-ERM and fine-tuning with the KL-regularized empirical risk minimization and establish generic conditions under which the generalization error and the population risk convergence rates for these scenarios are studied. Based on our theoretical results, we show the benefits of transfer learning with a one-hidden-layer neural network in the mean-field regime under some suitable integrability and regularity assumptions on the loss and activation functions.