Authors: Cuong V. Nguyen, Hieu Minh Duong, Cuong D.Do
We introduce MELEP, which stands for Muti-label Expected Log of Empirical Predictions, a novel measure to estimate how effective it is to transfer knowledge from a pre-trained model to a downstream task in a multi-label settings. The measure is generic to work with new target data having a different label set from source data. It is also computationally efficient, only requires forward passing the downstream dataset through the pre-trained model once. To the best of our knowledge, we are the first to develop such a transferability metric for multi-label ECG classification problems. Our experiments show that MELEP can predict the performance of pre-trained convolutional and recurrent deep neural networks, on small and imbalanced ECG data. Specifically, strong correlation coefficients, with absolute values exceeding 0.6 in most cases, were observed between MELEP and the actual average F1 scores of the fine-tuned models.
Authors: Ryohei Fukuma, Kei Majima, Yoshinobu Kawahara, Okito Yamashita, Yoshiyuki Shiraishi, Haruhiko Kishima, Takufumi Yanagisawa
Dynamic mode (DM) decomposition decomposes spatiotemporal signals into basic oscillatory components (DMs). DMs can improve the accuracy of neural decoding when used with the nonlinear Grassmann kernel, compared to conventional power features. However, such kernel-based machine learning algorithms have three limitations: large computational time preventing real-time application, incompatibility with non-kernel algorithms, and low interpretability. Here, we propose a mapping function corresponding to the Grassmann kernel that explicitly transforms DMs into spatial DM (sDM) features, which can be used in any machine learning algorithm. Using electrocorticographic signals recorded during various movement and visual perception tasks, the sDM features were shown to improve the decoding accuracy and computational time compared to conventional methods. Furthermore, the components of the sDM features informative for decoding showed similar characteristics to the high-$\gamma$ power of the signals, but with higher trial-to-trial reproducibility. The proposed sDM features enable fast, accurate, and interpretable neural decoding.
Authors: Mackenzie Wallich, Kenneth Lai, Svetlana Yanushkevich
Accelerometry has been extensively studied as an objective means of measuring upper limb function in patients post-stroke. The objective of this paper is to determine whether the accelerometry-derived measurements frequently used in more long-term rehabilitation studies can also be used to monitor and rapidly detect sudden changes in upper limb motor function in more recently hospitalized stroke patients. Six binary classification models were created by training on variable data window times of paretic upper limb accelerometer feature data. The models were assessed on their effectiveness for differentiating new input data into two classes: severe or moderately severe motor function. The classification models yielded Area Under the Curve (AUC) scores that ranged from 0.72 to 0.82 for 15-minute data windows to 0.77 to 0.94 for 120-minute data windows. These results served as a preliminary assessment and a basis on which to further investigate the efficacy of using accelerometry and machine learning to alert healthcare professionals to rapid changes in motor function in the days immediately following a stroke.
Authors: Kamyar Zeinalipour, Marco Gori
The electrocardiogram (ECG) is a dependable instrument for assessing the function of the cardiovascular system. There has recently been much emphasis on precisely classifying ECGs. While ECG situations have numerous similarities, little attention has been paid to categorizing ECGs using graph neural networks. In this study, we offer three distinct techniques for classifying heartbeats using deep graph neural networks to classify the ECG signals accurately. We suggest using different methods to extract topological features from the ECG signal and then using a branch of the graph neural network named graph isomorphism network for classifying the ECGs. On the PTB Diagnostics data set, we tested the three proposed techniques. According to the findings, the three proposed techniques are capable of making arrhythmia classification predictions with the accuracy of 99.38, 98.76, and 91.93 percent, respectively.
Authors: Amir Salimi, Sunil Vasu Kalmady, Abram Hindle, Osmar Zaiane, Padma Kaul
In this work we search for best practices in pre-processing of Electrocardiogram (ECG) signals in order to train better classifiers for the diagnosis of heart conditions. State of the art machine learning algorithms have achieved remarkable results in classification of some heart conditions using ECG data, yet there appears to be no consensus on pre-processing best practices. Is this lack of consensus due to different conditions and architectures requiring different processing steps for optimal performance? Is it possible that state of the art deep-learning models have rendered pre-processing unnecessary? In this work we apply down-sampling, normalization, and filtering functions to 3 different multi-label ECG datasets and measure their effects on 3 different high-performing time-series classifiers. We find that sampling rates as low as 50Hz can yield comparable results to the commonly used 500Hz. This is significant as smaller sampling rates will result in smaller datasets and models, which require less time and resources to train. Additionally, despite their common usage, we found min-max normalization to be slightly detrimental overall, and band-passing to make no measurable difference. We found the blind approach to pre-processing of ECGs for multi-label classification to be ineffective, with the exception of sample rate reduction which reliably reduces computational resources, but does not increase accuracy.
Authors: Yamin Li, Ange Lou, Catie Chang
In modern neuroscience, functional magnetic resonance imaging (fMRI) has been a crucial and irreplaceable tool that provides a non-invasive window into the dynamics of whole-brain activity. Nevertheless, fMRI is limited by hemodynamic blurring as well as high cost, immobility, and incompatibility with metal implants. Electroencephalography (EEG) is complementary to fMRI and can directly record the cortical electrical activity at high temporal resolution, but has more limited spatial resolution and is unable to recover information about deep subcortical brain structures. The ability to obtain fMRI information from EEG would enable cost-effective, imaging across a wider set of brain regions. Further, beyond augmenting the capabilities of EEG, cross-modality models would facilitate the interpretation of fMRI signals. However, as both EEG and fMRI are high-dimensional and prone to artifacts, it is currently challenging to model fMRI from EEG. To address this challenge, we propose a novel architecture that can predict fMRI signals directly from multi-channel EEG without explicit feature engineering. Our model achieves this by implementing a Sinusoidal Representation Network (SIREN) to learn frequency information in brain dynamics from EEG, which serves as the input to a subsequent encoder-decoder to effectively reconstruct the fMRI signal from a specific brain region. We evaluate our model using a simultaneous EEG-fMRI dataset with 8 subjects and investigate its potential for predicting subcortical fMRI signals. The present results reveal that our model outperforms a recent state-of-the-art model, and indicates the potential of leveraging periodic activation functions in deep neural networks to model functional neuroimaging data.
Authors: Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner
As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.
Authors: Ahmad Esmaeili, Zahra Ghorrati, Eric T. Matson
The rapid growth of wearable sensor technologies holds substantial promise for the field of personalized and context-aware Human Activity Recognition. Given the inherently decentralized nature of data sources within this domain, the utilization of multi-agent systems with their inherent decentralization capabilities presents an opportunity to facilitate the development of scalable, adaptable, and privacy-conscious methodologies. This paper introduces a collaborative distributed learning approach rooted in multi-agent principles, wherein individual users of sensor-equipped devices function as agents within a distributed network, collectively contributing to the comprehensive process of learning and classifying human activities. In this proposed methodology, not only is the privacy of activity monitoring data upheld for each individual, eliminating the need for an external server to oversee the learning process, but the system also exhibits the potential to surmount the limitations of conventional centralized models and adapt to the unique attributes of each user. The proposed approach has been empirically tested on two publicly accessible human activity recognition datasets, specifically PAMAP2 and HARTH, across varying settings. The provided empirical results conclusively highlight the efficacy of inter-individual collaborative learning when contrasted with centralized configurations, both in terms of local and global generalization.
Authors: Wei-Fu Tseng, Kai-Chun Chen, Zi-Hong Xiao, Yen-Huan Li
Online learning quantum states with the logarithmic loss (LL-OLQS) is a quantum generalization of online portfolio selection, a classic open problem in the field of online learning for over three decades. The problem also emerges in designing randomized optimization algorithms for maximum-likelihood quantum state tomography. Recently, Jezequel et al. (arXiv:2209.13932) proposed the VB-FTRL algorithm, the first nearly regret-optimal algorithm for OPS with moderate computational complexity. In this note, we generalize VB-FTRL for LL-OLQS. Let $d$ denote the dimension and $T$ the number of rounds. The generalized algorithm achieves a regret rate of $O ( d^2 \log ( d + T ) )$ for LL-OLQS. Each iteration of the algorithm consists of solving a semidefinite program that can be implemented in polynomial time by, e.g., cutting-plane methods. For comparison, the best-known regret rate for LL-OLQS is currently $O ( d^2 \log T )$, achieved by the exponential weight method. However, there is no explicit implementation available for the exponential weight method for LL-OLQS. To facilitate the generalization, we introduce the notion of VB-convexity. VB-convexity is a sufficient condition for the logarithmic barrier associated with any function to be convex and is of independent interest.
Authors: Farinaz Alamiyan-Harandi, Mersad Hassanjani, Pouria Ramazi
In human societies, people often incorporate fairness in their decisions and treat reciprocally by being kind to those who act kindly. They evaluate the kindness of others' actions not only by monitoring the outcomes but also by considering the intentions. This behavioral concept can be adapted to train cooperative agents in Multi-Agent Reinforcement Learning (MARL). We propose the KindMARL method, where agents' intentions are measured by counterfactual reasoning over the environmental impact of the actions that were available to the agents. More specifically, the current environment state is compared with the estimation of the current environment state provided that the agent had chosen another action. The difference between each agent's reward, as the outcome of its action, with that of its fellow, multiplied by the intention of the fellow is then taken as the fellow's "kindness". If the result of each reward-comparison confirms the agent's superiority, it perceives the fellow's kindness and reduces its own reward. Experimental results in the Cleanup and Harvest environments show that training based on the KindMARL method enabled the agents to earn 89\% (resp. 37\%) and 44% (resp. 43\%) more total rewards than training based on the Inequity Aversion and Social Influence methods. The effectiveness of KindMARL is further supported by experiments in a traffic light control problem.
Authors: Farinaz Alamiyan-Harandi, Pouria Ramazi
To promote cooperation and strengthen the individual impact on the collective outcome in social dilemmas, we propose the Environmental-impact Multi-Agent Reinforcement Learning (EMuReL) method where each agent estimates the "environmental impact" of every other agent, that is, the difference in the current environment state compared to the hypothetical environment in the absence of that other agent. Inspired by the Inequity Aversion model, the agent then compares its own reward with those of its fellows multiplied by their environmental impacts. If its reward exceeds the scaled reward of one of its fellows, the agent takes "social responsibility" toward that fellow by reducing its own reward. Therefore, the less influential an agent is in reaching the current state, the more social responsibility is taken by other agents. Experiments in the Cleanup (resp. Harvest) test environment demonstrate that agents trained based on EMuReL learn to cooperate more effectively and obtain $54\%$ ($39\%$) and $20\%$ ($44\%$) more total rewards while preserving the same cooperation levels compared to when they are trained based on the two state-of-the-art reward reshaping methods inequity aversion and social influence.
Authors: Li-Hsiang Shen, Kai-Ten Feng, Ta-Sung Lee, Yuan-Chun Lin, Shih-Cheng Lin, Chia-Chan Chang, Sheng-Fuh Chang
The requirement of wireless data demands is increasingly high as the sixth-generation (6G) technology evolves. Reconfigurable intelligent surface (RIS) is promisingly deemed to be one of 6G techniques for extending service coverage, reducing power consumption, and enhancing spectral efficiency. In this article, we have provided some fundamentals of RIS deployment in theory and hardware perspectives as well as utilization of artificial intelligence (AI) and machine learning. We conducted an intelligent deployment of RIS (i-Dris) prototype, including dual-band auto-guided vehicle (AGV) assisted RISs associated with an mmWave base station (BS) and a receiver. The RISs are deployed on the AGV with configured incident/reflection angles. While, both the mmWave BS and receiver are associated with an edge server monitoring downlink packets for obtaining system throughput. We have designed a federated multi-agent reinforcement learning scheme associated with several AGV-RIS agents and sub-agents per AGV-RIS consisting of the deployment of position, height, orientation and elevation angles. The experimental results presented the stationary measurement in different aspects and scenarios. The i-Dris can reach up to 980 Mbps transmission throughput under a bandwidth of 100 MHz with comparably low complexity as well as rapid deployment, which outperforms the other existing works. At last, we highlight some opportunities and future issues in leveraging RIS-empowered wireless communication networks.
Authors: Zhanting Zhou, Kejun Bi, Yuyanzhen Zhong, Chao Tang, Dongfen Li, Shi Ying, Ruijin Wang
The strength of a supply chain is an important measure of a country's or region's technical advancement and overall competitiveness. Establishing supply chain risk assessment models for effective management and mitigation of potential risks has become increasingly crucial. As the number of businesses grows, the important relationships become more complicated and difficult to measure. This emphasizes the need of extracting relevant information from graph data. Previously, academics mostly employed knowledge inference to increase the visibility of links between nodes in the supply chain. However, they have not solved the data hunger problem of single node feature characteristics. We propose a hierarchical knowledge transferable graph neural network-based (HKTGNN) supply chain risk assessment model to address these issues. Our approach is based on current graph embedding methods for assessing corporate investment risk assessment. We embed the supply chain network corresponding to individual goods in the supply chain using the graph embedding module, resulting in a directed homogeneous graph with just product nodes. This reduces the complicated supply chain network into a basic product network. It addresses difficulties using the domain difference knowledge transferable module based on centrality, which is presented by the premise that supply chain feature characteristics may be biased in the actual world. Meanwhile, the feature complement and message passing will alleviate the data hunger problem, which is driven by domain differences. Our model outperforms in experiments on a real-world supply chain dataset. We will give an equation to prove that our comparative experiment is both effective and fair.
Authors: Zhonghang Li, Lianghao Xia, Yong Xu, Chao Huang
In recent years, there has been a rapid development of spatio-temporal prediction techniques in response to the increasing demands of traffic management and travel planning. While advanced end-to-end models have achieved notable success in improving predictive performance, their integration and expansion pose significant challenges. This work aims to address these challenges by introducing a spatio-temporal pre-training framework that seamlessly integrates with downstream baselines and enhances their performance. The framework is built upon two key designs: (i) We propose a spatio-temporal mask autoencoder as a pre-training model for learning spatio-temporal dependencies. The model incorporates customized parameter learners and hierarchical spatial pattern encoding networks. These modules are specifically designed to capture spatio-temporal customized representations and intra- and inter-cluster region semantic relationships, which have often been neglected in existing approaches. (ii) We introduce an adaptive mask strategy as part of the pre-training mechanism. This strategy guides the mask autoencoder in learning robust spatio-temporal representations and facilitates the modeling of different relationships, ranging from intra-cluster to inter-cluster, in an easy-to-hard training manner. Extensive experiments conducted on representative benchmarks demonstrate the effectiveness of our proposed method. We have made our model implementation publicly available at https://github.com/HKUDS/GPT-ST.
Authors: Chenwei Tang, Wenqiang Zhou, Dong Wang, Caiyang Yu, Zhenan He, Jizhe Zhou, Shudong Huang, Yi Gao, Jianming Chen, Wentao Feng, Jiancheng Lv
The advent of Industry 4.0 has precipitated the incorporation of Artificial Intelligence (AI) methods within industrial contexts, aiming to realize intelligent manufacturing, operation as well as maintenance, also known as industrial intelligence. However, intricate industrial milieus, particularly those relating to energy exploration and production, frequently encompass data characterized by long-tailed class distribution, sample imbalance, and domain shift. These attributes pose noteworthy challenges to data-centric Deep Learning (DL) techniques, crucial for the realization of industrial intelligence. The present study centers on the intricate and distinctive industrial scenarios of Nuclear Power Generation (NPG), meticulously scrutinizing the application of DL techniques under the constraints of finite data samples. Initially, the paper expounds on potential employment scenarios for AI across the full life-cycle of NPG. Subsequently, we delve into an evaluative exposition of DL's advancement, grounded in the finite sample perspective. This encompasses aspects such as small-sample learning, few-shot learning, zero-shot learning, and open-set recognition, also referring to the unique data characteristics of NPG. The paper then proceeds to present two specific case studies. The first revolves around the automatic recognition of zirconium alloy metallography, while the second pertains to open-set recognition for signal diagnosis of machinery sensors. These cases, spanning the entirety of NPG's life-cycle, are accompanied by constructive outcomes and insightful deliberations. By exploring and applying DL methodologies within the constraints of finite sample availability, this paper not only furnishes a robust technical foundation but also introduces a fresh perspective toward the secure and efficient advancement and exploitation of this advanced energy source.
Authors: Sang-Hyun Je, Wontae Choi, Kwangjin Oh
The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known. In recent, pre-trained language model (PLM) based methods that utilize both textual and structural information are emerging, but their performances lag behind state-of-the-art (SOTA) structure-based methods or some methods lose their inductive inference capabilities in the process of fusing structure embedding to text encoder. In this paper, we propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning. We adopt entity anchors and these anchors and textual description of KG elements are fed together into the PLM-based encoder to learn unified representations. In addition, the proposed method utilizes additional random negative samples which can be reused in the each mini-batch during contrastive learning to learn a generalized entity representations. We verify the effectiveness of the our proposed method through various experiments and analysis. The experimental results on standard benchmark widely used in link prediction task show that the proposed model outperforms existing the SOTA KGC models. Especially, our method show the largest performance improvement on FB15K-237, which is competitive to the SOTA of structure-based KGC methods.
Authors: Chau Pham, Piotr Teterwak, Soren Nelson, Bryan A. Plummer
Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at https://github.com/chaudatascience/mixturegrowth.
Authors: Ishan Pathak, Ishan Jha, Aditya Sadana, Basuraj Bhowmik
Structural Health Monitoring (SHM) is vital for evaluating structural condition, aiming to detect damage through sensor data analysis. It aligns with predictive maintenance in modern industry, minimizing downtime and costs by addressing potential structural issues. Various machine learning techniques have been used to extract valuable information from vibration data, often relying on prior structural knowledge. This research introduces an innovative approach to structural damage detection, utilizing a new Convolutional Neural Network (CNN) algorithm. In order to extract deep spatial features from time series data, CNNs are taught to recognize long-term temporal connections. This methodology combines spatial and temporal features, enhancing discrimination capabilities when compared to methods solely reliant on deep spatial features. Time series data are divided into two categories using the proposed neural network: undamaged and damaged. To validate its efficacy, the method's accuracy was tested using a benchmark dataset derived from a three-floor structure at Los Alamos National Laboratory (LANL). The outcomes show that the new CNN algorithm is very accurate in spotting structural degradation in the examined structure.
Authors: Saeed Razavikia, José Mairton Barros Da Silva Júnior, Carlo Fischione
In this work, we investigate federated edge learning over a fading multiple access channel. To alleviate the communication burden between the edge devices and the access point, we introduce a pioneering digital over-the-air computation strategy employing q-ary quadrature amplitude modulation, culminating in a low latency communication scheme. Indeed, we propose a new federated edge learning framework in which edge devices use digital modulation for over-the-air uplink transmission to the edge server while they have no access to the channel state information. Furthermore, we incorporate multiple antennas at the edge server to overcome the fading inherent in wireless communication. We analyze the number of antennas required to mitigate the fading impact effectively. We prove a non-asymptotic upper bound for the mean squared error for the proposed federated learning with digital over-the-air uplink transmissions under both noisy and fading conditions. Leveraging the derived upper bound, we characterize the convergence rate of the learning process of a non-convex loss function in terms of the mean square error of gradients due to the fading channel. Furthermore, we substantiate the theoretical assurances through numerical experiments concerning mean square error and the convergence efficacy of the digital federated edge learning framework. Notably, the results demonstrate that augmenting the number of antennas at the edge server and adopting higher-order modulations improve the model accuracy up to 60\%.
Authors: Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, Si Qin, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
Recent advancements in Large Language Models (LLMs) have revolutionized decision-making by breaking down complex problems into more manageable language sequences referred to as ``thoughts''. An effective thought design should consider three key perspectives: performance, efficiency, and flexibility. However, existing thought can at most exhibit two of these attributes. To address these limitations, we introduce a novel thought prompting approach called ``Everything of Thoughts'' (XoT) to defy the law of ``Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, thereby enhancing LLMs' capabilities and enabling them to generalize to unseen problems efficiently. Through the utilization of the MCTS-LLM collaborative thought revision framework, this approach autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions.
Authors: Shizhan Lu
Hesitant fuzzy sets are widely used in the instances of uncertainty and hesitation. The inclusion relationship is an important and foundational definition for sets. Hesitant fuzzy set, as a kind of set, needs explicit definition of inclusion relationship. Base on the hesitant fuzzy membership degree of discrete form, several kinds of inclusion relationships for hesitant fuzzy sets are proposed. And then some foundational propositions of hesitant fuzzy sets and the families of hesitant fuzzy sets are presented. Finally, some foundational propositions of hesitant fuzzy information systems with respect to parameter reductions are put forward, and an example and an algorithm are given to illustrate the processes of parameter reductions.
Authors: D. Dhinakaran, S. Gopalakrishnan, M.D. Manigandan, T. P. Anish
In response to the burgeoning global demand for seafood and the challenges of managing fish farms, we introduce an innovative IoT based environmental control system that integrates sensor technology and advanced machine learning decision support. Deploying a network of wireless sensors within the fish farm, we continuously collect real-time data on crucial environmental parameters, including water temperature, pH levels, humidity, and fish behavior. This data undergoes meticulous preprocessing to ensure its reliability, including imputation, outlier detection, feature engineering, and synchronization. At the heart of our system are four distinct machine learning algorithms: Random Forests predict and optimize water temperature and pH levels for the fish, fostering their health and growth; Support Vector Machines (SVMs) function as an early warning system, promptly detecting diseases and parasites in fish; Gradient Boosting Machines (GBMs) dynamically fine-tune the feeding schedule based on real-time environmental conditions, promoting resource efficiency and fish productivity; Neural Networks manage the operation of critical equipment like water pumps and heaters to maintain the desired environmental conditions within the farm. These machine learning algorithms collaboratively make real-time decisions to ensure that the fish farm's environmental conditions align with predefined specifications, leading to improved fish health and productivity while simultaneously reducing resource wastage, thereby contributing to increased profitability and sustainability. This research article showcases the power of data-driven decision support in fish farming, promising to meet the growing demand for seafood while emphasizing environmental responsibility and economic viability, thus revolutionizing the future of fish farming.
Authors: Muntabir Hasan Choudhury, Lamia Salsabil, William A. Ingram, Edward A. Fox, Jian Wu
Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference proceedings and journals. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, discover, and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation).
Authors: Lea M. Trenkwalder, Eleanor Scerri, Thomas E. O'Brien, Vedran Dunjko
Hamiltonian simulation is believed to be one of the first tasks where quantum computers can yield a quantum advantage. One of the most popular methods of Hamiltonian simulation is Trotterization, which makes use of the approximation $e^{i\sum_jA_j}\sim \prod_je^{iA_j}$ and higher-order corrections thereto. However, this leaves open the question of the order of operations (i.e. the order of the product over $j$, which is known to affect the quality of approximation). In some cases this order is fixed by the desire to minimise the error of approximation; when it is not the case, we propose that the order can be chosen to optimize compilation to a native quantum architecture. This presents a new compilation problem -- order-agnostic quantum circuit compilation -- which we prove is NP-hard in the worst case. In lieu of an easily-computable exact solution, we turn to methods of heuristic optimization of compilation. We focus on reinforcement learning due to the sequential nature of the compilation task, comparing it to simulated annealing and Monte Carlo tree search. While two of the methods outperform a naive heuristic, reinforcement learning clearly outperforms all others, with a gain of around 12% with respect to the second-best method and of around 50% compared to the naive heuristic in terms of the gate count. We further test the ability of RL to generalize across instances of the compilation problem, and find that a single learner is able to solve entire problem families. This demonstrates the ability of machine learning techniques to provide assistance in an order-agnostic quantum compilation task.
Authors: Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang
The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.
Authors: Tara Akhound-Sadegh, Laurence Perreault-Levasseur, Johannes Brandstetter, Max Welling, Siamak Ravanbakhsh
Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equivariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions. Effectively, this means that once the network learns a solution, it also learns the neighbouring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.
Authors: Amritpal Singh, Mustafa Burak Gurbuz, Shiva Souhith Gantha, Prahlad Jasti
Healthcare clinics regularly encounter dynamic data that changes due to variations in patient populations, treatment policies, medical devices, and emerging disease patterns. Deep learning models can suffer from catastrophic forgetting when fine-tuned in such scenarios, causing poor performance on previously learned tasks. Continual learning allows learning on new tasks without performance drop on previous tasks. In this work, we investigate the performance of continual learning models on four different medical imaging scenarios involving ten classification datasets from diverse modalities, clinical specialties, and hospitals. We implement various continual learning approaches and evaluate their performance in these scenarios. Our results demonstrate that a single model can sequentially learn new tasks from different specialties and achieve comparable performance to naive methods. These findings indicate the feasibility of recycling or sharing models across the same or different medical specialties, offering another step towards the development of general-purpose medical imaging AI that can be shared across institutions.
Authors: Surajsinh Parmar, Tao Shan, San Lee, Yonghwan Kim, Jang Yong Kim
Sepsis requires urgent diagnosis, but research is predominantly focused on Western datasets. In this study, we perform a comparative analysis of two ensemble learning methods, LightGBM and XGBoost, using the public eICU-CRD dataset and a private South Korean St. Mary's Hospital's dataset. Our analysis reveals the effectiveness of these methods in addressing healthcare data imbalance and enhancing sepsis detection. Specifically, LightGBM shows a slight edge in computational efficiency and scalability. The study paves the way for the broader application of machine learning in critical care, thereby expanding the reach of predictive analytics in healthcare globally.
Authors: Amirhossein Afsharrad, Ahmadreza Moradipari, Sanjay Lall
Recently, bandit optimization has received significant attention in real-world safety-critical systems that involve repeated interactions with humans. While there exist various algorithms with performance guarantees in the literature, practical implementation of the algorithms has not received as much attention. This work presents a comprehensive study on the computational aspects of safe bandit algorithms, specifically safe linear bandits, by introducing a framework that leverages convex programming tools to create computationally efficient policies. In particular, we first characterize the properties of the optimal policy for safe linear bandit problem and then propose an end-to-end pipeline of safe linear bandit algorithms that only involves solving convex problems. We also numerically evaluate the performance of our proposed methods.
Authors: Shahan Nercessian, Johannes Imort
We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational text-to-instrument baseline, extending research in the domain of automatic sample-based instrument generation.
Authors: Mahdi Ghafourian, Julian Fierrez, Ruben Vera-Rodriguez, Ruben Tolosana, Aythami Morales
Federated Learning (FL) is a machine learning paradigm to conduct collaborative learning among clients on a joint model. The primary goal is to share clients' local training parameters with an integrating server while preserving their privacy. This method permits to exploit the potential of massive mobile users' data for the benefit of machine learning models' performance while keeping sensitive data on local devices. On the downside, FL raises security and privacy concerns that have just started to be studied. To address some of the key threats in FL, researchers have proposed to use secure aggregation methods (e.g. homomorphic encryption, secure multiparty computation, etc.). These solutions improve some security and privacy metrics, but at the same time bring about other serious threats such as poisoning attacks, backdoor attacks, and free running attacks. This paper proposes a new defense method against poisoning attacks in FL called SaFL (Sybil-aware Federated Learning) that minimizes the effect of sybils with a novel time-variant aggregation scheme.
Authors: Su Wang, Roberto Morabito, Seyyedali Hosseinalipour, Mung Chiang, Christopher G. Brinton
The conventional federated learning (FedL) architecture distributes machine learning (ML) across worker devices by having them train local models that are periodically aggregated by a server. FedL ignores two important characteristics of contemporary wireless networks, however: (i) the network may contain heterogeneous communication/computation resources, and (ii) there may be significant overlaps in devices' local data distributions. In this work, we develop a novel optimization methodology that jointly accounts for these factors via intelligent device sampling complemented by device-to-device (D2D) offloading. Our optimization methodology aims to select the best combination of sampled nodes and data offloading configuration to maximize FedL training accuracy while minimizing data processing and D2D communication resource consumption subject to realistic constraints on the network topology and device capabilities. Theoretical analysis of the D2D offloading subproblem leads to new FedL convergence bounds and an efficient sequential convex optimizer. Using these results, we develop a sampling methodology based on graph convolutional networks (GCNs) which learns the relationship between network attributes, sampled nodes, and D2D data offloading to maximize FedL accuracy. Through evaluation on popular datasets and real-world network measurements from our edge testbed, we find that our methodology outperforms popular device sampling methodologies from literature in terms of ML model performance, data processing overhead, and energy consumption.
Authors: Amaal F. Alshmarni, Mohammed A. Alliheedi
In the modern era, malware is experiencing a significant increase in both its variety and quantity, aligning with the widespread adoption of the digital world. This surge in malware has emerged as a critical challenge in the realm of cybersecurity, prompting numerous research endeavors and contributions to address the issue. Machine learning algorithms have been leveraged for malware detection due to their ability to uncover concealed patterns within vast datasets. However, deep learning algorithms, characterized by their multi-layered structure, surpass the limitations of traditional machine learning approaches. By employing deep learning techniques such as CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), this study aims to classify and identify malware extracted from a dataset containing API call sequences. The performance of these algorithms is compared with that of conventional machine learning methods, including SVM (Support Vector Machine), RF (Random Forest), KNN (K-Nearest Neighbors), XGB (Extreme Gradient Boosting), and GBC (Gradient Boosting Classifier), all using the same dataset. The outcomes of this research demonstrate that both deep learning and machine learning algorithms achieve remarkably high levels of accuracy, reaching up to 99% in certain cases.
Authors: Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak
Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
Authors: Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan
We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs including real-world in-the-wild captures and images from generative models. Video demos and interactable 3D meshes can be found on this website: https://yiconghong.me/LRM/.
Authors: Nicolas Emmenegger, Mojmír Mutný, Andreas Krause
Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.
Authors: Hongwu Peng, Caiwen Ding, Tong Geng, Sutanay Choudhury, Kevin Barker, Ang Li
The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks.
This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field.
Authors: Xiang Zhou, Yichen Jiang, Mohit Bansal
Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at https://github.com/owenzx/data4comp
Authors: Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, Shigang Chen, Ronald Fick, Miles Medina, Christine Angelini
Transformers are widely used deep learning architectures. Existing transformers are mostly designed for sequences (texts or time series), images or videos, and graphs. This paper proposes a novel transformer model for massive (up to a million) point samples in continuous space. Such data are ubiquitous in environment sciences (e.g., sensor observations), numerical simulations (e.g., particle-laden flow, astrophysics), and location-based services (e.g., POIs and trajectories). However, designing a transformer for massive spatial points is non-trivial due to several challenges, including implicit long-range and multi-scale dependency on irregular points in continuous space, a non-uniform point distribution, the potential high computational costs of calculating all-pair attention across massive points, and the risks of over-confident predictions due to varying point density. To address these challenges, we propose a new hierarchical spatial transformer model, which includes multi-resolution representation learning within a quad-tree hierarchy and efficient spatial attention via coarse approximation. We also design an uncertainty quantification branch to estimate prediction confidence related to input feature noise and point sparsity. We provide a theoretical analysis of computational time complexity and memory costs. Extensive experiments on both real-world and synthetic datasets show that our method outperforms multiple baselines in prediction accuracy and our model can scale up to one million points on one NVIDIA A100 GPU. The code is available at \url{https://github.com/spatialdatasciencegroup/HST}.
Authors: Feng Xie, Xin Song, Xiang Zeng, Xuechen Zhao, Lei Tian, Bin Zhou, Yusong Tan
Semi-supervised entity alignment (EA) is a practical and challenging task because of the lack of adequate labeled mappings as training data. Most works address this problem by generating pseudo mappings for unlabeled entities. However, they either suffer from the erroneous (noisy) pseudo mappings or largely ignore the uncertainty of pseudo mappings. In this paper, we propose a novel semi-supervised EA method, termed as MixTEA, which guides the model learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings. We firstly train a student model using few labeled mappings as standard. More importantly, in pseudo mapping learning, we propose a bi-directional voting (BDV) strategy that fuses the alignment decisions in different directions to estimate the uncertainty via the joint matching confidence score. Meanwhile, we also design a matching diversity-based rectification (MDR) module to adjust the pseudo mapping learning, thus reducing the negative influence of noisy mappings. Extensive results on benchmark datasets as well as further analyses demonstrate the superiority and the effectiveness of our proposed method.
Authors: Jishnu Ray Chowdhury, Cornelia Caragea
Binary Balanced Tree RvNNs (BBT-RvNNs) enforce sequence composition according to a preset balanced binary tree structure. Thus, their non-linear recursion depth is just $\log_2 n$ ($n$ being the sequence length). Such logarithmic scaling makes BBT-RvNNs efficient and scalable on long sequence tasks such as Long Range Arena (LRA). However, such computational efficiency comes at a cost because BBT-RvNNs cannot solve simple arithmetic tasks like ListOps. On the flip side, RvNNs (e.g., Beam Tree RvNN) that do succeed on ListOps (and other structure-sensitive tasks like formal logical inference) are generally several times more expensive than even RNNs. In this paper, we introduce a novel framework -- Recursion in Recursion (RIR) to strike a balance between the two sides - getting some of the benefits from both worlds. In RIR, we use a form of two-level nested recursion - where the outer recursion is a $k$-ary balanced tree model with another recursive model (inner recursion) implementing its cell function. For the inner recursion, we choose Beam Tree RvNNs (BT-RvNN). To adjust BT-RvNNs within RIR we also propose a novel strategy of beam alignment. Overall, this entails that the total recursive depth in RIR is upper-bounded by $k \log_k n$. Our best RIR-based model is the first model that demonstrates high ($\geq 90\%$) length-generalization performance on ListOps while at the same time being scalable enough to be trainable on long sequence inputs from LRA. Moreover, in terms of accuracy in the LRA language tasks, it performs competitively with Structured State Space Models (SSMs) without any special initialization - outperforming Transformers by a large margin. On the other hand, while SSMs can marginally outperform RIR on LRA, they (SSMs) fail to length-generalize on ListOps. Our code is available at: \url{https://github.com/JRC1995/BeamRecursionFamily/}.
Authors: Vardhan Dongre, Gurpreet Singh Hora
The accessibility of spatially distributed data, enabled by affordable sensors, field, and numerical experiments, has facilitated the development of data-driven solutions for scientific problems, including climate change, weather prediction, and urban planning. Neural Partial Differential Equations (Neural PDEs), which combine deep learning (DL) techniques with domain expertise (e.g., governing equations) for parameterization, have proven to be effective in capturing valuable correlations within spatiotemporal datasets. However, sparse and noisy measurements coupled with modeling approximation introduce aleatoric and epistemic uncertainties. Therefore, quantifying uncertainties propagated from model inputs to outputs remains a challenge and an essential goal for establishing the trustworthiness of Neural PDEs. This work evaluates various Uncertainty Quantification (UQ) approaches for both Forward and Inverse Problems in scientific applications. Specifically, we investigate the effectiveness of Bayesian methods, such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), and a more conventional approach, Deep Ensembles (DE). To illustrate their performance, we take two canonical PDEs: Burger's equation and the Navier-Stokes equation. Our results indicate that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters. However, it is noteworthy that the results derived from Bayesian methods, based on our observations, tend to display a higher degree of certainty in their predictions as compared to those obtained using the DE. This elevated certainty in predictions suggests that Bayesian techniques might underestimate the true underlying uncertainty, thereby appearing more confident in their predictions than the DE approach.
Authors: Shikai Fang, Madison Cooley, Da Long, Shibo Li, Robert Kirby, Shandian Zhe
Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student t mixture or Gaussian mixture. We then apply the inverse Fourier transform to obtain the covariance function (according to the Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. We are the first to discover its rationale and effectiveness for PDE solving. Next,we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to greatly promote computational efficiency and scalability, without any low-rank approximations. We show the advantage of our method in systematic experiments.
Authors: Sakshi Mishra, Praveen Palanisamy
Developing aerial robots that can both safely navigate and execute assigned mission without any human intervention - i.e., fully autonomous aerial mobility of passengers and goods - is the larger vision that guides the research, design, and development efforts in the aerial autonomy space. However, it is highly challenging to concurrently operationalize all types of aerial vehicles that are operating fully autonomously sharing the airspace. Full autonomy of the aerial transportation sector includes several aspects, such as design of the technology that powers the vehicles, operations of multi-agent fleets, and process of certification that meets stringent safety requirements of aviation sector. Thereby, Autonomous Advanced Aerial Mobility is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we present a comprehensive perspective on the emerging field of autonomous advanced aerial mobility, which involves the use of unmanned aerial vehicles (UAVs) and electric vertical takeoff and landing (eVTOL) aircraft for various applications, such as urban air mobility, package delivery, and surveillance. The article proposes a scalable and extensible autonomy framework consisting of four main blocks: sensing, perception, planning, and controls. Furthermore, the article discusses the challenges and opportunities in multi-agent fleet operations and management, as well as the testing, validation, and certification aspects of autonomous aerial systems. Finally, the article explores the potential of monolithic models for aerial autonomy and analyzes their advantages and limitations. The perspective aims to provide a holistic picture of the autonomous advanced aerial mobility field and its future directions.
Authors: Wenbo Zhu, Tiechuan Hu
In this paper, we look at a database of tweets sorted by various keywords that could indicate the users sentiment towards covid vaccines. With social media becoming such a prevalent source of opinion, sorting and ranking tweets that hold important information such as opinions on covid vaccines is of utmost importance. Two different ranking scales were used, and ranking a tweet in this way could represent the difference between an opinion being lost and an opinion being featured on the site, which affects the decisions and behavior of people, and why researchers were interested in it. Using natural language processing techniques, our aim is to determine and categorize opinions about covid vaccines with the highest accuracy possible.
Authors: Cheng-Yu Chuang, Pooyan Fazli
Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gradually applying a Gaussian noise to the video data, and (2) gradually reducing the capacity of the network through dropout during the training process. These methods enable the model to learn more robust and generalizable features. Moreover, CLearViD leverages the Mish activation function, which provides non-linearity and non-monotonicity and helps alleviate the issue of vanishing gradients. Our extensive experiments and ablation studies demonstrate the effectiveness of the proposed model. The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.
Authors: Gulsen Taskin, Erchan Aptoula, Alp Ertürk
Deep learning has taken by storm all fields involved in data analysis, including remote sensing for Earth observation. However, despite significant advances in terms of performance, its lack of explainability and interpretability, inherent to neural networks in general since their inception, remains a major source of criticism. Hence it comes as no surprise that the expansion of deep learning methods in remote sensing is being accompanied by increasingly intensive efforts oriented towards addressing this drawback through the exploration of a wide spectrum of Explainable Artificial Intelligence techniques. This chapter, organized according to prominent Earth observation application fields, presents a panorama of the state-of-the-art in explainable remote sensing image analysis.
Authors: Thibault Simonetto, Salah Ghamizi, Antoine Desjardins, Maxime Cordy, Yves Le Traon
State-of-the-art deep learning models for tabular data have recently achieved acceptable performance to be deployed in industrial settings. However, the robustness of these models remains scarcely explored. Contrary to computer vision, there is to date no realistic protocol to properly evaluate the adversarial robustness of deep tabular models due to intrinsic properties of tabular data such as categorical features, immutability, and feature relationship constraints. To fill this gap, we propose CAA, the first efficient evasion attack for constrained tabular deep learning models. CAA is an iterative parameter-free attack that combines gradient and search attacks to generate adversarial examples under constraints. We leverage CAA to build a benchmark of deep tabular models across three popular use cases: credit scoring, phishing and botnet attacks detection. Our benchmark supports ten threat models with increasing capabilities of the attacker, and reflects real-world attack scenarios for each use case. Overall, our results demonstrate how domain knowledge, adversarial training, and attack budgets impact the robustness assessment of deep tabular models and provide security practitioners with a set of recommendations to improve the robustness of deep tabular models against various evasion attack scenarios.
Authors: Amir H. Khodabakhsh, Seid H. Pourtakdoust
The Fokker-Plank-Kolmogorov (FPK) equation is an idealized model representing many stochastic systems commonly encountered in the analysis of stochastic structures as well as many other applications. Its solution thus provides an invaluable insight into the performance of many engineering systems. Despite its great importance, the solution of the FPK equation is still extremely challenging. For systems of practical significance, the FPK equation is usually high dimensional, rendering most of the numerical methods ineffective. In this respect, the present work introduces the FPK-DP Net as a physics-informed network that encodes the physical insights, i.e. the governing constrained differential equations emanated out of physical laws, into a deep neural network. FPK-DP Net is a mesh-free learning method that can solve the density evolution of stochastic dynamics subjected to additive white Gaussian noise without any prior simulation data and can be used as an efficient surrogate model afterward. FPK-DP Net uses the dimension-reduced FPK equation. Therefore, it can be used to address high-dimensional practical problems as well. To demonstrate the potential applicability of the proposed framework, and to study its accuracy and efficacy, numerical implementations on five different benchmark problems are investigated.
Authors: Ravil Mussabayev, Rustam Mussabayev
This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring four distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.
Authors: Dennis Rall, Bernhard Bauer, Thomas Fraunholz
Recent AI research has significantly reduced the barriers to apply AI, but the process of setting up the necessary tools and frameworks can still be a challenge. While AI-as-a-Service platforms have emerged to simplify the training and deployment of AI models, they still fall short of achieving true democratization of AI. In this paper, we aim to address this gap by comparing several popular AI-as-a-Service platforms and identifying the key requirements for a platform that can achieve true democratization of AI. Our analysis highlights the need for self-hosting options, high scalability, and openness. To address these requirements, we propose our approach: the "Open Space for Machine Learning" platform. Our platform is built on cutting-edge technologies such as Kubernetes, Kubeflow Pipelines, and Ludwig, enabling us to overcome the challenges of democratizing AI. We argue that our approach is more comprehensive and effective in meeting the requirements of democratizing AI than existing AI-as-a-Service platforms.
Authors: Feihu Huang
In the paper, we propose a class of efficient adaptive bilevel methods based on mirror descent for nonconvex bilevel optimization, where its upper-level problem is nonconvex possibly with nonsmooth regularization, and its lower-level problem is also nonconvex while satisfies Polyak-{\L}ojasiewicz (PL) condition. To solve these deterministic bilevel problems, we present an efficient adaptive projection-aid gradient (i.e., AdaPAG) method based on mirror descent, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-1})$ for finding an $\epsilon$-stationary solution of nonconvex bilevel problems. To solve these stochastic bilevel problems, we propose an efficient adaptive stochastic projection-aid gradient (i.e., AdaVSPAG) methods based on mirror descent and variance-reduced techniques, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-3/2})$ for finding an $\epsilon$-stationary solution. Since the PL condition relaxes the strongly convex, our algorithms can be used to nonconvex strongly-convex bilevel optimization. Theoretically, we provide a useful convergence analysis framework for our methods under some mild conditions, and prove that our methods have a fast convergence rate of $O(\frac{1}{T})$, where $T$ denotes the number of iterations.
Authors: Seonkyu Lim, Jaehyeon Park, Seojin Kim, Hyowon Wi, Haksoo Lim, Jinsung Jeon, Jeongwhan Choi, Noseong Park
Long-term time series forecasting (LTSF) is a challenging task that has been investigated in various domains such as finance investment, health care, traffic, and weather forecasting. In recent years, Linear-based LTSF models showed better performance, pointing out the problem of Transformer-based approaches causing temporal information loss. However, Linear-based approach has also limitations that the model is too simple to comprehensively exploit the characteristics of the dataset. To solve these limitations, we propose LTSF-DNODE, which applies a model based on linear ordinary differential equations (ODEs) and a time series decomposition method according to data statistical characteristics. We show that LTSF-DNODE outperforms the baselines on various real-world datasets. In addition, for each dataset, we explore the impacts of regularization in the neural ordinary differential equation (NODE) framework.
Authors: Tianchen Zhou, Jia Liu, Yang Jiao, Chaosheng Dong, Yetian Chen, Yan Gao, Yi Sun
Online learning to rank (ONL2R) is a foundational problem for recommender systems and has received increasing attention in recent years. Among the existing approaches for ONL2R, a natural modeling architecture is the multi-armed bandit framework coupled with the position-based click model. However, developing efficient online learning policies for MAB-based ONL2R with position-based click models is highly challenging due to the combinatorial nature of the problem, and partial observability in the position-based click model. To date, results in MAB-based ONL2R with position-based click models remain rather limited, which motivates us to fill this gap in this work. Our main contributions in this work are threefold: i) We propose the first general MAB framework that captures all key ingredients of ONL2R with position-based click models. Our model considers personalized and equal treatments in ONL2R ranking recommendations, both of which are widely used in practice; ii) Based on the above analytical framework, we develop two unified greed- and UCB-based policies called GreedyRank and UCBRank, each of which can be applied to personalized and equal ranking treatments; and iii) We show that both GreedyRank and UCBRank enjoy $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regret for personalized and equal treatment, respectively. For the fundamentally hard equal ranking treatment, we identify classes of collective utility functions and their associated sufficient conditions under which $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regrets are still achievable for GreedyRank and UCBRank, respectively. Our numerical experiments also verify our theoretical results and demonstrate the efficiency of GreedyRank and UCBRank in seeking the optimal action under various problem settings.
Authors: Xiong-Bin Yan, Keke Wu, Zhi-Qin John Xu, Zheng Ma
Full-waveform inversion (FWI) is a powerful geophysical imaging technique that infers high-resolution subsurface physical parameters by solving a non-convex optimization problem. However, due to limitations in observation, e.g., limited shots or receivers, and random noise, conventional inversion methods are confronted with numerous challenges, such as the local-minimum problem. In recent years, a substantial body of work has demonstrated that the integration of deep neural networks and partial differential equations for solving full-waveform inversion problems has shown promising performance. In this work, drawing inspiration from the expressive capacity of neural networks, we provide an unsupervised learning approach aimed at accurately reconstructing subsurface physical velocity parameters. This method is founded on a re-parametrization technique for Bayesian inference, achieved through a deep neural network with random weights. Notably, our proposed approach does not hinge upon the requirement of the labeled training dataset, rendering it exceedingly versatile and adaptable to diverse subsurface models. Extensive experiments show that the proposed approach performs noticeably better than existing conventional inversion methods.
Authors: Tiasa Singha Roy, Priyam Basu
Research on data generation and augmentation has been focused majorly on enhancing generation models, leaving a notable gap in the exploration and refinement of methods for evaluating synthetic data. There are several text similarity metrics within the context of generated data filtering which can impact the performance of specific Natural Language Understanding (NLU) tasks, specifically focusing on intent and sentiment classification. In this study, we propose RankAug, a text-ranking approach that detects and filters out the top augmented texts in terms of being most similar in meaning with lexical and syntactical diversity. Through experiments conducted on multiple datasets, we demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
Authors: Ercong Yu, Jinle Zhu, Qiang Li, Zilong Liu, Hongyang Chen, Shlomo Shamai (Shitz), H. Vincent Poor
This paper is focused on multiuser load modulation arrays (MU-LMAs) which are attractive due to their low system complexity and reduced cost for millimeter wave (mmWave) multi-input multi-output (MIMO) systems. The existing precoding algorithm for downlink MU-LMA relies on a sub-array structured (SAS) transmitter which may suffer from decreased degrees of freedom and complex system configuration. Furthermore, a conventional LMA codebook with codewords uniformly distributed on a hypersphere may not be channel-adaptive and may lead to increased signal detection complexity. In this paper, we conceive an MU-LMA system employing a full-array structured (FAS) transmitter and propose two algorithms accordingly. The proposed FAS-based system addresses the SAS structural problems and can support larger numbers of users. For LMA-imposed constant-power downlink precoding, we propose an FAS-based normalized block diagonalization (FAS-NBD) algorithm. However, the forced normalization may result in performance degradation. This degradation, together with the aforementioned codebook design problems, is difficult to solve analytically. This motivates us to propose a Deep Learning-enhanced (FAS-DL-NBD) algorithm for adaptive codebook design and codebook-independent decoding. It is shown that the proposed algorithms are robust to imperfect knowledge of channel state information and yield excellent error performance. Moreover, the FAS-DL-NBD algorithm enables signal detection with low complexity as the number of bits per codeword increases.
Authors: Nan Li, Bo Kang, Jefrey Lijffijt, Tijl De Bie
In settings such as e-recruitment and online dating, recommendation involves distributing limited opportunities, calling for novel approaches to quantify and enforce fairness. We introduce \emph{inferiority}, a novel (un)fairness measure quantifying a user's competitive disadvantage for their recommended items. Inferiority complements \emph{envy}, a fairness notion measuring preference for others' recommendations. We combine inferiority and envy with \emph{utility}, an accuracy-related measure of aggregated relevancy scores. Since these measures are non-differentiable, we reformulate them using a probabilistic interpretation of recommender systems, yielding differentiable versions. We combine these loss functions in a multi-objective optimization problem called \texttt{FEIR} (Fairness through Envy and Inferiority Reduction), applied as post-processing for standard recommender systems. Experiments on synthetic and real-world data demonstrate that our approach improves trade-offs between inferiority, envy, and utility compared to naive recommendations and the baseline methods.
Authors: Xin Cheng, Yuzhou Cao, Haobo Wang, Hongxin Wei, Bo An, Lei Feng
Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.
Authors: Huayi Tang, Yong Liu
In this paper, we develop data-dependent and algorithm-dependent generalization bounds for transductive learning algorithms in the context of information theory for the first time. We show that the generalization gap of transductive learning algorithms can be bounded by the mutual information between training labels and hypothesis. By innovatively proposing the concept of transductive supersamples, we go beyond the inductive learning setting and establish upper bounds in terms of various information measures. Furthermore, we derive novel PAC-Bayesian bounds and build the connection between generalization and loss landscape flatness under the transductive learning setting. Finally, we present the upper bounds for adaptive optimization algorithms and demonstrate the applications of results on semi-supervised learning and graph learning scenarios. Our theoretic results are validated on both synthetic and real-world datasets.
Authors: Krzysztof Werner, Kamil Wereszczyński, Rafał Potempa, Krzysztof Cyran
Despite the limited availability and quantum volume of quantum computers, quantum image representation is a widely researched area. Currently developed methods use quantum entanglement to encode information about pixel positions. These methods range from using the angle parameter of the rotation gate (e.g., the Flexible Representation of Quantum Images, FRQI), sequences of qubits (e.g., Novel Enhanced Quantum Representation, NEQR), or the angle parameter of the phase shift gates (e.g., Local Phase Image Quantum Encoding, LPIQE) for storing color information. All these methods are significantly affected by decoherence and other forms of quantum noise, which is an inseparable part of quantum computing in the noisy intermediate-scale quantum era. These phenomena can highly influence the measurements and result in extracted images that are visually dissimilar to the originals. Because this process is at its foundation quantum, the computational reversal of this process is possible. There are many methods for error correction, mitigation, and reduction, but all of them use quantum computer time or additional qubits to achieve the desired result. We report the successful use of a generative adversarial network trained for image-to-image translation, in conjunction with Phase Distortion Unraveling error reduction method, for reducing overall error in images encoded using LPIQE.
Authors: Akshit Jindal, Vikram Goyal, Saket Anand, Chetan Arora
Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the pool of available data. Existing attack strategies utilize approaches like Active Learning and Semi-Supervised learning to minimize costs. However, in the black-box setting, these approaches may select sub-optimal samples as they train only one thief model. Depending on the thief model's capacity and the data it was pretrained on, the model might even select noisy samples that harm the learning process. In this work, we explore the usage of an ensemble of deep learning models as our thief model. We call our attack Army of Thieves(AOT) as we train multiple models with varying complexities to leverage the crowd's wisdom. Based on the ensemble's collective decision, uncertain samples are selected for querying, while the most confident samples are directly included in the training data. Our approach is the first one to utilize an ensemble of thief models to perform model extraction. We outperform the base approaches of existing state-of-the-art methods by at least 3% and achieve a 21% higher adversarial sample transferability than previous work for models trained on the CIFAR-10 dataset.
Authors: Suryaka Suresh, Bishshoy Das, Vinayak Abrol, Sumantra Dutta Roy
We study how the topology of feature embedding space changes as it passes through the layers of a well-trained deep neural network (DNN) through Betti numbers. Motivated by existing studies using simplicial complexes on shallow fully connected networks (FCN), we present an extended analysis using Cubical homology instead, with a variety of popular deep architectures and real image datasets. We demonstrate that as depth increases, a topologically complicated dataset is transformed into a simple one, resulting in Betti numbers attaining their lowest possible value. The rate of decay in topological complexity (as a metric) helps quantify the impact of architectural choices on the generalization ability. Interestingly from a representation learning perspective, we highlight several invariances such as topological invariance of (1) an architecture on similar datasets; (2) embedding space of a dataset for architectures of variable depth; (3) embedding space to input resolution/size, and (4) data sub-sampling. In order to further demonstrate the link between expressivity \& the generalization capability of a network, we consider the task of ranking pre-trained models for downstream classification task (transfer learning). Compared to existing approaches, the proposed metric has a better correlation to the actually achievable accuracy via fine-tuning the pre-trained model.
Authors: Chunyang Huang, Shaoliang Zhang
This study presents an innovative method for predicting the market value of professional soccer players using explainable machine learning models. Using a dataset curated from the FIFA website, we employ an ensemble machine learning approach coupled with Shapley Additive exPlanations (SHAP) to provide detailed explanations of the models' predictions. The GBDT model achieves the highest mean R-Squared (0.8780) and the lowest mean Root Mean Squared Error (3,221,632.175), indicating its superior performance among the evaluated models. Our analysis reveals that specific skills such as ball control, short passing, finishing, interceptions, dribbling, and tackling are paramount within the skill dimension, whereas sprint speed and acceleration are critical in the fitness dimension, and reactions are preeminent in the cognitive dimension. Our results offer a more accurate, objective, and consistent framework for market value estimation, presenting useful insights for managerial decisions in player transfers.
Authors: Pourya Behmandpoor, Panagiotis Patrinos, Marc Moonen
Deep learning (DL) based resource allocation (RA) has recently gained a lot of attention due to its performance efficiency. However, most of the related studies assume an ideal case where the number of users and their utility demands, e.g., data rate constraints, are fixed and the designed DL based RA scheme exploits a policy trained only for these fixed parameters. A computationally complex policy retraining is required whenever these parameters change. Therefore, in this paper, a DL based resource allocator (ALCOR) is introduced, which allows users to freely adjust their utility demands based on, e.g., their application layer. ALCOR employs deep neural networks (DNNs), as the policy, in an iterative optimization algorithm. The optimization algorithm aims to optimize the on-off status of users in a time-sharing problem to satisfy their utility demands in expectation. The policy performs unconstrained RA (URA) -- RA without taking into account user utility demands -- among active users to maximize the sum utility (SU) at each time instant. Based on the chosen URA scheme, ALCOR can perform RA in a model-based or model-free manner and in a centralized or distributed scenario. Derived convergence analyses provide guarantees for the convergence of ALCOR, and numerical experiments corroborate its effectiveness.
Authors: Pourya Behmandpoor, Marc Moonen, Panagiotis Patrinos
Distributed optimization has experienced a significant surge in interest due to its wide-ranging applications in distributed learning and adaptation. While various scenarios, such as shared-memory, local-memory, and consensus-based approaches, have been extensively studied in isolation, there remains a need for further exploration of their interconnections. This paper specifically concentrates on a scenario where agents collaborate toward a unified mission while potentially having distinct tasks. Each agent's actions can potentially impact other agents through interactions. Within this context, the objective for the agents is to optimize their local parameters based on the aggregate of local reward functions, where only local zeroth-order oracles are available. Notably, the learning process is asynchronous, meaning that agents update and query their zeroth-order oracles asynchronously while communicating with other agents subject to bounded but possibly random communication delays. This paper presents theoretical convergence analyses and establishes a convergence rate for the proposed approach. Furthermore, it addresses the relevant issue of deep learning-based resource allocation in communication networks and conducts numerical experiments in which agents, acting as transmitters, collaboratively train their individual (possibly unique) policies to maximize a common performance metric.
Authors: Ali Mohammadifar, Hasan Samadbin, Arman Daliri
The path to an autism diagnosis can be long and difficult, and delays can have serious consequences. Artificial intelligence can completely change the way autism is diagnosed, especially when it comes to situations where it is difficult to see the first signs of the disease. AI-based diagnostic tools may help confirm a diagnosis or highlight the need for further testing by analyzing large volumes of data and uncovering patterns that may not be immediately apparent to human evaluators. After a successful and timely diagnosis, autism can be treated through artificial intelligence using various methods. In this article, by using four datasets and gathering them with the federated learning method and diagnosing them with the support vector classifier method, the early diagnosis of this disorder has been discussed. In this method, we have achieved 99% accuracy for predicting autism spectrum disorder and we have achieved 13% improvement in the results.
Authors: Nazarii Tupitsa, Abdulla Jasem Almansoori, Yanlin Wu, Martin Takáč, Karthik Nandakumar, Samuel Horváth, Eduard Gorbunov
Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work (Adibi et al., 2022) addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.
Authors: Dominic De Jonge
We introduce LuminanceL1Loss, a novel loss function designed to enhance the performance of image restoration tasks. We demonstrate its superiority over MSE when applied to the Retinexformer, BUIFD and DnCNN architectures. Our proposed LuminanceL1Loss leverages a unique approach by transforming images into grayscale and subsequently computing the MSE loss for both grayscale and color channels. Experimental results demonstrate that this innovative loss function consistently outperforms traditional methods, showcasing its potential in image denoising and other related tasks in image reconstruction. It demonstrates gains up to 4.7dB. The results presented in this study highlight the efficacy of LuminanceL1Loss for various image restoration tasks.
Authors: D. Tramontano, L. Waldmann, M. Drton, E. Duarte
We present a consistent and highly scalable local approach to learn the causal structure of a linear Gaussian polytree using data from interventional experiments with known intervention targets. Our methods first learn the skeleton of the polytree and then orient its edges. The output is a CPDAG representing the interventional equivalence class of the polytree of the true underlying distribution. The skeleton and orientation recovery procedures we use rely on second order statistics and low-dimensional marginal distributions. We assess the performance of our methods under different scenarios in synthetic data sets and apply our algorithm to learn a polytree in a gene expression interventional data set. Our simulation studies demonstrate that our approach is fast, has good accuracy in terms of structural Hamming distance, and handles problems with thousands of nodes.
Authors: Daniil Kirilenko, Vitaliy Vorobyov, Alexey K. Kovalev, Aleksandr I. Panov
Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.
Authors: Minhong Zhu, Zhenhao Zhao, Weiran Cai
The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are still inadequate for comprehending substructures. In this paper, we present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations. The core component of FFGT is a new mechanism of compound attention, which combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Beyond the scope of canonical Transformers, the FFGT has the merit of being more substructure-aware. Our approach enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-Range Graph Benchmark (LRGB) datasets even with a vanilla transformer. We further examine influential factors on the optimal focal length of attention via introducing a novel synthetic dataset based on SBM-PATTERN.
Authors: Chenmien Tan, Ge Zhang, Jie Fu
While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5-XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. Remarkably, MALMEN is capable of editing hundreds of times more facts than strong baselines with the identical hyper-network architecture and outperforms editor specifically designed for GPT. Our code is available at https://github.com/ChenmienTan/malmen.
Authors: Subba Reddy Oota, Emin Çelik, Fatma Deniz, Mariya Toneva
Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we eliminate information related to specific low-level stimulus features (textual, speech, and visual) in the language model representations, and observe how this intervention affects the alignment with fMRI brain recordings acquired while participants read versus listened to the same naturalistic stories. We further contrast our findings with speech-based language models, which would be expected to predict speech-evoked brain activity better, provided they model language processing in the brain well. Using our direct approach, we find that both text-based and speech-based language models align well with early sensory regions due to shared low-level features. Text-based models continue to align well with later language regions even after removing these features, while, surprisingly, speech-based models lose most of their alignment. These findings suggest that speech-based models can be further improved to better reflect brain-like language processing.
Authors: Titouan Vayer, Etienne Lasalle, Rémi Gribonval, Paulo Gonçalves
We consider the problem of learning a graph modeling the statistical relations of the $d$ variables of a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a sketch of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using nonlinear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega((d+2k)\log(d))$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.
Authors: Zhanbo Feng, Yuanjie Wang, Jie Li, Fan Yang, Jiong Lou, Tiebin Mi, Robert. C. Qiu, Zhenyu Liao
Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge.
Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability.
In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is \emph{independent} of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.
Authors: Cathrin Elich, Lukas Kirchdorfer, Jan M. Köhler, Lukas Schott
While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we challenge common assumptions in MTL in the context of STL: First, the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL. We deduce the effectiveness of Adam to its partial loss-scale invariance. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Lastly, we compare the transferability of features learned through MTL and STL on common image corruptions, and find no conclusive evidence that MTL leads to superior transferability. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.
Authors: Calvin Metzger
Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.
Authors: Bogumił Kamiński, Paweł Prałat, François Théberge, Sebastian Zając
A community structure that is often present in complex networks plays an important role not only in their formation but also shapes dynamics of these networks, affecting properties of their nodes. In this paper, we propose a family of community-aware node features and then investigate their properties. We show that they have high predictive power for classification tasks. We also verify that they contain information that cannot be recovered neither by classical node features nor by node embeddings (both classical as well as structural).
Authors: Wei Wang, Sattar Vakili, Ilija Bogunovic
We study the robust best-arm identification problem (RBAI) in the case of linear rewards. The primary objective is to identify a near-optimal robust arm, which involves selecting arms at every round and assessing their robustness by exploring potential adversarial actions. This approach is particularly relevant when utilizing a simulator and seeking to identify a robust solution for real-world transfer. To this end, we present an instance-dependent lower bound for the robust best-arm identification problem with linear rewards. Furthermore, we propose both static and adaptive bandit algorithms that achieve sample complexity that matches the lower bound. In synthetic experiments, our algorithms effectively identify the best robust arm and perform similarly to the oracle strategy. As an application, we examine diabetes care and the process of learning insulin dose recommendations that are robust with respect to inaccuracies in standard calculators. Our algorithms prove to be effective in identifying robust dosage values across various age ranges of patients.
Authors: Peihong Yu, Bhoram Lee, Aswin Raghavan, Supun Samarasekara, Pratap Tokekar, James Zachary Hare
In multi-agent systems, agents possess only local observations of the environment. Communication between teammates becomes crucial for enhancing coordination. Past research has primarily focused on encoding local information into embedding messages which are unintelligible to humans. We find that using these messages in agent's policy learning leads to brittle policies when tested on out-of-distribution initial states. We present an approach to multi-agent coordination, where each agent is equipped with the capability to integrate its (history of) observations, actions and messages received into a Common Operating Picture (COP) and disseminate the COP. This process takes into account the dynamic nature of the environment and the shared mission. We conducted experiments in the StarCraft2 environment to validate our approach. Our results demonstrate the efficacy of COP integration, and show that COP-based training leads to robust policies compared to state-of-the-art Multi-Agent Reinforcement Learning (MARL) methods when faced with out-of-distribution initial states.
Authors: Pim de Haan, Taco Cohen, Johann Brehmer
The Geometric Algebra Transformer (GATr) is a versatile architecture for geometric deep learning based on projective geometric algebra. We generalize this architecture into a blueprint that allows one to construct a scalable transformer architecture given any geometric (or Clifford) algebra. We study versions of this architecture for Euclidean, projective, and conformal algebras, all of which are suited to represent 3D data, and evaluate them in theory and practice. The simplest Euclidean architecture is computationally cheap, but has a smaller symmetry group and is not as sample-efficient, while the projective model is not sufficiently expressive. Both the conformal algebra and an improved version of the projective algebra define powerful, performant architectures.
Authors: Florent Bouchard, Alexandre Renaux, Guillaume Ginolhac, Arnaud Breloy
In this paper, we propose to develop a new Cram\'er-Rao Bound (CRB) when the parameter to estimate lies in a manifold and follows a prior distribution. This derivation leads to a natural inequality between an error criteria based on geometrical properties and this new bound. This main contribution is illustrated in the problem of covariance estimation when the data follow a Gaussian distribution and the prior distribution is an inverse Wishart. Numerical simulation shows new results where the proposed CRB allows to exhibit interesting properties of the MAP estimator which are not observed with the classical Bayesian CRB.
Authors: Wujiang Xu, Xuying Ning, Wenfang Lin, Mingming Ha, Qiongxu Ma, Linxun Chen, Bing Han, Minnan Luo
Cross-domain sequential recommendation (CDSR) aims to address the data sparsity problems that exist in traditional sequential recommendation (SR) systems.
The existing approaches aim to design a specific cross-domain unit that can transfer and propagate information across multiple domains by relying on overlapping users with abundant behaviors. However, in real-world recommender systems, CDSR scenarios usually consist of a majority of long-tailed users with sparse behaviors and cold-start users who only exist in one domain. This leads to a drop in the performance of existing CDSR methods in the real-world industry platform. Therefore, improving the consistency and effectiveness of models in open-world CDSR scenarios is crucial for constructing CDSR models (\textit{1st} CH). Recently, some SR approaches have utilized auxiliary behaviors to complement the information for long-tailed users. However, these multi-behavior SR methods cannot deliver promising performance in CDSR, as they overlook the semantic gap between target and auxiliary behaviors, as well as user interest deviation across domains (\textit{2nd} CH).
Authors: Jan Thieß Brockmann, Marco Rudolph, Bodo Rosenhahn, Bastian Wandt
During the operation of industrial robots, unusual events may endanger the safety of humans and the quality of production. When collecting data to detect such cases, it is not ensured that data from all potentially occurring errors is included as unforeseeable events may happen over time. Therefore, anomaly detection (AD) delivers a practical solution, using only normal data to learn to detect unusual events. We introduce a dataset that allows training and benchmarking of anomaly detection methods for robotic applications based on machine data which will be made publicly available to the research community. As a typical robot task the dataset includes a pick-and-place application which involves movement, actions of the end effector and interactions with the objects of the environment. Since several of the contained anomalies are not task-specific but general, evaluations on our dataset are transferable to other robotics applications as well. Additionally, we present MVT-Flow (multivariate time-series flow) as a new baseline method for anomaly detection: It relies on deep-learning-based density estimation with normalizing flows, tailored to the data domain by taking its structure into account for the architecture. Our evaluation shows that MVT-Flow outperforms baselines from previous work by a large margin of 6.2% in area under ROC.
Authors: Anubhav Bhatti, Yuwei Liu, Chen Dan, Bingjie Shen, San Lee, Yonghwan Kim, Jang Yong Kim
Sepsis and septic shock are a critical medical condition affecting millions globally, with a substantial mortality rate. This paper uses state-of-the-art deep learning (DL) architectures to introduce a multi-step forecasting system to predict vital signs indicative of septic shock progression in Intensive Care Units (ICUs). Our approach utilizes a short window of historical vital sign data to forecast future physiological conditions. We introduce a DL-based vital sign forecasting system that predicts up to 3 hours of future vital signs from 6 hours of past data. We further adopt the DILATE loss function to capture better the shape and temporal dynamics of vital signs, which are critical for clinical decision-making. We compare three DL models, N-BEATS, N-HiTS, and Temporal Fusion Transformer (TFT), using the publicly available eICU Collaborative Research Database (eICU-CRD), highlighting their forecasting capabilities in a critical care setting. We evaluate the performance of our models using mean squared error (MSE) and dynamic time warping (DTW) metrics. Our findings show that while TFT excels in capturing overall trends, N-HiTS is superior in retaining short-term fluctuations within a predefined range. This paper demonstrates the potential of deep learning in transforming the monitoring systems in ICUs, potentially leading to significant improvements in patient care and outcomes by accurately forecasting vital signs to assist healthcare providers in detecting early signs of physiological instability and anticipating septic shock.
Authors: Stefan Matthes, Zhiwei Han, Hao Shen
Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.
Authors: Dinar Sharafutdinov, Stanislav Kuskov, Saian Protasov, Alexey Voropaev
In recent years, computer vision has transformed fields such as medical imaging, object recognition, and geospatial analytics. One of the fundamental tasks in computer vision is semantic image segmentation, which is vital for precise object delineation. Autonomous driving represents one of the key areas where computer vision algorithms are applied. The task of road surface segmentation is crucial in self-driving systems, but it requires a labor-intensive annotation process in several data domains. The work described in this paper aims to improve the efficiency of image segmentation using a convolutional neural network in a multi-sensor setup. This approach leverages lidar (Light Detection and Ranging) annotations to directly train image segmentation models on RGB images. Lidar supplements the images by emitting laser pulses and measuring reflections to provide depth information. However, lidar's sparse point clouds often create difficulties for accurate object segmentation. Segmentation of point clouds requires time-consuming preliminary data preparation and a large amount of computational resources. The key innovation of our approach is the masked loss, addressing sparse ground-truth masks from point clouds. By calculating loss exclusively where lidar points exist, the model learns road segmentation on images by using lidar points as ground truth. This approach allows for blending of different ground-truth data types during model training. Experimental validation of the approach on benchmark datasets shows comparable performance to a high-quality image segmentation model. Incorporating lidar reduces the load on annotations and enables training of image-segmentation models without loss of segmentation quality. The methodology is tested on diverse datasets, both publicly available and proprietary. The strengths and weaknesses of the proposed method are also discussed in the paper.
Authors: Yahong Yang, Yulong Lu
This paper establishes the nearly optimal rate of approximation for deep neural networks (DNNs) when applied to Korobov functions, effectively overcoming the curse of dimensionality. The approximation results presented in this paper are measured with respect to $L_p$ norms and $H^1$ norms. Our achieved approximation rate demonstrates a remarkable "super-convergence" rate, outperforming traditional methods and any continuous function approximator. These results are non-asymptotic, providing error bounds that consider both the width and depth of the networks simultaneously.
Authors: Thomas Sanchez, Oscar Esteban, Yvan Gomez, Alexandre Pron, Mériam Koob, Vincent Dunet, Nadine Girard, Andras Jakab, Elisenda Eixarch, Guillaume Auzias, Meritxell Bach Cuadra
Fetal brain MRI is becoming an increasingly relevant complement to neurosonography for perinatal diagnosis, allowing fundamental insights into fetal brain development throughout gestation. However, uncontrolled fetal motion and heterogeneity in acquisition protocols lead to data of variable quality, potentially biasing the outcome of subsequent studies. We present FetMRQC, an open-source machine-learning framework for automated image quality assessment and quality control that is robust to domain shifts induced by the heterogeneity of clinical data. FetMRQC extracts an ensemble of quality metrics from unprocessed anatomical MRI and combines them to predict experts' ratings using random forests. We validate our framework on a pioneeringly large and diverse dataset of more than 1600 manually rated fetal brain T2-weighted images from four clinical centers and 13 different scanners. Our study shows that FetMRQC's predictions generalize well to unseen data while being interpretable. FetMRQC is a step towards more robust fetal brain neuroimaging, which has the potential to shed new insights on the developing human brain.
Authors: Thomas A. Lasko, Eric V. Strobl, William W. Stead
The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of clinical models.
Authors: Md Azim Khan
Online conversations can be toxic and subjected to threats, abuse, or harassment. To identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. However, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. In this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called BERT(Bidirectional Encoder Representation from Transformers). We apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned BERT model with a traditional Logistic Regression model in terms of classification and bias minimization. The Logistic Regression model with the TFIDF vectorizer achieve 57.1% accuracy, and fine-tuned BERT model's accuracy is 89%. Code is available at https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git
Authors: Michaela Hardt, William Orchard, Patrick Blöbaum, Shiva Kasiviswanathan, Elke Kirschbaum
Identifying root causes for unexpected or undesirable behavior in complex systems is a prevalent challenge. This issue becomes especially crucial in modern cloud applications that employ numerous microservices. Although the machine learning and systems research communities have proposed various techniques to tackle this problem, there is currently a lack of standardized datasets for quantitative benchmarking. Consequently, research groups are compelled to create their own datasets for experimentation. This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application. In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system. We showcase how this dataset can be used to evaluate the accuracy of a variety of methods spanning different causal and non-causal characterisations of the root cause analysis problem. We hope the new dataset, available at https://github.com/amazon-science/petshop-root-cause-analysis/ enables further development of techniques in this important area.
Authors: Muhammad Ali Siddiqi, David Vrijenhoek, Lennart P. L. Landsmeer, Job van der Kleij, Anteneh Gebregiorgis, Vincenzo Romano, Rajendra Bishnoi, Said Hamdioui, Christos Strydis
Electrophysiological recordings of neural activity in a mouse's brain are very popular among neuroscientists for understanding brain function. One particular area of interest is acquiring recordings from the Purkinje cells in the cerebellum in order to understand brain injuries and the loss of motor functions. However, current setups for such experiments do not allow the mouse to move freely and, thus, do not capture its natural behaviour since they have a wired connection between the animal's head stage and an acquisition device. In this work, we propose a lightweight neuronal-spike detection and classification architecture that leverages on the unique characteristics of the Purkinje cells to discard unneeded information from the sparse neural data in real time. This allows the (condensed) data to be easily stored on a removable storage device on the head stage, alleviating the need for wires. Our proposed implementation shows a >95% overall classification accuracy while still resulting in a small-form-factor design, which allows for the free movement of mice during experiments. Moreover, the power-efficient nature of the design and the usage of STT-RAM (Spin Transfer Torque Magnetic Random Access Memory) as the removable storage allows the head stage to easily operate on a tiny battery for up to approximately 4 days.
Authors: Hubert Baniecki, Maciej Chrabaszcz, Andreas Holzinger, Bastian Pfeifer, Anna Saranti, Przemyslaw Biecek
Evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. Driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. These are increasingly used in real-world applications like medical image analysis or robotics. We introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. Experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.
Authors: Renzhi Wu, Saayan Mitra, Xiang Chen, Anup Rao
Vanilla federated learning does not support learning in an online environment, learning a personalized model on each client, and learning in a decentralized setting. There are existing methods extending federated learning in each of the three aspects. However, some important applications on enterprise edge servers (e.g. online item recommendation at global scale) involve the three aspects at the same time. Therefore, we propose a new learning setting \textit{Decentralized Personalized Online Federated Learning} that considers all the three aspects at the same time.
In this new setting for learning, the first technical challenge is how to aggregate the shared model parameters from neighboring clients to obtain a personalized local model with good performance on each client. We propose to directly learn an aggregation by optimizing the performance of the local model with respect to the aggregation weights. This not only improves personalization of each local model but also helps the local model adapting to potential data shift by intelligently incorporating the right amount of information from its neighbors. The second challenge is how to select the neighbors for each client. We propose a peer selection method based on the learned aggregation weights enabling each client to select the most helpful neighbors and reduce communication cost at the same time. We verify the effectiveness and robustness of our proposed method on three real-world item recommendation datasets and one air quality prediction dataset.
Authors: Matt Gorbett, Hossein Shirazi, Indrakshi Ray
Learning from the collective knowledge of data dispersed across private sources can provide neural networks with enhanced generalization capabilities. Federated learning, a method for collaboratively training a machine learning model across remote clients, achieves this by combining client models via the orchestration of a central server. However, current approaches face two critical limitations: i) they struggle to converge when client domains are sufficiently different, and ii) current aggregation techniques produce an identical global model for each client. In this work, we address these issues by reformulating the typical federated learning setup: rather than learning a single global model, we learn N models each optimized for a common objective. To achieve this, we apply a weighted distance minimization to model parameters shared in a peer-to-peer topology. The resulting framework, Iterative Parameter Alignment, applies naturally to the cross-silo setting, and has the following properties: (i) a unique solution for each participant, with the option to globally converge each model in the federation, and (ii) an optional early-stopping mechanism to elicit fairness among peers in collaborative learning settings. These characteristics jointly provide a flexible new framework for iteratively learning from peer models trained on disparate datasets. We find that the technique achieves competitive results on a variety of data partitions compared to state-of-the-art approaches. Further, we show that the method is robust to divergent domains (i.e. disjoint classes across peers) where existing approaches struggle.
Authors: Zhen Qin, Songlin Yang, Yiran Zhong
Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at https://github.com/OpenNLPLab/HGRN.
Authors: Shikai Fang, Xin Yu, Zheng Wang, Shibo Li, Mike Kirby, Shandian Zhe
Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.
Authors: Julian Lemmel, Radu Grosu
Recent advances in reinforcement learning, for partially-observable Markov decision processes (POMDPs), rely on the biologically implausible backpropagation through time algorithm (BPTT) to perform gradient-descent optimisation. In this paper we propose a novel reinforcement learning algorithm that makes use of random feedback local online learning (RFLO), a biologically plausible approximation of realtime recurrent learning (RTRL) to compute the gradients of the parameters of a recurrent neural network in an online manner. By combining it with TD($\lambda$), a variant of temporaldifference reinforcement learning with eligibility traces, we create a biologically plausible, recurrent actor-critic algorithm, capable of solving discrete and continuous control tasks in POMDPs. We compare BPTT, RTRL and RFLO as well as different network architectures, and find that RFLO can perform just as well as RTRL while exceeding even BPTT in terms of complexity. The proposed method, called real-time recurrent reinforcement learning (RTRRL), serves as a model of learning in biological neural networks mimicking reward pathways in the mammalian brain.
Authors: Zijian Li, Zunhong Xu, Ruichu Cai, Zhenhui Yang, Yuguang Yan, Zhifeng Hao, Guangyi Chen, Kun Zhang
Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Different from existing methods that learn discriminative representations for prediction, we propose a generative model with semantic-components identifiability, named SCI. We demonstrate that the latent variables in this generative model can be explicitly identified into semantic-relevant (SR) and semantic-irrelevant (SI) components, which contributes to better OOD generalization by involving minimal change properties of causal mechanisms. Specifically, we first formulate the data generation process from the atom level to the molecular level, where the latent space is split into SI substructures, SR substructures, and SR atom variables. Sequentially, to reduce misidentification, we restrict the minimal changes of the SR atom variables and add a semantic latent substructure regularization to mitigate the variance of the SR substructure under augmented domain changes. Under mild assumptions, we prove the block-wise identifiability of the SR substructure and the comment-wise identifiability of SR atom variables. Experimental studies achieve state-of-the-art performance and show general improvement on 21 datasets in 3 mainstream benchmarks. Moreover, the visualization results of the proposed SCI method provide insightful case studies and explanations for the prediction results. The code is available at: https://github.com/DMIRLAB-Group/SCI.
Authors: Meiyi Li, Javad Mohammadi
The evolution towards a more distributed and interconnected grid necessitates large-scale decision-making within strict temporal constraints. Machine learning (ML) paradigms have demonstrated significant potential in improving the efficacy of optimization processes. However, the feasibility of solutions derived from ML models continues to pose challenges. It's imperative that ML models produce solutions that are attainable and realistic within the given system constraints of power systems. To address the feasibility issue and expedite the solution search process, we proposed LOOP-LC 2.0(Learning to Optimize the Optimization Process with Linear Constraints version 2.0) as a learning-based approach for solving the power dispatch problem. A notable advantage of the LOOP-LC 2.0 framework is its ability to ensure near-optimality and strict feasibility of solutions without depending on computationally intensive post-processing procedures, thus eliminating the need for iterative processes. At the heart of the LOOP-LC 2.0 model lies the newly proposed generalized gauge map method, capable of mapping any infeasible solution to a feasible point within the linearly-constrained domain. The proposed generalized gauge map method improves the traditional gauge map by exhibiting reduced sensitivity to input variances while increasing search speeds significantly. Utilizing the IEEE-200 test case as a benchmark, we demonstrate the effectiveness of the LOOP-LC 2.0 methodology, confirming its superior performance in terms of training speed, computational time, optimality, and solution feasibility compared to existing methodologies.
Authors: Yuang Geng, Souradeep Dutta, Ivan Ruchkin
Autonomous systems are increasingly implemented using end-end-end trained controllers. Such controllers make decisions that are executed on the real system with images as one of the primary sensing modalities. Deep neural networks form a fundamental building block of such controllers. Unfortunately, the existing neural-network verification tools do not scale to inputs with thousands of dimensions. Especially when the individual inputs (such as pixels) are devoid of clear physical meaning. This paper takes a step towards connecting exhaustive closed-loop verification with high-dimensional controllers. Our key insight is that the behavior of a high-dimensional controller can be approximated with several low-dimensional controllers in different regions of the state space. To balance approximation and verifiability, we leverage the latest verification-aware knowledge distillation. Then, if low-dimensional reachability results are inflated with statistical approximation errors, they yield a high-confidence reachability guarantee for the high-dimensional controller. We investigate two inflation techniques -- based on trajectories and actions -- both of which show convincing performance in two OpenAI gym benchmarks.
Authors: Giulia Di Teodoro, Martin Pirkl, Francesca Incardona, Ilaria Vicenti, Anders Sönnerborg, Rolf Kaiser, Laura Palagi, Maurizio Zazzi, Thomas Lengauer
Motivation: In predicting HIV therapy outcomes, a critical clinical question is whether using historical information can enhance predictive capabilities compared with current or latest available data analysis. This study analyses whether historical knowledge, which includes viral mutations detected in all genotypic tests before therapy, their temporal occurrence, and concomitant viral load measurements, can bring improvements. We introduce a method to weigh mutations, considering the previously enumerated factors and the reference mutation-drug Stanford resistance tables. We compare a model encompassing history (H) with one not using it (NH). Results: The H-model demonstrates superior discriminative ability, with a higher ROC-AUC score (76.34%) than the NH-model (74.98%). Significant Wilcoxon test results confirm that incorporating historical information improves consistently predictive accuracy for treatment outcomes. The better performance of the H-model might be attributed to its consideration of latent HIV reservoirs, probably obtained when leveraging historical information. The findings emphasize the importance of temporal dynamics in mutations, offering insights into HIV infection complexities. However, our result also shows that prediction accuracy remains relatively high even when no historical information is available. Supplementary information: Supplementary material is available.
Authors: Dylan Green, Stephen Bailey
Non-negative matrix factorization (NMF) is a dimensionality reduction technique that has shown promise for analyzing noisy data, especially astronomical data. For these datasets, the observed data may contain negative values due to noise even when the true underlying physical signal is strictly positive. Prior NMF work has not treated negative data in a statistically consistent manner, which becomes problematic for low signal-to-noise data with many negative values. In this paper we present two algorithms, Shift-NMF and Nearly-NMF, that can handle both the noisiness of the input data and also any introduced negativity. Both of these algorithms use the negative data space without clipping, and correctly recover non-negative signals without any introduced positive offset that occurs when clipping negative data. We demonstrate this numerically on both simple and more realistic examples, and prove that both algorithms have monotonically decreasing update rules.
Authors: Christopher J. Kymn, Denis Kleyko, E. Paxon Frady, Connor Bybee, Pentti Kanerva, Friedrich T. Sommer, Bruno A. Olshausen
We introduce Residue Hyperdimensional Computing, a computing framework that unifies residue number systems with an algebra defined over random, high-dimensional vectors. We show how residue numbers can be represented as high-dimensional vectors in a manner that allows algebraic operations to be performed with component-wise, parallelizable operations on the vector elements. The resulting framework, when combined with an efficient method for factorizing high-dimensional vectors, can represent and operate on numerical values over a large dynamic range using vastly fewer resources than previous methods, and it exhibits impressive robustness to noise. We demonstrate the potential for this framework to solve computationally difficult problems in visual perception and combinatorial optimization, showing improvement over baseline methods. More broadly, the framework provides a possible account for the computational operations of grid cells in the brain, and it suggests new machine learning architectures for representing and manipulating numerical data.
Authors: Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler
Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.
In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans -- copied verbatim from given input sources -- and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate.
To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.
Authors: Quentin Bouniot
In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make connections to Multi-Task Representation Learning, which benefits from solid theoretical foundations, to verify the best conditions for a more efficient meta-learning. Then, to leverage unlabeled data when training object detectors based on the Transformer architecture, we propose both an unsupervised pretraining and a semi-supervised learning method in two other separate contributions. For pretraining, we improve Contrastive Learning for object detectors by introducing the localization information. Finally, our semi-supervised method is the first tailored to transformer-based detectors.
Authors: Yash Jain, Harkirat Behl, Zsolt Kira, Vibhav Vineet
Construction of a universal detector poses a crucial question: How can we most effectively train a model on a large mixture of datasets? The answer lies in learning dataset-specific features and ensembling their knowledge but do all this in a single model. Previous methods achieve this by having separate detection heads on a common backbone but that results in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose Dataset-Aware Mixture-of-Experts, DAMEX where we train the experts to become an `expert' of a dataset by learning to route each dataset tokens to its mapped expert. Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art by average +10.2 AP score and improve over our non-MoE baseline by average +2.0 AP score. We also observe consistent gains while mixing datasets with (1) limited availability, (2) disparate domains and (3) divergent label sets. Further, we qualitatively show that DAMEX is robust against expert representation collapse.
Authors: Kieran A. Murphy, Dani S. Bassett
Deterministic chaos permits a precise notion of a "perfect measurement" as one that, when obtained repeatedly, captures all of the information created by the system's evolution with minimal redundancy. Finding an optimal measurement is challenging, and has generally required intimate knowledge of the dynamics in the few cases where it has been done. We establish an equivalence between a perfect measurement and a variant of the information bottleneck. As a consequence, we can employ machine learning to optimize measurement processes that efficiently extract information from trajectory data. We obtain approximately optimal measurements for multiple chaotic maps and lay the necessary groundwork for efficient information extraction from general time series.
Authors: Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau
We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.
Authors: Timm Hess, Tinne Tuytelaars, Gido M. van de Ven
Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. To evaluate the merits of our proposition, we plan to combine replay-approximated joint objectives with gradient projection-based optimization routines to test whether the addition of the latter provides benefits in terms of (1) alleviating the stability gap, (2) increasing the learning efficiency and (3) improving the final learning outcome.
Authors: Rocktim Jyoti Das, Liqun Ma, Zhiqiang Shen
Large Language Models (LLMs) with a billion or more parameters are prime targets for network pruning, which aims to reduce a portion of the network weights without compromising performance. Prior approaches such as Weights Magnitude, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained large language models. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the importance pruning score, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguing, after incorporating gradients, the unstructured pruning method tends to reveal some structural patterns post-pruning, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various language benchmarks and perplexity show that GBLM-Pruner surpasses magnitude pruning, Wanda (weights+activations) and SparseGPT (weights+activations+weight update) by significant margins. Our code and models are available at https://github.com/RocktimJyotiDas/GBLM-Pruner.
Authors: Chuan Guo, Tom Goldstein, Awni Hannun, Laurens van der Maaten
Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to "remove" data from a machine-learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.
Authors: Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu
Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.
Authors: Michael Woodward, Yifeng Tian, Criston Hyett, Chris Fryer, Daniel Livescu, Mikhail Stepanov, Michael Chertkov
Building efficient, accurate and generalizable reduced order models of developed turbulence remains a major challenge. This manuscript approaches this problem by developing a hierarchy of parameterized reduced Lagrangian models for turbulent flows, and investigates the effects of enforcing physical structure through Smoothed Particle Hydrodynamics (SPH) versus relying on neural networks (NN)s as universal function approximators. Starting from Neural Network (NN) parameterizations of a Lagrangian acceleration operator, this hierarchy of models gradually incorporates a weakly compressible and parameterized SPH framework, which enforces physical symmetries, such as Galilean, rotational and translational invariances. Within this hierarchy, two new parameterized smoothing kernels are developed in order to increase the flexibility of the learn-able SPH simulators. For each model we experiment with different loss functions which are minimized using gradient based optimization, where efficient computations of gradients are obtained by using Automatic Differentiation (AD) and Sensitivity Analysis (SA). Each model within the hierarchy is trained on two data sets associated with weekly compressible Homogeneous Isotropic Turbulence (HIT): (1) a validation set using weakly compressible SPH; and (2) a high fidelity set from Direct Numerical Simulations (DNS). Numerical evidence shows that encoding more SPH structure improves generalizability to different turbulent Mach numbers and time shifts, and that including the novel parameterized smoothing kernels improves the accuracy of SPH at the resolved scales.
Authors: Carlos Fernández-Loría, Jorge Loría
This paper introduces causal scoring as a novel approach to frame causal estimation in the context of decision making. Causal scoring entails the estimation of scores that support decision making by providing insights into causal effects. We present three valuable causal interpretations of these scores: effect estimation (EE), effect ordering (EO), and effect classification (EC). In the EE interpretation, the causal score represents the effect itself. The EO interpretation implies that the score can serve as a proxy for the magnitude of the effect, enabling the sorting of individuals based on their causal effects. The EC interpretation enables the classification of individuals into high- and low-effect categories using a predefined threshold. We demonstrate the value of these alternative causal interpretations (EO and EC) through two key results. First, we show that aligning the statistical modeling with the desired causal interpretation improves the accuracy of causal estimation. Second, we establish that more flexible causal interpretations are plausible in a wider range of data-generating processes and propose conditions to assess their validity. We showcase the practical utility of the causal scoring framework through examples in diverse fields such as advertising, healthcare, and education, illustrating how it facilitates reasoning about flexible causal interpretations of statistical estimates in various contexts. The examples encompass confounded estimates, effect estimates on surrogate outcomes, and even predictions about non-causal quantities as potential causal scores.
Authors: William Chen, Siyi Hu, Rajat Talak, Luca Carlone
Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.
Authors: André Santos, Nuno Ferreira Duarte, Atabak Dehban, José Santos-Victor
We tackle the challenge of robotic bin packing with irregular objects, such as groceries. Given the diverse physical attributes of these objects and the complex constraints governing their placement and manipulation, employing preprogrammed strategies becomes unfeasible. Our approach is to learn directly from expert demonstrations in order to extract implicit task knowledge and strategies to ensure safe object positioning, efficient use of space, and the generation of human-like behaviors that enhance human-robot trust.
We rely on human demonstrations to learn a Markov chain for predicting the object packing sequence for a given set of items and then compare it with human performance. Our experimental results show that the model outperforms human performance by generating sequence predictions that humans classify as human-like more frequently than human-generated sequences.
The human demonstrations were collected using our proposed VR platform, BoxED, which is a box packaging environment for simulating real-world objects and scenarios for fast and streamlined data collection with the purpose of teaching robots. We collected data from 43 participants packing a total of 263 boxes with supermarket-like objects, yielding 4644 object manipulations. Our VR platform can be easily adapted to new scenarios and objects, and is publicly available, alongside our dataset, at https://github.com/andrejfsantos4/BoxED.
Authors: Xing Han, Tongzheng Ren, Tan Minh Nguyen, Khai Nguyen, Joydeep Ghosh, Nhat Ho
Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering other practical issues, such as robustness to contaminated samples. Recent work by Nguyen et al., (2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods for alleviating the issue of data contamination. Specifically, we introduce a series of self-attention mechanisms that can be incorporated into different Transformer architectures and discuss the special properties of each method. We then perform extensive empirical studies on language modeling and image classification tasks. Our methods demonstrate robust performance in multiple scenarios while maintaining competitive results on clean datasets.
Authors: Xi Wang, Tomas Geffner, Justin Domke
Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.
Authors: Tianying Ji, Yu Luo, Fuchun Sun, Mingxuan Jing, Fengxiang He, Wenbing Huang
Designing and analyzing model-based RL (MBRL) algorithms with guaranteed monotonic improvement has been challenging, mainly due to the interdependence between policy optimization and model learning. Existing discrepancy bounds generally ignore the impacts of model shifts, and their corresponding algorithms are prone to degrade performance by drastic model updating. In this work, we first propose a novel and general theoretical scheme for a non-decreasing performance guarantee of MBRL. Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. These discoveries encourage us to formulate a constrained lower-bound optimization problem to permit the monotonicity of MBRL. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns. Motivated by these analyses, we design a simple but effective algorithm CMLO (Constrained Model-shift Lower-bound Optimization), by introducing an event-triggered mechanism that flexibly determines when to update the model. Experiments show that CMLO surpasses other state-of-the-art methods and produces a boost when various policy optimization methods are employed.
Authors: Tiago P. Peixoto, Alec Kirkley
The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.
Authors: Christo Kurisummoottil Thomas, Walid Saad
Semantic communication (SC) aims to communicate reliably with minimal data transfer while simultaneously providing seamless connectivity to heterogeneous services and users. In this paper, a novel emergent SC (ESC) system framework is proposed and is composed of a signaling game for emergent language design and a neuro-symbolic (NeSy) artificial intelligence (AI) approach for causal reasoning. In order to design the language, the signaling game is solved using an alternating maximization between the communicating node's utilities. The emergent language helps create a context-aware transmit vocabulary (minimal semantic representation) and aids the reasoning process (enabling generalization to unseen scenarios) by splitting complex messages into simpler reasoning tasks for the receiver. The causal description at the transmitter is then modeled (a neural component) as a posterior distribution of the relevant attributes present in the data. Using the reconstructed causal state, the receiver evaluates a set of logical formulas (symbolic part) to execute its task. The nodes NeSy reasoning components are implemented by the recently proposed AI tool called Generative Flow Networks, and they are optimized for higher semantic reliability. The ESC system is designed to enhance the novel metrics of semantic information, reliability, distortion and similarity that are designed using rigorous algebraic properties from category theory thereby generalizing the metrics beyond Shannon's notion of uncertainty. Simulation results validate the ability of ESC to communicate efficiently (with reduced bits) and achieve better semantic reliability than conventional wireless and state-of-the-art systems that do not exploit causal reasoning capabilities.
Authors: Chen-Yu Liu, Hsin-Yu Wang, Pei-Yen Liao, Ching-Jui Lai, Min-Hsiu Hsieh
Factorization Machine (FM) is the most commonly used model to build a recommendation system since it can incorporate side information to improve performance. However, producing item suggestions for a given user with a trained FM is time-consuming. It requires a run-time of $O((N_m \log N_m)^2)$, where $N_m$ is the number of items in the dataset. To address this problem, we propose a quadratic unconstrained binary optimization (QUBO) scheme to combine with FM and apply quantum annealing (QA) computation. Compared to classical methods, this hybrid algorithm provides a faster than quadratic speedup in finding good user suggestions. We then demonstrate the aforementioned computational advantage on current NISQ hardware by experimenting with a real example on a D-Wave annealer.
Authors: Zhichao Wang, Andrew Engel, Anand Sarwate, Ioana Dumitriu, Tony Chiang
We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.
Authors: Timothy O. Hodson, Thomas M. Over, Tyler J. Smith, Lucy M. Marshall
In machine learning or scientific computing, model performance is measured with an objective function. But why choose one objective over another? Information theory gives one answer: To maximize the information in the model, select the objective function that represents the error in the fewest bits. To evaluate different objectives, transform them into likelihood functions. As likelihoods, their relative magnitude represents how strongly we should prefer one objective versus another, and the log of that relation represents the difference in their bit-length, as well as the difference in their uncertainty. In other words, prefer whichever objective minimizes the uncertainty. Under the information-theoretic paradigm, the ultimate objective is to maximize information (and minimize uncertainty), as opposed to any specific utility. We argue that this paradigm is well-suited to models that have many uses and no definite utility, like the large Earth system models used to understand the effects of climate change.
Authors: Taoli Cheng, Aaron Courville
As a classical generative modeling approach, energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In line with these advancements, we build a multi-purpose energy-based probabilistic model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator for physics simulation, a generic anomalous signal detector free from spurious correlations, and an augmented event classifier for particle identification.
Authors: Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, Niao He
Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
Authors: Vladimir Kostic, Karim Lounici, Pietro Novelli, Massimiliano Pontil
Nonlinear dynamical systems can be handily described by the associated Koopman operator, whose action evolves every observable of the system forward in time. Learning the Koopman operator and its spectral decomposition from data is enabled by a number of algorithms. In this work we present for the first time non-asymptotic learning bounds for the Koopman eigenvalues and eigenfunctions. We focus on time-reversal-invariant stochastic dynamical systems, including the important example of Langevin dynamics. We analyze two popular estimators: Extended Dynamic Mode Decomposition (EDMD) and Reduced Rank Regression (RRR). Our results critically hinge on novel {minimax} estimation bounds for the operator norm error, that may be of independent interest. Our spectral learning bounds are driven by the simultaneous control of the operator norm error and a novel metric distortion functional of the estimated eigenfunctions. The bounds indicates that both EDMD and RRR have similar variance, but EDMD suffers from a larger bias which might be detrimental to its learning rate. Our results shed new light on the emergence of spurious eigenvalues, an issue which is well known empirically. Numerical experiments illustrate the implications of the bounds in practice.
Authors: Shyam Sudhakaran, Miguel González-Duque, Claire Glanois, Matthias Freiberger, Elias Najarro, Sebastian Risi
Procedural Content Generation (PCG) is a technique to generate complex and diverse environments in an automated way. However, while generating content with PCG methods is often straightforward, generating meaningful content that reflects specific intentions and constraints remains challenging. Furthermore, many PCG algorithms lack the ability to generate content in an open-ended manner. Recently, Large Language Models (LLMs) have shown to be incredibly effective in many diverse domains. These trained LLMs can be fine-tuned, re-using information and accelerating training for new tasks. Here, we introduce MarioGPT, a fine-tuned GPT2 model trained to generate tile-based game levels, in our case Super Mario Bros levels. MarioGPT can not only generate diverse levels, but can be text-prompted for controllable level generation, addressing one of the key challenges of current PCG techniques. As far as we know, MarioGPT is the first text-to-level model and combined with novelty search it enables the generation of diverse levels with varying play-style dynamics (i.e. player paths) and the open-ended discovery of an increasingly diverse range of content. Code available at https://github.com/shyamsn97/mario-gpt.
Authors: Marco Fiorucci, Peter Naylor, Makoto Yamada
Unsupervised change detection between airborne LiDAR data points, taken at separate times over the same location, can be difficult due to unmatching spatial support and noise from the acquisition system. Most current approaches to detect changes in point clouds rely heavily on the computation of Digital Elevation Models (DEM) images and supervised methods. Obtaining a DEM leads to LiDAR informational loss due to pixelisation, and supervision requires large amounts of labelled data often unavailable in real-world scenarios. We propose an unsupervised approach based on the computation of the transport of 3D LiDAR points over two temporal supports. The method is based on unbalanced optimal transport and can be generalised to any change detection problem with LiDAR data. We apply our approach to publicly available datasets for monitoring urban sprawling in various noise and resolution configurations that mimic several sensors used in practice. Our method allows for unsupervised multi-class classification and outperforms the previous state-of-the-art unsupervised approaches by a significant margin.
Authors: Aodong Li, Chen Qiu, Marius Kloft, Padhraic Smyth, Maja Rudolph, Stephan Mandt
Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal," has led to the development of zero-shot AD techniques. In this paper, we propose a simple yet effective method called Adaptive Centered Representations (ACR) for zero-shot batch-level AD. Our approach trains off-the-shelf deep anomaly detectors (such as deep SVDD) to adapt to a set of inter-related training data distributions in combination with batch normalization, enabling automatic zero-shot generalization for unseen AD tasks. This simple recipe, batch normalization plus meta-training, is a highly effective and versatile tool. Our theoretical results guarantee the zero-shot generalization for unseen AD tasks; our empirical results demonstrate the first zero-shot AD results for tabular data and outperform existing methods in zero-shot anomaly detection and segmentation on image data from specialized domains. Code is at https://github.com/aodongli/zero-shot-ad-via-batch-norm
Authors: Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, Amir Yazdanbakhsh
With the waning of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77K competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements". To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves an average speedup of 5.65X on CodeLlama-13B and 6.86X on GPT-3.5, surpassing the best human performance (4.06X). We find our proposed performance-conditioned generation is particularly effective at improving performance as well as increasing the fraction of optimized programs.
Authors: Gleb Novikov, David Steurer, Stefan Tiegel
We study the problem of robustly estimating the mean or location parameter without moment assumptions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distributions.
For product distributions and elliptical distributions with known scatter (covariance) matrix, we show that given an $\varepsilon$-corrupted sample, we can with probability at least $1-\delta$ estimate its location up to error $O(\varepsilon \sqrt{\log(1/\varepsilon)})$ using $\tfrac{d\log(d) + \log(1/\delta)}{\varepsilon^2 \log(1/\varepsilon)}$ samples. This result matches the best-known guarantees for the Gaussian distribution and known SQ lower bounds (up to the $\log(d)$ factor). For elliptical distributions with unknown scatter (covariance) matrix, we propose a sequence of efficient algorithms that approaches this optimal error. Specifically, for every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ achieving error $O(\varepsilon^{1-\frac{1}{2k}})$. This matches the error and running time guarantees when assuming certifiably bounded moments of order up to $k$. For unknown covariance, such error bounds of $o(\sqrt{\varepsilon})$ are not even known for (general) sub-Gaussian distributions.
Our algorithms are based on a generalization of the well-known filtering technique. We show how this machinery can be combined with Huber-loss-based techniques to work with projections of the noise that behave more nicely than the initial noise. Moreover, we show how SoS proofs can be used to obtain algorithmic guarantees even for distributions without a first moment. We believe that this approach may find other applications in future works.
Authors: Praveen Kumar, Christophe G. Lambert
Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the selected completely at random (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $\alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate $\alpha$ or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
Authors: Susung Hong, Donghoon Ahn, Seungryong Kim
Existing score-distilling text-to-3D generation techniques, despite their considerable promise, often encounter the view inconsistency problem. One of the most notable issues is the Janus problem, where the most canonical view of an object (\textit{e.g}., face or head) appears in other views. In this work, we explore existing frameworks for score-distilling text-to-3D generation and identify the main causes of the view inconsistency problem -- the embedded bias of 2D diffusion models. Based on these findings, we propose two approaches to debias the score-distillation frameworks for view-consistent text-to-3D generation. Our first approach, called score debiasing, involves cutting off the score estimated by 2D diffusion models and gradually increasing the truncation value throughout the optimization process. Our second approach, called prompt debiasing, identifies conflicting words between user prompts and view prompts using a language model, and adjusts the discrepancy between view prompts and the viewing direction of an object. Our experimental results show that our methods improve the realism of the generated 3D objects by significantly reducing artifacts and achieve a good trade-off between faithfulness to the 2D diffusion models and 3D consistency with little overhead. Our project page is available at~\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.
Authors: Thomas Sanchez, Oscar Esteban, Yvan Gomez, Elisenda Eixarch, Meritxell Bach Cuadra
Quality control (QC) has long been considered essential to guarantee the reliability of neuroimaging studies. It is particularly important for fetal brain MRI, where large and unpredictable fetal motion can lead to substantial artifacts in the acquired images. Existing methods for fetal brain quality assessment operate at the \textit{slice} level, and fail to get a comprehensive picture of the quality of an image, that can only be achieved by looking at the \textit{entire} brain volume. In this work, we propose FetMRQC, a machine learning framework for automated image quality assessment tailored to fetal brain MRI, which extracts an ensemble of quality metrics that are then used to predict experts' ratings. Based on the manual ratings of more than 1000 low-resolution stacks acquired across two different institutions, we show that, compared with existing quality metrics, FetMRQC is able to generalize out-of-domain, while being interpretable and data efficient. We also release a novel manual quality rating tool designed to facilitate and optimize quality rating of fetal brain images.
Our tool, along with all the code to generate, train and evaluate the model is available at https://github.com/Medical-Image-Analysis-Laboratory/fetal_brain_qc/ .
Authors: Nils Lukas, Florian Kerschbaum
Deepfakes refer to content synthesized using deep generators, which, when misused, have the potential to erode trust in digital media. Synthesizing high-quality deepfakes requires access to large and complex generators only a few entities can train and provide. The threat is malicious users that exploit access to the provided model and generate harmful deepfakes without risking detection. Watermarking makes deepfakes detectable by embedding an identifiable code into the generator that is later extractable from its generated images. We propose Pivotal Tuning Watermarking (PTW), a method for watermarking pre-trained generators (i) three orders of magnitude faster than watermarking from scratch and (ii) without the need for any training data. We improve existing watermarking methods and scale to generators $4 \times$ larger than related work. PTW can embed longer codes than existing methods while better preserving the generator's image quality. We propose rigorous, game-based definitions for robustness and undetectability, and our study reveals that watermarking is not robust against an adaptive white-box attacker who controls the generator's parameters. We propose an adaptive attack that can successfully remove any watermarking with access to only 200 non-watermarked images. Our work challenges the trustworthiness of watermarking for deepfake detection when the parameters of a generator are available. The source code to reproduce our experiments is available at https://github.com/nilslukas/gan-watermark.
Authors: Zhengyuan Jiang, Jinghuai Zhang, Neil Zhenqiang Gong
A generative AI model can generate extremely realistic-looking content, posing growing challenges to the authenticity of information. To address the challenges, watermark has been leveraged to detect AI-generated content. Specifically, a watermark is embedded into an AI-generated content before it is released. A content is detected as AI-generated if a similar watermark can be decoded from it. In this work, we perform a systematic study on the robustness of such watermark-based AI-generated content detection. We focus on AI-generated images. Our work shows that an attacker can post-process a watermarked image via adding a small, human-imperceptible perturbation to it, such that the post-processed image evades detection while maintaining its visual quality. We show the effectiveness of our attack both theoretically and empirically. Moreover, to evade detection, our adversarial post-processing method adds much smaller perturbations to AI-generated images and thus better maintain their visual quality than existing popular post-processing methods such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work shows the insufficiency of existing watermark-based detection of AI-generated content, highlighting the urgent needs of new methods. Our code is publicly available: https://github.com/zhengyuan-jiang/WEvade.
Authors: William Leeney, Ryan McConville
Graph Neural Networks (GNNs) have improved unsupervised community detection of clustered nodes due to their ability to encode the dual dimensionality of the connectivity and feature information spaces of graphs. Identifying the latent communities has many practical applications from social networks to genomics. Current benchmarks of real world performance are confusing due to the variety of decisions influencing the evaluation of GNNs at this task. To address this, we propose a framework to establish a common evaluation protocol. We motivate and justify it by demonstrating the differences with and without the protocol. The W Randomness Coefficient is a metric proposed for assessing the consistency of algorithm rankings to quantify the reliability of results under the presence of randomness. We find that by ensuring the same evaluation criteria is followed, there may be significant differences from the reported performance of methods at this task, but a more complete evaluation and comparison of methods is possible.
Authors: Shehzaad Dhuliawala, Mrinmaya Sachan, Carl Allen
We present a latent variable model for classification that provides a novel probabilistic interpretation of neural network softmax classifiers. We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders, that generalises the cross-entropy loss used to train classification models. Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency between their anticipated distribution, required for accurate label predictions to be output, and the empirical distribution found in practice. We augment the variational objective to mitigate such inconsistency and encourage a chosen latent distribution, instead of the implicit assumption in off-the-shelf softmax classifiers. Overall, we provide new theoretical insight into the inner workings of widely-used softmax classification. Empirical evaluation on image and text classification datasets demonstrates that our proposed approach, variational classification, maintains classification accuracy while the reshaped latent space improves other desirable properties of a classifier, such as calibration, adversarial robustness, robustness to distribution shift and sample efficiency useful in low data settings.
Authors: Sixing Yu, J. Pablo Muñoz, Ali Jannesari
Foundation Models (FMs), such as LLaMA, BERT, GPT, ViT, and CLIP, have demonstrated remarkable success in a wide range of applications, driven by their ability to leverage vast amounts of data for pre-training. However, optimizing FMs often requires access to sensitive data, raising privacy concerns and limiting their applicability in many domains. In this paper, we propose the Federated Foundation Models (FFMs) paradigm, which combines the benefits of FMs and Federated Learning (FL) to enable privacy-preserving and collaborative learning across multiple end-users. We discuss the potential benefits and challenges of integrating FL into the lifespan of FMs, covering pre-training, fine-tuning, and application. We further outline potential future research avenues in FFM, including FFM pre-training, FFM fine-tuning, and federated prompt tuning, which allow the development of more personalized and context-aware models while ensuring data privacy. Moreover, we explore the possibility of continual/lifelong learning in FFMs, as increased computational power at the edge may unlock the potential for optimizing FMs using newly generated private data close to the data source. The proposed FFM concepts offer a flexible and scalable framework for training large language models in a privacy-preserving manner, setting the stage for subsequent advancements in both FM training and federated learning.
Authors: Gaurav Maheshwari, Aurélien Bellet, Pascal Denis, Mikaela Keller
In this work, we consider the problem of intersectional group fairness in the classification setting, where the objective is to learn discrimination-free models in the presence of several intersecting sensitive groups. First, we illustrate various shortcomings of existing fairness measures commonly used to capture intersectional fairness. Then, we propose a new definition called the $\alpha$-Intersectional Fairness, which combines the absolute and the relative performance across sensitive groups and can be seen as a generalization of the notion of differential fairness. We highlight several desirable properties of the proposed definition and analyze its relation to other fairness measures. Finally, we benchmark multiple popular in-processing fair machine learning approaches using our new fairness definition and show that they do not achieve any improvement over a simple baseline. Our results reveal that the increase in fairness measured by previous definitions hides a "leveling down" effect, i.e., degrading the best performance over groups rather than improving the worst one.
Authors: Annie Gray, Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
Authors: Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich, Björn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick Schramowski, Kristian Kersting, Samuel Weinbach
The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
Authors: Jinyung Hong, Keun Hee Park, Theodore P. Pavlic
Many interpretable AI approaches have been proposed to provide plausible explanations for a model's decision-making. However, configuring an explainable model that effectively communicates among computational modules has received less attention. A recently proposed shared global workspace theory showed that networks of distributed modules can benefit from sharing information with a bottlenecked memory because the communication constraints encourage specialization, compositionality, and synchronization among the modules. Inspired by this, we propose Concept-Centric Transformers, a simple yet effective configuration of the shared global workspace for interpretability, consisting of: i) an object-centric-based memory module for extracting semantic concepts from input features, ii) a cross-attention mechanism between the learned concept and input embeddings, and iii) standard classification and explanation losses to allow human analysts to directly assess an explanation for the model's classification reasoning. We test our approach against other existing concept-based methods on classification tasks for various datasets, including CIFAR100, CUB-200-2011, and ImageNet, and we show that our model achieves better classification accuracy than all baselines across all problems but also generates more consistent concept-based explanations of classification output.
Authors: Honghao Wei, Xin Liu, Weina Wang, Lei Ying
This paper considers a class of reinforcement learning problems, which involve systems with two types of states: stochastic and pseudo-stochastic. In such systems, stochastic states follow a stochastic transition kernel while the transitions of pseudo-stochastic states are deterministic given the stochastic states/transitions. We refer to such systems as mixed systems, which are widely used in various applications, including manufacturing systems, communication networks, and queueing networks. We propose a sample efficient RL method that accelerates learning by generating augmented data samples. The proposed algorithm is data-driven and learns the policy from data samples from both real and augmented samples. This method significantly improves learning by reducing the sample complexity such that the dataset only needs to have sufficient coverage of the stochastic states. We analyze the sample complexity of the proposed method under Fitted Q Iteration (FQI) and demonstrate that the optimality gap decreases as $\tilde{\mathcal{O}}(\sqrt{{1}/{n}}+\sqrt{{1}/{m}}),$ where $n$ is the number of real samples and $m$ is the number of augmented samples per real sample. It is important to note that without augmented samples, the optimality gap is $\tilde{\mathcal{O}}(1)$ due to insufficient data coverage of the pseudo-stochastic states. Our experimental results on multiple queueing network applications confirm that the proposed method indeed significantly accelerates learning in both deep Q-learning and deep policy gradient.
Authors: Yingcong Li, Kartik Sreenivasan, Angeliki Giannou, Dimitris Papailiopoulos, Samet Oymak
Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating additional layers that perform the necessary data-filtering for CoT via the attention mechanism. In addition to these test-time benefits, we show CoT helps accelerate pretraining by learning shortcuts to represent complex functions and filtering plays an important role in this process. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.
Authors: Pedro Sandoval-Segura, Vasu Singla, Jonas Geiping, Micah Goldblum, Tom Goldstein
In an era of widespread web scraping, unlearnable dataset methods have the potential to protect data privacy by preventing deep neural networks from generalizing. But in addition to a number of practical limitations that make their use unlikely, we make a number of findings that call into question their ability to safeguard data. First, it is widely believed that neural networks trained on unlearnable datasets only learn shortcuts, simpler rules that are not useful for generalization. In contrast, we find that networks actually can learn useful features that can be reweighed for high test performance, suggesting that image protection is not assured. Unlearnable datasets are also believed to induce learning shortcuts through linear separability of added perturbations. We provide a counterexample, demonstrating that linear separability of perturbations is not a necessary condition. To emphasize why linearly separable perturbations should not be relied upon, we propose an orthogonal projection attack which allows learning from unlearnable datasets published in ICML 2021 and ICLR 2023. Our proposed attack is significantly less complex than recently proposed techniques.
Authors: Andrew Liu, Alla Borisyuk
Navigation is a complex skill with a long history of research in animals and humans. In this work, we simulate the Morris Water Maze in 2D to train deep reinforcement learning agents. We perform automatic classification of navigation strategies, analyze the distribution of strategies used by artificial agents, and compare them with experimental data to show similar learning dynamics as those seen in humans and rodents. We develop environment-specific auxiliary tasks and examine factors affecting their usefulness. We suggest that the most beneficial tasks are potentially more biologically feasible for real agents to use. Lastly, we explore the development of internal representations in the activations of artificial agent neural networks. These representations resemble place cells and head-direction cells found in mouse brains, and their presence has correlation to the navigation strategies that artificial agents employ.
Authors: Anqi Li, Dipendra Misra, Andrey Kolobov, Ching-An Cheng
We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. Formally, given a reward class -- which may not even contain the true reward -- we identify conditions on the training data distribution that enable offline RL to learn a near-optimal and safe policy from any reward within the class. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior with imperfect reward but purposely biased data coverage.
Authors: Haiyang Yu, Zhao Xu, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji
We consider the prediction of the Hamiltonian matrix, which finds use in quantum chemistry and condensed matter physics. Efficiency and equivariance are two important, but conflicting factors. In this work, we propose a SE(3)-equivariant network, named QHNet, that achieves efficiency and equivariance. Our key advance lies at the innovative design of QHNet architecture, which not only obeys the underlying symmetries, but also enables the reduction of number of tensor products by 92\%. In addition, QHNet prevents the exponential growth of channel dimension when more atom types are involved. We perform experiments on MD17 datasets, including four molecular systems. Experimental results show that our QHNet can achieve comparable performance to the state of the art methods at a significantly faster speed. Besides, our QHNet consumes 50\% less memory due to its streamlined architecture. Our code is publicly available as part of the AIRS library (\url{https://github.com/divelab/AIRS}).
Authors: Pum Jun Kim, Yoojin Jang, Jisu Kim, Jaejun Yoo
We propose a robust and reliable evaluation metric for generative models by introducing topological and statistical treatments for rigorous support estimation. Existing metrics, such as Inception Score (IS), Frechet Inception Distance (FID), and the variants of Precision and Recall (P&R), heavily rely on supports that are estimated from sample features. However, the reliability of their estimation has not been seriously discussed (and overlooked) even though the quality of the evaluation entirely depends on it. In this paper, we propose Topological Precision and Recall (TopP&R, pronounced 'topper'), which provides a systematic approach to estimating supports, retaining only topologically and statistically important features with a certain level of confidence. This not only makes TopP&R strong for noisy features, but also provides statistical consistency. Our theoretical and experimental results show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations, while accurately capturing the true trend of change in samples. To the best of our knowledge, this is the first evaluation metric focused on the robust estimation of the support and provides its statistical consistency under noise.
Authors: Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund
In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast rate and mixed rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast rate bound is equivalent to the Seeger--Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters' space. Finally, we extend all previous results to anytime-valid bounds using a simple technique applicable to any existing bound.
Authors: Allan Raventós, Mansheej Paul, Feng Chen, Surya Ganguli
Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally $\textit{new}$ tasks that are very different from those seen during pretraining? To probe this question, we examine ICL's performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a $\textit{task diversity threshold}$ for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the $\textit{non-diverse pretraining task distribution}$ as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over $\textit{all tasks}$, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers $\textit{can}$ optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL. Code is available at https://github.com/mansheej/icl-task-diversity.
Authors: Yan Ju, Shu Hu, Shan Jia, George H. Chen, Siwei Lyu
Despite the development of effective deepfake detectors in recent years, recent studies have demonstrated that biases in the data used to train these detectors can lead to disparities in detection accuracy across different races and genders. This can result in different groups being unfairly targeted or excluded from detection, allowing undetected deepfakes to manipulate public opinion and erode trust in a deepfake detection model. While existing studies have focused on evaluating fairness of deepfake detectors, to the best of our knowledge, no method has been developed to encourage fairness in deepfake detection at the algorithm level. In this work, we make the first attempt to improve deepfake detection fairness by proposing novel loss functions that handle both the setting where demographic information (eg, annotations of race and gender) is available as well as the case where this information is absent. Fundamentally, both approaches can be used to convert many existing deepfake detectors into ones that encourages fairness. Extensive experiments on four deepfake datasets and five deepfake detectors demonstrate the effectiveness and flexibility of our approach in improving deepfake detection fairness. Our code is available at https://github.com/littlejuyan/DF_Fairness.
Authors: Qiang He, Tianyi Zhou, Meng Fang, Setareh Maghsudi
We propose a novel value approximation method, namely Eigensubspace Regularized Critic (ERC) for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at https://sites.google.com/view/erc-ecml23/.
Authors: Oskar Allerbo
Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.
Authors: Jian Yao, Weiming Liu, Haobo Fu, Yaodong Yang, Stephen McAleer, Qiang Fu, Wei Yang
Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
Authors: Mufang Ying, Koulik Khamaru, Cun-Hui Zhang
Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.
Authors: Felix Ulrich-Oltean, Peter Nightingale, James Alfred Walker
Many constraint satisfaction and optimisation problems can be solved effectively by encoding them as instances of the Boolean Satisfiability problem (SAT). However, even the simplest types of constraints have many encodings in the literature with widely varying performance, and the problem of selecting suitable encodings for a given problem instance is not trivial. We explore the problem of selecting encodings for pseudo-Boolean and linear constraints using a supervised machine learning approach. We show that it is possible to select encodings effectively using a standard set of features for constraint problems; however we obtain better performance with a new set of features specifically designed for the pseudo-Boolean and linear constraints. In fact, we achieve good results when selecting encodings for unseen problem classes. Our results compare favourably to AutoFolio when using the same feature set. We discuss the relative importance of instance features to the task of selecting the best encodings, and compare several variations of the machine learning method.
Authors: Cian Eastwood, Shashank Singh, Andrei Liviu Nicolicioiu, Marin Vlastelica, Julius von Kügelgen, Bernhard Schölkopf
To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.
Authors: Jishnu Ray Chowdhury, Cornelia Caragea
Beam Tree Recursive Neural Network (BT-RvNN) was recently proposed as a simple extension of Gumbel Tree RvNN and it was shown to achieve state-of-the-art length generalization performance in ListOps while maintaining comparable performance on other tasks. However, although not the worst in its kind, BT-RvNN can be still exorbitantly expensive in memory usage. In this paper, we identify the main bottleneck in BT-RvNN's memory usage to be the entanglement of the scorer function and the recursive cell function. We propose strategies to remove this bottleneck and further simplify its memory usage. Overall, our strategies not only reduce the memory usage of BT-RvNN by $10$-$16$ times but also create a new state-of-the-art in ListOps while maintaining similar performance in other tasks. In addition, we also propose a strategy to utilize the induced latent-tree node representations produced by BT-RvNN to turn BT-RvNN from a sentence encoder of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{d}$ into a sequence contextualizer of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{n \times d}$. Thus, our proposals not only open up a path for further scalability of RvNNs but also standardize a way to use BT-RvNNs as another building block in the deep learning toolkit that can be easily stacked or interfaced with other popular models such as Transformers and Structured State Space models.
Authors: Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Antoine Souloumiac
This paper seeks to solve Multi-Source Domain Adaptation (MSDA), which aims to mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain. We propose a novel MSDA framework based on dictionary learning and optimal transport. We interpret each domain in MSDA as an empirical distribution. As such, we express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates. Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, based on the reconstruction of labeled samples in the target domain, and DaDiL-E, based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in 3 benchmarks: Caltech-Office, Office 31, and CRWU, where we improved previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance. Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.
Authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez
Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.
Authors: Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould
Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
Authors: Vaidotas Simkus, Michael U. Gutmann
Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable. A principled choice for asymptotically exact conditional sampling is Metropolis-within-Gibbs (MWG). However, we observe that the tendency of VAEs to learn a structured latent space, a commonly desired property, can cause the MWG sampler to get "stuck" far from the target distribution. This paper mitigates the limitations of MWG: we systematically outline the pitfalls in the context of VAEs, propose two original methods that address these pitfalls, and demonstrate an improved performance of the proposed methods on a set of sampling tasks.
Authors: Omer Veysel Cagatan
We present ToddlerBERTa, a BabyBERTa-like language model, exploring its capabilities through five different models with varied hyperparameters. Evaluating on BLiMP, SuperGLUE, MSGS, and a Supplement benchmark from the BabyLM challenge, we find that smaller models can excel in specific tasks, while larger models perform well with substantial data. Despite training on a smaller dataset, ToddlerBERTa demonstrates commendable performance, rivalling the state-of-the-art RoBERTa-base. The model showcases robust language understanding, even with single-sentence pretraining, and competes with baselines that leverage broader contextual information. Our work provides insights into hyperparameter choices, and data utilization, contributing to the advancement of language models.
Authors: Dejan Grubisic, Bram Wasti, Chris Cummins, John Mellor-Crummey, Aleksandar Zlateski
Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.
Authors: Philippe Gonzalez, Tommy Sonne Alstrøm, Tobias May
The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.
Authors: Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Mykel Kochenderfer, Chiho Choi, Behzad Dariush
The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Furthermore, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.
Authors: Fabiola Espinoza Castellon, Eduardo Fernandes Montesuma, Fred Ngolè Mboula, Aurélien Mayoue, Antoine Souloumiac, Cédric Gouy-Pailler
In this article, we propose an approach for federated domain adaptation, a setting where distributional shift exists among clients and some have unlabeled data. The proposed framework, FedDaDiL, tackles the resulting challenge through dictionary learning of empirical distributions. In our setting, clients' distributions represent particular domains, and FedDaDiL collectively trains a federated dictionary of empirical distributions. In particular, we build upon the Dataset Dictionary Learning framework by designing collaborative communication protocols and aggregation operations. The chosen protocols keep clients' data private, thus enhancing overall privacy compared to its centralized counterpart. We empirically demonstrate that our approach successfully generates labeled data on the target domain with extensive experiments on (i) Caltech-Office, (ii) TEP, and (iii) CWRU benchmarks. Furthermore, we compare our method to its centralized counterpart and other benchmarks in federated domain adaptation.
Authors: Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, Mosharaf Chowdhury
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f+1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $29.6x$.
Authors: Hugh Zhang, David C. Parkes
Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to the longest-length-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017). We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.
Authors: Shiyi Zhu, Jing Ye, Wei Jiang, Qi Zhang, Yifan Wu, Jianguo Li
Self-attention and position embedding are two key modules in Transformer based LLMs. The potential relationship among them are far from well studied, especially for context window extending. In this paper, we introduce collinear constrained relationship to fuse RoPE and self-attention, and name it as Collinear Constrained Attention (CoCA). We've analyzed the computational and spatial complexity of CoCA and have determined that it adds only minimal additional overhead compared to the original Transformer-based models. We provide an efficient implementation of CoCA, and make it drop-in replacement for any existing position embedding and attention modules in Transformer based models. Experiments show that CoCA performs extraordinary well on context window extending. For instance, a CoCA based GPT model trained with 512 context length can extend the context window up to 8K without perplexity diverging. This indicates more than 16x context window extending without any fine-tuning. Our code is released here: https://github.com/codefuse-ai/Collinear-Constrained-Attention
Authors: Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A. Rossi, Franck Dernoncourt
Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.
Authors: Xianming Li, Jing Li
High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.
Authors: Kevin Lin, Donald Brown, Sana Syed, Adam Greene
Eosinophilic Esophagitis (EoE) is an allergic condition increasing in prevalence. To diagnose EoE, pathologists must find 15 or more eosinophils within a single high-power field (400X magnification). Determining whether or not a patient has EoE can be an arduous process and any medical imaging approaches used to assist diagnosis must consider both efficiency and precision. We propose an improvement of Adorno et al's approach for quantifying eosinphils using deep image segmentation. Our new approach leverages Monte Carlo Dropout, a common approach in deep learning to reduce overfitting, to provide uncertainty quantification on current deep learning models. The uncertainty can be visualized in an output image to evaluate model performance, provide insight to how deep learning algorithms function, and assist pathologists in identifying eosinophils.
Authors: Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong
This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naive applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.The code is available at https://github.com/hahnyuan/BinaryLLM.
Authors: Jonatan Vallin, Karl Larsson, Mats G. Larson
We formalize and interpret the geometric structure of $d$-dimensional fully connected ReLU layers in neural networks. The parameters of a ReLU layer induce a natural partition of the input domain, such that the ReLU layer can be significantly simplified in each sector of the partition. This leads to a geometric interpretation of a ReLU layer as a projection onto a polyhedral cone followed by an affine transformation, in line with the description in [doi:10.48550/arXiv.1905.08922] for convolutional networks with ReLU activations. Further, this structure facilitates simplified expressions for preimages of the intersection between partition sectors and hyperplanes, which is useful when describing decision boundaries in a classification setting. We investigate this in detail for a feed-forward network with one hidden ReLU-layer, where we provide results on the geometric complexity of the decision boundary generated by such networks, as well as proving that modulo an affine transformation, such a network can only generate $d$ different decision boundaries. Finally, the effect of adding more layers to the network is discussed.
Authors: Avisek Naug, Antonio Guillen, Ricardo Luna Gutiérrez, Vineet Gundecha, Dejan Markovikj, Lekhapriya Dheeraj Kashyap, Lorenz Krause, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, Soumyendu Sarkar
The increasing global emphasis on sustainability and reducing carbon emissions is pushing governments and corporations to rethink their approach to data center design and operation. Given their high energy consumption and exponentially large computational workloads, data centers are prime candidates for optimizing power consumption, especially in areas such as cooling and IT energy usage. A significant challenge in this pursuit is the lack of a configurable and scalable thermal data center model that offers an end-to-end pipeline. Data centers consist of multiple IT components whose geometric configuration and heat dissipation make thermal modeling difficult. This paper presents PyDCM, a customizable Data Center Model implemented in Python, that allows users to create unique configurations of IT equipment with custom server specifications and geometric arrangements of IT cabinets. The use of vectorized thermal calculations makes PyDCM orders of magnitude faster (30 times) than current Energy Plus modeling implementations and scales sublinearly with the number of CPUs. Also, PyDCM enables the use of Deep Reinforcement Learning via the Gymnasium wrapper to optimize data center cooling and offers a user-friendly platform for testing various data center design prototypes.
Authors: Alexander Shmakov, Avisek Naug, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Ashwin Ramesh Babu, Antonio Guillen, Soumyendu Sarkar
Bayesian Optimization (BO), guided by Gaussian process (GP) surrogates, has proven to be an invaluable technique for efficient, high-dimensional, black-box optimization, a critical problem inherent to many applications such as industrial design and scientific computing. Recent contributions have introduced reinforcement learning (RL) to improve the optimization performance on both single function optimization and \textit{few-shot} multi-objective optimization. However, even few-shot techniques fail to exploit similarities shared between closely related objectives. In this paper, we combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a novel method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. We combine this Transformer Deep Kernel with a learned acquisition function trained with continuous Soft Actor-Critic Reinforcement Learning to aid in exploration. This Reinforced Transformer Deep Kernel (RTDK-BO) approach yields state-of-the-art results in continuous high-dimensional optimization problems.
Authors: Jianchao Lu, Yuzhe Tian, Yang Zhang, Jiaqi Ge, Quan Z. Sheng, Xi Zheng
Brain-Computer Interfaces (BCIs) are a groundbreaking technology for interacting with external devices using brain signals. Despite advancements, electroencephalogram (EEG)-based Motor Imagery (MI) tasks face challenges like amplitude and phase variability, and complex spatial correlations, with a need for smaller model size and faster inference. This study introduces the LGL-BCI framework, employing a Geometric Deep Learning Framework for EEG processing in non-Euclidean metric spaces, particularly the Symmetric Positive Definite (SPD) Manifold space. LGL-BCI offers robust EEG data representation and captures spatial correlations. We propose an EEG channel selection solution via a feature decomposition algorithm to reduce SPD matrix dimensionality, with a lossless transformation boosting inference speed. Extensive experiments show LGL-BCI's superior accuracy and efficiency compared to current solutions, highlighting geometric deep learning's potential in MI-BCI applications. The efficiency, assessed on two public EEG datasets and two real-world EEG devices, significantly outperforms the state-of-the-art solution in accuracy ($82.54\%$ versus $62.22\%$) with fewer parameters (64.9M compared to 183.7M).
Authors: Jayetri Bardhan, Kirk Roberts, Daisy Zhe Wang
Question Answering (QA) systems on patient-related data can assist both clinicians and patients. They can, for example, assist clinicians in decision-making and enable patients to have a better understanding of their medical history. Significant amounts of patient data are stored in Electronic Health Records (EHRs), making EHR QA an important research area. In EHR QA, the answer is obtained from the medical record of the patient. Because of the differences in data format and modality, this differs greatly from other medical QA tasks that employ medical websites or scientific papers to retrieve answers, making it critical to research EHR question answering. This study aimed to provide a methodological review of existing works on QA over EHRs. We searched for articles from January 1st, 2005 to September 30th, 2023 in four digital sources including Google Scholar, ACL Anthology, ACM Digital Library, and PubMed to collect relevant publications on EHR QA. 4111 papers were identified for our study, and after screening based on our inclusion criteria, we obtained a total of 47 papers for further study. Out of the 47 papers, 25 papers were about EHR QA datasets, and 37 papers were about EHR QA models. It was observed that QA on EHRs is relatively new and unexplored. Most of the works are fairly recent. Also, it was observed that emrQA is by far the most popular EHR QA dataset, both in terms of citations and usage in other papers. Furthermore, we identified the different models used in EHR QA along with the evaluation metrics used for these models.
Authors: Herbie Bradley, Andrew Dai, Hannah Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Grégory Schott, Joel Lehman
In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through AI feedback, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.
Authors: Marien Chenaud, José Alves, Frédéric Magoulès
Since the seminal work of [9] and their Physics-Informed neural networks (PINNs), many efforts have been conducted towards solving partial differential equations (PDEs) with Deep Learning models. However, some challenges remain, for instance the extension of such models to complex three-dimensional geometries, and a study on how such approaches could be combined to classical numerical solvers. In this work, we justify the use of graph neural networks for these problems, based on the similarity between these architectures and the meshes used in traditional numerical techniques for solving partial differential equations. After proving an issue with the Physics-Informed framework for complex geometries, during the computation of PDE residuals, an alternative procedure is proposed, by combining classical numerical solvers and the Physics-Informed framework. Finally, we propose an implementation of this approach, that we test on a three-dimensional problem on an irregular geometry.
Authors: Yunsheng Ma, Juanwu Lu, Can Cui, Sicheng Zhao, Xu Cao, Wenqian Ye, Ziran Wang
Vehicle-to-vehicle (V2V) communications have greatly enhanced the perception capabilities of connected and automated vehicles (CAVs) by enabling information sharing to "see through the occlusions", resulting in significant performance improvements. However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary when existing single-agent models show remarkable generalization capabilities. In this paper, we propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings, adapting the model by freezing most of its parameters and adding a few lightweight modules. We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks while requiring substantially fewer tunable parameters with reduced communication costs. Our source code is available at https://github.com/PurdueDigitalTwin/MACP.
Authors: Shikai Fang, Xin Yu, Shibo Li, Zheng Wang, Robert Kirby, Shandian Zhe
Practical tensor data is often along with time information. Most existing temporal decomposition approaches estimate a set of fixed factors for the objects in each tensor mode, and hence cannot capture the temporal evolution of the objects' representation. More important, we lack an effective approach to capture such evolution from streaming data, which is common in real-world applications. To address these issues, we propose Streaming Factor Trajectory Learning for temporal tensor decomposition. We use Gaussian processes (GPs) to model the trajectory of factors so as to flexibly estimate their temporal evolution. To address the computational challenges in handling streaming data, we convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE). We develop an efficient online filtering algorithm to estimate a decoupled running posterior of the involved factor states upon receiving new data. The decoupled estimation enables us to conduct standard Rauch-Tung-Striebel smoothing to compute the full posterior of all the trajectories in parallel, without the need for revisiting any previous data. We have shown the advantage of SFTL in both synthetic tasks and real-world applications. The code is available at {https://github.com/xuangu-fang/Streaming-Factor-Trajectory-Learning}.
Authors: Elise Walker, Jonas A. Actor, Carianne Martinez, Nathaniel Trask
Causal representation learning algorithms discover lower-dimensional representations of data that admit a decipherable interpretation of cause and effect; as achieving such interpretable representations is challenging, many causal learning algorithms utilize elements indicating prior information, such as (linear) structural causal models, interventional data, or weak supervision. Unfortunately, in exploratory causal representation learning, such elements and prior information may not be available or warranted. Alternatively, scientific datasets often have multiple modalities or physics-based constraints, and the use of such scientific, multimodal data has been shown to improve disentanglement in fully unsupervised settings. Consequently, we introduce a causal representation learning algorithm (causalPIMA) that can use multimodal data and known physics to discover important features with causal relationships. Our innovative algorithm utilizes a new differentiable parametrization to learn a directed acyclic graph (DAG) together with a latent space of a variational autoencoder in an end-to-end differentiable framework via a single, tractable evidence lower bound loss function. We place a Gaussian mixture prior on the latent space and identify each of the mixtures with an outcome of the DAG nodes; this novel identification enables feature discovery with causal relationships. Tested against a synthetic and a scientific dataset, our results demonstrate the capability of learning an interpretable causal structure while simultaneously discovering key features in a fully unsupervised setting.
Authors: Soumyendu Sarkar, Ashwin Ramesh Babu, Sajad Mousavi, Zachariah Carmichael, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna, Gutierrez Antonio Guillen, Avisek Naug
We present a novel framework for generating adversarial benchmarks to evaluate the robustness of image classification models. Our framework allows users to customize the types of distortions to be optimally applied to images, which helps address the specific distortions relevant to their deployment. The benchmark can generate datasets at various distortion levels to assess the robustness of different image classifiers. Our results show that the adversarial samples generated by our framework with any of the image classification models, like ResNet-50, Inception-V3, and VGG-16, are effective and transferable to other models causing them to fail. These failures happen even when these models are adversarially retrained using state-of-the-art techniques, demonstrating the generalizability of our adversarial samples. We achieve competitive performance in terms of net $L_2$ distortion compared to state-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we demonstrate our framework achieves such results with simple distortions like Gaussian noise without introducing unnatural artifacts or color bleeds. This is made possible by a model-based reinforcement learning (RL) agent and a technique that reduces a deep tree search of the image for model sensitivity to perturbations, to a one-level analysis and action. The flexibility of choosing distortions and setting classification probability thresholds for multiple classes makes our framework suitable for algorithmic audits.
Authors: Johannes Jakubik, Sujit Roy, C. E. Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, Daiki Kimura, Naomi Simumba, Linsong Chu, S. Karthik Mukkavilli, Devyani Lambhate, Kamal Das, Ranjini Bangalore, Dario Oliveira, Michal Muszynski, Kumar Ankur, Muthukumaran Ramasubramanian, Iksha Gurung, Sam Khallaghi, Hanxi (Steve)Li, Michael Cecil, Maryam Ahmadi, Fatemeh Kordi, Hamed Alemohammad, Manil Maskey, Raghu Ganti, Kommy Weldemariam, Rahul Ramachandran
Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.
Authors: Sajad Mousavi, Ricardo Luna Gutiérrez, Desik Rengarajan, Vineet Gundecha, Ashwin Ramesh Babu, Avisek Naug, Antonio Guillen, Soumyendu Sarkar
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.
Authors: Shervin Sadat Parsi
We have formulated generative machine learning problems as the time evolution of Parametric Probabilistic Models (PPMs), inherently rendering a thermodynamic process. Then, we have studied the thermodynamic exchange between the model's parameters, denoted as $\Theta$, and the model's generated samples, denoted as $X$. We demonstrate that the training dataset and the action of the Stochastic Gradient Descent (SGD) optimizer serve as a work source that governs the time evolution of these two subsystems. Our findings reveal that the model learns through the dissipation of heat during the generation of samples $X$, leading to an increase in the entropy of the model's parameters, $\Theta$. Thus, the parameter subsystem acts as a heat reservoir, effectively storing the learned information. Furthermore, the role of the model's parameters as a heat reservoir provides valuable thermodynamic insights into the generalization power of over-parameterized models. This approach offers an unambiguous framework for computing information-theoretic quantities within deterministic neural networks by establishing connections with thermodynamic variables. To illustrate the utility of this framework, we introduce two information-theoretic metrics: Memorized-information (M-info) and Learned-information (L-info), which trace the dynamic flow of information during the learning process of PPMs.
Authors: Ryan Rezai, Jason Wang
Closed drafting or "pick and pass" is a popular game mechanic where each round players select a card or other playable element from their hand and pass the rest to the next player. In this paper, we establish first-principle interpretability, generalizability, and memory benchmarks for studying model-free reinforcement learning (RL) algorithms playing closed drafting games. Specifically in a popular family of closed drafting games called "Sushi Go Party!", in which we achieve state-of-the-art performance. We fit decision rules to interpret the strategy of trained RL agents and compare these to the ranking preferences of different types of human players, finding easily understandable explanations of the disparate performance of RL agents in this environment. As Sushi Go Party! can be expressed as a set of closely-related games based on the set of cards in play, we quantify the generalizability of RL models trained on various sets of cards, establishing key trends between performance and the set distance between the train and evaluation game configurations. Using the explicitly calculable memory of other player's hands in closed drafting games, we create measures of the ability of RL models to learn memory.
Authors: Hyeonsoo Jo, Fanchen Bu, Kijung Shin
How can we find meaningful clusters in a graph robustly against noise edges? Graph clustering (i.e., dividing nodes into groups of similar ones) is a fundamental problem in graph analysis with applications in various fields. Recent studies have demonstrated that graph neural network (GNN) based approaches yield promising results for graph clustering. However, we observe that their performance degenerates significantly on graphs with noise edges, which are prevalent in practice. In this work, we propose MetaGC for robust GNN-based graph clustering. MetaGC employs a decomposable clustering loss function, which can be rephrased as a sum of losses over node pairs. We add a learnable weight to each node pair, and MetaGC adaptively adjusts the weights of node pairs using meta-weighting so that the weights of meaningful node pairs increase and the weights of less-meaningful ones (e.g., noise edges) decrease. We show empirically that MetaGC learns weights as intended and consequently outperforms the state-of-the-art GNN-based competitors, even when they are equipped with separate denoising schemes, on five real-world graphs under varying levels of noise. Our code and datasets are available at https://github.com/HyeonsooJo/MetaGC.
Authors: Edoardo Cignoni, Divya Suman, Jigyasa Nigam, Lorenzo Cupellini, Benedetta Mennucci, Michele Ceriotti
Data-driven techniques are increasingly used to replace electronic-structure calculations of matter. In this context, a relevant question is whether machine learning (ML) should be applied directly to predict the desired properties or be combined explicitly with physically-grounded operations. We present an example of an integrated modeling approach, in which a symmetry-adapted ML model of an effective Hamiltonian is trained to reproduce electronic excitations from a quantum-mechanical calculation. The resulting model can make predictions for molecules that are much larger and more complex than those that it is trained on, and allows for dramatic computational savings by indirectly targeting the outputs of well-converged calculations while using a parameterization corresponding to a minimal atom-centered basis. These results emphasize the merits of intertwining data-driven techniques with physical approximations, improving the transferability and interpretability of ML models without affecting their accuracy and computational efficiency, and providing a blueprint for developing ML-augmented electronic-structure methods.
Authors: Hussein Mozannar, Jimin J Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das, David Sontag
People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules, grounded in data regions and described in natural language, that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space where prior human behavior should be corrected. Each region is then described using a large language model in an iterative and contrastive procedure. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
Authors: Hanlin Yu, Marcelo Hartmann, Bernardo Williams, Mark Girolami, Arto Klami
The Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties heavily depend on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.
Authors: Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Dongjune Lee, Nam Soo Kim
We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.
Authors: Michelangelo Diligenti, Marco Gori, Marco Maggini, Leonardo Rigutini
In this paper we propose a general framework to integrate supervised and unsupervised examples with background knowledge expressed by a collection of first-order logic clauses into kernel machines. In particular, we consider a multi-task learning scheme where multiple predicates defined on a set of objects are to be jointly learned from examples, enforcing a set of FOL constraints on the admissible configurations of their values. The predicates are defined on the feature spaces, in which the input objects are represented, and can be either known a priori or approximated by an appropriate kernel-based learner. A general approach is presented to convert the FOL clauses into a continuous implementation that can deal with the outputs computed by the kernel-based predicates. The learning problem is formulated as a semi-supervised task that requires the optimization in the primal of a loss function that combines a fitting loss measure on the supervised examples, a regularization term, and a penalty term that enforces the constraints on both the supervised and unsupervised examples. Unfortunately, the penalty term is not convex and it can hinder the optimization process. However, it is possible to avoid poor solutions by using a two stage learning schema, in which the supervised examples are learned first and then the constraints are enforced.
Authors: Huixin Zhan, Zijun Zhang
Clinical variant classification of pathogenic versus benign genetic variants remains a pivotal challenge in clinical genetics. Recently, the proposition of protein language models has improved the generic variant effect prediction (VEP) accuracy via weakly-supervised or unsupervised training. However, these VEPs are not disease-specific, limiting their adaptation at point-of-care. To address this problem, we propose a disease-specific \textsc{pro}tein language model for variant \textsc{path}ogenicity, termed ProPath, to capture the pseudo-log-likelihood ratio in rare missense variants through a siamese network. We evaluate the performance of ProPath against pre-trained language models, using clinical variant sets in inherited cardiomyopathies and arrhythmias that were not seen during training. Our results demonstrate that ProPath surpasses the pre-trained ESM1b with an over $5\%$ improvement in AUC across both datasets. Furthermore, our model achieved the highest performances across all baselines for both datasets. Thus, our ProPath offers a potent disease-specific variant effect prediction, particularly valuable for disease associations and clinical applicability.
Authors: Injae Kim, Minhyuk Choi, Hyunwoo J. Kim
Neural Radiance Field (NeRF) has enabled novel view synthesis with high fidelity given images and camera poses. Subsequent works even succeeded in eliminating the necessity of pose priors by jointly optimizing NeRF and camera pose. However, these works are limited to relatively simple settings such as photometrically consistent and occluder-free image collections or a sequence of images from a video. So they have difficulty handling unconstrained images with varying illumination and transient occluders. In this paper, we propose $\textbf{UP-NeRF}$ ($\textbf{U}$nconstrained $\textbf{P}$ose-prior-free $\textbf{Ne}$ural $\textbf{R}$adiance $\textbf{F}$ields) to optimize NeRF with unconstrained image collections without camera pose prior. We tackle these challenges with surrogate tasks that optimize color-insensitive feature fields and a separate module for transient occluders to block their influence on pose estimation. In addition, we introduce a candidate head to enable more robust pose estimation and transient-aware depth supervision to minimize the effect of incorrect prior. Our experiments verify the superior performance of our method compared to the baselines including BARF and its variants in a challenging internet photo collection, $\textit{Phototourism}$ dataset.
Authors: Shantanu Gupta, Cheng Zhang, Agrin Hilmkil
For a given causal question, it is important to efficiently decide which causal inference method to use for a given dataset. This is challenging because causal methods typically rely on complex and difficult-to-verify assumptions, and cross-validation is not applicable since ground truth causal quantities are unobserved. In this work, we propose CAusal Method Predictor (CAMP), a framework for predicting the best method for a given dataset. To this end, we generate datasets from a diverse set of synthetic causal models, score the candidate methods, and train a model to directly predict the highest-scoring method for that dataset. Next, by formulating a self-supervised pre-training objective centered on dataset assumptions relevant for causal inference, we significantly reduce the need for costly labeled data and enhance training efficiency. Our strategy learns to map implicit dataset properties to the best method in a data-driven manner. In our experiments, we focus on method prediction for causal discovery. CAMP outperforms selecting any individual candidate method and demonstrates promising generalization to unseen semi-synthetic and real-world benchmarks.