Interpretability is not Explainability: New Quantitative XAI Approach with a focus on Recommender Systems in Education. (arXiv:2311.02078v1 [cs.IR])

Authors: Riccardo Porcedda

The field of eXplainable Artificial Intelligence faces challenges due to the absence of a widely accepted taxonomy that facilitates the quantitative evaluation of explainability in Machine Learning algorithms. In this paper, we propose a novel taxonomy that addresses the current gap in the literature by providing a clear and unambiguous understanding of the key concepts and relationships in XAI. Our approach is rooted in a systematic analysis of existing definitions and frameworks, with a focus on transparency, interpretability, completeness, complexity and understandability as essential dimensions of explainability. This comprehensive taxonomy aims to establish a shared vocabulary for future research. To demonstrate the utility of our proposed taxonomy, we examine a case study of a Recommender System designed to curate and recommend the most suitable online resources from MERLOT. By employing the SHAP package, we quantify and enhance the explainability of the RS within the context of our newly developed taxonomy.

Design Of Rubble Analyzer Probe Using ML For Earthquake. (arXiv:2311.02087v1 [cs.SD])

Authors: Abhishek Sebastian, R Pragna, K Vishal Vythianathan, Dasaraju Sohan Sai, U Shiva Sri Hari Al, R Anirudh, Apurv Choudhary

The earthquake rubble analyzer uses machine learning to detect human presence via ambient sounds, achieving 97.45% accuracy. It also provides real-time environmental data, aiding in assessing survival prospects for trapped individuals, crucial for post-earthquake rescue efforts

Combining Deep Learning on Order Books with Reinforcement Learning for Profitable Trading. (arXiv:2311.02088v1 [q-fin.CP])

Authors: Koti S. Jaddu, Paul A. Bilokon

High-frequency trading is prevalent, where automated decisions must be made quickly to take advantage of price imbalances and patterns in price action that forecast near-future movements. While many algorithms have been explored and tested, analytical methods fail to harness the whole nature of the market environment by focusing on a limited domain. With the evergrowing machine learning field, many large-scale end-to-end studies on raw data have been successfully employed to increase the domain scope for profitable trading but are very difficult to replicate. Combining deep learning on the order books with reinforcement learning is one way of breaking down large-scale end-to-end learning into more manageable and lightweight components for reproducibility, suitable for retail trading.

The following work focuses on forecasting returns across multiple horizons using order flow imbalance and training three temporal-difference learning models for five financial instruments to provide trading signals. The instruments used are two foreign exchange pairs (GBPUSD and EURUSD), two indices (DE40 and FTSE100), and one commodity (XAUUSD). The performances of these 15 agents are evaluated through backtesting simulation, and successful models proceed through to forward testing on a retail trading platform. The results prove potential but require further minimal modifications for consistently profitable trading to fully handle retail trading costs, slippage, and spread fluctuation.

Variational Autoencoders for Noise Reduction in Industrial LLRF Systems. (arXiv:2311.02096v1 [physics.acc-ph])

Authors: J. P. Edelen, M. J. Henderson, J. Einstein-Curtis, C. C. Hall, J. A. Diaz Cruz, A. L. Edelen

Industrial particle accelerators inherently operate in much dirtier environments than typical research accelerators. This leads to an increase in noise both in the RF system and in other electronic systems. Combined with the fact that industrial accelerators are mass produced, there is less attention given to optimizing the performance of an individual system. As a result, industrial systems tend to under perform considering their hardware hardware capabilities. With the growing demand for accelerators for medical sterilization, food irradiation, cancer treatment, and imaging, improving the signal processing of these machines will increase the margin for the deployment of these systems. Our work is focusing on using machine learning techniques to reduce the noise of RF signals used for pulse-to-pulse feedback in industrial accelerators. We will review our algorithms, simulation results, and results working with measured data. We will then discuss next steps for deployment and testing on an industrial system.

A Comprehensive Study on Model Initialization Techniques Ensuring Efficient Federated Learning. (arXiv:2311.02100v1 [cs.LG])

Authors: Ishmeet Kaur, Adwaita Janardhan Jadhav

Advancement in the field of machine learning is unavoidable, but something of major concern is preserving the privacy of the users whose data is being used for training these machine learning algorithms. Federated learning(FL) has emerged as a promising paradigm for training machine learning models in a distributed and privacy-preserving manner which enables one to collaborate and train a global model without sharing local data. But starting this learning process on each device in the right way, called ``model initialization" is critical. The choice of initialization methods used for models plays a crucial role in the performance, convergence speed, communication efficiency, privacy guarantees of federated learning systems, etc. In this survey, we dive deeper into a comprehensive study of various ways of model initialization techniques in FL.Unlike other studies, our research meticulously compares, categorizes, and delineates the merits and demerits of each technique, examining their applicability across diverse FL scenarios. We highlight how factors like client variability, data non-IIDness, model caliber, security considerations, and network restrictions influence FL model outcomes and propose how strategic initialization can address and potentially rectify many such challenges. The motivation behind this survey is to highlight that the right start can help overcome challenges like varying data quality, security issues, and network problems. Our insights provide a foundational base for experts looking to fully utilize FL, also while understanding the complexities of model initialization.

Solving MaxSAT with Matrix Multiplication. (arXiv:2311.02101v1 [cs.AI])

Authors: David Warde-Farley, Vinod Nair, Yujia Li, Ivan Lobov, Felix Gimeno, Simon Osindero

We propose an incomplete algorithm for Maximum Satisfiability (MaxSAT) specifically designed to run on neural network accelerators such as GPUs and TPUs. Given a MaxSAT problem instance in conjunctive normal form, our procedure constructs a Restricted Boltzmann Machine (RBM) with an equilibrium distribution wherein the probability of a Boolean assignment is exponential in the number of clauses it satisfies. Block Gibbs sampling is used to stochastically search the space of assignments with parallel Markov chains. Since matrix multiplication is the main computational primitive for block Gibbs sampling in an RBM, our approach leads to an elegantly simple algorithm (40 lines of JAX) well-suited for neural network accelerators. Theoretical results about RBMs guarantee that the required number of visible and hidden units of the RBM scale only linearly with the number of variables and constant-sized clauses in the MaxSAT instance, ensuring that the computational cost of a Gibbs step scales reasonably with the instance size. Search throughput can be increased by batching parallel chains within a single accelerator as well as by distributing them across multiple accelerators. As a further enhancement, a heuristic based on unit propagation running on CPU is periodically applied to the sampled assignments. Our approach, which we term RbmSAT, is a new design point in the algorithm-hardware co-design space for MaxSAT. We present timed results on a subset of problem instances from the annual MaxSAT Evaluation's Incomplete Unweighted Track for the years 2018 to 2021. When allotted the same running time and CPU compute budget (but no TPUs), RbmSAT outperforms other participating solvers on problems drawn from three out of the four years' competitions. Given the same running time on a TPU cluster for which RbmSAT is uniquely designed, it outperforms all solvers on problems drawn from all four years.

Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. (arXiv:2311.02103v1 [cs.LG])

Authors: Ruihang Lai, Junru Shao, Siyuan Feng, Steven S. Lyubomirsky, Bohan Hou, Wuwei Lin, Zihao Ye, Hongyi Jin, Yuchen Jin, Jiawei Liu, Lesheng Jin, Yaxing Cai, Ziheng Jiang, Yong Wu, Sunghyun Park, Prakalp Srivastava, Jared G. Roesch, Todd C. Mowry, Tianqi Chen

Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven demand for deploying them to a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces first-class symbolic shape annotations to track dynamic shape computations globally across the program. It also introduces a cross-level abstraction that encapsulates computational graphs, loop-level tensor programs, and library calls in a single representation to enable cross-level optimizations. We build an end-to-end compilation framework using the proposed approach to optimize dynamic shape models. Experimental results on large language models show that Relax delivers performance competitive with state-of-the-art hand-optimized systems across platforms and enables deployment of emerging dynamic models to a broader set of environments, including mobile phones, embedded devices, and web browsers.

Efficient Symbolic Policy Learning with Differentiable Symbolic Expression. (arXiv:2311.02104v1 [cs.LG])

Authors: Jiaming Guo, Rui Zhang, Shaohui Peng, Qi Yi, Xing Hu, Ruizhi Chen, Zidong Du, Xishan Zhang, Ling Li, Qi Guo, Yunji Chen

Deep reinforcement learning (DRL) has led to a wide range of advances in sequential decision-making tasks. However, the complexity of neural network policies makes it difficult to understand and deploy with limited computational resources. Currently, employing compact symbolic expressions as symbolic policies is a promising strategy to obtain simple and interpretable policies. Previous symbolic policy methods usually involve complex training processes and pre-trained neural network policies, which are inefficient and limit the application of symbolic policies. In this paper, we propose an efficient gradient-based learning method named Efficient Symbolic Policy Learning (ESPL) that learns the symbolic policy from scratch in an end-to-end way. We introduce a symbolic network as the search space and employ a path selector to find the compact symbolic policy. By doing so we represent the policy with a differentiable symbolic expression and train it in an off-policy manner which further improves the efficiency. In addition, in contrast with previous symbolic policies which only work in single-task RL because of complexity, we expand ESPL on meta-RL to generate symbolic policies for unseen tasks. Experimentally, we show that our approach generates symbolic policies with higher performance and greatly improves data efficiency for single-task RL. In meta-RL, we demonstrate that compared with neural network policies the proposed symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.

Making Harmful Behaviors Unlearnable for Large Language Models. (arXiv:2311.02105v1 [cs.LG])

Authors: Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang

Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. To meet the requirements of different applications, LLMs are often customized by further fine-tuning. However, the powerful learning ability of LLMs not only enables them to acquire new tasks but also makes them susceptible to learning undesired behaviors. For example, even safety-aligned LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. Can we train LLMs on harmful data without learning harmful behaviors? This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process. Specifically, we introduce ``security vectors'', a few new parameters that can be separated from the LLM, to ensure LLM's responses are consistent with the harmful behavior. Security vectors are activated during fine-tuning, the consistent behavior makes LLM believe that such behavior has already been learned, there is no need to further optimize for harmful data. During inference, we can deactivate security vectors to restore the LLM's normal behavior. The experimental results show that the security vectors generated by 100 harmful samples are enough to prevent LLM from learning 1000 harmful samples, while preserving the ability to learn other useful information.

Efficient Machine Learning Ensemble Methods for Detecting Gravitational Wave Glitches in LIGO Time Series. (arXiv:2311.02106v1 [cs.LG])

Authors: Elena-Simona Apostol, Ciprian-Octavian Truică

The phenomenon of Gravitational Wave (GW) analysis has grown in popularity as technology has advanced and the process of observing gravitational waves has become more precise. Although the sensitivity and the frequency of observation of GW signals are constantly improving, the possibility of noise in the collected GW data remains. In this paper, we propose two new Machine and Deep learning ensemble approaches (i.e., ShallowWaves and DeepWaves Ensembles) for detecting different types of noise and patterns in datasets from GW observatories. Our research also investigates various Machine and Deep Learning techniques for multi-class classification and provides a comprehensive benchmark, emphasizing the best results in terms of three commonly used performance metrics (i.e., accuracy, precision, and recall). We train and test our models on a dataset consisting of annotated time series from real-world data collected by the Advanced Laser Interferometer GW Observatory (LIGO). We empirically show that the best overall accuracy is obtained by the proposed DeepWaves Ensemble, followed close by the ShallowWaves Ensemble.

Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist. (arXiv:2311.02107v1 [cs.LG])

Authors: Yilin Ning, Salinelat Teixayavong, Yuqing Shang, Julian Savulescu, Vaishaanth Nagaraj, Di Miao, Mayli Mertens, Daniel Shu Wei Ting, Jasmine Chiat Ling Ong, Mingxuan Liu, Jiuwen Cao, Michael Dunn, Roger Vaughan, Marcus Eng Hock Ong, Joseph Jao-Yiu Sung, Eric J Topol, Nan Liu

The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (AI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare. However, less clear is how to resolve such issues beyond following guidelines and regulations that are still under discussion and development. On the other hand, other types of generative AI have been used to synthesize images and other types of data for research and practical purposes, which have resolved some ethical issues and exposed other ethical issues, but such technology is less often the focus of ongoing ethical discussions. Here we highlight gaps in current ethical discussions of generative AI via a systematic scoping review of relevant existing research in healthcare, and reduce the gaps by proposing an ethics checklist for comprehensive assessment and transparent documentation of ethical discussions in generative AI development. While the checklist can be readily integrated into the current peer review and publication system to enhance generative AI research, it may also be used in broader settings to disclose ethics-related considerations in generative AI-powered products (or real-life applications of such products) to help users establish reasonable trust in their capabilities.

Feature Attribution Explanations for Spiking Neural Networks. (arXiv:2311.02110v1 [cs.NE])

Authors: Elisa Nguyen, Meike Nauta, Gwenn Englebienne, Christin Seifert

Third-generation artificial neural networks, Spiking Neural Networks (SNNs), can be efficiently implemented on hardware. Their implementation on neuromorphic chips opens a broad range of applications, such as machine learning-based autonomous control and intelligent biomedical devices. In critical applications, however, insight into the reasoning of SNNs is important, thus SNNs need to be equipped with the ability to explain how decisions are reached. We present \textit{Temporal Spike Attribution} (TSA), a local explanation method for SNNs. To compute the explanation, we aggregate all information available in model-internal variables: spike times and model weights. We evaluate TSA on artificial and real-world time series data and measure explanation quality w.r.t. multiple quantitative criteria. We find that TSA correctly identifies a small subset of input features relevant to the decision (i.e., is output-complete and compact) and generates similar explanations for similar inputs (i.e., is continuous). Further, our experiments show that incorporating the notion of \emph{absent} spikes improves explanation quality. Our work can serve as a starting point for explainable SNNs, with future implementations on hardware yielding not only predictions but also explanations in a broad range of application scenarios. Source code is available at https://github.com/ElisaNguyen/tsa-explanations.

Towards objective and systematic evaluation of bias in medical imaging AI. (arXiv:2311.02115v1 [cs.CV])

Authors: Emma A.M. Stanley, Raissa Souza, Anthony Winder, Vedant Gulve, Kimberly Amador, Matthias Wilms, Nils D. Forkert

Artificial intelligence (AI) models trained using medical images for clinical tasks often exhibit bias in the form of disparities in performance between subgroups. Since not all sources of biases in real-world medical imaging data are easily identifiable, it is challenging to comprehensively assess how those biases are encoded in models, and how capable bias mitigation methods are at ameliorating performance disparities. In this article, we introduce a novel analysis framework for systematically and objectively investigating the impact of biases in medical images on AI models. We developed and tested this framework for conducting controlled in silico trials to assess bias in medical imaging AI using a tool for generating synthetic magnetic resonance images with known disease effects and sources of bias. The feasibility is showcased by using three counterfactual bias scenarios to measure the impact of simulated bias effects on a convolutional neural network (CNN) classifier and the efficacy of three bias mitigation strategies. The analysis revealed that the simulated biases resulted in expected subgroup performance disparities when the CNN was trained on the synthetic datasets. Moreover, reweighing was identified as the most successful bias mitigation strategy for this setup, and we demonstrated how explainable AI methods can aid in investigating the manifestation of bias in the model using this framework. Developing fair AI models is a considerable challenge given that many and often unknown sources of biases can be present in medical imaging datasets. In this work, we present a novel methodology to objectively study the impact of biases and mitigation strategies on deep learning pipelines, which can support the development of clinical AI that is robust and responsible.

Resist Label Noise with PGM for Graph Neural Networks. (arXiv:2311.02116v1 [cs.LG])

Authors: Qingqing Ge, Jianxiang Yu, Zeyuan Zhao, Xiang Li

While robust graph neural networks (GNNs) have been widely studied for graph perturbation and attack, those for label noise have received significantly less attention. Most existing methods heavily rely on the label smoothness assumption to correct noisy labels, which adversely affects their performance on heterophilous graphs. Further, they generally perform poorly in high noise-rate scenarios. To address these problems, in this paper, we propose a novel probabilistic graphical model (PGM) based framework LNP. Given a noisy label set and a clean label set, our goal is to maximize the likelihood of labels in the clean set. We first present LNP-v1, which generates clean labels based on graphs only in the Bayesian network. To further leverage the information of clean labels in the noisy label set, we put forward LNP-v2, which incorporates the noisy label set into the Bayesian network to generate clean labels. The generative process can then be used to predict labels for unlabeled nodes. We conduct extensive experiments to show the robustness of LNP on varying noise types and rates, and also on graphs with different heterophilies. In particular, we show that LNP can lead to inspiring performance in high noise-rate situations.

Cooperative Network Learning for Large-Scale and Decentralized Graphs. (arXiv:2311.02117v1 [cs.LG])

Authors: Qiang Wu, Yiming Huang, Yujie Zeng, Yujie Teng, Fang Zhou, Linyuan Lü

Graph research, the systematic study of interconnected data points represented as graphs, plays a vital role in capturing intricate relationships within networked systems. However, in the real world, as graphs scale up, concerns about data security among different data-owning agencies arise, hindering information sharing and, ultimately, the utilization of graph data. Therefore, establishing a mutual trust mechanism among graph agencies is crucial for unlocking the full potential of graphs. Here, we introduce a Cooperative Network Learning (CNL) framework to ensure secure graph computing for various graph tasks. Essentially, this CNL framework unifies the local and global perspectives of GNN computing with distributed data for an agency by virtually connecting all participating agencies as a global graph without a fixed central coordinator. Inter-agency computing is protected by various technologies inherent in our framework, including homomorphic encryption and secure transmission. Moreover, each agency has a fair right to design or employ various graph learning models from its local or global perspective. Thus, CNL can collaboratively train GNN models based on decentralized graphs inferred from local and global graphs. Experiments on contagion dynamics prediction and traditional graph tasks (i.e., node classification and link prediction) demonstrate that our CNL architecture outperforms state-of-the-art GNNs developed at individual sites, revealing that CNL can provide a reliable, fair, secure, privacy-preserving, and global perspective to build effective and personalized models for network applications. We hope this framework will address privacy concerns in graph-related research and integrate decentralized graph data structures to benefit the network research community in cooperation and innovation.

RigLSTM: Recurrent Independent Grid LSTM for Generalizable Sequence Learning. (arXiv:2311.02123v1 [cs.LG])

Authors: Ziyu Wang, Wenhao Jiang, Zixuan Zhang, Wei Tang, Junchi Yan

Sequential processes in real-world often carry a combination of simple subsystems that interact with each other in certain forms. Learning such a modular structure can often improve the robustness against environmental changes. In this paper, we propose recurrent independent Grid LSTM (RigLSTM), composed of a group of independent LSTM cells that cooperate with each other, for exploiting the underlying modular structure of the target task. Our model adopts cell selection, input feature selection, hidden state selection, and soft state updating to achieve a better generalization ability on the basis of the recent Grid LSTM for the tasks where some factors differ between training and evaluation. Specifically, at each time step, only a fraction of cells are activated, and the activated cells select relevant inputs and cells to communicate with. At the end of one time step, the hidden states of the activated cells are updated by considering the relevance between the inputs and the hidden states from the last and current time steps. Extensive experiments on diversified sequential modeling tasks are conducted to show the superior generalization ability when there exist changes in the testing environment. Source code is available at \url{https://github.com/ziyuwwang/rig-lstm}.

Sliced Denoising: A Physics-Informed Molecular Pre-Training Method. (arXiv:2311.02124v1 [q-bio.BM])

Authors: Yuyan Ni, Shikun Feng, Wei-Ying Ma, Zhi-Ming Ma, Yanyan Lan

While molecular pre-training has shown great potential in enhancing drug discovery, the lack of a solid physical interpretation in current methods raises concerns about whether the learned representation truly captures the underlying explanatory factors in observed data, ultimately resulting in limited generalization and robustness. Although denoising methods offer a physical interpretation, their accuracy is often compromised by ad-hoc noise design, leading to inaccurate learned force fields. To address this limitation, this paper proposes a new method for molecular pre-training, called sliced denoising (SliDe), which is based on the classical mechanical intramolecular potential theory. SliDe utilizes a novel noise strategy that perturbs bond lengths, angles, and torsion angles to achieve better sampling over conformations. Additionally, it introduces a random slicing approach that circumvents the computationally expensive calculation of the Jacobian matrix, which is otherwise essential for estimating the force field. By aligning with physical principles, SliDe shows a 42\% improvement in the accuracy of estimated force fields compared to current state-of-the-art denoising methods, and thus outperforms traditional baselines on various molecular property prediction tasks.

Using General Value Functions to Learn Domain-Backed Inventory Management Policies. (arXiv:2311.02125v1 [cs.LG])

Authors: Durgesh Kalwar, Omkar Shelke, Harshad Khadilkar

We consider the inventory management problem, where the goal is to balance conflicting objectives such as availability and wastage of a large range of products in a store. We propose a reinforcement learning (RL) approach that utilises General Value Functions (GVFs) to derive domain-backed inventory replenishment policies. The inventory replenishment decisions are modelled as a sequential decision making problem, which is challenging due to uncertain demand and the existence of aggregate (cross-product) constraints. In existing literature, GVFs have primarily been used for auxiliary task learning. We use this capability to train GVFs on domain-critical characteristics such as prediction of stock-out probability and wastage quantity. Using this domain expertise for more effective exploration, we train an RL agent to compute the inventory replenishment quantities for a large range of products (up to 6000 in the reported experiments), which share aggregate constraints such as the total weight/volume per delivery. Additionally, we show that the GVF predictions can be used to provide additional domain-backed insights into the decisions proposed by the RL agent. Finally, since the environment dynamics are fully transferred, the trained GVFs can be used for faster adaptation to vastly different business objectives (for example, due to the start of a promotional period or due to deployment in a new customer environment).

A Systematic Review of Deep Graph Neural Networks: Challenges, Classification, Architectures, Applications & Potential Utility in Bioinformatics. (arXiv:2311.02127v1 [cs.LG])

Authors: Adil Mudasir Malla, Asif Ali Banka

In recent years, tasks of machine learning ranging from image processing & audio/video analysis to natural language understanding have been transformed by deep learning. The data content in all these scenarios are expressed via Euclidean space. However, a considerable amount of application data is structured in non-Euclidean space and is expressed as graphs, e.g. dealing with complicated interactions & object interdependencies. Modelling physical systems, learning molecular signatures, identifying protein interactions and predicting diseases involve utilising a model that can adapt from graph data. Graph neural networks (GNNs), specified as artificial-neural models, employ message transmission between graph nodes to represent graph dependencies and are primarily used in the non-Euclidean domain. Variants of GNN like Graph Recurrent Networks (GRN), Graph Auto Encoder (GAE), Graph Convolution Networks (GCN), Graph Adversarial Methods & Graph Reinforcement learning have exhibited breakthrough productivity on a wide range of tasks, especially in the field of bioinformatics, in recent years as a result of the rapid collection of biological network data. Apart from presenting all existing GNN models, mathematical analysis and comparison of the variants of all types of GNN have been highlighted in this survey. Graph neural networks are investigated for their potential real-world applications in various fields, focusing on Bioinformatics. Furthermore, resources for evaluating graph neural network models and accessing open-source code & benchmark data sets are included. Ultimately, we provide some (seven) proposals for future research in this rapidly evolving domain. GNNs have the potential to be an excellent tool for solving a wide range of biological challenges in bioinformatics research, as they are best represented as connected complex graphs.

Hierarchical Reinforcement Learning for Power Network Topology Control. (arXiv:2311.02129v1 [cs.LG])

Authors: Blazej Manczak, Jan Viebahn, Herke van Hoof

Learning in high-dimensional action spaces is a key challenge in applying reinforcement learning (RL) to real-world systems. In this paper, we study the possibility of controlling power networks using RL methods. Power networks are critical infrastructures that are complex to control. In particular, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Hierarchical reinforcement learning (HRL) represents one approach to address this challenge. More precisely, a HRL framework for power network topology control is proposed. The HRL framework consists of three levels of action abstraction. At the highest level, there is the overall long-term task of power network operation, namely, keeping the power grid state within security constraints at all times, which is decomposed into two temporally extended actions: 'do nothing' versus 'propose a topology change'. At the intermediate level, the action space consists of all controllable substations. Finally, at the lowest level, the action space consists of all configurations of the chosen substation. By employing this HRL framework, several hierarchical power network agents are trained for the IEEE 14-bus network. Whereas at the highest level a purely rule-based policy is still chosen for all agents in this study, at the intermediate level the policy is trained using different state-of-the-art RL algorithms. At the lowest level, either an RL algorithm or a greedy algorithm is used. The performance of the different 3-level agents is compared with standard baseline (RL or greedy) approaches. A key finding is that the 3-level agent that employs RL both at the intermediate and the lowest level outperforms all other agents on the most difficult task. Our code is publicly available.

Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning. (arXiv:2311.02130v1 [cs.LG])

Authors: Bibo Wu, Fang Fang, Xianbin Wang, Donghong Cai, Shu Fu, Zhiguo Ding

Hierarchical federated learning (HFL) shows great advantages over conventional two-layer federated learning (FL) in reducing network overhead and interaction latency while still retaining the data privacy of distributed FL clients. However, the communication and energy overhead still pose a bottleneck for HFL performance, especially as the number of clients raises dramatically. To tackle this issue, we propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation in this paper, aiming to minimize the total cost of time and energy at each HFL global round. Specifically, we first propose a novel fuzzy logic based client orchestration policy considering client heterogenerity in multiple aspects, including channel quality, data quantity and model staleness. Subsequently, given the fuzzy based client-edge association, a joint edge server scheduling and resource allocation problem is formulated. Utilizing problem decomposition, we firstly derive the closed-form solution for the edge server scheduling subproblem via the penalty dual decomposition (PDD) method. Next, a deep deterministic policy gradient (DDPG) based algorithm is proposed to tackle the resource allocation subproblem considering time-varying environments. Finally, extensive simulations demonstrate that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.

Sparse Training of Discrete Diffusion Models for Graph Generation. (arXiv:2311.02142v1 [cs.LG])

Authors: Yiming Qin, Clement Vignac, Pascal Frossard

Generative models for graphs often encounter scalability challenges due to the inherent need to predict interactions for every node pair. Despite the sparsity often exhibited by real-world graphs, the unpredictable sparsity patterns of their adjacency matrices, stemming from their unordered nature, leads to quadratic computational complexity. In this work, we introduce SparseDiff, a denoising diffusion model for graph generation that is able to exploit sparsity during its training phase. At the core of SparseDiff is a message-passing neural network tailored to predict only a subset of edges during each forward pass. When combined with a sparsity-preserving noise model, this model can efficiently work with edge lists representations of graphs, paving the way for scalability to much larger structures. During the sampling phase, SparseDiff iteratively populates the adjacency matrix from its prior state, ensuring prediction of the full graph while controlling memory utilization. Experimental results show that SparseDiff simultaneously matches state-of-the-art in generation performance on both small and large graphs, highlighting the versatility of our method.

Pairing-based graph neural network for simulating quantum materials. (arXiv:2311.02143v1 [cond-mat.str-el])

Authors: Di Luo, David D. Dai, Liang Fu

We introduce a pairing-based graph neural network, $\textit{GemiNet}$, for simulating quantum many-body systems. Our architecture augments a BCS mean-field wavefunction with a generalized pair amplitude parameterized by a graph neural network. Variational Monte Carlo with GemiNet simultaneously provides an accurate, flexible, and scalable method for simulating many-electron systems. We apply GemiNet to two-dimensional semiconductor electron-hole bilayers and obtain highly accurate results on a variety of interaction-induced phases, including the exciton Bose-Einstein condensate, electron-hole superconductor, and bilayer Wigner crystal. Our study demonstrates the potential of physically-motivated neural network wavefunctions for quantum materials simulations.

Bayesian Optimization of Function Networks with Partial Evaluations. (arXiv:2311.02146v1 [stat.ML])

Authors: Poompol Buathong, Jiayue Wan, Samuel Daulton, Raul Astudillo, Maximilian Balandat, Peter I. Frazier

Bayesian optimization is a framework for optimizing functions that are costly or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is computed via a network of functions, each taking as input the output of previous nodes in the network and additional parameters. Exploiting this network structure has been shown to yield significant performance improvements. Existing BOFN algorithms for general-purpose networks are required to evaluate the full network at each iteration. However, many real-world applications allow evaluating nodes individually. To take advantage of this opportunity, we propose a novel knowledge gradient acquisition function for BOFN that chooses which node to evaluate as well as the inputs for that node in a cost-aware fashion. This approach can dramatically reduce query costs by allowing the evaluation of part of the network at a lower cost relative to evaluating the entire network. We provide an efficient approach to optimizing our acquisition function and show it outperforms existing BOFN methods and other benchmarks across several synthetic and real-world problems. Our acquisition function is the first to enable cost-aware optimization of a broad class of function networks.

The Alignment Problem in Context. (arXiv:2311.02147v1 [cs.LG])

Authors: Raphaël Millière

A core challenge in the development of increasingly capable AI systems is to make them safe and reliable by ensuring their behaviour is consistent with human values. This challenge, known as the alignment problem, does not merely apply to hypothetical future AI systems that may pose catastrophic risks; it already applies to current systems, such as large language models, whose potential for harm is rapidly increasing. In this paper, I assess whether we are on track to solve the alignment problem for large language models, and what that means for the safety of future AI systems. I argue that existing strategies for alignment are insufficient, because large language models remain vulnerable to adversarial attacks that can reliably elicit unsafe behaviour. I offer an explanation of this lingering vulnerability on which it is not simply a contingent limitation of current language models, but has deep technical ties to a crucial aspect of what makes these models useful and versatile in the first place -- namely, their remarkable aptitude to learn "in context" directly from user instructions. It follows that the alignment problem is not only unsolved for current AI systems, but may be intrinsically difficult to solve without severely undermining their capabilities. Furthermore, this assessment raises concerns about the prospect of ensuring the safety of future and more capable AI systems.

Emergence of Abstract State Representations in Embodied Sequence Modeling. (arXiv:2311.02171v1 [cs.LG])

Authors: Tian Yun, Zilai Zeng, Kunal Handa, Ashish V Thapliyal, Bo Pang, Ellie Pavlick, Chen Sun

Decision making via sequence modeling aims to mimic the success of language models, where actions taken by an embodied agent are modeled as tokens to predict. Despite their promising performance, it remains unclear if embodied sequence modeling leads to the emergence of internal representations that represent the environmental state information. A model that lacks abstract state representations would be liable to make decisions based on surface statistics which fail to generalize. We take the BabyAI environment, a grid world in which language-conditioned navigation tasks are performed, and build a sequence modeling Transformer, which takes a language instruction, a sequence of actions, and environmental observations as its inputs. In order to investigate the emergence of abstract state representations, we design a "blindfolded" navigation task, where only the initial environmental layout, the language instruction, and the action sequence to complete the task are available for training. Our probing results show that intermediate environmental layouts can be reasonably reconstructed from the internal activations of a trained model, and that language instructions play a role in the reconstruction accuracy. Our results suggest that many key features of state representations can emerge via embodied sequence modeling, supporting an optimistic outlook for applications of sequence modeling objectives to more complex embodied decision-making domains.

Joint Problems in Learning Multiple Dynamical Systems. (arXiv:2311.02181v1 [math.OC])

Authors: Mengjia Niu, Xiaoyu He, Petr Rysavy, Quan Zhou, Jakub Marecek

Clustering of time series is a well-studied problem, with applications ranging from quantitative, personalized models of metabolism obtained from metabolite concentrations to state discrimination in quantum information theory. We consider a variant, where given a set of trajectories and a number of parts, we jointly partition the set of trajectories and learn linear dynamical system (LDS) models for each part, so as to minimize the maximum error across all the models. We present globally convergent methods and EM heuristics, accompanied by promising computational results.

Automating Governing Knowledge Commons and Contextual Integrity (GKC-CI) Privacy Policy Annotations with Large Language Models. (arXiv:2311.02192v1 [cs.CY])

Authors: Jake Chanenson, Madison Pickering, Noah Apthorpe

Identifying contextual integrity (CI) and governing knowledge commons (GKC) parameters in privacy policy texts can facilitate normative privacy analysis. However, GKC-CI annotation has heretofore required manual or crowdsourced effort. This paper demonstrates that high-accuracy GKC-CI parameter annotation of privacy policies can be performed automatically using large language models. We fine-tune 18 open-source and proprietary models on 21,588 GKC-CI annotations from 16 ground truth privacy policies. Our best-performing model (fine-tuned GPT-3.5 Turbo with prompt engineering) has an accuracy of 86%, exceeding the performance of prior crowdsourcing approaches despite the complexity of privacy policy texts and the nuance of the GKC-CI annotation task. We apply our best-performing model to privacy policies from 164 popular online services, demonstrating the effectiveness of scaling GKC-CI annotation for data exploration. We make all annotated policies as well as the training data and scripts needed to fine-tune our best-performing model publicly available for future research.

AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation. (arXiv:2311.02194v1 [cs.LG])

Authors: Daiki E. Matsunaga, Jongmin Lee, Jaeseok Yoon, Stefanos Leonardos, Pieter Abbeel, Kee-Eung Kim

One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.

Imitation Bootstrapped Reinforcement Learning. (arXiv:2311.02198v1 [cs.LG])

Authors: Hengyuan Hu, Suvir Mirchandani, Dorsa Sadigh

Despite the considerable potential of reinforcement learning (RL), robotics control tasks predominantly rely on imitation learning (IL) owing to its better sample efficiency. However, given the high cost of collecting extensive demonstrations, RL is still appealing if it can utilize limited imitation data for efficient autonomous self-improvement. Existing RL methods that utilize demonstrations either initialize the replay buffer with demonstrations and oversample them during RL training, which does not benefit from the generalization potential of modern IL methods, or pretrain the RL policy with IL on the demonstrations, which requires additional mechanisms to prevent catastrophic forgetting during RL fine-tuning. We propose imitation bootstrapped reinforcement learning (IBRL), a novel framework that first trains an IL policy on a limited number of demonstrations and then uses it to propose alternative actions for both online exploration and target value bootstrapping. IBRL achieves SoTA performance and sample efficiency on 7 challenging sparse reward continuous control tasks in simulation while learning directly from pixels. As a highlight of our method, IBRL achieves $6.4\times$ higher success rate than RLPD, a strong method that combines the idea of oversampling demonstrations with modern RL improvements, under the budget of 10 demos and 100K interactions in the challenging PickPlaceCan task in the Robomimic benchmark.

Joint Composite Latent Space Bayesian Optimization. (arXiv:2311.02213v1 [cs.LG])

Authors: Natalie Maus, Zhiyuan Jerry Lin, Maximilian Balandat, Eytan Bakshy

Bayesian Optimization (BO) is a technique for sample-efficient black-box optimization that employs probabilistic models to identify promising input locations for evaluation. When dealing with composite-structured functions, such as f=g o h, evaluating a specific location x yields observations of both the final outcome f(x) = g(h(x)) as well as the intermediate output(s) h(x). Previous research has shown that integrating information from these intermediate outputs can enhance BO performance substantially. However, existing methods struggle if the outputs h(x) are high-dimensional. Many relevant problems fall into this setting, including in the context of generative AI, molecular design, or robotics. To effectively tackle these challenges, we introduce Joint Composite Latent Space Bayesian Optimization (JoCo), a novel framework that jointly trains neural network encoders and probabilistic models to adaptively compress high-dimensional input and output spaces into manageable latent representations. This enables viable BO on these compressed representations, allowing JoCo to outperform other state-of-the-art methods in high-dimensional BO on a wide variety of simulated and real-world problems.

Towards model-free RL algorithms that scale well with unstructured data. (arXiv:2311.02215v1 [cs.LG])

Authors: Joseph Modayil, Zaheer Abbas

Conventional reinforcement learning (RL) algorithms exhibit broad generality in their theoretical formulation and high performance on several challenging domains when combined with powerful function approximation. However, developing RL algorithms that perform well across problems with unstructured observations at scale remains challenging because most function approximation methods rely on externally provisioned knowledge about the structure of the input for good performance (e.g. convolutional networks, graph neural networks, tile-coding). A common practice in RL is to evaluate algorithms on a single problem, or on problems with limited variation in the observation scale. RL practitioners lack a systematic way to study how well a single RL algorithm performs when instantiated across a range of problem scales, and they lack function approximation techniques that scale well with unstructured observations.

We address these limitations by providing environments and algorithms to study scaling for unstructured observation vectors and flat action spaces. We introduce a family of combinatorial RL problems with an exponentially large state space and high-dimensional dynamics but where linear computation is sufficient to learn a (nonlinear) value function estimate for performant control. We provide an algorithm that constructs reward-relevant general value function (GVF) questions to find and exploit predictive structure directly from the experience stream. In an empirical evaluation of the approach on synthetic problems, we observe a sample complexity that scales linearly with the observation size. The proposed algorithm reliably outperforms a conventional deep RL algorithm on these scaling problems, and they exhibit several desirable auxiliary properties. These results suggest new algorithmic mechanisms by which algorithms can learn at scale from unstructured data.

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data. (arXiv:2311.02216v1 [cs.CL])

Authors: Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil, Oana Cocarascu, Elena Simperl

Numbers are crucial for various real-world domains such as finance, economics, and science. Thus, understanding and reasoning with numbers are essential skills for language models to solve different tasks. While different numerical benchmarks have been introduced in recent years, they are limited to specific numerical aspects mostly. In this paper, we propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels: representation, number sense, manipulation, and complex reasoning. We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them. Henceforth, we develop a diverse set of numerical probes employing a semi-automated approach. We focus on the tabular Natural Language Inference (TNLI) task as a case study and measure models' performance shifts. Our results show that no model consistently excels across all numerical reasoning types. Among the probed models, FlanT5 (few-/zero-shot) and GPT-3.5 (few-shot) demonstrate strong overall numerical reasoning skills compared to other models. Label-flipping probes indicate that models often exploit dataset artifacts to predict the correct labels.

Structured Neural Networks for Density Estimation and Causal Inference. (arXiv:2311.02221v1 [cs.LG])

Authors: Asic Q. Chen, Ruian Shi, Xiang Gao, Ricardo Baptista, Rahul G. Krishnan

Injecting structure into neural networks enables learning functions that satisfy invariances with respect to subsets of inputs. For instance, when learning generative models using neural networks, it is advantageous to encode the conditional independence structure of observed variables, often in the form of Bayesian networks. We propose the Structured Neural Network (StrNN), which injects structure through masking pathways in a neural network. The masks are designed via a novel relationship we explore between neural network architectures and binary matrix factorization, to ensure that the desired independencies are respected. We devise and study practical algorithms for this otherwise NP-hard design problem based on novel objectives that control the model architecture. We demonstrate the utility of StrNN in three applications: (1) binary and Gaussian density estimation with StrNN, (2) real-valued density estimation with Structured Autoregressive Flows (StrAFs) and Structured Continuous Normalizing Flows (StrCNF), and (3) interventional and counterfactual analysis with StrAFs for causal inference. Our work opens up new avenues for learning neural networks that enable data-efficient generative modeling and the use of normalizing flows for causal effect estimation.

Multi-scale Time-stepping of Partial Differential Equations with Transformers. (arXiv:2311.02225v1 [cs.LG])

Authors: AmirPouya Hemmasian, Amir Barati Farimani

Developing fast surrogates for Partial Differential Equations (PDEs) will accelerate design and optimization in almost all scientific and engineering applications. Neural networks have been receiving ever-increasing attention and demonstrated remarkable success in computational modeling of PDEs, however; their prediction accuracy is not at the level of full deployment. In this work, we utilize the transformer architecture, the backbone of numerous state-of-the-art AI models, to learn the dynamics of physical systems as the mixing of spatial patterns learned by a convolutional autoencoder. Moreover, we incorporate the idea of multi-scale hierarchical time-stepping to increase the prediction speed and decrease accumulated error over time. Our model achieves similar or better results in predicting the time-evolution of Navier-Stokes equations compared to the powerful Fourier Neural Operator (FNO) and two transformer-based neural operators OFormer and Galerkin Transformer.

State-wise Safe Reinforcement Learning With Pixel Observations. (arXiv:2311.02227v1 [cs.LG])

Authors: Simon Sinong Zhan, Yixuan Wang, Qingyuan Wu, Ruochen Jiao, Chao Huang, Qi Zhu

Reinforcement Learning(RL) in the context of safe exploration has long grappled with the challenges of the delicate balance between maximizing rewards and minimizing safety violations, the complexities arising from contact-rich or non-smooth environments, and high-dimensional pixel observations. Furthermore, incorporating state-wise safety constraints in the exploration and learning process, where the agent is prohibited from accessing unsafe regions without prior knowledge, adds an additional layer of complexity. In this paper, we propose a novel pixel-observation safe RL algorithm that efficiently encodes state-wise safety constraints with unknown hazard regions through the introduction of a latent barrier function learning mechanism. As a joint learning framework, our approach first involves constructing a latent dynamics model with low-dimensional latent spaces derived from pixel observations. Subsequently, we build and learn a latent barrier function on top of the latent dynamics and conduct policy optimization simultaneously, thereby improving both safety and the total expected return. Experimental evaluations on the safety-gym benchmark suite demonstrate that our proposed method significantly reduces safety violations throughout the training process and demonstrates faster safety convergence compared to existing methods while achieving competitive results in reward return.

Robust Fine-Tuning of Vision-Language Models for Domain Generalization. (arXiv:2311.02236v1 [cs.CV])

Authors: Kevin Vogt-Lowell, Noah Lee, Theodoros Tsiligkaridis, Marc Vaillant

Transfer learning enables the sharing of common knowledge among models for a variety of downstream tasks, but traditional methods suffer in limited training data settings and produce narrow models incapable of effectively generalizing under distribution shifts. Foundation models have recently demonstrated impressive zero-shot inference capabilities and robustness under distribution shifts. However, zero-shot evaluation for these models has been predominantly confined to benchmarks with simple distribution shifts, limiting our understanding of their effectiveness under the more realistic shifts found in practice. Moreover, common fine-tuning methods for these models have yet to be evaluated against vision models in few-shot scenarios where training data is limited. To address these gaps, we present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP and evaluate its performance on challenging benchmark datasets with realistic distribution shifts from the WILDS collection. Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts in terms of in-distribution and out-of-distribution accuracy at all levels of training data availability. This provides a strong incentive for adoption of foundation models within few-shot learning applications operating with real-world data. Code is available at https://github.com/mit-ll/robust-vision-language-finetuning

Explainable Authorship Identification in Cultural Heritage Applications: Analysis of a New Perspective. (arXiv:2311.02237v1 [cs.LG])

Authors: Mattia Setzu, Silvia Corbara, Anna Monreale, Alejandro Moreo, Fabrizio Sebastiani

While a substantial amount of work has recently been devoted to enhance the performance of computational Authorship Identification (AId) systems, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This lacking substantially hinders the practical employment of AId methodologies, since the predictions returned by such systems are hardly useful unless they are supported with suitable explanations. In this paper, we explore the applicability of existing general-purpose eXplainable Artificial Intelligence (XAI) techniques to AId, with a special focus on explanations addressed to scholars working in cultural heritage. In particular, we assess the relative merits of three different types of XAI techniques (feature ranking, probing, factuals and counterfactual selection) on three different AId tasks (authorship attribution, authorship verification, same-authorship verification) by running experiments on real AId data. Our analysis shows that, while these techniques make important first steps towards explainable Authorship Identification, more work remains to be done in order to provide tools that can be profitably integrated in the workflows of scholars.

Using DUCK-Net for Polyp Image Segmentation. (arXiv:2311.02239v1 [cs.CV])

Authors: Razvan-Gabriel Dumitru, Darius Peteleaza, Catalin Craciun

This paper presents a novel supervised convolutional neural network architecture, "DUCK-Net", capable of effectively learning and generalizing from small amounts of medical images to perform accurate segmentation tasks. Our model utilizes an encoder-decoder structure with a residual downsampling mechanism and a custom convolutional block to capture and process image information at multiple resolutions in the encoder segment. We employ data augmentation techniques to enrich the training set, thus increasing our model's performance. While our architecture is versatile and applicable to various segmentation tasks, in this study, we demonstrate its capabilities specifically for polyp segmentation in colonoscopy images. We evaluate the performance of our method on several popular benchmark datasets for polyp segmentation, Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, and ETIS-LARIBPOLYPDB showing that it achieves state-of-the-art results in terms of mean Dice coefficient, Jaccard index, Precision, Recall, and Accuracy. Our approach demonstrates strong generalization capabilities, achieving excellent performance even with limited training data. The code is publicly available on GitHub: https://github.com/RazvanDu/DUCK-Net

Equal Opportunity of Coverage in Fair Regression. (arXiv:2311.02243v1 [cs.LG])

Authors: Fangxin Wang, Lu Cheng, Ruocheng Guo, Kay Liu, Philip S. Yu

We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making. The seminal work of ``equalized coverage'' proposed an uncertainty-aware fairness notion. However, it does not guarantee equal coverage rates across more fine-grained groups (e.g., low-income females) conditioning on the true label and is biased in the assessment of uncertainty. To tackle these limitations, we propose a new uncertainty-aware fairness -- Equal Opportunity of Coverage (EOC) -- that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level. Further, the prediction intervals should be narrow to be informative. We propose Binned Fair Quantile Regression (BFQR), a distribution-free post-processing method to improve EOC with reasonable width for any trained ML models. It first calibrates a hold-out set to bound deviation from EOC, then leverages conformal prediction to maintain EOC on a test set, meanwhile optimizing prediction interval width. Experimental results demonstrate the effectiveness of our method in improving EOC. Our code is publicly available at https://github.com/fangxin-wang/bfqr .

PRISM: Progressive Restoration for Scene Graph-based Image Manipulation. (arXiv:2311.02247v1 [cs.LG])

Authors: Pavel Jahoda, Azade Farshad, Yousef Yeganeh, Ehsan Adeli, Nassir Navab

Scene graphs have emerged as accurate descriptive priors for image generation and manipulation tasks, however, their complexity and diversity of the shapes and relations of objects in data make it challenging to incorporate them into the models and generate high-quality results. To address these challenges, we propose PRISM, a novel progressive multi-head image manipulation approach to improve the accuracy and quality of the manipulated regions in the scene. Our image manipulation framework is trained using an end-to-end denoising masked reconstruction proxy task, where the masked regions are progressively unmasked from the outer regions to the inner part. We take advantage of the outer part of the masked area as they have a direct correlation with the context of the scene. Moreover, our multi-head architecture simultaneously generates detailed object-specific regions in addition to the entire image to produce higher-quality images. Our model outperforms the state-of-the-art methods in the semantic image manipulation task on the CLEVR and Visual Genome datasets. Our results demonstrate the potential of our approach for enhancing the quality and precision of scene graph-based image manipulation.

The Potential of Wearable Sensors for Assessing Patient Acuity in Intensive Care Unit (ICU). (arXiv:2311.02251v1 [cs.LG])

Authors: Jessica Sena, Mohammad Tahsin Mostafiz, Jiaqing Zhang, Andrea Davidson, Sabyasachi Bandyopadhyay, Ren Yuanfang, Tezcan Ozrazgat-Baslanti, Benjamin Shickel, Tyler Loftus, William Robson Schwartz, Azra Bihorac, Parisa Rashidi

Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level, which can indicate recovery or deterioration in the ICU. We hypothesized that existing acuity scores could be potentially improved by employing Artificial Intelligence (AI) techniques in conjunction with Electronic Health Records (EHR) and wearable sensor data. In this study, we evaluated the impact of integrating mobility data collected from wrist-worn accelerometers with clinical data obtained from EHR for developing an AI-driven acuity assessment score. Accelerometry data were collected from 86 patients wearing accelerometers on their wrists in an academic hospital setting. The data was analyzed using five deep neural network models: VGG, ResNet, MobileNet, SqueezeNet, and a custom Transformer network. These models outperformed a rule-based clinical score (SOFA= Sequential Organ Failure Assessment) used as a baseline, particularly regarding the precision, sensitivity, and F1 score. The results showed that while a model relying solely on accelerometer data achieved limited performance (AUC 0.50, Precision 0.61, and F1-score 0.68), including demographic information with the accelerometer data led to a notable enhancement in performance (AUC 0.69, Precision 0.75, and F1-score 0.67). This work shows that the combination of mobility and patient information can successfully differentiate between stable and unstable states in critically ill patients.

Comparative Knowledge Distillation. (arXiv:2311.02253v1 [cs.LG])

Authors: Alex Wilf, Alex Tianyi Xu, Paul Pu Liang, Alexander Obolenskiy, Daniel Fried, Louis-Philippe Morency

In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available access to teacher models for frequent inference -- a notion increasingly at odds with the realities of costly, often proprietary, large scale models. Addressing this gap, our paper considers how to minimize the dependency on teacher model inferences in KD in a setting we term Few Teacher Inference Knowledge Distillation (FTI KD). We observe that prevalent KD techniques and state of the art data augmentation strategies fall short in this constrained setting. Drawing inspiration from educational principles that emphasize learning through comparison, we propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. Critically, CKD provides additional learning signals to the student without making additional teacher calls. We also extend the principle of CKD to groups of samples, enabling even more efficient learning from limited teacher calls. Empirical evaluation across varied experimental settings indicates that CKD consistently outperforms state of the art data augmentation and KD techniques.

Learning Time-Invariant Representations for Individual Neurons from Population Dynamics. (arXiv:2311.02258v1 [q-bio.NC])

Authors: Lu Mi, Trung Le, Tianxing He, Eli Shlizerman, Uygar Sümbül

Neurons can display highly variable dynamics. While such variability presumably supports the wide range of behaviors generated by the organism, their gene expressions are relatively stable in the adult brain. This suggests that neuronal activity is a combination of its time-invariant identity and the inputs the neuron receives from the rest of the circuit. Here, we propose a self-supervised learning based method to assign time-invariant representations to individual neurons based on permutation-, and population size-invariant summary of population recordings. We fit dynamical models to neuronal activity to learn a representation by considering the activity of both the individual and the neighboring population. Our self-supervised approach and use of implicit representations enable robust inference against imperfections such as partial overlap of neurons across sessions, trial-to-trial variability, and limited availability of molecular (transcriptomic) labels for downstream supervised tasks. We demonstrate our method on a public multimodal dataset of mouse cortical neuronal activity and transcriptomic labels. We report > 35% improvement in predicting the transcriptomic subclass identity and > 20% improvement in predicting class identity with respect to the state-of-the-art.

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs. (arXiv:2311.02262v1 [cs.CL])

Authors: Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao

In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting with large language models (LLMs), we have a similar need - steering the model to pay closer attention to user-specified information, e.g., an instruction. Existing methods, however, are constrained to process plain text and do not support such a mechanism. This motivates us to introduce PASTA - Post-hoc Attention STeering Approach, a method that allows LLMs to read text with user-specified emphasis marks. To this end, PASTA identifies a small subset of attention heads and applies precise attention reweighting on them, directing the model attention to user-specified parts. Like prompting, PASTA is applied at inference time and does not require changing any model parameters. Experiments demonstrate that PASTA can substantially enhance an LLM's ability to follow user instructions or integrate new knowledge from user inputs, leading to a significant performance improvement on a variety of tasks, e.g., an average accuracy improvement of 22% for LLAMA-7B. Our code is publicly available at https://github.com/QingruZhang/PASTA .

Multi-task Learning for Optical Coherence Tomography Angiography (OCTA) Vessel Segmentation. (arXiv:2311.02266v1 [eess.IV])

Authors: Can Koz, Onat Dalmaz, Mertay Dayanc

Optical Coherence Tomography Angiography (OCTA) is a non-invasive imaging technique that provides high-resolution cross-sectional images of the retina, which are useful for diagnosing and monitoring various retinal diseases. However, manual segmentation of OCTA images is a time-consuming and labor-intensive task, which motivates the development of automated segmentation methods. In this paper, we propose a novel multi-task learning method for OCTA segmentation, called OCTA-MTL, that leverages an image-to-DT (Distance Transform) branch and an adaptive loss combination strategy. The image-to-DT branch predicts the distance from each vessel voxel to the vessel surface, which can provide useful shape prior and boundary information for the segmentation task. The adaptive loss combination strategy dynamically adjusts the loss weights according to the inverse of the average loss values of each task, to balance the learning process and avoid the dominance of one task over the other. We evaluate our method on the ROSE-2 dataset its superiority in terms of segmentation performance against two baseline methods: a single-task segmentation method and a multi-task segmentation method with a fixed loss combination.

LLMs-augmented Contextual Bandit. (arXiv:2311.02268v1 [cs.LG])

Authors: Ali Baheri, Cecilia O. Alm

Contextual bandits have emerged as a cornerstone in reinforcement learning, enabling systems to make decisions with partial feedback. However, as contexts grow in complexity, traditional bandit algorithms can face challenges in adequately capturing and utilizing such contexts. In this paper, we propose a novel integration of large language models (LLMs) with the contextual bandit framework. By leveraging LLMs as an encoder, we enrich the representation of the context, providing the bandit with a denser and more informative view. Preliminary results on synthetic datasets demonstrate the potential of this approach, showing notable improvements in cumulative rewards and reductions in regret compared to traditional bandit algorithms. This integration not only showcases the capabilities of LLMs in reinforcement learning but also opens the door to a new era of contextually-aware decision systems.

Regularized Linear Regression for Binary Classification. (arXiv:2311.02270v1 [cs.LG])

Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi

Regularized linear regression is a promising approach for binary classification problems in which the training set has noisy labels since the regularization term can help to avoid interpolating the mislabeled data points. In this paper we provide a systematic study of the effects of the regularization strength on the performance of linear classifiers that are trained to solve binary classification problems by minimizing a regularized least-squares objective. We consider the over-parametrized regime and assume that the classes are generated from a Gaussian Mixture Model (GMM) where a fraction $c<\frac{1}{2}$ of the training data is mislabeled. Under these assumptions, we rigorously analyze the classification errors resulting from the application of ridge, $\ell_1$, and $\ell_\infty$ regression. In particular, we demonstrate that ridge regression invariably improves the classification error. We prove that $\ell_1$ regularization induces sparsity and observe that in many cases one can sparsify the solution by up to two orders of magnitude without any considerable loss of performance, even though the GMM has no underlying sparsity structure. For $\ell_\infty$ regularization we show that, for large enough regularization strength, the optimal weights concentrate around two values of opposite sign. We observe that in many cases the corresponding "compression" of each weight to a single bit leads to very little loss in performance. These latter observations can have significant practical ramifications.

Machine learning's own Industrial Revolution. (arXiv:2311.02278v1 [cs.LG])

Authors: Yuan Luo, Song Han, Jingjing Liu

Machine learning is expected to enable the next Industrial Revolution. However, lacking standardized and automated assembly networks, ML faces significant challenges to meet ever-growing enterprise demands and empower broad industries. In the Perspective, we argue that ML needs to first complete its own Industrial Revolution, elaborate on how to best achieve its goals, and discuss new opportunities to enable rapid translation from ML's innovation frontier to mass production and utilization.

Contrastive Multi-Modal Representation Learning for Spark Plug Fault Diagnosis. (arXiv:2311.02282v1 [cs.LG])

Authors: Ardavan Modarres, Vahid Mohammad-Zadeh Eivaghi, Mahdi Aliyari Shoorehdeli, Ashkan Moosavian

Due to the incapability of one sensory measurement to provide enough information for condition monitoring of some complex engineered industrial mechanisms and also for overcoming the misleading noise of a single sensor, multiple sensors are installed to improve the condition monitoring of some industrial equipment. Therefore, an efficient data fusion strategy is demanded. In this research, we presented a Denoising Multi-Modal Autoencoder with a unique training strategy based on contrastive learning paradigm, both being utilized for the first time in the machine health monitoring realm. The presented approach, which leverages the merits of both supervised and unsupervised learning, not only achieves excellent performance in fusing multiple modalities (or views) of data into an enriched common representation but also takes data fusion to the next level wherein one of the views can be omitted during inference time with very slight performance reduction, or even without any reduction at all. The presented methodology enables multi-modal fault diagnosis systems to perform more robustly in case of sensor failure occurrence, and one can also intentionally omit one of the sensors (the more expensive one) in order to build a more cost-effective condition monitoring system without sacrificing performance for practical purposes. The effectiveness of the presented methodology is examined on a real-world private multi-modal dataset gathered under non-laboratory conditions from a complex engineered mechanism, an inline four-stroke spark-ignition engine, aiming for spark plug fault diagnosis. This dataset, which contains the accelerometer and acoustic signals as two modalities, has a very slight amount of fault, and achieving good performance on such a dataset promises that the presented method can perform well on other equipment as well.

Predicting Ground Reaction Force from Inertial Sensors. (arXiv:2311.02287v1 [cs.LG])

Authors: Bowen Song, Marco Paolieri, Harper E. Stewart, Leana Golubchik, Jill L. McNitt-Gray, Vishal Misra, Devavrat Shah

The study of ground reaction forces (GRF) is used to characterize the mechanical loading experienced by individuals in movements such as running, which is clinically applicable to identify athletes at risk for stress-related injuries. Our aim in this paper is to determine if data collected with inertial measurement units (IMUs), that can be worn by athletes during outdoor runs, can be used to predict GRF with sufficient accuracy to allow the analysis of its derived biomechanical variables (e.g., contact time and loading rate).

In this paper, we consider lightweight approaches in contrast to state-of-the-art prediction using LSTM neural networks. Specifically, we compare use of LSTMs to k-Nearest Neighbors (KNN) regression as well as propose a novel solution, SVD Embedding Regression (SER), using linear regression between singular value decomposition embeddings of IMUs data (input) and GRF data (output). We evaluate the accuracy of these techniques when using training data collected from different athletes, from the same athlete, or both, and we explore the use of acceleration and angular velocity data from sensors at different locations (sacrum and shanks). Our results illustrate that simple machine learning methods such as SER and KNN can be similarly accurate or more accurate than LSTM neural networks, with much faster training times and hyperparameter optimization; in particular, SER and KNN are more accurate when personal training data are available, and KNN comes with benefit of providing provenance of prediction. Notably, the use of personal data reduces prediction errors of all methods for most biomechanical variables.

Successive Model-Agnostic Meta-Learning for Few-Shot Fault Time Series Prognosis. (arXiv:2311.02300v1 [cs.LG])

Authors: Hai Su, Jiajun Hu, Songsen Yu

Meta learning is a promising technique for solving few-shot fault prediction problems, which have attracted the attention of many researchers in recent years. Existing meta-learning methods for time series prediction, which predominantly rely on random and similarity matching-based task partitioning, face three major limitations: (1) feature exploitation inefficiency; (2) suboptimal task data allocation; and (3) limited robustness with small samples. To overcome these limitations, we introduce a novel 'pseudo meta-task' partitioning scheme that treats a continuous time period of a time series as a meta-task, composed of multiple successive short time periods. Employing continuous time series as pseudo meta-tasks allows our method to extract more comprehensive features and relationships from the data, resulting in more accurate predictions. Moreover, we introduce a differential algorithm to enhance the robustness of our method across different datasets. Through extensive experiments on several fault and time series prediction datasets, we demonstrate that our approach substantially enhances prediction performance and generalization capability under both few-shot and general conditions.

MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning. (arXiv:2311.02303v1 [cs.LG])

Authors: Bingchang Liu, Chaoyu Chen, Cong Liao, Zi Gong, Huan Wang, Zhichao Lei, Ming Liang, Dajun Chen, Min Shen, Hailian Zhou, Hang Yu, Jianguo Li

Code LLMs have emerged as a specialized research field, with remarkable studies dedicated to enhancing model's coding capabilities through fine-tuning on pre-trained models. Previous fine-tuning approaches were typically tailored to specific downstream tasks or scenarios, which meant separate fine-tuning for each task, requiring extensive training resources and posing challenges in terms of deployment and maintenance. Furthermore, these approaches failed to leverage the inherent interconnectedness among different code-related tasks. To overcome these limitations, we present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks. By incorporating various loss functions, we effectively address common challenges in multi-task learning, such as data imbalance, varying difficulty levels, and inconsistent convergence speeds. Extensive experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks. Moreover, MFTcoder offers efficient training capabilities, including efficient data tokenization modes and PEFT fine-tuning, resulting in significantly improved speed compared to traditional fine-tuning methods. MFTcoder seamlessly integrates with several mainstream open-source LLMs, such as CodeLLama and Qwen. Leveraging the CodeLLama foundation, our MFTcoder fine-tuned model, \textsc{CodeFuse-CodeLLama-34B}, achieves an impressive pass@1 score of 74.4\% on the HumaneEval benchmark, surpassing GPT-4 performance (67\%, zero-shot). MFTCoder is open-sourced at \url{https://github.com/codefuse-ai/MFTCOder}

Heteroskedastic Tensor Clustering. (arXiv:2311.02306v1 [math.ST])

Authors: Yuchen Zhou, Yuxin Chen

Tensor clustering, which seeks to extract underlying cluster structures from noisy tensor observations, has gained increasing attention. One extensively studied model for tensor clustering is the tensor block model, which postulates the existence of clustering structures along each mode and has found broad applications in areas like multi-tissue gene expression analysis and multilayer network analysis. However, currently available computationally feasible methods for tensor clustering either are limited to handling i.i.d. sub-Gaussian noise or suffer from suboptimal statistical performance, which restrains their utility in applications that have to deal with heteroskedastic data and/or low signal-to-noise-ratio (SNR).

To overcome these challenges, we propose a two-stage method, named $\mathsf{High\text{-}order~HeteroClustering}$ ($\mathsf{HHC}$), which starts by performing tensor subspace estimation via a novel spectral algorithm called $\mathsf{Thresholded~Deflated\text{-}HeteroPCA}$, followed by approximate $k$-means to obtain cluster nodes. Encouragingly, our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit (ignoring logarithmic factors); here, the SNR refers to the ratio of the pairwise disparity between nodes to the noise level, and the computational limit indicates the lowest SNR that enables exact clustering with polynomial runtime. Comprehensive simulation and real-data experiments suggest that our algorithm outperforms existing algorithms across various settings, delivering more reliable clustering performance.

Thermal Face Image Classification using Deep Learning Techniques. (arXiv:2311.02314v1 [cs.CV])

Authors: Prosenjit Chatterjee, ANK Zaman

Thermal images have various applications in security, medical and industrial domains. This paper proposes a practical deep-learning approach for thermal image classification. Accurate and efficient classification of thermal images poses a significant challenge across various fields due to the complex image content and the scarcity of annotated datasets. This work uses a convolutional neural network (CNN) architecture, specifically ResNet-50 and VGGNet-19, to extract features from thermal images. This work also applied Kalman filter on thermal input images for image denoising. The experimental results demonstrate the effectiveness of the proposed approach in terms of accuracy and efficiency.

Self-Supervised Learning of Representations for Space Generates Multi-Modular Grid Cells. (arXiv:2311.02316v1 [cs.LG])

Authors: Rylan Schaeffer, Mikail Khona, Tzuhsuan Ma, Cristóbal Eyzaguirre, Sanmi Koyejo, Ila Rani Fiete

To solve the spatial problems of mapping, localization and navigation, the mammalian lineage has developed striking spatial representations. One important spatial representation is the Nobel-prize winning grid cells: neurons that represent self-location, a local and aperiodic quantity, with seemingly bizarre non-local and spatially periodic activity patterns of a few discrete periods. Why has the mammalian lineage learnt this peculiar grid representation? Mathematical analysis suggests that this multi-periodic representation has excellent properties as an algebraic code with high capacity and intrinsic error-correction, but to date, there is no satisfactory synthesis of core principles that lead to multi-modular grid cells in deep recurrent neural networks. In this work, we begin by identifying key insights from four families of approaches to answering the grid cell question: coding theory, dynamical systems, function optimization and supervised deep learning. We then leverage our insights to propose a new approach that combines the strengths of all four approaches. Our approach is a self-supervised learning (SSL) framework - including data, data augmentations, loss functions and a network architecture - motivated from a normative perspective, without access to supervised position information or engineering of particular readout representations as needed in previous approaches. We show that multiple grid cell modules can emerge in networks trained on our SSL framework and that the networks and emergent representations generalize well outside their training distribution. This work contains insights for neuroscientists interested in the origins of grid cells as well as machine learning researchers interested in novel SSL frameworks.

FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven Interpretation. (arXiv:2311.02326v1 [cs.LG])

Authors: Ali Khodabandeh Yalabadi, Mehdi Yazdani-Jahromi, Niloofar Yousefi, Aida Tayebi, Sina Abdidizaji, Ozlem Ozmen Garibay

Drug-Target Interaction (DTI) prediction is vital for drug discovery, yet challenges persist in achieving model interpretability and optimizing performance. We propose a novel transformer-based model, FragXsiteDTI, that aims to address these challenges in DTI prediction. Notably, FragXsiteDTI is the first DTI model to simultaneously leverage drug molecule fragments and protein pockets. Our information-rich representations for both proteins and drugs offer a detailed perspective on their interaction. Inspired by the Perceiver IO framework, our model features a learnable latent array, initially interacting with protein binding site embeddings using cross-attention and later refined through self-attention and used as a query to the drug fragments in the drug's cross-attention transformer block. This learnable query array serves as a mediator and enables seamless information translation, preserving critical nuances in drug-protein interactions. Our computational results on three benchmarking datasets demonstrate the superior predictive power of our model over several state-of-the-art models. We also show the interpretability of our model in terms of the critical components of both target proteins and drug molecules within drug-target pairs.

An Operator Learning Framework for Spatiotemporal Super-resolution of Scientific Simulations. (arXiv:2311.02328v1 [cs.LG])

Authors: Valentin Duruisseaux, Amit Chakraborty

In numerous contexts, high-resolution solutions to partial differential equations are required to capture faithfully essential dynamics which occur at small spatiotemporal scales, but these solutions can be very difficult and slow to obtain using traditional methods due to limited computational resources. A recent direction to circumvent these computational limitations is to use machine learning techniques for super-resolution, to reconstruct high-resolution numerical solutions from low-resolution simulations which can be obtained more efficiently. The proposed approach, the Super Resolution Operator Network (SROpNet), frames super-resolution as an operator learning problem and draws inspiration from existing architectures to learn continuous representations of solutions to parametric differential equations from low-resolution approximations, which can then be evaluated at any desired location. In addition, no restrictions are imposed on the locations of (the fixed number of) spatiotemporal sensors at which the low-resolution approximations are provided, thereby enabling the consideration of a broader spectrum of problems arising in practice, for which many existing super-resolution approaches are not well-suited.

WD3: Taming the Estimation Bias in Deep Reinforcement Learning. (arXiv:2006.12622v2 [cs.LG] UPDATED)

Authors: Qiang He, Xinwen Hou

The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies. To address this issue, TD3 takes the minimum value between a pair of critics. In this paper, we show that the TD3 algorithm introduces underestimation bias in mild assumptions. To obtain a more precise estimation for value function, we unify these two opposites and propose a novel algorithm \underline{W}eighted \underline{D}elayed \underline{D}eep \underline{D}eterministic Policy Gradient (WD3), which can eliminate the estimation bias and further improve the performance by weighting a pair of critics. To demonstrate the effectiveness of WD3, we compare the learning process of value function between DDPG, TD3, and WD3. The results verify that our algorithm does eliminate the estimation error of value functions. Furthermore, we evaluate our algorithm on the continuous control tasks. We observe that in each test task, the performance of WD3 consistently outperforms, or at the very least matches, that of the state-of-the-art algorithms\footnote{Our code is available at~\href{https://sites.google.com/view/ictai20-wd3/}{https://sites.google.com/view/ictai20-wd3/}.}.

ProtoryNet - Interpretable Text Classification Via Prototype Trajectories. (arXiv:2007.01777v5 [cs.LG] UPDATED)

Authors: Dat Hong, Tong Wang, Stephen S. Baek

We propose a novel interpretable deep neural network for text classification, called ProtoryNet, based on a new concept of prototype trajectories. Motivated by the prototype theory in modern linguistics, ProtoryNet makes a prediction by finding the most similar prototype for each sentence in a text sequence and feeding an RNN backbone with the proximity of each sentence to the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which we refer to as prototype trajectories. Prototype trajectories enable intuitive and fine-grained interpretation of the reasoning process of the RNN model, in resemblance to how humans analyze texts. We also design a prototype pruning procedure to reduce the total number of prototypes used by the model for better interpretability. Experiments on multiple public data sets show that ProtoryNet is more accurate than the baseline prototype-based deep neural net and reduces the performance gap compared to state-of-the-art black-box models. In addition, after prototype pruning, the resulting ProtoryNet models only need less than or around 20 prototypes for all datasets, which significantly benefits interpretability. Furthermore, we report a survey result indicating that human users find ProtoryNet more intuitive and easier to understand than other prototype-based methods.

Training Matters: Unlocking Potentials of Deeper Graph Convolutional Neural Networks. (arXiv:2008.08838v3 [cs.LG] UPDATED)

Authors: Sitao Luan, Mingde Zhao, Xiao-Wen Chang, Doina Precup

The performance limit of Graph Convolutional Networks (GCNs) and the fact that we cannot stack more of them to increase the performance, which we usually do for other deep learning paradigms, are pervasively thought to be caused by the limitations of the GCN layers, including insufficient expressive power, etc. However, if so, for a fixed architecture, it would be unlikely to lower the training difficulty and to improve performance by changing only the training procedure, which we show in this paper not only possible but possible in several ways. This paper first identify the training difficulty of GCNs from the perspective of graph signal energy loss. More specifically, we find that the loss of energy in the backward pass during training nullifies the learning of the layers closer to the input. Then, we propose several methodologies to mitigate the training problem by slightly modifying the GCN operator, from the energy perspective. After empirical validation, we confirm that these changes of operator lead to significant decrease in the training difficulties and notable performance boost, without changing the composition of parameters. With these, we conclude that the root cause of the problem is more likely the training difficulty than the others.

A New Bandit Setting Balancing Information from State Evolution and Corrupted Context. (arXiv:2011.07989v4 [cs.LG] UPDATED)

Authors: Alexander Galozy, Slawomir Nowaczyk, Mattias Ohlsson

We propose a new sequential decision-making setting, combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent. Each state is associated with a context distribution, possibly corrupted, allowing the agent to identify the state. Furthermore, states evolve in a Markovian fashion, providing useful information to estimate the current state via state history. In the proposed problem setting, we tackle the challenge of deciding on which of the two sources of information the agent should base its arm selection. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Our setting is motivated by adaptive mobile health (mHealth) interventions. Users transition through different, time-correlated, but only partially observable internal states, determining their current needs. The side information associated with each internal state might not always be reliable, and standard approaches solely rely on the context risk of incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We analyze our setting and algorithm in terms of regret lower bound and upper bounds and evaluate our method on simulated medication adherence intervention data and several real-world data sets, showing improved empirical performance compared to several popular algorithms.

Using reinforcement learning to autonomously identify sources of error for agents in group missions. (arXiv:2107.09232v4 [cs.RO] UPDATED)

Authors: Keishu Utimula, Ken-taro Hayaschi, Trevor J. Bihl, Kenta Hongo, Ryo Maezono

When agents swarm to execute a mission, some of them frequently exhibit sudden failure, as observed from the command base. It is generally difficult to determine whether a failure is caused by actuators (hypothesis, $h_a$) or sensors (hypothesis, $h_s$) by solely relying on the communication between the command base and concerning agent. However, by instigating collusion between the agents, the cause of failure can be identified; in other words, we expect to detect corresponding displacements for $h_a$ but not for $h_s$. In this study, we considered the question as to whether artificial intelligence can autonomously generate an action plan $\boldsymbol{g}$ to pinpoint the cause as aforedescribed. Because the expected response to $\boldsymbol{g}$ generally depends upon the adopted hypothesis [let the difference be denoted by $D(\boldsymbol{g})$], a formulation that uses $D\left(\boldsymbol{g}\right)$ to pinpoint the cause can be made. Although a $\boldsymbol{g}^*$ that maximizes $D(\boldsymbol{g})$ would be a suitable action plan for this task, such an optimization is difficult to achieve using the conventional gradient method, as $D(\boldsymbol{g})$ becomes nonzero in rare events such as collisions with other agents, and most swarm actions $\boldsymbol{g}$ give $D(\boldsymbol{g})=0$. In other words, throughout almost the entire space of $\boldsymbol{g}$, $D(\boldsymbol{g})$ has zero gradient, and the gradient method is not applicable. To overcome this problem, we formulated an action plan using Q-table reinforcement learning. Surprisingly, the optimal action plan generated via reinforcement learning presented a human-like solution to pinpoint the problem by colliding other agents with the failed agent. Using this simple prototype, we demonstrated the potential of applying Q-table reinforcement learning methods to plan autonomous actions to pinpoint the causes of failure.

New Insights into Graph Convolutional Networks using Neural Tangent Kernels. (arXiv:2110.04060v2 [cs.LG] UPDATED)

Authors: Mahalakshmi Sabanayagam, Pascal Esser, Debarghya Ghoshdastidar

Graph Convolutional Networks (GCNs) have emerged as powerful tools for learning on network structured data. Although empirically successful, GCNs exhibit certain behaviour that has no rigorous explanation -- for instance, the performance of GCNs significantly degrades with increasing network depth, whereas it improves marginally with depth using skip connections. This paper focuses on semi-supervised learning on graphs, and explains the above observations through the lens of Neural Tangent Kernels (NTKs). We derive NTKs corresponding to infinitely wide GCNs (with and without skip connections). Subsequently, we use the derived NTKs to identify that, with suitable normalisation, network depth does not always drastically reduce the performance of GCNs -- a fact that we also validate through extensive simulation. Furthermore, we propose NTK as an efficient `surrogate model' for GCNs that does not suffer from performance fluctuations due to hyper-parameter tuning since it is a hyper-parameter free deterministic kernel. The efficacy of this idea is demonstrated through a comparison of different skip connections for GCNs using the surrogate NTKs.

Error-bounded Approximate Time Series Joins Using Compact Dictionary Representations of Time Series. (arXiv:2112.12965v2 [cs.DB] UPDATED)

Authors: Chin-Chia Michael Yeh, Yan Zheng, Junpeng Wang, Huiyuan Chen, Zhongfang Zhuang, Wei Zhang, Eamonn Keogh

The matrix profile is an effective data mining tool that provides similarity join functionality for time series data. Users of the matrix profile can either join a time series with itself using intra-similarity join (i.e., self-join) or join a time series with another time series using inter-similarity join. By invoking either or both types of joins, the matrix profile can help users discover both conserved and anomalous structures in the data. Since the introduction of the matrix profile five years ago, multiple efforts have been made to speed up the computation with approximate joins; however, the majority of these efforts only focus on self-joins. In this work, we show that it is possible to efficiently perform approximate inter-time series similarity joins with error bounded guarantees by creating a compact "dictionary" representation of time series. Using the dictionary representation instead of the original time series, we are able to improve the throughput of an anomaly mining system by at least 20X, with essentially no decrease in accuracy. As a side effect, the dictionaries also summarize the time series in a semantically meaningful way and can provide intuitive and actionable insights. We demonstrate the utility of our dictionary-based inter-time series similarity joins on domains as diverse as medicine and transportation.

Transfer-Learning Across Datasets with Different Input Dimensions: An Algorithm and Analysis for the Linear Regression Case. (arXiv:2202.05069v4 [stat.ML] UPDATED)

Authors: Luis Pedro Silvestrin, Harry van Zanten, Mark Hoogendoorn, Ger Koole

With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer learning algorithm that combines new and historical data with different input dimensions. This approach is easy to implement, efficient, with computational complexity equivalent to the ordinary least-squares method, and requires no hyperparameter tuning, making it straightforward to apply when the new data is limited. Different from other approaches, we provide a rigorous theoretical study of its robustness, showing that it cannot be outperformed by a baseline that utilizes only the new data. Our approach achieves state-of-the-art performance on 9 real-life datasets, outperforming the linear DSFT, another linear transfer learning algorithm, and performing comparably to non-linear DSFT.

Monotone Learning. (arXiv:2202.05246v3 [cs.LG] UPDATED)

Authors: Olivier Bousquet, Amit Daniely, Haim Kaplan, Yishay Mansour, Shay Moran, Uri Stemmer

The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm.

We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021).

Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification.

In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).

Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization. (arXiv:2202.05423v3 [cs.LG] UPDATED)

Authors: Runlong Zhou, Zelin He, Yuandong Tian, Yi Wu, Simon S. Du

Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, the Best Choice Problem (BCP), we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is a randomly generated BCP on a smaller scale. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and AdWords to verify our findings.

Distraction is All You Need for Fairness. (arXiv:2203.07593v3 [cs.LG] UPDATED)

Authors: Mehdi Yazdani-Jahromi, AmirArsalan Rajabi, Ali Khodabandeh Yalabadi, Aida Tayebi, Ozlem Ozmen Garibay

Bias in training datasets must be managed for various groups in classification tasks to ensure parity or equal treatment. With the recent growth in artificial intelligence models and their expanding role in automated decision-making, ensuring that these models are not biased is vital. There is an abundance of evidence suggesting that these models could contain or even amplify the bias present in the data on which they are trained, inherent to their objective function and learning algorithms; Many researchers direct their attention to this issue in different directions, namely, changing data to be statistically independent, adversarial training for restricting the capabilities of a particular competitor who aims to maximize parity, etc. These methods result in information loss and do not provide a suitable balance between accuracy and fairness or do not ensure limiting the biases in training. To this end, we propose a powerful strategy for training deep learning models called the Distraction module, which can be theoretically proven effective in controlling bias from affecting the classification results. This method can be utilized with different data types (e.g., Tabular, images, graphs, etc.). We demonstrate the potency of the proposed method by testing it on UCI Adult and Heritage Health datasets (tabular), POKEC-Z, POKEC-N and NBA datasets (graph), and CelebA dataset (vision). Using state-of-the-art methods proposed in the fairness literature for each dataset, we exhibit our model is superior to these proposed methods in minimizing bias and maintaining accuracy.

Model-free optimization of power/efficiency tradeoffs in quantum thermal machines using reinforcement learning. (arXiv:2204.04785v2 [quant-ph] UPDATED)

Authors: Paolo Andrea Erdman, Frank Noé

A quantum thermal machine is an open quantum system that enables the conversion between heat and work at the micro or nano-scale. Optimally controlling such out-of-equilibrium systems is a crucial yet challenging task with applications to quantum technologies and devices. We introduce a general model-free framework based on Reinforcement Learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal trade-offs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method on a model of an experimentally realistic refrigerator based on a superconducting qubit, and on a heat engine based on a quantum harmonic oscillator. In both cases, we identify the Pareto-front representing optimal power-efficiency tradeoffs, and the corresponding cycles. Such solutions outperform previous proposals made in the literature, such as optimized Otto cycles, reducing quantum friction.

A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity. (arXiv:2204.10806v3 [cs.HC] UPDATED)

Authors: Charvi Rastogi, Liu Leqi, Kenneth Holstein, Hoda Heidari

Hybrid human-ML systems increasingly make consequential decisions in a wide range of domains. These systems are often introduced with the expectation that the combined human-ML system will achieve complementary performance, that is, the combined decision-making system will be an improvement compared with either decision-making agent in isolation. However, empirical results have been mixed, and existing research rarely articulates the sources and mechanisms by which complementary performance is expected to arise. Our goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, we propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ. In doing so, we conceptually map potential mechanisms by which combining human and ML decision-making may yield complementary performance, developing a language for the research community to reason about design of hybrid systems in any decision-making domain. To illustrate how our taxonomy can be used to investigate complementarity, we provide a mathematical aggregation framework to examine enabling conditions for complementarity. Through synthetic simulations, we demonstrate how this framework can be used to explore specific aspects of our taxonomy and shed light on the optimal mechanisms for combining human-ML judgments

Gacs-Korner Common Information Variational Autoencoder. (arXiv:2205.12239v2 [cs.LG] UPDATED)

Authors: Michael Kleinman, Alessandro Achille, Stefano Soatto, Jonathan Kao

We propose a notion of common information that allows one to quantify and separate the information that is shared between two random variables from the information that is unique to each. Our notion of common information is defined by an optimization problem over a family of functions and recovers the G\'acs-K\"orner common information as a special case. Importantly, our notion can be approximated empirically using samples from the underlying data distribution. We then provide a method to partition and quantify the common and unique information using a simple modification of a traditional variational auto-encoder. Empirically, we demonstrate that our formulation allows us to learn semantically meaningful common and unique factors of variation even on high-dimensional data such as images and videos. Moreover, on datasets where ground-truth latent factors are known, we show that we can accurately quantify the common information between the random variables.

Detecting hidden confounding in observational data using multiple environments. (arXiv:2205.13935v4 [stat.ME] UPDATED)

Authors: Rickard K.A. Karlsson, Jesse H. Krijthe

A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify this assumption from a single dataset. Under the assumption of independent causal mechanisms underlying the data-generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent when there is hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, the proposed procedure correctly predicts the presence of hidden confounding, particularly when the confounding bias is large.

What Knowledge Gets Distilled in Knowledge Distillation?. (arXiv:2205.16004v3 [cs.CV] UPDATED)

Authors: Utkarsh Ojha, Yuheng Li, Anirudh Sundara Rajan, Yingyu Liang, Yong Jae Lee

Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions. We show that existing methods can indeed indirectly distill these properties beyond improving task performance. We further study why knowledge distillation might work this way, and show that our findings have practical implications as well.

A Robust Backpropagation-Free Framework for Images. (arXiv:2206.01820v2 [cs.NE] UPDATED)

Authors: Timothy Zee, Alexander G. Ororbia, Ankur Mali, Ifeoma Nwogu

While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients that are computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowledge of feed-forward activities in order to conduct backward propagation, a biologically implausible process. This is known as the "weight transport problem". Therefore, in this work, we present a more biologically plausible approach towards solving the weight transport problem for image data. This approach, which we name the error kernel driven activation alignment (EKDAA) algorithm, accomplishes through the introduction of locally derived error transmission kernels and error maps. Like standard deep learning networks, EKDAA performs the standard forward process via weights and activation functions; however, its backward error computation involves adaptive error kernels that propagate local error signals through the network. The efficacy of EKDAA is demonstrated by performing visual-recognition tasks on the Fashion MNIST, CIFAR-10 and SVHN benchmarks, along with demonstrating its ability to extract visual features from natural color images. Furthermore, in order to demonstrate its non-reliance on gradient computations, results are presented for an EKDAA trained CNN that employs a non-differentiable activation function.

Distributed Machine Learning in D2D-Enabled Heterogeneous Networks: Architectures, Performance, and Open Challenges. (arXiv:2206.01906v2 [cs.LG] UPDATED)

Authors: Zhipeng Cheng, Xuwei Fan, Minghui Liwang, Ning Chen, Xiaoyu Xia, Xianbin Wang

The ever-growing concerns regarding data privacy have led to a paradigm shift in machine learning (ML) architectures from centralized to distributed approaches, giving rise to federated learning (FL) and split learning (SL) as the two predominant privacy-preserving ML mechanisms. However,implementing FL or SL in device-to-device (D2D)-enabled heterogeneous networks with diverse clients presents substantial challenges, including architecture scalability and prolonged training delays. To address these challenges, this article introduces two innovative hybrid distributed ML architectures, namely, hybrid split FL (HSFL) and hybrid federated SL (HFSL). Such architectures combine the strengths of both FL and SL in D2D-enabled heterogeneous wireless networks. We provide a comprehensive analysis of the performance and advantages of HSFL and HFSL, while also highlighting open challenges for future exploration. We support our proposals with preliminary simulations using three datasets in non-independent and non-identically distributed settings, demonstrating the feasibility of our architectures. Our simulations reveal notable reductions in communication/computation costs and training delays as compared to conventional FL and SL.

Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches. (arXiv:2206.03827v7 [stat.ML] UPDATED)

Authors: Tamim El Ahmad, Pierre Laforgue, Florence d'Alché-Buc

Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods.

Gray Learning from Non-IID Data with Out-of-distribution Samples. (arXiv:2206.09375v2 [cs.LG] UPDATED)

Authors: Zhilin Zhao, Longbing Cao, Chang-Dong Wang

The integrity of training data, even when annotated by experts, is far from guaranteed, especially for non-IID datasets comprising both in- and out-of-distribution samples. In an ideal scenario, the majority of samples would be in-distribution, while samples that deviate semantically would be identified as out-of-distribution and excluded during the annotation process. However, experts may erroneously classify these out-of-distribution samples as in-distribution, assigning them labels that are inherently unreliable. This mixture of unreliable labels and varied data types makes the task of learning robust neural networks notably challenging. We observe that both in- and out-of-distribution samples can almost invariably be ruled out from belonging to certain classes, aside from those corresponding to unreliable ground-truth labels. This opens the possibility of utilizing reliable complementary labels that indicate the classes to which a sample does not belong. Guided by this insight, we introduce a novel approach, termed \textit{Gray Learning} (GL), which leverages both ground-truth and complementary labels. Crucially, GL adaptively adjusts the loss weights for these two label types based on prediction confidence levels. By grounding our approach in statistical learning theory, we derive bounds for the generalization error, demonstrating that GL achieves tight constraints even in non-IID settings. Extensive experimental evaluations reveal that our method significantly outperforms alternative approaches grounded in robust statistics.

A Contrastive Approach to Online Change Point Detection. (arXiv:2206.10143v3 [stat.ML] UPDATED)

Authors: Artur Goldman, Nikita Puchkin, Valeriia Shcherbakova, Uliana Vinogradova

We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.

Multi-label Classification with High-rank and High-order Label Correlations. (arXiv:2207.04197v2 [cs.LG] UPDATED)

Authors: Chongjie Si, Yuheng Jia, Ran Wang, Min-Ling Zhang, Yanghe Feng, Chongxiao Qu

Exploiting label correlations is important to multi-label classification. Previous methods capture the high-order label correlations mainly by transforming the label matrix to a latent label space with low-rank matrix factorization. However, the label matrix is generally a full-rank or approximate full-rank matrix, making the low-rank factorization inappropriate. Besides, in the latent space, the label correlations will become implicit. To this end, we propose a simple yet effective method to depict the high-order label correlations explicitly, and at the same time maintain the high-rank of the label matrix. Moreover, we estimate the label correlations and infer model parameters simultaneously via the local geometric structure of the input to achieve mutual enhancement. Comparative studies over twelve benchmark data sets validate the effectiveness of the proposed algorithm in multi-label classification. The exploited high-order label correlations are consistent with common sense empirically. Our code is publicly available at https://github.com/Chongjie-Si/HOMI.

Monte Carlo is a good sampling strategy for polynomial approximation in high dimensions. (arXiv:2208.09045v3 [math.NA] UPDATED)

Authors: Ben Adcock, Simone Brugiapaglia

This paper concerns the approximation of smooth, high-dimensional functions from limited samples using polynomials. This task lies at the heart of many applications in computational science and engineering - notably, some of those arising from parametric modelling and computational uncertainty quantification. It is common to use Monte Carlo sampling in such applications, so as not to succumb to the curse of dimensionality. However, it is well known that such a strategy is theoretically suboptimal. Specifically, there are many polynomial spaces of dimension $n$ for which the sample complexity scales log-quadratically, i.e., like $c \cdot n^2 \cdot \log(n)$ as $n \rightarrow \infty$. This well-documented phenomenon has led to a concerted effort over the last decade to design improved, and moreover, near-optimal strategies, whose sample complexities scale log-linearly, or even linearly in $n$. In this work we demonstrate that Monte Carlo is actually a perfectly good strategy in high dimensions, despite its apparent suboptimality. We first document this phenomenon empirically via a systematic set of numerical experiments. Next, we present a theoretical analysis that rigorously justifies this fact in the case of holomorphic functions of infinitely-many variables. We show that there is a least-squares approximation based on $m$ Monte Carlo samples whose error decays algebraically fast in $m/\log(m)$, with a rate that is the same as that of the best $n$-term polynomial approximation. This result is non-constructive, since it assumes knowledge of a suitable polynomial subspace in which to perform the approximation. We next present a compressed sensing-based scheme that achieves the same rate, except for a larger polylogarithmic factor. This scheme is practical, and numerically it performs as well as or better than well-known adaptive least-squares schemes.

Rethinking Symmetric Matrix Factorization: A More General and Better Clustering Perspective. (arXiv:2209.02528v3 [cs.LG] UPDATED)

Authors: Mengyuan Zhang, Kai Liu

Nonnegative matrix factorization (NMF) is widely used for clustering with strong interpretability. Among general NMF problems, symmetric NMF is a special one that plays an important role in graph clustering where each element measures the similarity between data points. Most existing symmetric NMF algorithms require factor matrices to be nonnegative, and only focus on minimizing the gap between similarity matrix and its approximation for clustering, without giving a consideration to other potential regularization terms which can yield better clustering. In this paper, we explore factorizing a symmetric matrix that does not have to be nonnegative, presenting an efficient factorization algorithm with a regularization term to boost the clustering performance. Moreover, a more general framework is proposed to solve symmetric matrix factorization problems with different constraints on the factor matrices.

Expanding continual few-shot learning benchmarks to include recognition of specific instances. (arXiv:2209.07863v3 [cs.NE] UPDATED)

Authors: Gideon Kowadlo, Abdelrahman Ahmed, Amir Mayan, David Rawlinson

Continual learning and few-shot learning are important frontiers in progress towards broader Machine Learning (ML) capabilities. There is a growing body of work in both, but few works combining the two. One exception is the Continual few-shot Learning (CFSL) framework of Antoniou et al. arXiv:2004.11967. In this study, we extend CFSL in two ways that capture a broader range of challenges, important for intelligent agent behaviour in real-world conditions. First, we modify CFSL to make it more comparable to standard continual learning experiments, where usually a much larger number of classes are presented. Second, we introduce an 'instance test' which requires recognition of specific instances of classes -- a capability of animal cognition that is usually neglected in ML. For an initial exploration of ML model performance under these conditions, we selected representative baseline models from the original CFSL work and added a model variant with replay. As expected, learning more classes is more difficult than the original CFSL experiments, and interestingly, the way in which image instances and classes are presented affects classification performance. Surprisingly, accuracy in the baseline instance test is comparable to other classification tasks, but poor given significant occlusion and noise. The use of replay for consolidation improves performance substantially for both types of tasks, but particularly the instance test.

Mutual Information Regularized Offline Reinforcement Learning. (arXiv:2210.07484v2 [cs.LG] UPDATED)

Authors: Xiao Ma, Bingyi Kang, Zhongwen Xu, Min Lin, Shuicheng Yan

The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding mutual information regularizations. MISA is a general framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance. In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark,e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is available at https://github.com/sail-sg/MISA.

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. (arXiv:2210.07893v3 [stat.ML] UPDATED)

Authors: Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge

Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.

Multi-scale data reconstruction of turbulent rotating flows with Gappy POD, Extended POD and Generative Adversarial Networks. (arXiv:2210.11921v2 [physics.flu-dyn] UPDATED)

Authors: Tianyi Li, Michele Buzzicotti, Luca Biferale, Fabio Bonaccorso, Shiyi Chen, Minping Wan

Data reconstruction of rotating turbulent snapshots is investigated utilizing data-driven tools. This problem is crucial for numerous geophysical applications and fundamental aspects, given the concurrent effects of direct and inverse energy cascades, which lead to non-Gaussian statistics at both large and small scales. Data assimilation also serves as a tool to rank physical features within turbulence, by evaluating the performance of reconstruction in terms of the quality and quantity of the information used. Additionally, benchmarking various reconstruction techniques is essential to assess the trade-off between quantitative supremacy, implementation complexity, and explicability. In this study, we use linear and non-linear tools based on the Proper Orthogonal Decomposition (POD) and Generative Adversarial Network (GAN) for reconstructing rotating turbulence snapshots with spatial damages (inpainting). We focus on accurately reproducing both statistical properties and instantaneous velocity fields. Different gap sizes and gap geometries are investigated in order to assess the importance of coherency and multi-scale properties of the missing information. Surprisingly enough, concerning point-wise reconstruction, the non-linear GAN does not outperform one of the linear POD techniques. On the other hand, supremacy of the GAN approach is shown when the statistical multi-scale properties are compared. Similarly, extreme events in the gap region are better predicted when using GAN. The balance between point-wise error and statistical properties is controlled by the adversarial ratio, which determines the relative importance of the generator and the discriminator in the GAN training. Robustness against the measurement noise is also discussed.

Federated Learning and Meta Learning: Approaches, Applications, and Directions. (arXiv:2210.13111v2 [cs.LG] UPDATED)

Authors: Xiaonan Liu, Yansha Deng, Arumugam Nallanathan, Mehdi Bennis

Over the past few years, significant advancements have been made in the field of machine learning (ML) to address resource management, interference management, autonomy, and decision-making in wireless networks. Traditional ML approaches rely on centralized methods, where data is collected at a central server for training. However, this approach poses a challenge in terms of preserving the data privacy of devices. To address this issue, federated learning (FL) has emerged as an effective solution that allows edge devices to collaboratively train ML models without compromising data privacy. In FL, local datasets are not shared, and the focus is on learning a global model for a specific task involving all devices. However, FL has limitations when it comes to adapting the model to devices with different data distributions. In such cases, meta learning is considered, as it enables the adaptation of learning models to different data distributions using only a few data samples. In this tutorial, we present a comprehensive review of FL, meta learning, and federated meta learning (FedMeta). Unlike other tutorial papers, our objective is to explore how FL, meta learning, and FedMeta methodologies can be designed, optimized, and evolved, and their applications over wireless networks. We also analyze the relationships among these learning algorithms and examine their advantages and disadvantages in real-world applications.

MARLlib: A Scalable and Efficient Multi-agent Reinforcement Learning Library. (arXiv:2210.13708v4 [cs.LG] UPDATED)

Authors: Siyi Hu, Yifan Zhong, Minquan Gao, Weixun Wang, Hao Dong, Xiaodan Liang, Zhihui Li, Xiaojun Chang, Yaodong Yang

A significant challenge facing researchers in the area of multi-agent reinforcement learning (MARL) pertains to the identification of a library that can offer fast and compatible development for multi-agent tasks and algorithm combinations, while obviating the need to consider compatibility issues. In this paper, we present MARLlib, a library designed to address the aforementioned challenge by leveraging three key mechanisms: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy. By utilizing these mechanisms, MARLlib can effectively disentangle the intertwined nature of the multi-agent task and the learning process of the algorithm, with the ability to automatically alter the training strategy based on the current task's attributes. The MARLlib library's source code is publicly accessible on GitHub: \url{https://github.com/Replicable-MARL/MARLlib}.

When Do We Need Graph Neural Networks for Node Classification?. (arXiv:2210.16979v2 [cs.LG] UPDATED)

Authors: Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, Doina Precup

Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by additionally making use of graph structure based on the relational inductive bias (edge bias), rather than treating the nodes as collections of independent and identically distributed (i.i.d.) samples. Though GNNs are believed to outperform basic NNs in real-world tasks, it is found that in some cases, GNNs have little performance gain or even underperform graph-agnostic NNs. To identify these cases, based on graph signal processing and statistical hypothesis testing, we propose two measures which analyze the cases in which the edge bias in features and labels does not provide advantages. Based on the measures, a threshold value can be given to predict the potential performance advantages of graph-aware models over graph-agnostic models.

Using multimodal learning and deep generative models for corporate bankruptcy prediction. (arXiv:2211.08405v4 [q-fin.RM] UPDATED)

Authors: Rogelio A. Mancisidor, Kjersti Aas

Textual data from financial filings, e.g., the Management's Discussion \& Analysis (MDA) section in Form 10-K, has been used to improve the prediction accuracy of bankruptcy models. In practice, however, we cannot obtain the MDA section for all public companies. The two main reasons for the lack of MDA are: (i) not all companies are obliged to submit the MDA and (ii) technical problems arise when crawling and scrapping the MDA section. This research introduces for the first time, to the best of our knowledge, the concept of multimodal learning in bankruptcy prediction models to solve the problem that for some companies we are unable to obtain the MDA text. We use the Conditional Multimodal Discriminative (CMMD) model to learn multimodal representations that embed information from accounting, market, and textual modalities. The CMMD model needs a sample with all data modalities for model training. At test time, the CMMD model only needs access to accounting and market modalities to generate multimodal representations, which are further used to make bankruptcy predictions. This fact makes the use of bankruptcy prediction models using textual data realistic and possible, since accounting and market data are available for all companies unlike textual data. The empirical results in this research show that the classification performance of our proposed methodology is superior compared to that of a large number of traditional classifier models. We also show that our proposed methodology solves the limitation of previous bankruptcy models using textual data, as they can only make predictions for a small proportion of companies.

A Theory of Unsupervised Translation Motivated by Understanding Animal Communication. (arXiv:2211.11081v2 [cs.CL] UPDATED)

Authors: Shafi Goldwasser, David F. Gruber, Adam Tauman Kalai, Orr Paradise

Neural networks are capable of translating between languages -- in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. We propose a theoretical framework for analyzing UMT when no parallel translations are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure. We exemplify this theory with two stylized models of language, for which our framework provides bounds on necessary sample complexity; the bounds are formally proven and experimentally verified on synthetic data. These bounds show that the error rates are inversely related to the language complexity and amount of common ground. This suggests that unsupervised translation of animal communication may be feasible if the communication system is sufficiently complex.

BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular Representation. (arXiv:2211.13979v3 [cs.LG] UPDATED)

Authors: Zhen Wang, Zheng Feng, Yanjun Li, Bowen Li, Yongrui Wang, Chulin Sha, Min He, Xiaolin Li

Although substantial efforts have been made using graph neural networks (GNNs) for AI-driven drug discovery (AIDD), effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets, which are time-consuming, computationally expensive, and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.

Estimation and inference for transfer learning with high-dimensional quantile regression. (arXiv:2211.14578v3 [stat.ML] UPDATED)

Authors: Jiayu Huang, Mingqiu Wang, Yuanshan Wu

Transfer learning has become an essential technique to exploit information from the source domain to boost performance of the target task. Despite the prevalence in high-dimensional data, heterogeneity and heavy tails are insufficiently accounted for by current transfer learning approaches and thus may undermine the resulting performance. We propose a transfer learning procedure in the framework of high-dimensional quantile regression models to accommodate heterogeneity and heavy tails in the source and target domains. We establish error bounds of transfer learning estimator based on delicately selected transferable source domains, showing that lower error bounds can be achieved for critical selection criterion and larger sample size of source tasks. We further propose valid confidence interval and hypothesis test procedures for individual component of high-dimensional quantile regression coefficients by advocating a double transfer learning estimator, which is one-step debiased estimator for the transfer learning estimator wherein the technique of transfer learning is designed again. By adopting data-splitting technique, we advocate a transferability detection approach that guarantees to circumvent negative transfer and identify transferable sources with high probability. Simulation results demonstrate that the proposed method exhibits some favorable and compelling performances and the practical utility is further illustrated by analyzing a real example.

Performance evaluation of deep segmentation models for Contrails detection. (arXiv:2211.14851v4 [cs.CV] UPDATED)

Authors: Akshat Bhandari, Sriya Rallabandi, Sanchit Singhal, Aditya Kasliwal, Pratinav Seth

Contrails, short for condensation trails, are line-shaped ice clouds produced by aircraft engine exhaust when they fly through cold and humid air. They generate a greenhouse effect by absorbing or directing back to Earth approximately 33% of emitted outgoing longwave radiation. They account for over half of the climate change resulting from aviation activities. Avoiding contrails and adjusting flight routes could be an inexpensive and effective way to reduce their impact. An accurate, automated, and reliable detection algorithm is required to develop and evaluate contrail avoidance strategies. Advancement in contrail detection has been severely limited due to several factors, primarily due to a lack of quality-labeled data. Recently, proposed a large human-labeled Landsat-8 contrails dataset. Each contrail is carefully labeled with various inputs in various scenes of Landsat-8 satellite imagery. In this work, we benchmark several popular segmentation models with combinations of different loss functions and encoder backbones. This work is the first to apply state-of-the-art segmentation techniques to detect contrails in low-orbit satellite imagery. Our work can also be used as an open benchmark for contrail segmentation and is publicly available.

Are you using test log-likelihood correctly?. (arXiv:2212.00219v3 [stat.ML] UPDATED)

Authors: Sameer K. Deshpande, Soumya Ghosh, Tin D. Nguyen, Tamara Broderick

Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes. (arXiv:2212.06132v3 [cs.LG] UPDATED)

Authors: Jiafan He, Heyang Zhao, Dongruo Zhou, Quanquan Gu

We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.

Efficient First-order Methods for Convex Optimization with Strongly Convex Function Constraints. (arXiv:2212.11143v3 [math.OC] UPDATED)

Authors: Zhenwei Lin, Qi Deng

In this paper, we introduce faster first-order primal-dual algorithms for minimizing a convex function subject to strongly convex function constraints. Before our work, the best complexity bound was $\mathcal{O}(1/{\varepsilon})$, and it remains unclear how to improve this result by leveraging the strong convexity assumption. We address this issue by developing novel techniques to progressively estimate the strong convexity of the Lagrangian function. Our approach yields an improved complexity of $\mathcal{O}(1/\sqrt{\varepsilon})$, matching the complexity lower bound for strongly-convex-concave saddle point optimization. We show the superior performance of our methods in sparsity-inducing constrained optimization, notably Google's personalized PageRank problem. Furthermore, we show that a restarted version of the proposed methods can effectively identify the sparsity pattern of the optimal solution within a finite number of steps, a result that appears to have independent significance.

Robust Meta-Representation Learning via Global Label Inference and Classification. (arXiv:2212.11702v2 [cs.LG] UPDATED)

Authors: Ruohan Wang, Isak Falk, Massimiliano Pontil, Carlo Ciliberto

Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.

Robust representations of oil wells' intervals via sparse attention mechanism. (arXiv:2212.14246v3 [cs.LG] UPDATED)

Authors: Alina Ermilova, Nikita Baramiia, Valerii Kornilov, Sergey Petrakov, Alexey Zaytsev

Transformer-based neural network architectures achieve state-of-the-art results in different domains, from natural language processing (NLP) to computer vision (CV). The key idea of Transformers, the attention mechanism, has already led to significant breakthroughs in many areas. The attention has found their implementation for time series data as well. However, due to the quadratic complexity of the attention calculation regarding input sequence length, the application of Transformers is limited by high resource demands. Moreover, their modifications for industrial time series need to be robust to missing or noised values, which complicates the expansion of the horizon of their application. To cope with these issues, we introduce the class of efficient Transformers named Regularized Transformers (Reguformers). We implement the regularization technique inspired by the dropout ideas to improve robustness and reduce computational expenses. The focus in our experiments is on oil&gas data, namely, well logs, a prominent example of multivariate time series. The goal is to solve the problems of similarity and representation learning for them. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells. The experiments show that all variations of Reguformers outperform the previously developed RNNs, classical Transformer model, and robust modifications of it like Informer and Performer in terms of well-intervals' classification and the quality of the obtained well-intervals' representations. Moreover, the sustainability to missing and incorrect data in our models exceeds that of others by a significant margin. The best result that the Reguformer achieves on well-interval similarity task is the mean PR~AUC score equal to 0.983, which is comparable to the classical Transformer and outperforms the previous models.

Generative Adversarial Networks to infer velocity components in rotating turbulent flows. (arXiv:2301.07541v2 [physics.flu-dyn] UPDATED)

Authors: Tianyi Li, Michele Buzzicotti, Luca Biferale, Fabio Bonaccorso

Inference problems for two-dimensional snapshots of rotating turbulent flows are studied. We perform a systematic quantitative benchmark of point-wise and statistical reconstruction capabilities of the linear Extended Proper Orthogonal Decomposition (EPOD) method, a non-linear Convolutional Neural Network (CNN) and a Generative Adversarial Network (GAN). We attack the important task of inferring one velocity component out of the measurement of a second one, and two cases are studied: (I) both components lay in the plane orthogonal to the rotation axis and (II) one of the two is parallel to the rotation axis. We show that EPOD method works well only for the former case where both components are strongly correlated, while CNN and GAN always outperform EPOD both concerning point-wise and statistical reconstructions. For case (II), when the input and output data are weakly correlated, all methods fail to reconstruct faithfully the point-wise information. In this case, only GAN is able to reconstruct the field in a statistical sense. The analysis is performed using both standard validation tools based on $L_2$ spatial distance between the prediction and the ground truth and more sophisticated multi-scale analysis using wavelet decomposition. Statistical validation is based on standard Jensen-Shannon divergence between the probability density functions, spectral properties and multi-scale flatness.

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. (arXiv:2301.08110v5 [cs.LG] UPDATED)

Authors: Björn Deiseroth, Mayukh Deb, Samuel Weinbach, Manuel Brack, Patrick Schramowski, Kristian Kersting

Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.

PhysGraph: Physics-Based Integration Using Graph Neural Networks. (arXiv:2301.11841v2 [cs.GR] UPDATED)

Authors: Oshri Halimi, Egor Larionov, Zohar Barzelay, Philipp Herholz, Tuur Stuyck

Physics-based simulation of mesh based domains remains a challenging task. State-of-the-art techniques can produce realistic results but require expert knowledge. A major bottleneck in many approaches is the step of integrating a potential energy in order to compute velocities or displacements. Recently, learning based method for physics-based simulation have sparked interest with graph based approaches being a promising research direction. One of the challenges for these methods is to generate models that are mesh independent and generalize to different material properties. Moreover, the model should also be able to react to unforeseen external forces like ubiquitous collisions. Our contribution is based on a simple observation: evaluating forces is computationally relatively cheap for traditional simulation methods and can be computed in parallel in contrast to their integration. If we learn how a system reacts to forces in general, irrespective of their origin, we can learn an integrator that can predict state changes due to the total forces with high generalization power. We effectively factor out the physical model behind resulting forces by relying on an opaque force module. We demonstrate that this idea leads to a learnable module that can be trained on basic internal forces of small mesh patches and generalizes to different mesh typologies, resolutions, material parameters and unseen forces like collisions at inference time. Our proposed paradigm is general and can be used to model a variety of physical phenomena. We focus our exposition on the detail enhancement of coarse clothing geometry which has many applications including computer games, virtual reality and virtual try-on.

A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence. (arXiv:2301.13139v3 [stat.ML] UPDATED)

Authors: Carlo Alfano, Rui Yuan, Patrick Rebeschini

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.

The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v3 [cs.LG] UPDATED)

Authors: Xingyu Xu, Yandi Shen, Yuejie Chi, Cong Ma

We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.

Flat Seeking Bayesian Neural Networks. (arXiv:2302.02713v5 [cs.LG] UPDATED)

Authors: Van-Anh Nguyen, Tung-Long Vuong, Hoang Phan, Thanh-Toan Do, Dinh Phung, Trung Le

Bayesian Neural Networks (BNNs) provide a probabilistic interpretation for deep learning models by imposing a prior distribution over model parameters and inferring a posterior distribution based on observed data. The model sampled from the posterior distribution can be used for providing ensemble predictions and quantifying prediction uncertainty. It is well-known that deep learning models with lower sharpness have better generalization ability. However, existing posterior inferences are not aware of sharpness/flatness in terms of formulation, possibly leading to high sharpness for the models sampled from them. In this paper, we develop theories, the Bayesian setting, and the variational inference approach for the sharpness-aware posterior. Specifically, the models sampled from our sharpness-aware posterior, and the optimal approximate posterior estimating this sharpness-aware posterior, have better flatness, hence possibly possessing higher generalization ability. We conduct experiments by leveraging the sharpness-aware posterior with state-of-the-art Bayesian Neural Networks, showing that the flat-seeking counterparts outperform their baselines in all metrics of interest.

Detection and Localization of Melanoma Skin Cancer in Histopathological Whole Slide Images. (arXiv:2302.03014v4 [eess.IV] UPDATED)

Authors: Neel Kanwal, Roger Amundsen, Helga Hardardottir, Luca Tomasetti, Erling Sandoy Undersrud, Emiel A.M. Janssen, Kjersti Engan

Melanoma diagnosed and treated in its early stages can increase the survival rate. A projected increase in skin cancer incidents and a dearth of dermatopathologists have emphasized the need for computational pathology (CPATH) systems. CPATH systems with deep learning (DL) models have the potential to identify the presence of melanoma by exploiting underlying morphological and cellular features. This paper proposes a DL method to detect melanoma and distinguish between normal skin and benign/malignant melanocytic lesions in Whole Slide Images (WSI). Our method detects lesions with high accuracy and localizes them on a WSI to identify potential regions of interest for pathologists. Interestingly, our DL method relies on using a single CNN network to create localization maps first and use them to perform slide-level predictions to determine patients who have melanoma. Our best model provides favorable patch-wise classification results with a 0.992 F1 score and 0.99 sensitivity on unseen data. The source code is https://github.com/RogerAmundsen/Melanoma-Diagnosis-and-Localization-from-Whole-Slide-Images-using-Convolutional-Neural-Networks.

A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v4 [stat.ML] UPDATED)

Authors: Ben Chugg, Hongjian Wang, Aaditya Ramdas

We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.

Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v3 [stat.ML] UPDATED)

Authors: Fadhel Ayed, Soufiane Hayou

Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law.

In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.

Bandit Social Learning: Exploration under Myopic Behavior. (arXiv:2302.07425v4 [cs.GT] UPDATED)

Authors: Kiarash Banihashem, MohammadTaghi Hajiaghayi, Suho Shin, Aleksandrs Slivkins

We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals for the arms' expected rewards. We derive stark learning failures for any such behavior, and provide matching positive results. As a special case, we obtain the first general results on failure of the greedy algorithm in bandits, thus providing a theoretical foundation for why bandit algorithms should explore.

ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory. (arXiv:2302.08284v2 [cs.LG] UPDATED)

Authors: Marcel Khalifa, Barak Hoffer, Orian Leitersdorf, Robert Hanhan, Ben Perach, Leonid Yavits, Shahar Kvatinsky

DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision.

Learning a Consensus Sub-Network with Polarization Regularization and One Pass Training. (arXiv:2302.10798v4 [cs.LG] UPDATED)

Authors: Xiaoying Zhi, Varun Babbar, Pheobe Sun, Fran Silavong, Ruibo Shi, Sean Moran

The subject of green AI has been gaining attention within the deep learning community given the recent trend of ever larger and more complex neural network models. Existing solutions for reducing the computational load of training at inference time usually involve pruning the network parameters. Pruning schemes often create extra overhead either by iterative training and fine-tuning for static pruning or repeated computation of a dynamic pruning graph. We propose a new parameter pruning strategy for learning a lighter-weight sub-network that minimizes the energy cost while maintaining comparable performance to the fully parameterised network on given downstream tasks. Our proposed pruning scheme is green-oriented, as it only requires a one-off training to discover the optimal static sub-networks by dynamic pruning methods. The pruning scheme consists of a binary gating module and a novel loss function to uncover sub-networks with user-defined sparsity. Our method enables pruning and training simultaneously, which saves energy in both the training and inference phases and avoids extra computational overhead from gating modules at inference time. Our results on CIFAR-10 and CIFAR-100 suggest that our scheme can remove 50% of connections in deep networks with less than 1% reduction in classification accuracy. Compared to other related pruning methods, our method demonstrates a lower drop in accuracy for equivalent reductions in computational cost.

Benign Overfitting for Two-layer ReLU Convolutional Neural Networks. (arXiv:2303.04145v2 [cs.LG] UPDATED)

Authors: Yiwen Kou, Zixiang Chen, Yuanzhou Chen, Quanquan Gu

Modern deep learning models with great expressive power can be trained to overfit the training data but still generalize well. This phenomenon is referred to as \textit{benign overfitting}. Recently, a few studies have attempted to theoretically understand benign overfitting in neural networks. However, these works are either limited to neural networks with smooth activation functions or to the neural tangent kernel regime. How and when benign overfitting can occur in ReLU neural networks remains an open problem. In this work, we seek to answer this question by establishing algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk. Our result also reveals a sharp transition between benign and harmful overfitting under different conditions on data distribution in terms of test risk. Experiments on synthetic data back up our theory.

Multilevel Diffusion: Infinite Dimensional Score-Based Diffusion Models for Image Generation. (arXiv:2303.04772v3 [cs.LG] UPDATED)

Authors: Paul Hagemann, Sophie Mildenberger, Lars Ruthotto, Gabriele Steidl, Nicole Tianjiao Yang

Score-based diffusion models (SBDM) have recently emerged as state-of-the-art approaches for image generation. Existing SBDMs are typically formulated in a finite-dimensional setting, where images are considered as tensors of finite size. This paper develops SBDMs in the infinite-dimensional setting, that is, we model the training data as functions supported on a rectangular domain. Besides the quest for generating images at ever higher resolution, our primary motivation is to create a well-posed infinite-dimensional learning problem so that we can discretize it consistently on multiple resolution levels. We thereby intend to obtain diffusion models that generalize across different resolution levels and improve the efficiency of the training process. We demonstrate how to overcome two shortcomings of current SBDM approaches in the infinite-dimensional setting. First, we modify the forward process to ensure that the latent distribution is well-defined in the infinite-dimensional setting using the notion of trace class operators. We derive the reverse processes for finite approximations. Second, we illustrate that approximating the score function with an operator network is beneficial for multilevel training. After deriving the convergence of the discretization and the approximation of multilevel training, we implement an infinite-dimensional SBDM approach and show the first promising results on MNIST and Fashion-MNIST, underlining our developed theory.

Flooding with Absorption: An Efficient Protocol for Heterogeneous Bandits over Complex Networks. (arXiv:2303.05445v3 [cs.LG] UPDATED)

Authors: Junghyun Lee, Laura Schmid, Se-Young Yun

Multi-armed bandits are extensively used to model sequential decision-making, making them ubiquitous in many real-life applications such as online recommender systems and wireless networking. We consider a multi-agent setting where each agent solves their own bandit instance endowed with a different set of arms. Their goal is to minimize their group regret while collaborating via some communication protocol over a given network. Previous literature on this problem only considered arm heterogeneity and networked agents separately. In this work, we introduce a setting that encompasses both features. For this novel setting, we first provide a rigorous regret analysis for a standard flooding protocol combined with the classic UCB policy. Then, to mitigate the issue of high communication costs incurred by flooding in complex networks, we propose a new protocol called Flooding with Absorption (FwA). We provide a theoretical analysis of the resulting regret bound and discuss the advantages of using FwA over flooding. Lastly, we experimentally verify on various scenarios, including dynamic networks, that FwA leads to significantly lower communication costs despite minimal regret performance loss compared to other network protocols.

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. (arXiv:2303.08797v3 [cs.LG] UPDATED)

Authors: Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden

A class of generative models that unifies flow-based and diffusion-based methods is introduced. These models extend the framework proposed in Albergo & Vanden-Eijnden (2023), enabling the use of a broad class of continuous-time stochastic processes called `stochastic interpolants' to bridge any two arbitrary probability density functions exactly in finite time. These interpolants are built by combining data from the two prescribed densities with an additional latent variable that shapes the bridge in a flexible way. The time-dependent probability density function of the stochastic interpolant is shown to satisfy a first-order transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diffusion coefficient. Upon consideration of the time evolution of an individual sample, this viewpoint immediately leads to both deterministic and stochastic generative models based on probability flow equations or stochastic differential equations with an adjustable level of noise. The drift coefficients entering these models are time-dependent velocity fields characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score of the interpolant density. We show that minimization of these quadratic objectives leads to control of the likelihood for generative models built upon stochastic dynamics, while likelihood control for deterministic dynamics is more stringent. We also discuss connections with other methods such as score-based diffusion models, stochastic localization processes, probabilistic denoising techniques, and rectifying flows. In addition, we demonstrate that stochastic interpolants recover the Schr\"odinger bridge between the two target densities when explicitly optimizing over the interpolant. Finally, algorithmic aspects are discussed and the approach is illustrated on numerical examples.

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding. (arXiv:2303.12513v2 [cs.CV] UPDATED)

Authors: Morris Alper, Michael Fiman, Hadar Averbuch-Elor

Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being underperformed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts.

The Exact Sample Complexity Gain from Invariances for Kernel Regression. (arXiv:2303.14269v2 [cs.LG] UPDATED)

Authors: Behrooz Tahmasebi, Stefanie Jegelka

In practice, encoding invariances into models improves sample complexity. In this work, we study this phenomenon from a theoretical perspective. In particular, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a group action on the manifold. Our results hold for any smooth compact Lie group action, even groups of positive dimension. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold's dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials. This new geometric viewpoint on learning with invariances may be of independent interest.

Bi-directional Training for Composed Image Retrieval via Text Prompt Learning. (arXiv:2303.16604v2 [cs.CV] UPDATED)

Authors: Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, Stephen Gould

Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as described by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures with minimum changes, which improves the performance of the model. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves competitive performance. Our code is released at https://github.com/Cuberick-Orion/Bi-Blip4CIR.

Unified Out-Of-Distribution Detection: A Model-Specific Perspective. (arXiv:2304.06813v2 [cs.LG] UPDATED)

Authors: Reza Averly, Wei-Lun Chao

Out-of-distribution (OOD) detection aims to identify test examples that do not belong to the training distribution and are thus unlikely to be predicted reliably. Despite a plethora of existing works, most of them focused only on the scenario where OOD examples come from semantic shift (e.g., unseen categories), ignoring other possible causes (e.g., covariate shift). In this paper, we present a novel, unifying framework to study OOD detection in a broader scope. Instead of detecting OOD examples from a particular cause, we propose to detect examples that a deployed machine learning model (e.g., an image classifier) is unable to predict correctly. That is, whether a test example should be detected and rejected or not is ``model-specific''. We show that this framework unifies the detection of OOD examples caused by semantic shift and covariate shift, and closely addresses the concern of applying a machine learning model to uncontrolled environments. We provide an extensive analysis that involves a variety of models (e.g., different architectures and training strategies), sources of OOD examples, and OOD detection approaches, and reveal several insights into improving and understanding OOD detection in uncontrolled environments.

An adaptive safety layer with hard constraints for safe reinforcement learning in multi-energy management systems. (arXiv:2304.08897v3 [eess.SY] UPDATED)

Authors: Glenn Ceusters, Muhammad Andy Putratama, Rüdiger Franke, Ann Nowé, Maarten Messagie

Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a priori and not a complete model. The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can still be learnt, and modelling bias is kept to a minimum. However, even the constraint functions alone are not always trivial to accurately provide in advance, leading to potentially unsafe behaviour. In this paper, we present two novel advancements: (I) combining the OptLayer and SafeFallback method, named OptLayerPolicy, to increase the initial utility while keeping a high sample efficiency and the possibility to formulate equality constraints. (II) introducing self-improving hard constraints, to increase the accuracy of the constraint functions as more and new data becomes available so that better policies can be learnt. Both advancements keep the constraint formulation decoupled from the RL formulation, so new (presumably better) RL algorithms can act as drop-in replacements. We have shown that, in a simulated multi-energy system case study, the initial utility is increased to 92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4% (OptLayer) - all relative to a vanilla RL benchmark. Although introducing surrogate functions into the optimisation problem requires special attention, we conclude that the newly presented GreyOptLayerPolicy method is the most advantageous.

Bridging RL Theory and Practice with the Effective Horizon. (arXiv:2304.09853v2 [cs.LG] UPDATED)

Authors: Cassidy Laidlaw, Stuart Russell, Anca Dragan

Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon

A Closer Look at Reward Decomposition for High-Level Robotic Explanations. (arXiv:2304.12958v2 [cs.LG] UPDATED)

Authors: Wenhao Lu, Xufeng Zhao, Sven Magg, Martin Gromniak, Mengdi Li, Stefan Wermter

Explaining the behaviour of intelligent agents learned by reinforcement learning (RL) to humans is challenging yet crucial due to their incomprehensible proprioceptive states, variational intermediate goals, and resultant unpredictability. Moreover, one-step explanations for RL agents can be ambiguous as they fail to account for the agent's future behaviour at each transition, adding to the complexity of explaining robot actions. By leveraging abstracted actions that map to task-specific primitives, we avoid explanations on the movement level. To further improve the transparency and explainability of robotic systems, we propose an explainable Q-Map learning framework that combines reward decomposition (RD) with abstracted action spaces, allowing for non-ambiguous and high-level explanations based on object properties in the task. We demonstrate the effectiveness of our framework through quantitative and qualitative analysis of two robotic scenarios, showcasing visual and textual explanations, from output artefacts of RD explanations, that are easy for humans to comprehend. Additionally, we demonstrate the versatility of integrating these artefacts with large language models (LLMs) for reasoning and interactive querying.

ZRG: A Dataset for Multimodal 3D Residential Rooftop Understanding. (arXiv:2304.13219v2 [cs.CV] UPDATED)

Authors: Isaac Corley, Jonathan Lwowski, Peyman Najafirad

A crucial part of any home is the roof over our heads to protect us from the elements. In this paper we present the Zeitview Rooftop Geometry (ZRG) dataset for residential rooftop understanding. ZRG is a large-scale residential rooftop dataset of over 20k properties collected through roof inspections from across the U.S. and contains multiple modalities including high resolution aerial orthomosaics, digital surface models (DSM), colored point clouds, and 3D roof wireframe annotations. We provide an in-depth analysis and perform several experimental baselines including roof outline extraction, monocular height estimation, and planar roof structure extraction, to illustrate a few of the numerous potential applications unlocked by this dataset.

Calibrated Explanations: with Uncertainty Information and Counterfactuals. (arXiv:2305.02305v3 [cs.AI] UPDATED)

Authors: Helena Lofstrom, Tuwe Lofstrom, Ulf Johansson, Cecilia Sonstrod

While local explanations for AI models can offer insights into individual predictions, such as feature importance, they are plagued by issues like instability. The unreliability of feature weights, often skewed due to poorly calibrated ML models, deepens these challenges. Moreover, the critical aspect of feature importance uncertainty remains mostly unaddressed in Explainable AI (XAI). The novel feature importance explanation method presented in this paper, called Calibrated Explanations (CE), is designed to tackle these issues head-on. Built on the foundation of Venn-Abers, CE not only calibrates the underlying model but also delivers reliable feature importance explanations with an exact definition of the feature weights. CE goes beyond conventional solutions by addressing output uncertainty. It accomplishes this by providing uncertainty quantification for both feature weights and the model's probability estimates. Additionally, CE is model-agnostic, featuring easily comprehensible conditional rules and the ability to generate counterfactual explanations with embedded uncertainty quantification. Results from an evaluation with 25 benchmark datasets underscore the efficacy of CE, making it stand as a fast, reliable, stable, and robust solution.

Approximating CKY with Transformers. (arXiv:2305.02386v2 [cs.CL] UPDATED)

Authors: Ghazal Khalighinejad, Ollie Liu, Sam Wiseman

We investigate the ability of transformer models to approximate the CKY algorithm, using them to directly predict a sentence's parse and thus avoid the CKY algorithm's cubic dependence on sentence length. We find that on standard constituency parsing benchmarks this approach achieves competitive or better performance than comparable parsers that make use of CKY, while being faster. We also evaluate the viability of this approach for parsing under \textit{random} PCFGs. Here we find that performance declines as the grammar becomes more ambiguous, suggesting that the transformer is not fully capturing the CKY computation. However, we also find that incorporating additional inductive bias is helpful, and we propose a novel approach that makes use of gradients with respect to chart representations in predicting the parse, in analogy with the CKY algorithm being a subgradient of a partition function variant with respect to the chart.

Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study. (arXiv:2305.03017v4 [cs.SE] UPDATED)

Authors: Sajjad Rahmani, AmirHossein Naghshzan, Latifa Guerrouj

Our research investigates the recommendation of code examples to aid software developers, a practice that saves developers significant time by providing ready-to-use code snippets. The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions, particularly in the context of the Java programming language. We applied BERT, a powerful Large Language Model (LLM) that enables us to transform code examples into numerical vectors by extracting their semantic information. Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH). Our research employed two variants of LSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared these two approaches across four parameters: HitRate, Mean Reciprocal Rank (MRR), Average Execution Time, and Relevance. Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method. Specifically, it exhibited a notable improvement of 20\% to 35\% in HitRate for query pairs compared to the RH approach. Furthermore, the QA approach proved significantly more time-efficient, with its speed in creating hashing tables and assigning data samples to buckets being at least four times faster. It can return code examples within milliseconds, whereas the RH approach typically requires several seconds to recommend code examples. Due to the superior performance of the QA approach, we tested it against PostFinder and FaCoY, the state-of-the-art baselines. Our QA method showed comparable efficiency proving its potential for effective code recommendation.

HACMan: Learning Hybrid Actor-Critic Maps for 6D Non-Prehensile Manipulation. (arXiv:2305.03942v4 [cs.RO] UPDATED)

Authors: Wenxuan Zhou, Bowen Jiang, Fan Yang, Chris Paxton, David Held

Manipulating objects without grasping them is an essential component of human dexterity, referred to as non-prehensile manipulation. Non-prehensile manipulation may enable more complex interactions with the objects, but also presents challenges in reasoning about gripper-object interactions. In this work, we introduce Hybrid Actor-Critic Maps for Manipulation (HACMan), a reinforcement learning approach for 6D non-prehensile manipulation of objects using point cloud observations. HACMan proposes a temporally-abstracted and spatially-grounded object-centric action representation that consists of selecting a contact location from the object point cloud and a set of motion parameters describing how the robot will move after making contact. We modify an existing off-policy RL algorithm to learn in this hybrid discrete-continuous action representation. We evaluate HACMan on a 6D object pose alignment task in both simulation and in the real world. On the hardest version of our task, with randomized initial poses, randomized 6D goals, and diverse object categories, our policy demonstrates strong generalization to unseen object categories without a performance drop, achieving an 89% success rate on unseen objects in simulation and 50% success rate with zero-shot transfer in the real world. Compared to alternative action representations, HACMan achieves a success rate more than three times higher than the best baseline. With zero-shot sim2real transfer, our policy can successfully manipulate unseen objects in the real world for challenging non-planar goals, using dynamic and contact-rich non-prehensile skills. Videos can be found on the project website: https://hacman-2023.github.io.

Recommender Systems with Generative Retrieval. (arXiv:2305.05065v3 [cs.IR] UPDATED)

Authors: Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, Maheswaran Sathiamoorthy

Modern recommender systems perform large-scale retrieval by first embedding queries and item candidates in the same unified space, followed by approximate nearest neighbor search to select top candidates given a query embedding. In this paper, we propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates. To that end, we create semantically meaningful tuple of codewords to serve as a Semantic ID for each item. Given Semantic IDs for items in a user session, a Transformer-based sequence-to-sequence model is trained to predict the Semantic ID of the next item that the user will interact with. To the best of our knowledge, this is the first Semantic ID-based generative model for recommendation tasks. We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets. In addition, we show that incorporating Semantic IDs into the sequence-to-sequence model enhances its ability to generalize, as evidenced by the improved retrieval performance observed for items with no prior interaction history.

Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation. (arXiv:2305.05803v4 [cs.CV] UPDATED)

Authors: Tianle Chen, Zheda Mai, Ruiwen Li, Wei-lun Chao

Weakly supervised semantic segmentation (WSSS) aims to bypass the need for laborious pixel-level annotation by using only image-level annotation. Most existing methods rely on Class Activation Maps (CAM) to derive pixel-level pseudo-labels and use them to train a fully supervised semantic segmentation model. Although these pseudo-labels are class-aware, indicating the coarse regions for particular classes, they are not object-aware and fail to delineate accurate object boundaries. To address this, we introduce a simple yet effective method harnessing the Segment Anything Model (SAM), a class-agnostic foundation model capable of producing fine-grained instance masks of objects, parts, and subparts. We use CAM pseudo-labels as cues to select and combine SAM masks, resulting in high-quality pseudo-labels that are both class-aware and object-aware. Our approach is highly versatile and can be easily integrated into existing WSSS methods without any modification. Despite its simplicity, our approach shows consistent gain over the state-of-the-art WSSS methods on both PASCAL VOC and MS-COCO datasets.

RECKONING: Reasoning through Dynamic Knowledge Encoding. (arXiv:2305.06349v3 [cs.CL] UPDATED)

Authors: Zeming Chen, Gail Weiss, Eric Mitchell, Asli Celikyilmaz, Antoine Bosselut

Recent studies on transformer-based language models show that they can answer questions by reasoning over knowledge provided as part of the context (i.e., in-context reasoning). However, since the available knowledge is often not filtered for a particular question, in-context reasoning can be sensitive to distractor facts, additional content that is irrelevant to a question but that may be relevant for a different question (i.e., not necessarily random noise). In these situations, the model fails to distinguish the knowledge that is necessary to answer the question, leading to spurious reasoning and degraded performance. This reasoning failure contrasts with the model's apparent ability to distinguish its contextual knowledge from all the knowledge it has memorized during pre-training. Following this observation, we propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters before presenting it with a question. Our method, RECKONING, is a bi-level learning algorithm that teaches language models to reason by updating their parametric knowledge through back-propagation, allowing them to then answer questions using the updated parameters. During training, the inner loop rapidly adapts a copy of the model weights to encode contextual knowledge into its parameters. In the outer loop, the model learns to use the updated weights to reproduce and answer reasoning questions about the memorized knowledge. Our experiments on two multi-hop reasoning datasets show that RECKONING's performance improves over the in-context reasoning baseline (by up to 4.5%). We also find that compared to in-context reasoning, RECKONING generalizes better to longer reasoning chains unseen during training, is more robust to distractors in the context, and is more computationally efficient when multiple questions are asked about the same knowledge.

NODE-ImgNet: a PDE-informed effective and robust model for image denoising. (arXiv:2305.11049v2 [eess.IV] UPDATED)

Authors: Xinheng Xie, Yue Wu, Hao Ni, Cuiyu He

Inspired by the traditional partial differential equation (PDE) approach for image denoising, we propose a novel neural network architecture, referred as NODE-ImgNet, that combines neural ordinary differential equations (NODEs) with convolutional neural network (CNN) blocks. NODE-ImgNet is intrinsically a PDE model, where the dynamic system is learned implicitly without the explicit specification of the PDE. This naturally circumvents the typical issues associated with introducing artifacts during the learning process. By invoking such a NODE structure, which can also be viewed as a continuous variant of a residual network (ResNet) and inherits its advantage in image denoising, our model achieves enhanced accuracy and parameter efficiency. In particular, our model exhibits consistent effectiveness in different scenarios, including denoising gray and color images perturbed by Gaussian noise, as well as real-noisy images, and demonstrates superiority in learning from small image datasets.

Moment Matching Denoising Gibbs Sampling. (arXiv:2305.11650v2 [stat.ML] UPDATED)

Authors: Mingtian Zhang, Alex Hawkins-Hooker, Brooks Paige, David Barber

Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a `noisy' data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a `noisy' model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.

Learning Large Graph Property Prediction via Graph Segment Training. (arXiv:2305.12322v3 [cs.LG] UPDATED)

Authors: Kaidi Cao, Phitchaya Mangpo Phothilimthana, Sami Abu-El-Haija, Dustin Zelle, Yanqi Zhou, Charith Mendis, Jure Leskovec, Bryan Perozzi

Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.

Continually Improving Extractive QA via Human Feedback. (arXiv:2305.12473v2 [cs.CL] UPDATED)

Authors: Ge Gao, Hung-Ting Chen, Yoav Artzi, Eunsol Choi

We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Our experiments show effective improvement from user feedback of extractive QA models over time across different data regimes, including significant potential for domain adaptation.

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars. (arXiv:2305.12529v3 [cs.CV] UPDATED)

Authors: Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, Lei Zhang

We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning which enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable 3D avatar representation from abundant image priors of diffusion model conditioned on various poses, which could animate complex non-rigged avatars given arbitrary poses without retraining. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and animation results.

Tight conditions for when the NTK approximation is valid. (arXiv:2305.13141v3 [cs.LG] UPDATED)

Authors: Enric Boix-Adsera, Etai Littwin

We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $\alpha = O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of $\alpha = O(T^2)$.

Deep Learning with Kernels through RKHM and the Perron-Frobenius Operator. (arXiv:2305.13588v2 [stat.ML] UPDATED)

Authors: Yuka Hashimoto, Masahiro Ikeda, Hachem Kadri

Reproducing kernel Hilbert $C^*$-module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of $C^*$-algebra, and the Perron-Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron-Frobenius operators. By virtue of $C^*$-algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that $C^*$-algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.

Data-Dependent Bounds for Online Portfolio Selection Without Lipschitzness and Smoothness. (arXiv:2305.13946v2 [cs.LG] UPDATED)

Authors: Chung-En Tsai, Ying-Ting Lin, Yen-Huan Li

This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses. The algorithms we propose exhibit sublinear regret rates in the worst cases and achieve logarithmic regrets when the data is "easy," with per-iteration time almost linear in the number of investment alternatives. The regret bounds are derived using novel smoothness characterizations of the logarithmic loss, a local norm-based analysis of following the regularized leader (FTRL) with self-concordant regularizers, which are not necessarily barriers, and an implicit variant of optimistic FTRL with the log-barrier.

FedZero: Leveraging Renewable Excess Energy in Federated Learning. (arXiv:2305.15092v2 [cs.LG] UPDATED)

Authors: Philipp Wiesner, Ramin Khalili, Dennis Grinwald, Pratik Agrawal, Lauritz Thamsen, Odej Kao

Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carbon footprint is to schedule training jobs based on the availability of renewable excess energy that can occur at certain times and places in the grid. However, in the presence of such volatile and unreliable resources, existing FL schedulers cannot always ensure fast, efficient, and fair trainings.

We propose FedZero, an FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce a training's operational carbon emissions to zero. Using energy and load forecasts, FedZero leverages the spatio-temporal availability of excess resources by selecting clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges significantly faster than existing approaches under the mentioned constraints while consuming less energy. Furthermore, it is robust to forecasting errors and scalable to tens of thousands of clients.

C-STS: Conditional Semantic Textual Similarity. (arXiv:2305.15093v2 [cs.CL] UPDATED)

Authors: Ameet Deshpande, Carlos E. Jimenez, Howard Chen, Vishvak Murahari, Victoria Graf, Tanmay Rajpurohit, Ashwin Kalyan, Danqi Chen, Karthik Narasimhan

Semantic textual similarity (STS), a cornerstone task in NLP, measures the degree of similarity between a pair of sentences, and has broad application in fields such as information retrieval and natural language understanding. However, sentence similarity can be inherently ambiguous, depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called Conditional STS (C-STS) which measures sentences' similarity conditioned on an feature described in natural language (hereon, condition). As an example, the similarity between the sentences "The NBA player shoots a three-pointer." and "A man throws a tennis ball into the air to serve." is higher for the condition "The motion of the ball" (both upward) and lower for "The size of the ball" (one large and one small). C-STS's advantages are two-fold: (1) it reduces the subjectivity and ambiguity of STS and (2) enables fine-grained language model evaluation through diverse natural language conditions. We put several state-of-the-art models to the test, and even those performing well on STS (e.g. SimCSE, Flan-T5, and GPT-4) find C-STS challenging; all with Spearman correlation scores below 50. To encourage a more comprehensive evaluation of semantic similarity and natural language understanding, we make nearly 19K C-STS examples and code available for others to train and test their models.

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective. (arXiv:2305.15408v4 [cs.LG] UPDATED)

Authors: Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, Liwei Wang

Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the expressivity of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of constant size suffice to solve both tasks by generating CoT derivations using a commonly used math language format. Moreover, we show LLMs with CoT can handle a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, an extensive set of experiments show that, while Transformers always fail to directly predict the answers, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.

Adaptive Data Analysis in a Balanced Adversarial Model. (arXiv:2305.15452v2 [cs.LG] UPDATED)

Authors: Kobbi Nissim, Uri Stemmer, Eliad Tsfadia

In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations to a sequence of adaptively chosen statistical queries with respect to $D$. Hardt and Ullman (FOCS 2014) and Steinke and Ullman (COLT 2015) showed that in general, it is computationally hard to answer more than $\Theta(n^2)$ adaptive queries, assuming the existence of one-way functions.

However, these negative results strongly rely on an adversarial model that significantly advantages the adversarial analyst over the mechanism, as the analyst, who chooses the adaptive queries, also chooses the underlying distribution $D$. This imbalance raises questions with respect to the applicability of the obtained hardness results -- an analyst who has complete knowledge of the underlying distribution $D$ would have little need, if at all, to issue statistical queries to a mechanism which only holds a finite number of samples from $D$.

We consider more restricted adversaries, called \emph{balanced}, where each such adversary consists of two separated algorithms: The \emph{sampler} who is the entity that chooses the distribution and provides the samples to the mechanism, and the \emph{analyst} who chooses the adaptive queries, but has no prior knowledge of the underlying distribution (and hence has no a priori advantage with respect to the mechanism). We improve the quality of previous lower bounds by revisiting them using an efficient \emph{balanced} adversary, under standard public-key cryptography assumptions. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded \emph{balanced} adversary that has the structure of all known attacks, implies the existence of public-key cryptography.

An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond. (arXiv:2305.16041v2 [stat.ML] UPDATED)

Authors: Marc Jourdan, Rémy Degenne, Emilie Kaufmann

We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings.

Differentiable Clustering with Perturbed Spanning Forests. (arXiv:2305.16358v3 [cs.LG] UPDATED)

Authors: Lawrence Stewart (DI-ENS), Francis S Bach (DI-ENS), Felipe Llinares López, Quentin Berthet

We introduce a differentiable clustering method based on stochastic perturbations of minimum-weight spanning forests. This allows us to include clustering in end-to-end trainable pipelines, with efficient gradients. We show that our method performs well even in difficult settings, such as data sets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several data sets for supervised and semi-supervised tasks.

PlaNeRF: SVD Unsupervised 3D Plane Regularization for NeRF Large-Scale Scene Reconstruction. (arXiv:2305.16914v4 [cs.CV] UPDATED)

Authors: Fusang Wang, Arnaud Louys, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou

Neural Radiance Fields (NeRF) enable 3D scene reconstruction from 2D images and camera poses for Novel View Synthesis (NVS). Although NeRF can produce photorealistic results, it often suffers from overfitting to training views, leading to poor geometry reconstruction, especially in low-texture areas. This limitation restricts many important applications which require accurate geometry, such as extrapolated NVS, HD mapping and scene editing. To address this limitation, we propose a new method to improve NeRF's 3D structure using only RGB images and semantic maps. Our approach introduces a novel plane regularization based on Singular Value Decomposition (SVD), that does not rely on any geometric prior. In addition, we leverage the Structural Similarity Index Measure (SSIM) in our loss design to properly initialize the volumetric representation of NeRF. Quantitative and qualitative results show that our method outperforms popular regularization approaches in accurate geometry reconstruction for large-scale outdoor scenes and achieves SoTA rendering quality on the KITTI-360 NVS benchmark.

Exact Generalization Guarantees for (Regularized) Wasserstein Distributionally Robust Models. (arXiv:2305.17076v2 [cs.LG] UPDATED)

Authors: Waïss Azizian (DAO), Franck Iutzeler (DAO), Jérôme Malick (DAO)

Wasserstein distributionally robust estimators have emerged as powerful models for prediction and decision-making under uncertainty. These estimators provide attractive generalization guarantees: the robust objective obtained from the training distribution is an exact upper bound on the true risk with high probability. However, existing guarantees either suffer from the curse of dimensionality, are restricted to specific settings, or lead to spurious error terms. In this paper, we show that these generalization guarantees actually hold on general classes of models, do not suffer from the curse of dimensionality, and can even cover distribution shifts at testing. We also prove that these results carry over to the newly-introduced regularized versions of Wasserstein distributionally robust problems.

NashFormer: Leveraging Local Nash Equilibria for Semantically Diverse Trajectory Prediction. (arXiv:2305.17600v2 [cs.LG] UPDATED)

Authors: Justin Lidard, Oswin So, Yanxia Zhang, Jonathan DeCastro, Xiongyi Cui, Xin Huang, Yen-Ling Kuo, John Leonard, Avinash Balachandran, Naomi Leonard, Guy Rosman

Interactions between road agents present a significant challenge in trajectory prediction, especially in cases involving multiple agents. Because existing diversity-aware predictors do not account for the interactive nature of multi-agent predictions, they may miss these important interaction outcomes. In this paper, we propose NashFormer, a framework for trajectory prediction that leverages game-theoretic inverse reinforcement learning to improve coverage of multi-modal predictions. We use a training-time game-theoretic analysis as an auxiliary loss resulting in improved coverage and accuracy without presuming a taxonomy of actions for the agents. We demonstrate our approach on the interactive split of the Waymo Open Motion Dataset, including four subsets involving scenarios with high interaction complexity. Experiment results show that our predictor produces accurate predictions while covering $33\%$ more potential interactions versus a baseline model.

Gaussian Process Probes (GPP) for Uncertainty-Aware Probing. (arXiv:2305.18213v2 [cs.LG] UPDATED)

Authors: Zi Wang, Alexander Ku, Jason Baldridge, Thomas L. Griffiths, Been Kim

Understanding which concepts models can and cannot represent has been fundamental to many tasks: from effective and responsible use of models to detecting out of distribution data. We introduce Gaussian process probes (GPP), a unified and simple framework for probing and measuring uncertainty about concepts represented by models. As a Bayesian extension of linear probing methods, GPP asks what kind of distribution over classifiers (of concepts) is induced by the model. This distribution can be used to measure both what the model represents and how confident the probe is about what the model represents. GPP can be applied to any pre-trained model with vector representations of inputs (e.g., activations). It does not require access to training data, gradients, or the architecture. We validate GPP on datasets containing both synthetic and real images. Our experiments show it can (1) probe a model's representations of concepts even with a very small number of examples, (2) accurately measure both epistemic uncertainty (how confident the probe is) and aleatory uncertainty (how fuzzy the concepts are to the model), and (3) detect out of distribution data using those uncertainty measures as well as classic methods do. By using Gaussian processes to expand what probing can offer, GPP provides a data-efficient, versatile and uncertainty-aware tool for understanding and evaluating the capabilities of machine learning models.

Active Vision Reinforcement Learning under Limited Visual Observability. (arXiv:2306.00975v2 [cs.LG] UPDATED)

Authors: Jinghuan Shang, Michael S. Ryoo

In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) together with eye movements (sensory policy). ActiveVision-RL poses challenges on coordinating two policies given their mutual influence. We propose SUGARL, Sensorimotor Understanding Guided Active Reinforcement Learning, a framework that models motor and sensory policies separately, but jointly learns them using with an intrinsic sensorimotor reward. This learnable reward is assigned by sensorimotor reward module, incentivizes the sensory policy to select observations that are optimal to infer its own motor action, inspired by the sensorimotor stage of humans. Through a series of experiments, we show the effectiveness of our method across a range of observability conditions and its adaptability to existed RL algorithms. The sensory policies learned through our method are observed to exhibit effective active vision strategies.

DiffLoad: Uncertainty Quantification in Load Forecasting with Diffusion Model. (arXiv:2306.01001v2 [cs.LG] UPDATED)

Authors: Zhixian Wang, Qingsong Wen, Chaoli Zhang, Liang Sun, Yi Wang

Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.

Transfer learning for atomistic simulations using GNNs and kernel mean embeddings. (arXiv:2306.01589v4 [cs.LG] UPDATED)

Authors: John Falk, Luigi Bonati, Pietro Novelli, Michele Parrinello, Massimiliano Pontil

Interatomic potentials learned using machine learning methods have been successfully applied to atomistic simulations. However, accurate models require large training datasets, while generating reference calculations is computationally demanding. To bypass this difficulty, we propose a transfer learning algorithm that leverages the ability of graph neural networks (GNNs) to represent chemical environments together with kernel mean embeddings. We extract a feature map from GNNs pre-trained on the OC20 dataset and use it to learn the potential energy surface from system-specific datasets of catalytic processes. Our method is further enhanced by incorporating into the kernel the chemical species information, resulting in improved performance and interpretability. We test our approach on a series of realistic datasets of increasing complexity, showing excellent generalization and transferability performance, and improving on methods that rely on GNNs or ridge regression alone, as well as similar fine-tuning approaches.

Evaluating Language Models for Mathematics through Interactions. (arXiv:2306.01694v2 [cs.LG] UPDATED)

Authors: Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, William Hart, Timothy Gowers, Wenda Li, Adrian Weller, Mateja Jamnik

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs, and is insufficient for making an informed decision about which LLMs and under which assistive settings can they be sensibly used. Static assessment fails to account for the essential interactive element in LLM deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty respond well to user corrections, and are more interpretable and concise may constitute better assistants. Interactive evaluation is a promising way to navigate the capability of these models; humans should be aware of language models' algebraic fallibility and discern where they are appropriate to use.

GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction. (arXiv:2306.01951v5 [cs.LG] UPDATED)

Authors: Amit Roy, Juan Shu, Jia Li, Carl Yang, Olivier Elshocht, Jeroen Smeets, Pan Li

Graph Anomaly Detection (GAD) is a technique used to identify abnormal nodes within graphs, finding applications in network security, fraud detection, social media spam detection, and various other domains. A common method for GAD is Graph Auto-Encoders (GAEs), which encode graph data into node representations and identify anomalies by assessing the reconstruction quality of the graphs based on these representations. However, existing GAE models are primarily optimized for direct link reconstruction, resulting in nodes connected in the graph being clustered in the latent space. As a result, they excel at detecting cluster-type structural anomalies but struggle with more complex structural anomalies that do not conform to clusters. To address this limitation, we propose a novel solution called GAD-NR, a new variant of GAE that incorporates neighborhood reconstruction for graph anomaly detection. GAD-NR aims to reconstruct the entire neighborhood of a node, encompassing the local structure, self-attributes, and neighbor attributes, based on the corresponding node representation. By comparing the neighborhood reconstruction loss between anomalous nodes and normal nodes, GAD-NR can effectively detect any anomalies. Extensive experimentation conducted on six real-world datasets validates the effectiveness of GAD-NR, showcasing significant improvements (by up to 30% in AUC) over state-of-the-art competitors. The source code for GAD-NR is openly available. Importantly, the comparative analysis reveals that the existing methods perform well only in detecting one or two types of anomalies out of the three types studied. In contrast, GAD-NR excels at detecting all three types of anomalies across the datasets, demonstrating its comprehensive anomaly detection capabilities.

For SALE: State-Action Representation Learning for Deep Reinforcement Learning. (arXiv:2306.02451v2 [cs.LG] UPDATED)

Authors: Scott Fujimoto, Wei-Di Chang, Edward J. Smith, Shixiang Shane Gu, Doina Precup, David Meger

In the field of reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.

Finding Counterfactually Optimal Action Sequences in Continuous State Spaces. (arXiv:2306.03929v2 [cs.LG] UPDATED)

Authors: Stratis Tsirtsis, Manuel Gomez-Rodriguez

Whenever a clinician reflects on the efficacy of a sequence of treatment decisions for a patient, they may try to identify critical time steps where, had they made different decisions, the patient's health would have improved. While recent methods at the intersection of causal inference and reinforcement learning promise to aid human experts, as the clinician above, to retrospectively analyze sequential decision making processes, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the $A^*$ algorithm that, under a natural form of Lipschitz continuity of the environment's dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.

Guiding The Last Layer in Federated Learning with Pre-Trained Models. (arXiv:2306.03937v2 [cs.LG] UPDATED)

Authors: Gwen Legate, Nicolas Bernier, Lucas Caccia, Edouard Oyallon, Eugene Belilovsky

Federated Learning (FL) is an emerging paradigm that allows a model to be trained across a number of participants without sharing data. Recent works have begun to consider the effects of using pre-trained models as an initialization point for existing FL algorithms; however, these approaches ignore the vast body of efficient transfer learning literature from the centralized learning setting. Here we revisit the problem of FL from a pre-trained model considered in prior work and expand it to a set of computer vision transfer learning problems. We first observe that simply fitting a linear classification head can be efficient and effective in many cases. We then show that in the FL setting, fitting a classifier using the Nearest Class Means (NCM) can be done exactly and orders of magnitude more efficiently than existing proposals, while obtaining strong performance. Finally, we demonstrate that using a two-phase approach of obtaining the classifier and then fine-tuning the model can yield rapid convergence and improved generalization in the federated setting. We demonstrate the potential our method has to reduce communication and compute costs while achieving better model performance.

Benchmarking Foundation Models with Language-Model-as-an-Examiner. (arXiv:2306.04181v2 [cs.CL] UPDATED)

Authors: Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou

Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: this http URL

RDumb: A simple approach that questions our progress in continual test-time adaptation. (arXiv:2306.05401v2 [cs.LG] UPDATED)

Authors: Ori Press, Steffen Schneider, Matthias Kümmerer, Matthias Bethge

Test-Time Adaptation (TTA) allows to update pre-trained models to changing data distributions at deployment time. While early work tested these algorithms for individual fixed distribution shifts, recent work proposed and applied methods for continual adaptation over long timescales. To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, "RDumb", that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks. Our results show that previous TTA approaches are neither effective at regularizing adaptation to avoid collapse nor able to outperform a simplistic resetting strategy.

Explainable Representation Learning of Small Quantum States. (arXiv:2306.05694v3 [quant-ph] UPDATED)

Authors: Felix Frohnert, Evert van Nieuwenburg

Unsupervised machine learning models build an internal representation of their training data without the need for explicit human guidance or feature engineering. This learned representation provides insights into which features of the data are relevant for the task at hand. In the context of quantum physics, training models to describe quantum states without human intervention offers a promising approach to gaining insight into how machines represent complex quantum states. The ability to interpret the learned representation may offer a new perspective on non-trivial features of quantum systems and their efficient representation. We train a generative model on two-qubit density matrices generated by a parameterized quantum circuit. In a series of computational experiments, we investigate the learned representation of the model and its internal understanding of the data. We observe that the model learns an interpretable representation which relates the quantum states to their underlying entanglement characteristics. In particular, our results demonstrate that the latent representation of the model is directly correlated with the entanglement measure concurrence. The insights from this study represent proof of concept towards interpretable machine learning of quantum states. Our approach offers insight into how machines learn to represent small-scale quantum systems autonomously.

Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model. (arXiv:2306.05720v2 [cs.CV] UPDATED)

Authors: Yida Chen, Fernanda Viégas, Martin Wattenberg

Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process$-$well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output. Project page: https://yc015.github.io/scene-representation-diffusion-model/

FLSL: Feature-level Self-supervised Learning. (arXiv:2306.06203v4 [cs.LG] UPDATED)

Authors: Qing Su, Anton Netchaev, Hai Li, Shihao Ji

Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg,MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation.Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a two-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)% AP and 42.1% AP in instance segmentation on MS-COCO, using Mask R-CNN with ViT-S/16 and ViT-S/8 as backbone, respectively. FLSL consistently outperforms existing SSL methods across additional benchmarks, including UAV17 object detection on UAVDT, and video instance segmentation on DAVIS 2017.We conclude by presenting visualization and various ablation studies to better understand the success of FLSL. The source code is available at https://github.com/ISL-CV/FLSL.

MANER: Multi-Agent Neural Rearrangement Planning of Objects in Cluttered Environments. (arXiv:2306.06543v2 [cs.RO] UPDATED)

Authors: Vivek Gupta, Praphpreet Dhir, Jeegn Dani, Ahmed H. Qureshi

Object rearrangement is a fundamental problem in robotics with various practical applications ranging from managing warehouses to cleaning and organizing home kitchens. While existing research has primarily focused on single-agent solutions, real-world scenarios often require multiple robots to work together on rearrangement tasks. This paper proposes a comprehensive learning-based framework for multi-agent object rearrangement planning, addressing the challenges of task sequencing and path planning in complex environments. The proposed method iteratively selects objects, determines their relocation regions, and pairs them with available robots under kinematic feasibility and task reachability for execution to achieve the target arrangement. Our experiments on a diverse range of simulated and real-world environments demonstrate the effectiveness and robustness of the proposed framework. Furthermore, results indicate improved performance in terms of traversal time and success rate compared to baseline approaches.

VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models. (arXiv:2306.06874v3 [cs.CR] UPDATED)

Authors: Sheng-Yen Chou, Pin-Yu Chen, Tsung-Yi Ho

Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. Our code is available on GitHub: \url{https://github.com/IBM/villandiffusion}

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models. (arXiv:2306.09869v3 [cs.CV] UPDATED)

Authors: Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, Jong Chul Ye

Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. Code: https://github.com/EnergyAttention/Energy-Based-CrossAttention.

Practical Equivariances via Relational Conditional Neural Processes. (arXiv:2306.10915v2 [stat.ML] UPDATED)

Authors: Daolang Huang, Manuel Haussmann, Ulpu Remes, ST John, Grégoire Clarté, Kevin Sebastian Luck, Samuel Kaski, Luigi Acerbi

Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances -- for example to translation -- which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.

CAT-Walk: Inductive Hypergraph Learning via Set Walks. (arXiv:2306.11147v2 [cs.LG] UPDATED)

Authors: Ali Behrouz, Farnoosh Hashemi, Sadaf Sadeghian, Margo Seltzer

Temporal hypergraphs provide a powerful paradigm for modeling time-dependent, higher-order interactions in complex systems. Representation learning for hypergraphs is essential for extracting patterns of the higher-order interactions that are critically important in real-world problems in social network analysis, neuroscience, finance, etc. However, existing methods are typically designed only for specific tasks or static hypergraphs. We present CAT-Walk, an inductive method that learns the underlying dynamic laws that govern the temporal and structural processes underlying a temporal hypergraph. CAT-Walk introduces a temporal, higher-order walk on hypergraphs, SetWalk, that extracts higher-order causal patterns. CAT-Walk uses a novel adaptive and permutation invariant pooling strategy, SetMixer, along with a set-based anonymization process that hides the identity of hyperedges. Finally, we present a simple yet effective neural network model to encode hyperedges. Our evaluation on 10 hypergraph benchmark datasets shows that CAT-Walk attains outstanding performance on temporal hyperedge prediction benchmarks in both inductive and transductive settings. It also shows competitive performance with state-of-the-art methods for node classification. (https://github.com/ubc-systopia/CATWalk)

Sparse Modular Activation for Efficient Sequence Modeling. (arXiv:2306.11197v4 [cs.LG] UPDATED)

Authors: Liliang Ren, Yang Liu, Shuohang Wang, Yichong Xu, Chenguang Zhu, ChengXiang Zhai

Recent hybrid models combining Linear State Space Models (SSMs) with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. However, current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. To address this limitation, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption of neural networks at both training and inference stages. To validate the effectiveness of SMA on sequence modeling, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including long sequence modeling, speech classification and language modeling, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity, and reveals the amount of attention needed for each task through the learned sparse activation patterns. Our code is publicly available at https://github.com/renll/SeqBoat.

One-shot Imitation Learning via Interaction Warping. (arXiv:2306.12392v2 [cs.RO] UPDATED)

Authors: Ondrej Biza, Skye Thompson, Kishore Reddy Pagidi, Abhinav Kumar, Elise van der Pol, Robin Walters, Thomas Kipf, Jan-Willem van de Meent, Lawson L.S. Wong, Robert Platt

Imitation learning of robot policies from few demonstrations is crucial in open-ended applications. We propose a new method, Interaction Warping, for learning SE(3) robotic manipulation policies from a single demonstration. We infer the 3D mesh of each object in the environment using shape warping, a technique for aligning point clouds across object instances. Then, we represent manipulation actions as keypoints on objects, which can be warped with the shape of the object. We show successful one-shot imitation learning on three simulated and real-world object re-arrangement tasks. We also demonstrate the ability of our method to predict object meshes and robot grasps in the wild.

Is RLHF More Difficult than Standard RL?. (arXiv:2306.14111v2 [cs.LG] UPDATED)

Authors: Yuanhao Wang, Qinghua Liu, Chi Jin

Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.

Reduce Computational Complexity for Convolutional Layers by Skipping Zeros. (arXiv:2306.15951v3 [cs.LG] UPDATED)

Authors: Zhiyi Zhang, Pengfei Zhang, Zhuopin Xu, Qi Wang

Convolutional neural networks necessitate good algorithms to reduce complexity, and sufficient utilization of parallel processors for acceleration. Within convolutional layers, there are three types of operators: convolution used in forward propagation, deconvolution and dilated-convolution utilized in backward propagation. During the execution of these operators, zeros are typically added to tensors, leading to redundant calculations and unnecessary strain on hardware. To circumvent these inefficiencies, we propose the C-K-S algorithm, accompanied by efficient GPU implementations. C-K-S trims filters to exclude zero-padding. For deconvolution and dilated-convolution, C-K-S transforms sparse tensors into dense tensors, and standardizes the local computational rules to simplify the hardware control. The experimental results demonstrate that C-K-S offers good performance in terms of speed and convergence, surpassing the capabilities of PyTorch and cuDNN in certain scenarios.

Transport, Variational Inference and Diffusions: with Applications to Annealed Flows and Schr\"odinger Bridges. (arXiv:2307.01050v4 [stat.ML] UPDATED)

Authors: Francisco Vargas, Shreyas Padhy, Denis Blessing, Nikolas Nüsken

Connecting optimal transport and variational inference, we present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of the \emph{Controlled Monte Carlo Diffusion} sampler (CMCD) for Bayesian computation, a score-based annealing technique that crucially adapts both forward and backward dynamics in a diffusion model. On the way, we clarify the relationship between the EM-algorithm and iterative proportional fitting (IPF) for Schr{\"o}dinger bridges, deriving as well a regularised objective that bypasses the iterative bottleneck of standard IPF-updates. Finally, we show that CMCD has a strong foundation in the Jarzinsky and Crooks identities from statistical physics, and that it convincingly outperforms competing approaches across a wide array of experiments.

Towards Symmetry-Aware Generation of Periodic Materials. (arXiv:2307.02707v2 [cs.LG] UPDATED)

Authors: Youzhi Luo, Chengkai Liu, Shuiwang Ji

We consider the problem of generating periodic materials with deep models. While symmetry-aware molecule generation has been studied extensively, periodic materials possess different symmetries, which have not been completely captured by existing methods. In this work, we propose SyMat, a novel material generation approach that can capture physical symmetries of periodic material structures. SyMat generates atom types and lattices of materials through generating atom type sets, lattice lengths and lattice angles with a variational auto-encoder model. In addition, SyMat employs a score-based diffusion model to generate atom coordinates of materials, in which a novel symmetry-aware probabilistic model is used in the coordinate diffusion process. We show that SyMat is theoretically invariant to all symmetry transformations on materials and demonstrate that SyMat achieves promising performance on random generation and property optimization tasks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).

DEDUCE: Multi-head attention decoupled contrastive learning to discover cancer subtypes based on multi-omics data. (arXiv:2307.04075v2 [cs.LG] UPDATED)

Authors: Liangrui Pan, Dazhen Liu, Yutao Dou, Lian Wang, Zhichao Feng, Pengfei Rong, Liwen Xu, Shaoliang Peng

Due to the high heterogeneity and clinical characteristics of cancer, there are significant differences in multi-omics data and clinical features among subtypes of different cancers. Therefore, the identification and discovery of cancer subtypes are crucial for the diagnosis, treatment, and prognosis of cancer. In this study, we proposed a generalization framework based on attention mechanisms for unsupervised contrastive learning to analyze cancer multi-omics data for the identification and characterization of cancer subtypes. The framework contains a symmetric unsupervised multi-head attention encoder, which can deeply extract contextual features and long-range dependencies of multi-omics data, reducing the impact of noise in multi-omics data. Importantly, the proposed framework includes a decoupled contrastive learning model (DEDUCE) based on a multi-head attention mechanism to learn multi-omics data features and clustering and identify cancer subtypes. This method clusters subtypes by calculating the similarity between samples in the feature space and sample space of multi-omics data. The basic idea is to decouple different attributes of multi-omics data features and learn them as contrasting terms. Construct a contrastive loss function to measure the difference between positive examples and negative examples, and minimize this difference, thereby encouraging the model to learn better feature representation. The DEDUCE model conducts large-scale experiments on simulated multi-omics data sets, single-cell multi-omics data sets and cancer multi-omics data sets, and the results are better than 10 deep learning models. Finally, we used the DEDUCE model to reveal six cancer subtypes of AML. By analyzing GO functional enrichment, subtype-specific biological functions and GSEA of AML,

A Novel Site-Agnostic Multimodal Deep Learning Model to Identify Pro-Eating Disorder Content on Social Media. (arXiv:2307.06775v4 [cs.LG] UPDATED)

Authors: Jonathan Feldman

Over the last decade, there has been a vast increase in eating disorder diagnoses and eating disorder-attributed deaths, reaching their zenith during the Covid-19 pandemic. This immense growth derived in part from the stressors of the pandemic but also from increased exposure to social media, which is rife with content that promotes eating disorders. This study aimed to create a multimodal deep learning model that can determine if a given social media post promotes eating disorders based on a combination of visual and textual data. A labeled dataset of Tweets was collected from Twitter, recently rebranded as X, upon which twelve deep learning models were trained and evaluated. Based on model performance, the most effective deep learning model was the multimodal fusion of the RoBERTa natural language processing model and the MaxViT image classification model, attaining accuracy and F1 scores of 95.9% and 0.959, respectively. The RoBERTa and MaxViT fusion model, deployed to classify an unlabeled dataset of posts from the social media sites Tumblr and Reddit, generated results akin to those of previous research studies that did not employ artificial intelligence-based techniques, indicating that deep learning models can develop insights congruent to those of researchers. Additionally, the model was used to conduct a time-series analysis of yet unseen Tweets from eight Twitter hashtags, uncovering that, since 2014, the relative abundance of content that promotes eating disorders has decreased drastically within those communities. Despite this reduction, by 2018, content that promotes eating disorders had either stopped declining or increased in ampleness anew on those hashtags.

Embodied Lifelong Learning for Task and Motion Planning. (arXiv:2307.06870v2 [cs.RO] UPDATED)

Authors: Jorge Mendez-Mendez, Leslie Pack Kaelbling, Tomás Lozano-Pérez

A robot deployed in a home over long stretches of time faces a true lifelong learning problem. As it seeks to provide assistance to its users, the robot should leverage any accumulated experience to improve its own knowledge and proficiency. We formalize this setting with a novel formulation of lifelong learning for task and motion planning (TAMP), which endows our learner with the compositionality of TAMP systems. Exploiting the modularity of TAMP, we develop a mixture of generative models that produces candidate continuous parameters for a planner. Whereas most existing lifelong learning approaches determine a priori how data is shared across various models, our approach learns shared and non-shared models and determines which to use online during planning based on auxiliary tasks that serve as a proxy for each model's understanding of a state. Our method exhibits substantial improvements (over time and compared to baselines) in planning success on 2D and BEHAVIOR domains.

Adaptive Linear Estimating Equations. (arXiv:2307.07320v2 [math.ST] UPDATED)

Authors: Mufang Ying, Koulik Khamaru, Cun-Hui Zhang

Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.

Enabling Efficient, Reliable Real-World Reinforcement Learning with Approximate Physics-Based Models. (arXiv:2307.08168v2 [cs.LG] UPDATED)

Authors: Tyler Westenbroek, Jacob Levy, David Fridovich-Keil

We focus on developing efficient and reliable policy optimization strategies for robot learning with real-world data. In recent years, policy gradient methods have emerged as a promising paradigm for training control policies in simulation. However, these approaches often remain too data inefficient or unreliable to train on real robotic hardware. In this paper we introduce a novel policy gradient-based policy optimization framework which systematically leverages a (possibly highly simplified) first-principles model and enables learning precise control policies with limited amounts of real-world data. Our approach $1)$ uses the derivatives of the model to produce sample-efficient estimates of the policy gradient and $2)$ uses the model to design a low-level tracking controller, which is embedded in the policy class. Theoretical analysis provides insight into how the presence of this feedback controller overcomes key limitations of stand-alone policy gradient methods, while hardware experiments with a small car and quadruped demonstrate that our approach can learn precise control strategies reliably and with only minutes of real-world data.

A Theory for Emergence of Complex Skills in Language Models. (arXiv:2307.15936v2 [cs.LG] UPDATED)

Authors: Sanjeev Arora, Anirudh Goyal

A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.

AI Increases Global Access to Reliable Flood Forecasts. (arXiv:2307.16104v4 [cs.LG] UPDATED)

Authors: Grey Nearing, Deborah Cohen, Vusumuzi Dube, Martin Gauch, Oren Gilon, Shaun Harrigan, Avinatan Hassidim, Daniel Klotz, Frederik Kratzert, Asher Metzger, Sella Nevo, Florian Pappenberger, Christel Prudhomme, Guy Shalev, Shlomo Shenzis, Tadele Tekalign, Dana Weitzner, Yoss Matias

Floods are one of the most common natural disasters, with a disproportionate impact in developing countries that often lack dense streamflow gauge networks. Accurate and timely warnings are critical for mitigating flood risks, but hydrological simulation models typically must be calibrated to long data records in each watershed. Using AI, we achieve reliability in predicting extreme riverine events in ungauged watersheds at up to a 5-day lead time that is similar to or better than the reliability of nowcasts (0-day lead time) from a current state of the art global modeling system (the Copernicus Emergency Management Service Global Flood Awareness System). Additionally, we achieve accuracies over 5-year return period events that are similar to or better than current accuracies over 1-year return period events. This means that AI can provide flood warnings earlier and over larger and more impactful events in ungauged basins. The model developed in this paper was incorporated into an operational early warning system that produces publicly available (free and open) forecasts in real time in over 80 countries. This work highlights a need for increasing the availability of hydrological data to continue to improve global access to reliable flood warnings.

Discrete neural nets and polymorphic learning. (arXiv:2308.00677v2 [cs.NE] UPDATED)

Authors: Charlotte Aten

Theorems from universal algebra such as that of Murski\u{i} from the 1970s have a striking similarity to universal approximation results for neural nets along the lines of Cybenko's from the 1980s. We consider here a discrete analogue of the classical notion of a neural net which places these results in a unified setting. We introduce a learning algorithm based on polymorphisms of relational structures and show how to use it for a classical learning task.

Online covariance estimation for stochastic gradient descent under Markovian sampling. (arXiv:2308.01481v2 [math.ST] UPDATED)

Authors: Abhishek Roy, Krishnakumar Balasubramanian

We investigate the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling. Convergence rates of order $O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d}\,n^{-1/8}\big)$ are established under state-dependent and state-independent Markovian sampling, respectively, where $d$ is the dimensionality and $n$ denotes observations or SGD iterations. These rates match the best-known convergence rate for independent and identically distributed (i.i.d) data. Our analysis overcomes significant challenges that arise due to Markovian sampling, leading to the introduction of additional error terms and complex dependencies between the blocks of the batch-means covariance estimator. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data, which holds potential interest as an independent result. Numerical illustrations provide confidence intervals for SGD in linear and logistic regression models under Markovian sampling. Additionally, our method is applied to the strategic classification with logistic regression, where adversaries adaptively modify features during training to affect target class classification.

Fine-Tune Language Models as Differential Equation Solvers. (arXiv:2308.05061v2 [cs.LG] UPDATED)

Authors: Liu Yang, Siting Liu, Stanley J. Osher

In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in learning operators and solving differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data, may inadvertently overlook the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly improves in-context operator learning, but also creates a new path for the application of language models.

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features. (arXiv:2308.06197v2 [cs.CV] UPDATED)

Authors: Angus Maiden (1), Bahareh Nakisa (1) ((1) Deakin University)

Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.

Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning. (arXiv:2308.09544v3 [cs.LG] UPDATED)

Authors: Filip Szatkowski, Mateusz Pyla, Marcin Przewięźlikowski, Sebastian Cygert, Bartłomiej Twardowski, Tomasz Trzciński

In this work, we investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy, aiming to prevent forgetting. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Our analysis reveals that this issue originates from substantial representation shifts in the teacher network when dealing with out-of-distribution data. This causes large errors in the KD loss component, leading to performance degradation in CIL models. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training. Our method seamlessly integrates with KD-based CIL approaches and allows for consistent enhancement of their performance across multiple exemplar-free CIL benchmarks. The source code for our method is available at https://github.com/fszatkowski/cl-teacher-adaptation.

Flamingo: Multi-Round Single-Server Secure Aggregation with Applications to Private Federated Learning. (arXiv:2308.09883v2 [cs.CR] UPDATED)

Authors: Yiping Ma, Jess Woods, Sebastian Angel, Antigoni Polychroniadou, Tal Rabin

This paper introduces Flamingo, a system for secure aggregation of data across a large set of clients. In secure aggregation, a server sums up the private inputs of clients and obtains the result without learning anything about the individual inputs beyond what is implied by the final sum. Flamingo focuses on the multi-round setting found in federated learning in which many consecutive summations (averages) of model weights are performed to derive a good model. Previous protocols, such as Bell et al. (CCS '20), have been designed for a single round and are adapted to the federated learning setting by repeating the protocol multiple times. Flamingo eliminates the need for the per-round setup of previous protocols, and has a new lightweight dropout resilience protocol to ensure that if clients leave in the middle of a sum the server can still obtain a meaningful result. Furthermore, Flamingo introduces a new way to locally choose the so-called client neighborhood introduced by Bell et al. These techniques help Flamingo reduce the number of interactions between clients and the server, resulting in a significant reduction in the end-to-end runtime for a full training session over prior work. We implement and evaluate Flamingo and show that it can securely train a neural network on the (Extended) MNIST and CIFAR-100 datasets, and the model converges without a loss in accuracy, compared to a non-private federated learning system.

An Online Multiple Kernel Parallelizable Learning Scheme. (arXiv:2308.10101v2 [cs.LG] UPDATED)

Authors: Emilio Ruiz-Moreno, Baltasar Beferull-Lozano

The performance of reproducing kernel Hilbert space-based methods is known to be sensitive to the choice of the reproducing kernel. Choosing an adequate reproducing kernel can be challenging and computationally demanding, especially in data-rich tasks without prior information about the solution domain. In this paper, we propose a learning scheme that scalably combines several single kernel-based online methods to reduce the kernel-selection bias. The proposed learning scheme applies to any task formulated as a regularized empirical risk minimization convex problem. More specifically, our learning scheme is based on a multi-kernel learning formulation that can be applied to widen any single-kernel solution space, thus increasing the possibility of finding higher-performance solutions. In addition, it is parallelizable, allowing for the distribution of the computational load across different computing units. We show experimentally that the proposed learning scheme outperforms the combined single-kernel online methods separately in terms of the cumulative regularized least squares cost metric.

SBSM-Pro: Support Bio-sequence Machine for Proteins. (arXiv:2308.10275v2 [q-bio.QM] UPDATED)

Authors: Yizheng Wang, Yixiao Zhai, Yijie Ding, Quan Zou

Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and posttranslational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at this http URL

Low Tensor Rank Learning of Neural Dynamics. (arXiv:2308.11567v2 [q-bio.NC] UPDATED)

Authors: Arthur Pellegrino, N Alex Cayco-Gajic, Angus Chadwick

Learning relies on coordinated synaptic changes in recurrently connected populations of neurons. Therefore, understanding the collective evolution of synaptic connectivity over learning is a key challenge in neuroscience and machine learning. In particular, recent work has shown that the weight matrices of task-trained RNNs are typically low rank, but how this low rank structure unfolds over learning is unknown. To address this, we investigate the rank of the 3-tensor formed by the weight matrices throughout learning. By fitting RNNs of varying rank to large-scale neural recordings during a motor learning task, we find that the inferred weights are low-tensor-rank and therefore evolve over a fixed low-dimensional subspace throughout the entire course of learning. We next validate the observation of low-tensor-rank learning on an RNN trained to solve the same task. Finally, we present a set of mathematical results bounding the matrix and tensor ranks of gradient descent learning dynamics which show that low-tensor-rank weights emerge naturally in RNNs trained to solve low-dimensional tasks. Taken together, our findings provide insight on the evolution of population connectivity over learning in both biological and artificial neural networks, and enable reverse engineering of learning-induced changes in recurrent dynamics from large-scale neural recordings.

Optimal data pooling for shared learning in maintenance operations. (arXiv:2308.12670v2 [cs.LG] UPDATED)

Authors: Collin Drent, Melvin Drent, Geert-Jan van Houtum

We study optimal data pooling for shared learning in two common maintenance operations: condition-based maintenance and spare parts management. We consider a set of systems subject to Poisson input -- the degradation or demand process -- that are coupled through an a-priori unknown rate. Decision problems involving these systems are high-dimensional Markov decision processes (MDPs) and hence notoriously difficult to solve. We present a decomposition result that reduces such an MDP to two-dimensional MDPs, enabling structural analyses and computations. Leveraging this decomposition, we (i) demonstrate that pooling data can lead to significant cost reductions compared to not pooling, and (ii) show that the optimal policy for the condition-based maintenance problem is a control limit policy, while for the spare parts management problem, it is an order-up-to level policy, both dependent on the pooled data.

Linear Oscillation: A Novel Activation Function for Vision Transformer. (arXiv:2308.13670v3 [cs.LG] UPDATED)

Authors: Juyoung Yun

Activation functions are the linchpins of deep learning, profoundly influencing both the representational capacity and training dynamics of neural networks. They shape not only the nature of representations but also optimize convergence rates and enhance generalization potential. Appreciating this critical role, we present the Linear Oscillation (LoC) activation function, defined as $f(x) = x \times \sin(\alpha x + \beta)$. Distinct from conventional activation functions which primarily introduce non-linearity, LoC seamlessly blends linear trajectories with oscillatory deviations. The nomenclature "Linear Oscillation" is a nod to its unique attribute of infusing linear activations with harmonious oscillations, capturing the essence of the "Importance of Confusion". This concept of "controlled confusion" within network activations is posited to foster more robust learning, particularly in contexts that necessitate discerning subtle patterns. Our empirical studies reveal that, when integrated into diverse neural architectures, the LoC activation function consistently outperforms established counterparts like ReLU and Sigmoid. The stellar performance exhibited by the avant-garde Vision Transformer model using LoC further validates its efficacy. This study illuminates the remarkable benefits of the LoC over other prominent activation functions. It champions the notion that intermittently introducing deliberate complexity or "confusion" during training can spur more profound and nuanced learning. This accentuates the pivotal role of judiciously selected activation functions in shaping the future of neural network training.

Detecting Language Model Attacks with Perplexity. (arXiv:2308.14132v2 [cs.CL] UPDATED)

Authors: Gabriel Alon, Michael Kamfonas

A novel hack involving Large Language Models (LLMs) has emerged, leveraging adversarial suffixes to trick models into generating perilous responses. This method has garnered considerable attention from reputable media outlets such as the New York Times and Wired, thereby influencing public perception regarding the security and safety of LLMs. In this study, we advocate the utilization of perplexity as one of the means to recognize such potential attacks. The underlying concept behind these hacks revolves around appending an unusually constructed string of text to a harmful query that would otherwise be blocked. This maneuver confuses the protective mechanisms and tricks the model into generating a forbidden response. Such scenarios could result in providing detailed instructions to a malicious user for constructing explosives or orchestrating a bank heist. Our investigation demonstrates the feasibility of employing perplexity, a prevalent natural language processing metric, to detect these adversarial tactics before generating a forbidden response. By evaluating the perplexity of queries with and without such adversarial suffixes using an open-source LLM, we discovered that nearly 90 percent were above a perplexity of 1000. This contrast underscores the efficacy of perplexity for detecting this type of exploit.

Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability. (arXiv:2309.00738v2 [cs.LG] UPDATED)

Authors: Zehao Dong, Muhan Zhang, Philip R.O. Payne, Michael A Province, Carlos Cruchaga, Tianyu Zhao, Fuhai Li, Yixin Chen

The expressivity of Graph Neural Networks (GNNs) has been studied broadly in recent years to reveal the design principles for more powerful GNNs. Graph canonization is known as a typical approach to distinguish non-isomorphic graphs, yet rarely adopted when developing expressive GNNs. This paper proposes to maximize the expressivity of GNNs by graph canonization, then the power of such GNNs is studies from the perspective of model stability. A stable GNN will map similar graphs to close graph representations in the vectorial space, and the stability of GNNs is critical to generalize their performance to unseen graphs. We theoretically reveal the trade-off of expressivity and stability in graph-canonization-enhanced GNNs. Then we introduce a notion of universal graph canonization as the general solution to address the trade-off and characterize a widely applicable sufficient condition to solve the universal graph canonization. A comprehensive set of experiments demonstrates the effectiveness of the proposed method. In many popular graph benchmark datasets, graph canonization successfully enhances GNNs and provides highly competitive performance, indicating the capability and great potential of proposed method in general graph representation learning. In graph datasets where the sufficient condition holds, GNNs enhanced by universal graph canonization consistently outperform GNN baselines and successfully improve the SOTA performance up to $31\%$, providing the optimal solution to numerous challenging real-world graph analytical tasks like gene network representation learning in bioinformatics.

RADIO: Reference-Agnostic Dubbing Video Synthesis. (arXiv:2309.01950v2 [cs.CV] UPDATED)

Authors: Dongyeun Lee, Chaewon Kim, Sangjoon Yu, Jaejun Yoo, Gyeong-Moon Park

One of the most challenging problems in audio-driven talking head generation is achieving high-fidelity detail while ensuring precise synchronization. Given only a single reference image, extracting meaningful identity attributes becomes even more challenging, often causing the network to mirror the facial and lip structures too closely. To address these issues, we introduce RADIO, a framework engineered to yield high-quality dubbed videos regardless of the pose or expression in reference images. The key is to modulate the decoder layers using latent space composed of audio and reference features. Additionally, we incorporate ViT blocks into the decoder to emphasize high-fidelity details, especially in the lip region. Our experimental results demonstrate that RADIO displays high synchronization without the loss of fidelity. Especially in harsh scenarios where the reference frame deviates significantly from the ground truth, our method outperforms state-of-the-art methods, highlighting its robustness.

PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference. (arXiv:2309.02334v2 [cs.LG] UPDATED)

Authors: Marta Andronic, George A. Constantinides

Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.

A physics-informed and attention-based graph learning approach for regional electric vehicle charging demand prediction. (arXiv:2309.05259v2 [cs.LG] UPDATED)

Authors: Haohao Qu, Haoxuan Kuang, Jun Li, Linlin You

Along with the proliferation of electric vehicles (EVs), optimizing the use of EV charging space can significantly alleviate the growing load on intelligent transportation systems. As the foundation to achieve such an optimization, a spatiotemporal method for EV charging demand prediction in urban areas is required. Although several solutions have been proposed by using data-driven deep learning methods, it can be found that these performance-oriented methods may suffer from misinterpretations to correctly handle the reverse relationship between charging demands and prices. To tackle the emerging challenges of training an accurate and interpretable prediction model, this paper proposes a novel approach that enables the integration of graph and temporal attention mechanisms for feature extraction and the usage of physic-informed meta-learning in the model pre-training step for knowledge transfer. Evaluation results on a dataset of 18,013 EV charging piles in Shenzhen, China, show that the proposed approach, named PAG, can achieve state-of-the-art forecasting performance and the ability in understanding the adaptive changes in charging demands caused by price fluctuations.

PRE: Vision-Language Prompt Learning with Reparameterization Encoder. (arXiv:2309.07760v2 [cs.CV] UPDATED)

Authors: Anh Pham Thi Minh, An Duc Nguyen, Georgios Tzimiropoulos

Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.

HINT: Healthy Influential-Noise based Training to Defend against Data Poisoning Attacks. (arXiv:2309.08549v2 [cs.LG] UPDATED)

Authors: Minh-Hao Van, Alycia N. Carey, Xintao Wu

While numerous defense methods have been proposed to prohibit potential poisoning attacks from untrusted data sources, most research works only defend against specific attacks, which leaves many avenues for an adversary to exploit. In this work, we propose an efficient and robust training approach to defend against data poisoning attacks based on influence functions, named Healthy Influential-Noise based Training. Using influence functions, we craft healthy noise that helps to harden the classification model against poisoning attacks without significantly affecting the generalization ability on test data. In addition, our method can perform effectively when only a subset of the training data is modified, instead of the current method of adding noise to all examples that has been used in several previous works. We conduct comprehensive evaluations over two image datasets with state-of-the-art poisoning attacks under different realistic attack scenarios. Our empirical results show that HINT can efficiently protect deep learning models against the effect of both untargeted and targeted poisoning attacks.

Active Learning for Semantic Segmentation with Multi-class Label Query. (arXiv:2309.09319v2 [cs.CV] UPDATED)

Authors: Sehyun Hwang, Sohyun Lee, Hoyoung Kim, Minhyeon Oh, Jungseul Ok, Suha Kwak

This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training as it assigns partial labels (i.e., a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperforms previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost. Our code and results are available at https://github.com/sehyun03/MulActSeg.

Spatial-frequency channels, shape bias, and adversarial robustness. (arXiv:2309.13190v2 [cs.LG] UPDATED)

Authors: Ajay Subramanian, Elena Sizikova, Najib J. Majaj, Denis G. Pelli

What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel") that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. Unlike humans, the neural network channel is very broad, 2-4 times wider than the human channel. Thus, noise at certain high and low frequencies will impair network performance and spare human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (51% variance explained) and robustness of adversarially-trained networks (66% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further beyond the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only makes it worse. Networks with narrower channels might be more robust.

DeepACO: Neural-enhanced Ant Systems for Combinatorial Optimization. (arXiv:2309.14032v2 [cs.NE] UPDATED)

Authors: Haoran Ye, Jiarui Wang, Zhiguang Cao, Helan Liang, Yong Li

Ant Colony Optimization (ACO) is a meta-heuristic algorithm that has been successfully applied to various Combinatorial Optimization Problems (COPs). Traditionally, customizing ACO for a specific problem requires the expert design of knowledge-driven heuristics. In this paper, we propose DeepACO, a generic framework that leverages deep reinforcement learning to automate heuristic designs. DeepACO serves to strengthen the heuristic measures of existing ACO algorithms and dispense with laborious manual design in future ACO applications. As a neural-enhanced meta-heuristic, DeepACO consistently outperforms its ACO counterparts on eight COPs using a single neural architecture and a single set of hyperparameters. As a Neural Combinatorial Optimization method, DeepACO performs better than or on par with problem-specific methods on canonical routing problems. Our code is publicly available at https://github.com/henry-yeh/DeepACO.

MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks. (arXiv:2309.14118v2 [cs.LG] UPDATED)

Authors: Vinitra Swamy, Malika Satayeva, Jibril Frej, Thierry Bossy, Thijs Vogels, Martin Jaggi, Tanja Käser, Mary-Anne Hartley

Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in parallel, which not only limits their interpretability but also creates a dependency on modality availability. We present MultiModN, a multimodal, modular network that fuses latent representations in a sequence of any number, combination, or type of modality while providing granular real-time predictive feedback on any number or combination of predictive tasks. MultiModN's composable pipeline is interpretable-by-design, as well as innately multi-task and robust to the fundamental issue of biased missingness. We perform four experiments on several benchmark MM datasets across 10 real-world tasks (predicting medical diagnoses, academic performance, and weather), and show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion. By simulating the challenging bias of missing not-at-random (MNAR), this work shows that, contrary to MultiModN, parallel fusion baselines erroneously learn MNAR and suffer catastrophic failure when faced with different patterns of MNAR at inference. To the best of our knowledge, this is the first inherently MNAR-resistant approach to MM modeling. In conclusion, MultiModN provides granular insights, robustness, and flexibility without compromising performance.

Uncertainty Quantification via Neural Posterior Principal Components. (arXiv:2309.15533v2 [cs.CV] UPDATED)

Authors: Elias Nehme, Omer Yair, Tomer Michaeli

Uncertainty quantification is crucial for the deployment of image restoration models in safety-critical domains, like autonomous driving and biological imaging. To date, methods for uncertainty visualization have mainly focused on per-pixel estimates. Yet, a heatmap of per-pixel variances is typically of little practical use, as it does not capture the strong correlations between pixels. A more natural measure of uncertainty corresponds to the variances along the principal components (PCs) of the posterior distribution. Theoretically, the PCs can be computed by applying PCA on samples generated from a conditional generative model for the input image. However, this requires generating a very large number of samples at test time, which is painfully slow with the current state-of-the-art (diffusion) models. In this work, we present a method for predicting the PCs of the posterior distribution for any input image, in a single forward pass of a neural network. Our method can either wrap around a pre-trained model that was trained to minimize the mean square error (MSE), or can be trained from scratch to output both a predicted image and the posterior PCs. We showcase our method on multiple inverse problems in imaging, including denoising, inpainting, super-resolution, and biological image-to-image translation. Our method reliably conveys instance-adaptive uncertainty directions, achieving uncertainty quantification comparable with posterior samplers while being orders of magnitude faster. Code and examples are available at https://eliasnehme.github.io/NPPC/

On the Computational Entanglement of Distant Features in Adversarial Machine Learning. (arXiv:2309.15669v3 [cs.LG] UPDATED)

Authors: YenLung Lai, Xingbo Dong, Zhe Jin

Adversarial examples in machine learning has emerged as a focal point of research due to their remarkable ability to deceive models with seemingly inconspicuous input perturbations, potentially resulting in severe consequences. In this study, we embark on a comprehensive exploration of adversarial machine learning models, shedding light on their intrinsic complexity and interpretability. Our investigation reveals intriguing links between machine learning model complexity and Einstein's theory of special relativity, all through the lens of entanglement. While our work does not primarily center on quantum entanglement, we instead define the entanglement correlations we have discovered to be computational, and demonstrate that distant feature samples can be entangled, strongly resembling entanglement correlation in the quantum realm. This revelation bestows fresh insights for understanding the phenomenon of emergent adversarial examples in modern machine learning, potentially paving the way for more robust and interpretable models in this rapidly evolving field.

GInX-Eval: Towards In-Distribution Evaluation of Graph Neural Network Explanations. (arXiv:2309.16223v2 [cs.AI] UPDATED)

Authors: Kenza Amara, Mennatallah El-Assady, Rex Ying

Diverse explainability methods of graph neural networks (GNN) have recently been developed to highlight the edges and nodes in the graph that contribute the most to the model predictions. However, it is not clear yet how to evaluate the correctness of those explanations, whether it is from a human or a model perspective. One unaddressed bottleneck in the current evaluation procedure is the problem of out-of-distribution explanations, whose distribution differs from those of the training data. This important issue affects existing evaluation metrics such as the popular faithfulness or fidelity score. In this paper, we show the limitations of faithfulness metrics. We propose GInX-Eval (Graph In-distribution eXplanation Evaluation), an evaluation procedure of graph explanations that overcomes the pitfalls of faithfulness and offers new insights on explainability methods. Using a fine-tuning strategy, the GInX score measures how informative removed edges are for the model and the EdgeRank score evaluates if explanatory edges are correctly ordered by their importance. GInX-Eval verifies if ground-truth explanations are instructive to the GNN model. In addition, it shows that many popular methods, including gradient-based methods, produce explanations that are not better than a random designation of edges as important subgraphs, challenging the findings of current works in the area. Results with GInX-Eval are consistent across multiple datasets and align with human evaluation.

Benefits of mirror weight symmetry for 3D mesh segmentation in biomedical applications. (arXiv:2309.17076v2 [eess.IV] UPDATED)

Authors: Vladislav Dordiuk, Maksim Dzhigil, Konstantin Ushenin

3D mesh segmentation is an important task with many biomedical applications. The human body has bilateral symmetry and some variations in organ positions. It allows us to expect a positive effect of rotation and inversion invariant layers in convolutional neural networks that perform biomedical segmentations. In this study, we show the impact of weight symmetry in neural networks that perform 3D mesh segmentation. We analyze the problem of 3D mesh segmentation for pathological vessel structures (aneurysms) and conventional anatomical structures (endocardium and epicardium of ventricles). Local geometrical features are encoded as sampling from the signed distance function, and the neural network performs prediction for each mesh node. We show that weight symmetry gains from 1 to 3% of additional accuracy and allows decreasing the number of trainable parameters up to 8 times without suffering the performance loss if neural networks have at least three convolutional layers. This also works for very small training sets.

Data Filtering Networks. (arXiv:2309.17425v3 [cs.AI] UPDATED)

Authors: Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, Vaishaal Shankar

Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.

Towards Robust Cardiac Segmentation using Graph Convolutional Networks. (arXiv:2310.01210v4 [eess.IV] UPDATED)

Authors: Gilles Van De Vyver, Sarina Thomas, Guy Ben-Yosef, Sindre Hellum Olaisen, Håvard Dalen, Lasse Løvstakken, Erik Smistad

Fully automatic cardiac segmentation can be a fast and reproducible method to extract clinical measurements from an echocardiography examination. The U-Net architecture is the current state-of-the-art deep learning architecture for medical segmentation and can segment cardiac structures in real-time with average errors comparable to inter-observer variability. However, this architecture still generates large outliers that are often anatomically incorrect. This work uses the concept of graph convolutional neural networks that predict the contour points of the structures of interest instead of labeling each pixel. We propose a graph architecture that uses two convolutional rings based on cardiac anatomy and show that this eliminates anatomical incorrect multi-structure segmentations on the publicly available CAMUS dataset. Additionally, this work contributes with an ablation study on the graph convolutional architecture and an evaluation of clinical measurements on the clinical HUNT4 dataset. Finally, we propose to use the inter-model agreement of the U-Net and the graph network as a predictor of both the input and segmentation quality. We show this predictor can detect out-of-distribution and unsuitable input images in real-time. Source code is available online: https://github.com/gillesvntnu/GCN_multistructure

DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training. (arXiv:2310.02025v2 [cs.LG] UPDATED)

Authors: Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, Sijia Liu

Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinate-wise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box.

De Novo Drug Design with Joint Transformers. (arXiv:2310.02066v2 [cs.LG] UPDATED)

Authors: Adam Izdebski, Ewelina Weglarz-Tomczak, Ewa Szczurek, Jakub M. Tomczak

De novo drug design requires simultaneously generating novel molecules outside of training data and predicting their target properties, making it a hard task for generative models. To address this, we propose Joint Transformer that combines a Transformer decoder, a Transformer encoder, and a predictor in a joint generative model with shared weights. We show that training the model with a penalized log-likelihood objective results in state-of-the-art performance in molecule generation, while decreasing the prediction error on newly sampled molecules, as compared to a fine-tuned decoder-only Transformer, by 42%. Finally, we propose a probabilistic black-box optimization algorithm that employs Joint Transformer to generate novel molecules with improved target properties, as compared to the training data, outperforming other SMILES-based optimization methods in de novo drug design.

Machine learning the interaction network in coupled dynamical systems. (arXiv:2310.03378v2 [math.DS] UPDATED)

Authors: Pawan R. Bhure, M. S. Santhanam

The study of interacting dynamical systems continues to attract research interest in various fields of science and engineering. In a collection of interacting particles, the interaction network contains information about how various components interact with one another. Inferring the information about the interaction network from the dynamics of agents is a problem of long-standing interest. In this work, we employ a self-supervised neural network model to achieve two outcomes: to recover the interaction network and to predict the dynamics of individual agents. Both these information are inferred solely from the observed trajectory data. This work presents an application of the Neural Relational Inference model to two dynamical systems: coupled particles mediated by Hooke's law interaction and coupled phase (Kuramoto) oscillators.

Sampling via Gradient Flows in the Space of Probability Measures. (arXiv:2310.03597v2 [stat.ML] UPDATED)

Authors: Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, Sebastian Reich, Andrew M Stuart

Sampling a target probability distribution with an unknown normalization constant is a fundamental challenge in computational science and engineering. Recent work shows that algorithms derived by considering gradient flows in the space of probability measures open up new avenues for algorithm development. This paper makes three contributions to this sampling approach by scrutinizing the design components of such gradient flows. Any instantiation of a gradient flow for sampling needs an energy functional and a metric to determine the flow, as well as numerical approximations of the flow to derive algorithms. Our first contribution is to show that the Kullback-Leibler divergence, as an energy functional, has the unique property (among all f-divergences) that gradient flows resulting from it do not depend on the normalization constant of the target distribution. Our second contribution is to study the choice of metric from the perspective of invariance. The Fisher-Rao metric is known as the unique choice (up to scaling) that is diffeomorphism invariant. As a computationally tractable alternative, we introduce a relaxed, affine invariance property for the metrics and gradient flows. In particular, we construct various affine invariant Wasserstein and Stein gradient flows. Affine invariant gradient flows are shown to behave more favorably than their non-affine-invariant counterparts when sampling highly anisotropic distributions, in theory and by using particle methods. Our third contribution is to study, and develop efficient algorithms based on Gaussian approximations of the gradient flows; this leads to an alternative to particle methods. We establish connections between various Gaussian approximate gradient flows, discuss their relation to gradient methods arising from parametric variational inference, and study their convergence properties both theoretically and numerically.

PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability. (arXiv:2310.03906v5 [cs.LG] UPDATED)

Authors: Avisek Naug, Antonio Guillen, Ricardo Luna Gutiérrez, Vineet Gundecha, Dejan Markovikj, Lekhapriya Dheeraj Kashyap, Lorenz Krause, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, Soumyendu Sarkar

The increasing global emphasis on sustainability and reducing carbon emissions is pushing governments and corporations to rethink their approach to data center design and operation. Given their high energy consumption and exponentially large computational workloads, data centers are prime candidates for optimizing power consumption, especially in areas such as cooling and IT energy usage. A significant challenge in this pursuit is the lack of a configurable and scalable thermal data center model that offers an end-to-end pipeline. Data centers consist of multiple IT components whose geometric configuration and heat dissipation make thermal modeling difficult. This paper presents PyDCM, a customizable Data Center Model implemented in Python, that allows users to create unique configurations of IT equipment with custom server specifications and geometric arrangements of IT cabinets. The use of vectorized thermal calculations makes PyDCM orders of magnitude faster (30 times) than current Energy Plus modeling implementations and scales sublinearly with the number of CPUs. Also, PyDCM enables the use of Deep Reinforcement Learning via the Gymnasium wrapper to optimize data center cooling and offers a user-friendly platform for testing various data center design prototypes.

RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels. (arXiv:2310.03912v4 [cs.LG] UPDATED)

Authors: Alexander Shmakov, Avisek Naug, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Ashwin Ramesh Babu, Antonio Guillen, Soumyendu Sarkar

Bayesian Optimization (BO), guided by Gaussian process (GP) surrogates, has proven to be an invaluable technique for efficient, high-dimensional, black-box optimization, a critical problem inherent to many applications such as industrial design and scientific computing. Recent contributions have introduced reinforcement learning (RL) to improve the optimization performance on both single function optimization and \textit{few-shot} multi-objective optimization. However, even few-shot techniques fail to exploit similarities shared between closely related objectives. In this paper, we combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a novel method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. We combine this Transformer Deep Kernel with a learned acquisition function trained with continuous Soft Actor-Critic Reinforcement Learning to aid in exploration. This Reinforced Transformer Deep Kernel (RTDK-BO) approach yields state-of-the-art results in continuous high-dimensional optimization problems.

PGraphDTA: Improving Drug Target Interaction Prediction using Protein Language Models and Contact Maps. (arXiv:2310.04017v2 [cs.LG] UPDATED)

Authors: Rakesh Bal, Yijia Xiao, Wei Wang

Developing and discovering new drugs is a complex and resource-intensive endeavor that often involves substantial costs, time investment, and safety concerns. A key aspect of drug discovery involves identifying novel drug-target (DT) interactions. Existing computational methods for predicting DT interactions have primarily focused on binary classification tasks, aiming to determine whether a DT pair interacts or not. However, protein-ligand interactions exhibit a continuum of binding strengths, known as binding affinity, presenting a persistent challenge for accurate prediction. In this study, we investigate various techniques employed in Drug Target Interaction (DTI) prediction and propose novel enhancements to enhance their performance. Our approaches include the integration of Protein Language Models (PLMs) and the incorporation of Contact Map information as an inductive bias within current models. Through extensive experimentation, we demonstrate that our proposed approaches outperform the baseline models considered in this study, presenting a compelling case for further development in this direction. We anticipate that the insights gained from this work will significantly narrow the search space for potential drugs targeting specific proteins, thereby accelerating drug discovery. Code and data for PGraphDTA are available at https://github.com/Yijia-Xiao/PgraphDTA/.

Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design. (arXiv:2310.05764v2 [cs.LG] UPDATED)

Authors: Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola

A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon state-of-the-art generative processes for docking in simplicity, generality, and average sample quality in pocket-level docking. Enabled by this structure modeling, FlowSite designs binding sites substantially better than baseline approaches.

A Nonlinear Method for time series forecasting using VMD-GARCH-LSTM model. (arXiv:2310.08812v2 [stat.ME] UPDATED)

Authors: Zhengtao Gui, Haoyuan Li, Sijie Xu, Yu Chen

Time series forecasting represents a significant and challenging task across various fields. Recently, methods based on mode decomposition have dominated the forecasting of complex time series because of the advantages of capturing local characteristics and extracting intrinsic modes from data. Unfortunately, most models fail to capture the implied volatilities that contain significant information. To enhance the forecasting of current, rapidly evolving, and volatile time series, we propose a novel decomposition-ensemble paradigm, the VMD-LSTM-GARCH model. The Variational Mode Decomposition algorithm is employed to decompose the time series into K sub-modes. Subsequently, the GARCH model extracts the volatility information from these sub-modes, which serve as the input for the LSTM. The numerical and volatility information of each sub-mode is utilized to train a Long Short-Term Memory network. This network predicts the sub-mode, and then we aggregate the predictions from all sub-modes to produce the output. By integrating econometric and artificial intelligence methods, and taking into account both the numerical and volatility information of the time series, our proposed model demonstrates superior performance in time series forecasting, as evidenced by the significant decrease in MSE, RMSE, and MAPE in our comparative experimental results.

Causality and Independence Enhancement for Biased Node Classification. (arXiv:2310.09586v2 [cs.LG] UPDATED)

Authors: Guoxin Chen, Yongqing Wang, Fangda Guo, Qinglang Guo, Jiangli Shao, Huawei Shen, Xueqi Cheng

Most existing methods that address out-of-distribution (OOD) generalization for node classification on graphs primarily focus on a specific type of data biases, such as label selection bias or structural bias. However, anticipating the type of bias in advance is extremely challenging, and designing models solely for one specific type may not necessarily improve overall generalization performance. Moreover, limited research has focused on the impact of mixed biases, which are more prevalent and demanding in real-world scenarios. To address these limitations, we propose a novel Causality and Independence Enhancement (CIE) framework, applicable to various graph neural networks (GNNs). Our approach estimates causal and spurious features at the node representation level and mitigates the influence of spurious correlations through the backdoor adjustment. Meanwhile, independence constraint is introduced to improve the discriminability and stability of causal and spurious features in complex biased environments. Essentially, CIE eliminates different types of data biases from a unified perspective, without the need to design separate methods for each bias as before. To evaluate the performance under specific types of data biases, mixed biases, and low-resource scenarios, we conducted comprehensive experiments on five publicly available datasets. Experimental results demonstrate that our approach CIE not only significantly enhances the performance of GNNs but outperforms state-of-the-art debiased node classification methods.

Feature selection and regression methods for stock price prediction using technical indicators. (arXiv:2310.09903v4 [q-fin.ST] UPDATED)

Authors: Fatemeh Moodi, Amir Jahangard-Rafsanjani, Sajad Zarifzadeh

Due to the influence of many factors, including technical indicators on stock price prediction, feature selection is important to choose the best indicators. This study uses technical indicators and features selection and regression methods to solve the problem of closing the stock market price. The aim of this research is to predict the stock market price with the least error. By the proposed method, the data created by the 3-day time window were converted to the appropriate input for regression methods. In this paper, 10 regressor and 123 technical indicators have been examined on data of the last 13 years of Apple Company. The results have been investigated by 5 error-based evaluation criteria. Based on results of the proposed method, MLPSF has 56/47% better performance than MLP. Also, SVRSF has 67/42% improved compared to SVR. LRSF was 76.7 % improved compared to LR. The RISF method also improved 72.82 % of Ridge regression. The DTRSB method had 24.23 % improvement over DTR. KNNSB had 15.52 % improvement over KNN regression. RFSB had a 6 % improvement over RF. GBRSF also improved at 7% over GBR. Finally, ADASF and ADASB also had a 4% improvement over the ADA regression. Also, Ridge and LinearRegression had the best results for stock price prediction. Based on results, the best indicators to predict stock price are: the Squeeze_pro, Percentage Price Oscillator, Thermo, Decay, Archer On-Balance Volume, Bollinger Bands, Squeeze and Ichimoku indicator. According to the results, the use of suitable combination of suggested indicators along with regression methods has resulted in high accuracy in predicting the closing price.

AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents. (arXiv:2310.09971v2 [cs.LG] UPDATED)

Authors: Jake Grigsby, Linxi Fan, Yuke Zhu

We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is uniquely scalable and applicable to a wide range of problems. We demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a novel hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments. We evaluate our agent on three goal-conditioned domains and study how its individual improvements connect to create a generalist policy.

Non-parametric Conditional Independence Testing for Mixed Continuous-Categorical Variables: A Novel Method and Numerical Evaluation. (arXiv:2310.11132v2 [cs.LG] UPDATED)

Authors: Oana-Iuliana Popescu, Andreas Gerhardus, Jakob Runge

Conditional independence testing (CIT) is a common task in machine learning, e.g., for variable selection, and a main component of constraint-based causal discovery. While most current CIT approaches assume that all variables are numerical or all variables are categorical, many real-world applications involve mixed-type datasets that include numerical and categorical variables. Non-parametric CIT can be conducted using conditional mutual information (CMI) estimators combined with a local permutation scheme. Recently, two novel CMI estimators for mixed-type datasets based on k-nearest-neighbors (k-NN) have been proposed. As with any k-NN method, these estimators rely on the definition of a distance metric. One approach computes distances by a one-hot encoding of the categorical variables, essentially treating categorical variables as discrete-numerical, while the other expresses CMI by entropy terms where the categorical variables appear as conditions only. In this work, we study these estimators and propose a variation of the former approach that does not treat categorical variables as numeric. Our numerical experiments show that our variant detects dependencies more robustly across different data distributions and preprocessing types.

VQ-NeRF: Neural Reflectance Decomposition and Editing with Vector Quantization. (arXiv:2310.11864v2 [cs.CV] UPDATED)

Authors: Hongliang Zhong, Jingbo Zhang, Jing Liao

We propose VQ-NeRF, a two-branch neural network model that incorporates Vector Quantization (VQ) to decompose and edit reflectance fields in 3D scenes. Conventional neural reflectance fields use only continuous representations to model 3D scenes, despite the fact that objects are typically composed of discrete materials in reality. This lack of discretization can result in noisy material decomposition and complicated material editing. To address these limitations, our model consists of a continuous branch and a discrete branch. The continuous branch follows the conventional pipeline to predict decomposed materials, while the discrete branch uses the VQ mechanism to quantize continuous materials into individual ones. By discretizing the materials, our model can reduce noise in the decomposition process and generate a segmentation map of discrete materials. Specific materials can be easily selected for further editing by clicking on the corresponding area of the segmentation outcomes. Additionally, we propose a dropout-based VQ codeword ranking strategy to predict the number of materials in a scene, which reduces redundancy in the material segmentation process. To improve usability, we also develop an interactive interface to further assist material editing. We evaluate our model on both computer-generated and real-world scenes, demonstrating its superior performance. To the best of our knowledge, our model is the first to enable discrete material editing in 3D scenes.

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation. (arXiv:2310.12508v2 [cs.LG] UPDATED)

Authors: Chongyu Fan, Jiancheng Liu, Yihua Zhang, Dennis Wei, Eric Wong, Sijia Liu

With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today's AI models. However, existing MU methods focusing on data and/or weight perspectives often grapple with limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of 'weight saliency' in MU, drawing parallels with input saliency in model explanation. This innovation directs MU's attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with 'exact' unlearning (model retraining from scratch after removing the forgetting dataset). To the best of our knowledge, SalUn is the first principled MU approach adaptable enough to effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation. For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not.

Optimizing Retrieval-augmented Reader Models via Token Elimination. (arXiv:2310.13682v2 [cs.CL] UPDATED)

Authors: Moshe Berchansky, Peter Izsak, Avi Caciularu, Ido Dagan, Moshe Wasserblat

Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model applied across a variety of open-domain tasks, such as question answering, fact checking, etc. In FiD, supporting passages are first retrieved and then processed using a generative model (Reader), which can cause a significant bottleneck in decoding time, particularly with long outputs. In this work, we analyze the contribution and necessity of all the retrieved passages to the performance of reader models, and propose eliminating some of the retrieved information, at the token level, that might not contribute essential information to the answer generation process. We demonstrate that our method can reduce run-time by up to 62.2%, with only a 2% reduction in performance, and in some cases, even improve the performance results.

Contrast Everything: A Hierarchical Contrastive Framework for Medical Time-Series. (arXiv:2310.14017v4 [cs.LG] UPDATED)

Authors: Yihe Wang, Yu Han, Haishuai Wang, Xiang Zhang

Contrastive representation learning is crucial in medical time series analysis as it alleviates dependency on labor-intensive, domain-specific, and scarce expert annotations. However, existing contrastive learning methods primarily focus on one single data level, which fails to fully exploit the intricate nature of medical time series. To address this issue, we present COMET, an innovative hierarchical framework that leverages data consistencies at all inherent levels in medical time series. Our meticulously designed model systematically captures data consistency from four potential levels: observation, sample, trial, and patient levels. By developing contrastive loss at multiple levels, we can learn effective representations that preserve comprehensive data consistency, maximizing information utilization in a self-supervised manner. We conduct experiments in the challenging patient-independent setting. We compare COMET against six baselines using three diverse datasets, which include ECG signals for myocardial infarction and EEG signals for Alzheimer's and Parkinson's diseases. The results demonstrate that COMET consistently outperforms all baselines, particularly in setup with 10% and 1% labeled data fractions across all datasets. These results underscore the significant impact of our framework in advancing contrastive representation learning techniques for medical time series. The source code is available at https://github.com/DL4mHealth/COMET.

On existence, uniqueness and scalability of adversarial robustness measures for AI classifiers. (arXiv:2310.14421v2 [stat.ML] UPDATED)

Authors: Illia Horenko

Simply-verifiable mathematical conditions for existence, uniqueness and explicit analytical computation of minimal adversarial paths (MAP) and minimal adversarial distances (MAD) for (locally) uniquely-invertible classifiers, for generalized linear models (GLM), and for entropic AI (EAI) are formulated and proven. Practical computation of MAP and MAD, their comparison and interpretations for various classes of AI tools (for neuronal networks, boosted random forests, GLM and EAI) are demonstrated on the common synthetic benchmarks: on a double Swiss roll spiral and its extensions, as well as on the two biomedical data problems (for the health insurance claim predictions, and for the heart attack lethality classification). On biomedical applications it is demonstrated how MAP provides unique minimal patient-specific risk-mitigating interventions in the predefined subsets of accessible control variables.

Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand. (arXiv:2310.14942v2 [cs.CV] UPDATED)

Authors: Junfeng Guo, Yiming Li, Lixu Wang, Shu-Tao Xia, Heng Huang, Cong Liu, Bo Li

The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}.

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI. (arXiv:2310.16787v3 [cs.CL] UPDATED)

Authors: Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org.

Disentangled Representation Learning with Large Language Models for Text-Attributed Graphs. (arXiv:2310.18152v2 [cs.CL] UPDATED)

Authors: Yijian Qin, Xin Wang, Ziwei Zhang, Wenwu Zhu

Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs such as citation networks, e-commerce networks and social networks has attracted considerable attention in the web community. Recently, large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks. However, the existing works focus on harnessing the potential of LLMs solely relying on prompts to convey graph structure information to LLMs, thus suffering from insufficient understanding of the complex structural relationships within TAGs. To address this problem, in this paper we present the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the reasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model incorporates graph structure information through tailored disentangled graph neural network (GNN) layers, enabling LLMs to capture the intricate relationships hidden in text-attributed graphs from multiple structural factors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing computational costs and allowing much more flexibility in combining with different LLM models. Experimental evaluations demonstrate the effectiveness of the proposed DGTL model on achieving superior or comparable performance over state-of-the-art baselines. Additionally, we also demonstrate that our DGTL model can offer natural language explanations for predictions, thereby significantly enhancing model interpretability.

ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models. (arXiv:2310.18208v2 [cs.CL] UPDATED)

Authors: Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation. (arXiv:2310.18919v2 [cs.LG] UPDATED)

Authors: Nikki Lijing Kuang, Ming Yin, Mengdi Wang, Yu-Xiang Wang, Yi-An Ma

Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.

Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery. (arXiv:2310.19776v2 [cs.CV] UPDATED)

Authors: Sarah Rastegar, Hazel Doughty, Cees G. M. Snoek

In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. Our code is available at https://github.com/SarahRastegar/InfoSieve.

Early detection of inflammatory arthritis to improve referrals using multimodal machine learning from blood testing, semi-structured and unstructured patient records. (arXiv:2310.19967v2 [cs.LG] UPDATED)

Authors: Bing Wang, Weizi Li, Anthony Bradlow, Antoni T.Y. Chan, Eghosa Bazuaye

Early detection of inflammatory arthritis (IA) is critical to efficient and accurate hospital referral triage for timely treatment and preventing the deterioration of the IA disease course, especially under limited healthcare resources. The manual assessment process is the most common approach in practice for the early detection of IA, but it is extremely labor-intensive and inefficient. A large amount of clinical information needs to be assessed for every referral from General Practice (GP) to the hospitals. Machine learning shows great potential in automating repetitive assessment tasks and providing decision support for the early detection of IA. However, most machine learning-based methods for IA detection rely on blood testing results. But in practice, blood testing data is not always available at the point of referrals, so we need methods to leverage multimodal data such as semi-structured and unstructured data for early detection of IA. In this research, we present fusion and ensemble learning-based methods using multimodal data to assist decision-making in the early detection of IA, and a conformal prediction-based method to quantify the uncertainty of the prediction and detect any unreliable predictions. To the best of our knowledge, our study is the first attempt to utilize multimodal data to support the early detection of IA from GP referrals.

Efficient Robust Bayesian Optimization for Arbitrary Uncertain Inputs. (arXiv:2310.20145v2 [cs.LG] UPDATED)

Authors: Lin Yang, Junlong Lyu, Wenlong Lyu, Zhitang Chen

Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in the final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nystrom approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve state-of-the-art performance.

Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning. (arXiv:2310.20587v2 [cs.LG] UPDATED)

Authors: Ruizhe Shi, Yuyao Liu, Yanjie Ze, Simon S. Du, Huazhe Xu

Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples. Our project website is $\href{https://lamo2023.github.io}{\text{this https URL}}$.

Crop Disease Classification using Support Vector Machines with Green Chromatic Coordinate (GCC) and Attention based feature extraction for IoT based Smart Agricultural Applications. (arXiv:2311.00429v2 [eess.IV] UPDATED)

Authors: Shashwat Jha, Vishvaditya Luhach, Gauri Shanker Gupta, Beependra Singh

Crops hold paramount significance as they serve as the primary provider of energy, nutrition, and medicinal benefits for the human population. Plant diseases, however, can negatively affect leaves during agricultural cultivation, resulting in significant losses in crop output and economic value. Therefore, it is crucial for farmers to identify crop diseases. However, this method frequently necessitates hard work, a lot of planning, and in-depth familiarity with plant pathogens. Given these numerous obstacles, it is essential to provide solutions that can easily interface with mobile and IoT devices so that our farmers can guarantee the best possible crop development. Various machine learning (ML) as well as deep learning (DL) algorithms have been created & studied for the identification of plant disease detection, yielding substantial and promising results. This article presents a novel classification method that builds on prior work by utilising attention-based feature extraction, RGB channel-based chromatic analysis, Support Vector Machines (SVM) for improved performance, and the ability to integrate with mobile applications and IoT devices after quantization of information. Several disease classification algorithms were compared with the suggested model, and it was discovered that, in terms of accuracy, Vision Transformer-based feature extraction and additional Green Chromatic Coordinate feature with SVM classification achieved an accuracy of (GCCViT-SVM) - 99.69%, whereas after quantization for IoT device integration achieved an accuracy of - 97.41% while almost reducing 4x in size. Our findings have profound implications because they have the potential to transform how farmers identify crop illnesses with precise and fast information, thereby preserving agricultural output and ensuring food security.

TRIALSCOPE: A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models. (arXiv:2311.01301v2 [cs.LG] UPDATED)

Authors: Javier González, Cliff Wong, Zelalem Gero, Jass Bagga, Risa Ueno, Isabel Chien, Eduard Oravkin, Emre Kiciman, Aditya Nori, Roshanthi Weerasinghe, Rom S. Leidner, Brian Piening, Tristan Naumann, Carlo Bifulco, Hoifung Poon

The rapid digitization of real-world data offers an unprecedented opportunity for optimizing healthcare delivery and accelerating biomedical discovery. In practice, however, such data is most abundantly available in unstructured forms, such as clinical notes in electronic medical records (EMRs), and it is generally plagued by confounders. In this paper, we present TRIALSCOPE, a unifying framework for distilling real-world evidence from population-level observational data. TRIALSCOPE leverages biomedical language models to structure clinical text at scale, employs advanced probabilistic modeling for denoising and imputation, and incorporates state-of-the-art causal inference techniques to combat common confounders. Using clinical trial specification as generic representation, TRIALSCOPE provides a turn-key solution to generate and reason with clinical hypotheses using observational data. In extensive experiments and analyses on a large-scale real-world dataset with over one million cancer patients from a large US healthcare network, we show that TRIALSCOPE can produce high-quality structuring of real-world data and generates comparable results to marquee cancer trials. In addition to facilitating in-silicon clinical trial design and optimization, TRIALSCOPE may be used to empower synthetic controls, pragmatic trials, post-market surveillance, as well as support fine-grained patient-like-me reasoning in precision diagnosis and treatment.

Vision-Language Foundation Models as Effective Robot Imitators. (arXiv:2311.01378v2 [cs.RO] UPDATED)

Authors: Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, Tao Kong

Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.

CDGraph: Dual Conditional Social Graph Synthesizing via Diffusion Model. (arXiv:2311.01729v2 [cs.SI] UPDATED)

Authors: Jui-Yi Tsai, Ya-Wen Teng, Ho Chiok Yew, De-Nian Yang, Lydia Y. Chen

The social graphs synthesized by the generative models are increasingly in demand due to data scarcity and concerns over user privacy. One of the key performance criteria for generating social networks is the fidelity to specified conditionals, such as users with certain membership and financial status. While recent diffusion models have shown remarkable performance in generating images, their effectiveness in synthesizing graphs has not yet been explored in the context of conditional social graphs. In this paper, we propose the first kind of conditional diffusion model for social networks, CDGraph, which trains and synthesizes graphs based on two specified conditions. We propose the co-evolution dependency in the denoising process of CDGraph to capture the mutual dependencies between the dual conditions and further incorporate social homophily and social contagion to preserve the connectivity between nodes while satisfying the specified conditions. Moreover, we introduce a novel classifier loss, which guides the training of the diffusion process through the mutual dependency of dual conditions. We evaluate CDGraph against four existing graph generative methods, i.e., SPECTRE, GSM, EDGE, and DiGress, on four datasets. Our results show that the generated graphs from CDGraph achieve much higher dual-conditional validity and lower discrepancy in various social network metrics than the baselines, thus demonstrating its proficiency in generating dual-conditional social graphs.

Differentially Private Federated Clustering over Non-IID Data. (arXiv:2301.00955v3 [cs.DC] CROSS LISTED)

Authors: Yiwei Li, Shuai Wang, Chong-Yung Chi, Tony Q. S. Quek

In this paper, we investigate federated clustering (FedC) problem, that aims to accurately partition unlabeled data samples distributed over massive clients into finite clusters under the orchestration of a parameter server, meanwhile considering data privacy. Though it is an NP-hard optimization problem involving real variables denoting cluster centroids and binary variables denoting the cluster membership of each data sample, we judiciously reformulate the FedC problem into a non-convex optimization problem with only one convex constraint, accordingly yielding a soft clustering solution. Then a novel FedC algorithm using differential privacy (DP) technique, referred to as DP-FedC, is proposed in which partial clients participation and multiple local model updating steps are also considered. Furthermore, various attributes of the proposed DP-FedC are obtained through theoretical analyses of privacy protection and convergence rate, especially for the case of non-identically and independently distributed (non-i.i.d.) data, that ideally serve as the guidelines for the design of the proposed DP-FedC. Then some experimental results on two real datasets are provided to demonstrate the efficacy of the proposed DP-FedC together with its much superior performance over some state-of-the-art FedC algorithms, and the consistency with all the presented analytical results.