Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision. (arXiv:2401.08581v1 [cs.CV])

Authors: Yi Cao, Swetava Ganguli, Vipul Pandey

There exists a correlation between geospatial activity temporal patterns and type of land use. A novel self-supervised approach is proposed to stratify landscape based on mobility activity time series. First, the time series signal is transformed to the frequency domain and then compressed into task-agnostic temporal embeddings by a contractive autoencoder, which preserves cyclic temporal patterns observed in time series. The pixel-wise embeddings are converted to image-like channels that can be used for task-based, multimodal modeling of downstream geospatial tasks using deep semantic segmentation. Experiments show that temporal embeddings are semantically meaningful representations of time series data and are effective across different tasks such as classifying residential area and commercial areas. Temporal embeddings transform sequential, spatiotemporal motion trajectory data into semantically meaningful image-like tensor representations that can be combined (multimodal fusion) with other data modalities that are or can be transformed into image-like tensor representations (for e.g., RBG imagery, graph embeddings of road networks, passively collected imagery like SAR, etc.) to facilitate multimodal learning in geospatial computer vision. Multimodal computer vision is critical for training machine learning models for geospatial feature detection to keep a geospatial mapping service up-to-date in real-time and can significantly improve user experience and above all, user safety.

Nahid: AI-based Algorithm for operating fully-automatic surgery. (arXiv:2401.08584v1 [cs.CV])

Authors: Sina Saadati

In this paper, for the first time, a method is presented that can provide a fully automated surgery based on software and computer vision techniques. Then, the advantages and challenges of computerization of medical surgery are examined. Finally, the surgery related to isolated ovarian endometriosis disease has been examined, and based on the presented method, a more detailed algorithm is presented that is capable of automatically diagnosing and treating this disease during surgery as proof of our proposed method where a U-net is trained to detect the endometriosis during surgery.

Learning with Chemical versus Electrical Synapses -- Does it Make a Difference?. (arXiv:2401.08602v1 [cs.NE])

Authors: Mónika Farsang, Mathias Lechner, David Lung, Ramin Hasani, Daniela Rus, Radu Grosu

Bio-inspired neural networks have the potential to advance our understanding of neural computation and improve the state-of-the-art of AI systems. Bio-electrical synapses directly transmit neural signals, by enabling fast current flow between neurons. In contrast, bio-chemical synapses transmit neural signals indirectly, through neurotransmitters. Prior work showed that interpretable dynamics for complex robotic control, can be achieved by using chemical synapses, within a sparse, bio-inspired architecture, called Neural Circuit Policies (NCPs). However, a comparison of these two synaptic models, within the same architecture, remains an unexplored area. In this work we aim to determine the impact of using chemical synapses compared to electrical synapses, in both sparse and all-to-all connected networks. We conduct experiments with autonomous lane-keeping through a photorealistic autonomous driving simulator to evaluate their performance under diverse conditions and in the presence of noise. The experiments highlight the substantial influence of the architectural and synaptic-model choices, respectively. Our results show that employing chemical synapses yields noticeable improvements compared to electrical synapses, and that NCPs lead to better results in both synaptic models.

Representation Learning in a Decomposed Encoder Design for Bio-inspired Hebbian Learning. (arXiv:2401.08603v1 [cs.NE])

Authors: Achref Jaziri, Sina Ditzel, Iuliia Pliushch, Visvanathan Ramesh

Modern data-driven machine learning system designs exploit inductive biases on architectural structure, invariance and equivariance requirements, task specific loss functions, and computational optimization tools. Previous works have illustrated that inductive bias in the early layers of the encoder in the form of human specified quasi-invariant filters can serve as a powerful inductive bias to attain better robustness and transparency in learned classifiers. This paper explores this further in the context of representation learning with local plasticity rules i.e. bio-inspired Hebbian learning . We propose a modular framework trained with a bio-inspired variant of contrastive predictive coding (Hinge CLAPP Loss). Our framework is composed of parallel encoders each leveraging a different invariant visual descriptor as an inductive bias. We evaluate the representation learning capacity of our system in a classification scenario on image data of various difficulties (GTSRB, STL10, CODEBRIM) as well as video data (UCF101). Our findings indicate that this form of inductive bias can be beneficial in closing the gap between models with local plasticity rules and backpropagation models as well as learning more robust representations in general.

MATE-Pred: Multimodal Attention-based TCR-Epitope interaction Predictor. (arXiv:2401.08619v1 [cs.LG])

Authors: Etienne Goffinet, Raghvendra Mall, Ankita Singh, Rahul Kaushik, Filippo Castiglione

An accurate binding affinity prediction between T-cell receptors and epitopes contributes decisively to develop successful immunotherapy strategies. Some state-of-the-art computational methods implement deep learning techniques by integrating evolutionary features to convert the amino acid residues of cell receptors and epitope sequences into numerical values, while some other methods employ pre-trained language models to summarize the embedding vectors at the amino acid residue level to obtain sequence-wise representations.

Here, we propose a highly reliable novel method, MATE-Pred, that performs multi-modal attention-based prediction of T-cell receptors and epitopes binding affinity. The MATE-Pred is compared and benchmarked with other deep learning models that leverage multi-modal representations of T-cell receptors and epitopes. In the proposed method, the textual representation of proteins is embedded with a pre-trained bi-directional encoder model and combined with two additional modalities: a) a comprehensive set of selected physicochemical properties; b) predicted contact maps that estimate the 3D distances between amino acid residues in the sequences.

The MATE-Pred demonstrates the potential of multi-modal model in achieving state-of-the-art performance (+8.4\% MCC, +5.5\% AUC compared to baselines) and efficiently capturing contextual, physicochemical, and structural information from amino acid residues. The performance of MATE-Pred projects its potential application in various drug discovery regimes.

Wake-Sleep Consolidated Learning. (arXiv:2401.08623v1 [cs.NE])

Authors: Amelia Sorrenti, Giovanni Bellitto, Federica Proietto Salanitri, Matteo Pennisi, Simone Palazzo, Concetto Spampinato

We propose Wake-Sleep Consolidated Learning (WSCL), a learning strategy leveraging Complementary Learning System theory and the wake-sleep phases of the human brain to improve the performance of deep neural networks for visual classification tasks in continual learning settings. Our method learns continually via the synchronization between distinct wake and sleep phases. During the wake phase, the model is exposed to sensory input and adapts its representations, ensuring stability through a dynamic parameter freezing mechanism and storing episodic memories in a short-term temporary memory (similarly to what happens in the hippocampus). During the sleep phase, the training process is split into NREM and REM stages. In the NREM stage, the model's synaptic weights are consolidated using replayed samples from the short-term and long-term memory and the synaptic plasticity mechanism is activated, strengthening important connections and weakening unimportant ones. In the REM stage, the model is exposed to previously-unseen realistic visual sensory experience, and the dreaming process is activated, which enables the model to explore the potential feature space, thus preparing synapses to future knowledge. We evaluate the effectiveness of our approach on three benchmark datasets: CIFAR-10, Tiny-ImageNet and FG-ImageNet. In all cases, our method outperforms the baselines and prior work, yielding a significant performance gain on continual visual classification tasks. Furthermore, we demonstrate the usefulness of all processing stages and the importance of dreaming to enable positive forward transfer.

Predicting and Interpreting Energy Barriers of Metallic Glasses with Graph Neural Networks. (arXiv:2401.08627v1 [cond-mat.dis-nn])

Authors: Haoyu Li, Shichang Zhang, Longwen Tang, Mathieu Bauchy, Yizhou Sun

Metallic Glasses (MGs) are widely used disordered materials. Understanding the relationship between the local structure and physical properties of MGs is one of the greatest challenges for both material science and condensed matter physics. In this work, we utilize Graph Neural Networks (GNNs) to model the atomic graph structure and study the connection between the structure and the corresponding local energy barrier, which is believed to govern many critical physical properties in MGs. One of our key contributions is to propose a novel Symmetrized GNN (SymGNN) model for predicting the energy barriers, which is invariant under orthogonal transformations of the structure, e.g., rotations and reflections. Such invariance is a desired property that standard GNNs like Graph Convolutional Networks cannot capture. SymGNNs handle the invariance by aggregating over orthogonal transformations of the graph structure for representation learning, and an optimal distribution over all 3D orthogonal transformations $\mathcal{O}_3$ is learned to maximize the benefit of invariance. We demonstrate in our experiments that SymGNN can significantly improve the energy barrier prediction over other GNNs and non-graph machine learning models. With such an accurate model, we also apply graph explanation algorithms to better reveal the structure-property relationship of MGs. Our GNN framework allows effective prediction of material physical properties and bolsters material science research through the use of AI models.

Synergizing Quality-Diversity with Descriptor-Conditioned Reinforcement Learning. (arXiv:2401.08632v1 [cs.NE])

Authors: Maxence Faldor, Félix Chalumeau, Manon Flageat, Antoine Cully

A fundamental trait of intelligence involves finding novel and creative solutions to address a given challenge or to adapt to unforeseen situations. Reflecting this, Quality-Diversity optimization is a family of Evolutionary Algorithms, that generates collections of both diverse and high-performing solutions. Among these, MAP-Elites is a prominent example, that has been successfully applied to a variety of domains, including evolutionary robotics. However, MAP-Elites performs a divergent search with random mutations originating from Genetic Algorithms, and thus, is limited to evolving populations of low-dimensional solutions. PGA-MAP-Elites overcomes this limitation using a gradient-based variation operator inspired by deep reinforcement learning which enables the evolution of large neural networks. Although high-performing in many environments, PGA-MAP-Elites fails on several tasks where the convergent search of the gradient-based variation operator hinders diversity. In this work, we present three contributions: (1) we enhance the Policy Gradient variation operator with a descriptor-conditioned critic that reconciles diversity search with gradient-based methods, (2) we leverage the actor-critic training to learn a descriptor-conditioned policy at no additional cost, distilling the knowledge of the population into one single versatile policy that can execute a diversity of behaviors, (3) we exploit the descriptor-conditioned actor by injecting it in the population, despite network architecture differences. Our method, DCG-MAP-Elites, achieves equal or higher QD score and coverage compared to all baselines on seven challenging continuous control locomotion tasks.

Collaborative Inference via Dynamic Composition of Tiny AI Accelerators on MCUs. (arXiv:2401.08637v1 [cs.DC])

Authors: Taesik Gong, Si Young Jang, Utku Günay Acer, Fahim Kawsar, Chulhong Min

The advent of tiny AI accelerators opens opportunities for deep neural network deployment at the extreme edge, offering reduced latency, lower power cost, and improved privacy in on-device ML inference. Despite these advancements, challenges persist due to inherent limitations of these accelerators, such as restricted onboard memory and single-device focus. This paper introduces Synergy, a system that dynamically composes tiny AI accelerators for multi-tenant models, effectively addressing tinyML's critical challenges for the increasing demand for on-device AI. A key feature of Synergy is its virtual computing space, providing a unified, virtualized view of resources and enabling efficient task mapping to physical devices. Synergy's runtime orchestration module ensures optimal inference across dynamic and heterogeneous accelerators. Our evaluations with 7 baselines and 8 models demonstrate that Synergy improves throughput by an average of 8.0X compared to baselines.

One-Step Diffusion Distillation via Deep Equilibrium Models. (arXiv:2401.08639v1 [cs.CV])

Authors: Zhengyang Geng, Ashwini Pokle, J. Zico Kolter

Diffusion models excel at producing high-quality samples but naively require hundreds of iterations, prompting multiple attempts to distill the generation process into a faster network. However, many existing approaches suffer from a variety of challenges: the process for distillation training can be complex, often requiring multiple training stages, and the resulting models perform poorly when utilized in single-step generative applications. In this paper, we introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Of particular importance to our approach is to leverage a new Deep Equilibrium (DEQ) model as the distilled architecture: the Generative Equilibrium Transformer (GET). Our method enables fully offline training with just noise/image pairs from the diffusion model while achieving superior performance compared to existing one-step methods on comparable training budgets. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5\times$ larger ViT in terms of FID scores while striking a critical balance of computational cost and image quality. Code, checkpoints, and datasets are available.

Deep Pulse-Coupled Neural Networks. (arXiv:2401.08649v1 [cs.NE])

Authors: Zexiang Yi, Jing Lian, Yunliang Qi, Zhaofei Yu, Huajin Tang, Yide Ma, Jizhao Liu

Spiking Neural Networks (SNNs) capture the information processing mechanism of the brain by taking advantage of spiking neurons, such as the Leaky Integrate-and-Fire (LIF) model neuron, which incorporates temporal dynamics and transmits information via discrete and asynchronous spikes. However, the simplified biological properties of LIF ignore the neuronal coupling and dendritic structure of real neurons, which limits the spatio-temporal dynamics of neurons and thus reduce the expressive power of the resulting SNNs. In this work, we leverage a more biologically plausible neural model with complex dynamics, i.e., a pulse-coupled neural network (PCNN), to improve the expressiveness and recognition performance of SNNs for vision tasks. The PCNN is a type of cortical model capable of emulating the complex neuronal activities in the primary visual cortex. We construct deep pulse-coupled neural networks (DPCNNs) by replacing commonly used LIF neurons in SNNs with PCNN neurons. The intra-coupling in existing PCNN models limits the coupling between neurons only within channels. To address this limitation, we propose inter-channel coupling, which allows neurons in different feature maps to interact with each other. Experimental results show that inter-channel coupling can efficiently boost performance with fewer neurons, synapses, and less training time compared to widening the networks. For instance, compared to the LIF-based SNN with wide VGG9, DPCNN with VGG9 uses only 50%, 53%, and 73% of neurons, synapses, and training time, respectively. Furthermore, we propose receptive field and time dependent batch normalization (RFTD-BN) to speed up the convergence and performance of DPCNNs.

SAiD: Speech-driven Blendshape Facial Animation with Diffusion. (arXiv:2401.08655v1 [cs.CV])

Authors: Inkyu Park, Jaewoong Cho

Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.

Risk-anticipatory autonomous driving strategies considering vehicles' weights, based on hierarchical deep reinforcement learning. (arXiv:2401.08661v1 [cs.RO])

Authors: Di Chen, Hao Li, Zhicheng Jin, Huizhao Tu

Autonomous vehicles (AVs) have the potential to prevent accidents caused by drivers' error and reduce road traffic risks. Due to the nature of heavy vehicles, whose collisions cause more serious crashes, the weights of vehicles need to be considered when making driving strategies aimed at reducing the potential risks and their consequences in the context of autonomous driving. This study develops an autonomous driving strategy based on risk anticipation, considering the weights of surrounding vehicles and using hierarchical deep reinforcement learning. A risk indicator integrating surrounding vehicles' weights, based on the risk field theory, is proposed and incorporated into autonomous driving decisions. A hybrid action space is designed to allow for left lane changes, right lane changes and car-following, which enables AVs to act more freely and realistically whenever possible. To solve the above hybrid decision-making problem, a hierarchical proximal policy optimization (HPPO) algorithm is developed and an attention mechanism is incorporated, providing great advantages in maintaining stable performance. An indicator, potential collision energy in conflicts (PCEC), is newly proposed to evaluate the performance of the developed AV driving strategy from both the perspectives of the likelihood and the consequences of potential accidents. An application is carried out and the simulation results demonstrate that our model provides driving strategies that reduce both the likelihood and consequences of potential accidents, at the same time maintaining driving efficiency. The developed method is especially meaningful for AVs driving on highways, where heavy vehicles make up a high proportion of the traffic.

An Integrated Imitation and Reinforcement Learning Methodology for Robust Agile Aircraft Control with Limited Pilot Demonstration Data. (arXiv:2401.08663v1 [cs.AI])

Authors: Gulay Goktas Sever, Umut Demir, Abdullah Sadik Satir, Mustafa Cagatay Sahin, Nazim Kemal Ure

In this paper, we present a methodology for constructing data-driven maneuver generation models for agile aircraft that can generalize across a wide range of trim conditions and aircraft model parameters. Maneuver generation models play a crucial role in the testing and evaluation of aircraft prototypes, providing insights into the maneuverability and agility of the aircraft. However, constructing the models typically requires extensive amounts of real pilot data, which can be time-consuming and costly to obtain. Moreover, models built with limited data often struggle to generalize beyond the specific flight conditions covered in the original dataset. To address these challenges, we propose a hybrid architecture that leverages a simulation model, referred to as the source model. This open-source agile aircraft simulator shares similar dynamics with the target aircraft and allows us to generate unlimited data for building a proxy maneuver generation model. We then fine-tune this model to the target aircraft using a limited amount of real pilot data. Our approach combines techniques from imitation learning, transfer learning, and reinforcement learning to achieve this objective. To validate our methodology, we utilize real agile pilot data provided by Turkish Aerospace Industries (TAI). By employing the F-16 as the source model, we demonstrate that it is possible to construct a maneuver generation model that generalizes across various trim conditions and aircraft parameters without requiring any additional real pilot data. Our results showcase the effectiveness of our approach in developing robust and adaptable models for agile aircraft.

Data-Driven Physics-Informed Neural Networks: A Digital Twin Perspective. (arXiv:2401.08667v1 [physics.flu-dyn])

Authors: Sunwoong Yang, Hojin Kim, Yoonpyo Hong, Kwanjung Yee, Romit Maulik, Namwoo Kang

This study explores the potential of physics-informed neural networks (PINNs) for the realization of digital twins (DT) from various perspectives. First, various adaptive sampling approaches for collocation points are investigated to verify their effectiveness in the mesh-free framework of PINNs, which allows automated construction of virtual representation without manual mesh generation. Then, the overall performance of the data-driven PINNs (DD-PINNs) framework is examined, which can utilize the acquired datasets in DT scenarios. Its scalability to more general physics is validated within parametric Navier-Stokes equations, where PINNs do not need to be retrained as the Reynolds number varies. In addition, since datasets can be often collected from different fidelity/sparsity in practice, multi-fidelity DD-PINNs are also proposed and evaluated. They show remarkable prediction performance even in the extrapolation tasks, with $42\sim62\%$ improvement over the single-fidelity approach. Finally, the uncertainty quantification performance of multi-fidelity DD-PINNs is investigated by the ensemble method to verify their potential in DT, where an accurate measure of predictive uncertainty is critical. The DD-PINN frameworks explored in this study are found to be more suitable for DT scenarios than traditional PINNs from the above perspectives, bringing engineers one step closer to seamless DT realization.

Deep Reinforcement Learning for Multi-Truck Vehicle Routing Problems with Multi-Leg Demand Routes. (arXiv:2401.08669v1 [cs.LG])

Authors: Joshua Levin, Randall Correll, Takanori Ide, Takafumi Suzuki, Takaho Saito, Alan Arai

Deep reinforcement learning (RL) has been shown to be effective in producing approximate solutions to some vehicle routing problems (VRPs), especially when using policies generated by encoder-decoder attention mechanisms. While these techniques have been quite successful for relatively simple problem instances, there are still under-researched and highly complex VRP variants for which no effective RL method has been demonstrated. In this work we focus on one such VRP variant, which contains multiple trucks and multi-leg routing requirements. In these problems, demand is required to move along sequences of nodes, instead of just from a start node to an end node. With the goal of making deep RL a viable strategy for real-world industrial-scale supply chain logistics, we develop new extensions to existing encoder-decoder attention models which allow them to handle multiple trucks and multi-leg routing requirements. Our models have the advantage that they can be trained for a small number of trucks and nodes, and then embedded into a large supply chain to yield solutions for larger numbers of trucks and nodes. We test our approach on a real supply chain environment arising in the operations of Japanese automotive parts manufacturer Aisin Corporation, and find that our algorithm outperforms Aisin's previous best solution.

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. (arXiv:2401.08671v1 [cs.PF])

Authors: Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.

Concept Alignment. (arXiv:2401.08672v1 [cs.LG])

Authors: Sunayana Rane, Polyphony J. Bruna, Ilia Sucholutsky, Christopher Kello, Thomas L. Griffiths

Discussion of AI alignment (alignment between humans and AI systems) has focused on value alignment, broadly referring to creating AI systems that share human values. We argue that before we can even attempt to align values, it is imperative that AI systems and humans align the concepts they use to understand the world. We integrate ideas from philosophy, cognitive science, and deep learning to explain the need for concept alignment, not just value alignment, between humans and machines. We summarize existing accounts of how humans and machines currently learn concepts, and we outline opportunities and challenges in the path towards shared concepts. Finally, we explain how we can leverage the tools already being developed in cognitive science and AI research to accelerate progress towards concept alignment.

Zero-Shot RTL Code Generation with Attention Sink Augmented Large Language Models. (arXiv:2401.08683v1 [cs.AR])

Authors: Selim Sandal, Ismail Akturk

The design and optimization of hardware have traditionally been resource-intensive, demanding considerable expertise and dependence on established design automation tools. This paper discusses the possibility of exploiting large language models to streamline the code generation process in hardware design. In contrast to earlier studies, this paper aims to use large language models that accepts high-level design specifications through a single prompt to generate corresponding Register-Transfer Level (RTL) code. The ability to use large language models on RTL code generation not only expedites design iteration cycles but also facilitates the exploration of design spaces that have computational challenges for conventional techniques. Through our evaluation, we demonstrate the shortcoming of existing attention mechanisms, and present the abilities of language models to produce functional, optimized, and industry-standard compliant RTL code when a novel attention mechanism is used. These findings underscore the expanding role of large language models in shaping the future landscape of architectural exploration and automation in hardware design.

A Physics-informed machine learning model for time-dependent wave runup prediction. (arXiv:2401.08684v1 [physics.flu-dyn])

Authors: Saeed Saviz Naeini, Reda Snaiki

Wave runup is a critical factor affecting coastal flooding, shoreline changes, and damage to coastal structures. Climate change is also expected to amplify wave runup's impact on coastal areas. Therefore, fast and accurate wave runup estimation is essential for effective coastal engineering design and management. However, predicting the time-dependent wave runup is challenging due to the intrinsic nonlinearities and non-stationarity of the process, even with the use of the most advanced machine learning techniques. In this study, a physics-informed machine learning-based approach is proposed to efficiently and accurately simulate time-series wave runup. The methodology combines the computational efficiency of the Surfbeat (XBSB) mode with the accuracy of the nonhydrostatic (XBNH) mode of the XBeach model. Specifically, a conditional generative adversarial network (cGAN) is used to map the image representation of wave runup from XBSB to the corresponding image from XBNH. These images are generated by first converting wave runup signals into time-frequency scalograms and then transforming them into image representations. The cGAN model achieves improved performance in image-to-image mapping tasks by incorporating physics-based knowledge from XBSB. After training the model, the high-fidelity XBNH-based scalograms can be predicted, which are then employed to reconstruct the time-series wave runup using the inverse wavelet transform. The simulation results underscore the efficiency and robustness of the proposed model in predicting wave runup, suggesting its potential value for applications in risk assessment and management.

NODI: Out-Of-Distribution Detection with Noise from Diffusion. (arXiv:2401.08689v1 [cs.CV])

Authors: Jingqiu Zhou, Aojun Zou, Hongshen Li

Out-of-distribution (OOD) detection is a crucial part of deploying machine learning models safely. It has been extensively studied with a plethora of methods developed in the literature. This problem is tackled with an OOD score computation, however, previous methods compute the OOD scores with limited usage of the in-distribution dataset. For instance, the OOD scores are computed with information from a small portion of the in-distribution data. Furthermore, these methods encode images with a neural image encoder. The robustness of these methods is rarely checked with respect to image encoders of different training methods and architectures. In this work, we introduce the diffusion process into the OOD task. The diffusion model integrates information on the whole training set into the predicted noise vectors. What's more, we deduce a closed-form solution for the noise vector (stable point). Then the noise vector is converted into our OOD score, we test both the deep model predicted noise vector and the closed-form noise vector on the OOD benchmarks \cite{openood}. Our method outperforms previous OOD methods across all types of image encoders (Table. \ref{main}). A $3.5\%$ performance gain is achieved with the MAE-based image encoder. Moreover, we studied the robustness of OOD methods by applying different types of image encoders. Some OOD methods failed to generalize well when switching image encoders from ResNet to Vision Transformers, our method performs exhibits good robustness with all the image encoders.

Contrastive Learning with Negative Sampling Correction. (arXiv:2401.08690v1 [cs.LG])

Authors: Lu Wang, Chao Du, Pu Zhao, Chuan Luo, Zhangchi Zhu, Bo Qiao, Wei Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

As one of the most effective self-supervised representation learning methods, contrastive learning (CL) relies on multiple negative pairs to contrast against each positive pair. In the standard practice of contrastive learning, data augmentation methods are utilized to generate both positive and negative pairs. While existing works have been focusing on improving the positive sampling, the negative sampling process is often overlooked. In fact, the generated negative samples are often polluted by positive samples, which leads to a biased loss and performance degradation. To correct the negative sampling bias, we propose a novel contrastive learning method named Positive-Unlabeled Contrastive Learning (PUCL). PUCL treats the generated negative samples as unlabeled samples and uses information from positive samples to correct bias in contrastive loss. We prove that the corrected loss used in PUCL only incurs a negligible bias compared to the unbiased contrastive loss. PUCL can be applied to general contrastive learning problems and outperforms state-of-the-art methods on various image and graph classification tasks. The code of PUCL is in the supplementary file.

Towards Responsible AI in Banking: Addressing Bias for Fair Decision-Making. (arXiv:2401.08691v1 [stat.ML])

Authors: Alessandro Castelnovo

In an era characterized by the pervasive integration of artificial intelligence into decision-making processes across diverse industries, the demand for trust has never been more pronounced. This thesis embarks on a comprehensive exploration of bias and fairness, with a particular emphasis on their ramifications within the banking sector, where AI-driven decisions bear substantial societal consequences. In this context, the seamless integration of fairness, explainability, and human oversight is of utmost importance, culminating in the establishment of what is commonly referred to as "Responsible AI". This emphasizes the critical nature of addressing biases within the development of a corporate culture that aligns seamlessly with both AI regulations and universal human rights standards, particularly in the realm of automated decision-making systems. Nowadays, embedding ethical principles into the development, training, and deployment of AI models is crucial for compliance with forthcoming European regulations and for promoting societal good. This thesis is structured around three fundamental pillars: understanding bias, mitigating bias, and accounting for bias. These contributions are validated through their practical application in real-world scenarios, in collaboration with Intesa Sanpaolo. This collaborative effort not only contributes to our understanding of fairness but also provides practical tools for the responsible implementation of AI-based decision-making systems. In line with open-source principles, we have released Bias On Demand and FairView as accessible Python packages, further promoting progress in the field of AI fairness.

Hierarchical Source-to-Post-Route QoR Prediction in High-Level Synthesis with GNNs. (arXiv:2401.08696v1 [cs.AR])

Authors: Mingzhe Gao, Jieru Zhao, Zhe Lin, Minyi Guo

High-level synthesis (HLS) notably speeds up the hardware design process by avoiding RTL programming. However, the turnaround time of HLS increases significantly when post-route quality of results (QoR) are considered during optimization. To tackle this issue, we propose a hierarchical post-route QoR prediction approach for FPGA HLS, which features: (1) a modeling flow that directly estimates latency and post-route resource usage from C/C++ programs; (2) a graph construction method that effectively represents the control and data flow graph of source code and effects of HLS pragmas; and (3) a hierarchical GNN training and prediction method capable of capturing the impact of loop hierarchies. Experimental results show that our method presents a prediction error of less than 10% for different types of QoR metrics, which gains tremendous improvement compared with the state-of-the-art GNN methods. By adopting our proposed methodology, the runtime for design space exploration in HLS is shortened to tens of minutes and the achieved ADRS is reduced to 6.91% on average.

Computationally Efficient Optimisation of Elbow-Type Draft Tube Using Neural Network Surrogates. (arXiv:2401.08700v1 [math.OC])

Authors: Ante Sikirica, Ivana Lučin, Marta Alvir, Lado Kranjčević, Zoran Čarija

This study aims to provide a comprehensive assessment of single-objective and multi-objective optimisation algorithms for the design of an elbow-type draft tube, as well as to introduce a computationally efficient optimisation workflow. The proposed workflow leverages deep neural network surrogates trained on data obtained from numerical simulations. The use of surrogates allows for a more flexible and faster evaluation of novel designs. The success history-based adaptive differential evolution with linear reduction and the multi-objective evolutionary algorithm based on decomposition were identified as the best-performing algorithms and used to determine the influence of different objectives in the single-objective optimisation and their combined impact on the draft tube design in the multi-objective optimisation. The results for the single-objective algorithm are consistent with those of the multi-objective algorithm when the objectives are considered separately. Multi-objective approach, however, should typically be chosen, especially for computationally inexpensive surrogates. A multi-criteria decision analysis method was used to obtain optimal multi-objective results, showing an improvement of 1.5% and 17% for the pressure recovery factor and drag coefficient, respectively. The difference between the predictions and the numerical results is less than 0.5% for the pressure recovery factor and 3% for the drag coefficient. As the demand for renewable energy continues to increase, the relevance of data-driven optimisation workflows, as discussed in this study, will become increasingly important, especially in the context of global sustainability efforts.

Do We Really Even Need Data?. (arXiv:2401.08702v1 [stat.ME])

Authors: Kentaro Hoffman, Stephen Salerno, Awan Afiaz, Jeffrey T. Leek, Tyler H. McCormick

As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``post-prediction inference'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure. We also contrast the framework for post-prediction inference with classical work spanning several related fields, including survey sampling, missing data, and semi-supervised learning. This contrast elucidates the role of design in both classical and modern inference problems.

Decoupled Prototype Learning for Reliable Test-Time Adaptation. (arXiv:2401.08703v1 [cs.LG])

Authors: Guowei Wang, Changxing Ding, Wentao Tan, Mingkui Tan

Test-time adaptation (TTA) is a task that continually adapts a pre-trained source model to the target domain during inference. One popular approach involves fine-tuning model with cross-entropy loss according to estimated pseudo-labels. However, its performance is significantly affected by noisy pseudo-labels. This study reveals that minimizing the classification error of each sample causes the cross-entropy loss's vulnerability to label noise. To address this issue, we propose a novel Decoupled Prototype Learning (DPL) method that features prototype-centric loss computation. First, we decouple the optimization of class prototypes. For each class prototype, we reduce its distance with positive samples and enlarge its distance with negative samples in a contrastive manner. This strategy prevents the model from overfitting to noisy pseudo-labels. Second, we propose a memory-based strategy to enhance DPL's robustness for the small batch sizes often encountered in TTA. We update each class's pseudo-feature from a memory in a momentum manner and insert an additional DPL loss. Finally, we introduce a consistency regularization-based approach to leverage samples with unconfident pseudo-labels. This approach transfers feature styles of samples with unconfident pseudo-labels to those with confident pseudo-labels. Thus, more reliable samples for TTA are created. The experimental results demonstrate that our methods achieve state-of-the-art performance on domain generalization benchmarks, and reliably improve the performance of self-training-based methods on image corruption benchmarks. The code will be released.

Survival Analysis of Young Triple-Negative Breast Cancer Patients. (arXiv:2401.08712v1 [q-bio.QM])

Authors: M. Mehdi Owrang O, Fariba Jafari Horestani, Ginger Schwarz

Breast cancer prognosis is crucial for effective treatment, with the disease more common in women over 40 years old but rare under 40 years old, where less than 5 percent of cases occur in the U.S. Studies indicate a worse prognosis in younger women, which varies by ethnicity. Breast cancers are classified based on receptors like estrogen, progesterone, and HER2. Triple-negative breast cancer (TNBC), lacking these receptors, accounts for about 15 percent of cases and is more prevalent in younger patients, often resulting in poorer outcomes. Nevertheless, the impact of age on TNBC prognosis remains unclear. Factors like age, race, tumor grade, size, and lymph node status are studied for their role in TNBC's clinical outcomes, but current research is inconclusive about age-related differences. This study uses SEER data set to examine the influence of younger age on survivability in TNBC patients, aiming to determine if age is a significant prognostic factor. Our experimental results on SEER dataset confirm the existing research reports that TNBC patients have worse prognosis compared to non-TNBC based on age. Our main goal was to investigate whether younger age has any significance on the survivability of TNBC patients. Experimental results do not show that younger age has any significance on the prognosis and survival rate of the TNBC patients

Selecting Subsets of Source Data for Transfer Learning with Applications in Metal Additive Manufacturing. (arXiv:2401.08715v1 [cs.LG])

Authors: Yifan Tang, M. Rahmani Dehaghani, Pouyan Sajadi, G. Gary Wang

Considering data insufficiency in metal additive manufacturing (AM), transfer learning (TL) has been adopted to extract knowledge from source domains (e.g., completed printings) to improve the modeling performance in target domains (e.g., new printings). Current applications use all accessible source data directly in TL with no regard to the similarity between source and target data. This paper proposes a systematic method to find appropriate subsets of source data based on similarities between the source and target datasets for a given set of limited target domain data. Such similarity is characterized by the spatial and model distance metrics. A Pareto frontier-based source data selection method is developed, where the source data located on the Pareto frontier defined by two similarity distance metrics are selected iteratively. The method is integrated into an instance-based TL method (decision tree regression model) and a model-based TL method (fine-tuned artificial neural network). Both models are then tested on several regression tasks in metal AM. Comparison results demonstrate that 1) the source data selection method is general and supports integration with various TL methods and distance metrics, 2) compared with using all source data, the proposed method can find a small subset of source data from the same domain with better TL performance in metal AM regression tasks involving different processes and machines, and 3) when multiple source domains exist, the source data selection method could find the subset from one source domain to obtain comparable or better TL performance than the model constructed using data from all source domains.

Investigating Fouling Efficiency in Football Using Expected Booking (xB) Model. (arXiv:2401.08718v1 [cs.LG])

Authors: Adnan Azmat, Su Su Yi

This paper introduces the Expected Booking (xB) model, a novel metric designed to estimate the likelihood of a foul resulting in a yellow card in football. Through three iterative experiments, employing ensemble methods, the model demonstrates improved performance with additional features and an expanded dataset. Analysis of FIFA World Cup 2022 data validates the model's efficacy in providing insights into team and player fouling tactics, aligning with actual defensive performance. The xB model addresses a gap in fouling efficiency examination, emphasizing defensive strategies which often overlooked. Further enhancements are suggested through the incorporation of comprehensive data and spatial features.

HierSFL: Local Differential Privacy-aided Split Federated Learning in Mobile Edge Computing. (arXiv:2401.08723v1 [cs.CR])

Authors: Minh K. Quan, Dinh C. Nguyen, Van-Dinh Nguyen, Mayuri Wijayasundara, Sujeeva Setunge, Pubudu N. Pathirana

Federated Learning is a promising approach for learning from user data while preserving data privacy. However, the high requirements of the model training process make it difficult for clients with limited memory or bandwidth to participate. To tackle this problem, Split Federated Learning is utilized, where clients upload their intermediate model training outcomes to a cloud server for collaborative server-client model training. This methodology facilitates resource-constrained clients' participation in model training but also increases the training time and communication overhead. To overcome these limitations, we propose a novel algorithm, called Hierarchical Split Federated Learning (HierSFL), that amalgamates models at the edge and cloud phases, presenting qualitative directives for determining the best aggregation timeframes to reduce computation and communication expenses. By implementing local differential privacy at the client and edge server levels, we enhance privacy during local model parameter updates. Our experiments using CIFAR-10 and MNIST datasets show that HierSFL outperforms standard FL approaches with better training accuracy, training time, and communication-computing trade-offs. HierSFL offers a promising solution to mobile edge computing's challenges, ultimately leading to faster content delivery and improved mobile service quality.

MA2GCN: Multi Adjacency relationship Attention Graph Convolutional Networks for Traffic Prediction using Trajectory data. (arXiv:2401.08727v1 [cs.LG])

Authors: Zhengke Sun, Yuliang Ma

The problem of traffic congestion not only causes a large amount of economic losses, but also seriously endangers the urban environment. Predicting traffic congestion has important practical significance. So far, most studies have been based on historical data from sensors placed on different roads to predict future traffic flow and speed, to analyze the traffic congestion conditions of a certain road segment. However, due to the fixed position of sensors, it is difficult to mine new information. On the other hand, vehicle trajectory data is more flexible and can extract traffic information as needed. Therefore, we proposed a new traffic congestion prediction model - Multi Adjacency relationship Attention Graph Convolutional Networks(MA2GCN). This model transformed vehicle trajectory data into graph structured data in grid form, and proposed a vehicle entry and exit matrix based on the mobility between different grids. At the same time, in order to improve the performance of the model, this paper also built a new adaptive adjacency matrix generation method and adjacency matrix attention module. This model mainly used gated temporal convolution and graph convolution to extract temporal and spatial information, respectively. Compared with multiple baselines, our model achieved the best performance on Shanghai taxi GPS trajectory dataset. The code is available at https://github.com/zachysun/Taxi Traffic Benchmark.

Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information. (arXiv:2401.08732v1 [cs.LG])

Authors: Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan, En-Hui Yang

It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.

Bag of Tricks to Boost Adversarial Transferability. (arXiv:2401.08734v1 [cs.CV])

Authors: Zeliang Zhang, Rongyi Zhu, Wei Yao, Xiaosen Wang, Chenliang Xu

Deep neural networks are widely known to be vulnerable to adversarial examples. However, vanilla adversarial examples generated under the white-box setting often exhibit low transferability across different models. Since adversarial transferability poses more severe threats to practical applications, various approaches have been proposed for better transferability, including gradient-based, input transformation-based, and model-related attacks, \etc. In this work, we find that several tiny changes in the existing adversarial attacks can significantly affect the attack performance, \eg, the number of iterations and step size. Based on careful studies of existing adversarial attacks, we propose a bag of tricks to enhance adversarial transferability, including momentum initialization, scheduled step size, dual example, spectral-based input transformation, and several ensemble strategies. Extensive experiments on the ImageNet dataset validate the high effectiveness of our proposed tricks and show that combining them can further boost adversarial transferability. Our work provides practical insights and techniques to enhance adversarial transferability, and offers guidance to improve the attack performance on the real-world application through simple adjustments.

A Framework for Scalable Ambient Air Pollution Concentration Estimation. (arXiv:2401.08735v1 [stat.AP])

Authors: Liam J Berrisford, Lucy S Neal, Helen J Buttery, Benjamin R Evans, Ronaldo Menezes

Ambient air pollution remains a critical issue in the United Kingdom, where data on air pollution concentrations form the foundation for interventions aimed at improving air quality. However, the current air pollution monitoring station network in the UK is characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements. This approach provides a comprehensive dataset for England throughout 2018 at a 1kmx1km hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area, yielding data valued at approximately \pounds70 billion. Validation was conducted to assess the model's performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for NO2, O3, PM10, PM2.5, and SO2. This resource empowers stakeholders to conduct studies at a higher resolution than was previously possible.

Machine Learning-Based Analysis of Ebola Virus' Impact on Gene Expression in Nonhuman Primates. (arXiv:2401.08738v1 [q-bio.GN])

Authors: Mostafa Rezapour, Muhammad Khalid Khan Niazi, Hao Lu, Aarthi Narayanan, Metin Nafi Gurcan

This study introduces the Supervised Magnitude-Altitude Scoring (SMAS) methodology, a machine learning-based approach, for analyzing gene expression data obtained from nonhuman primates (NHPs) infected with Ebola virus (EBOV). We utilize a comprehensive dataset of NanoString gene expression profiles from Ebola-infected NHPs, deploying the SMAS system for nuanced host-pathogen interaction analysis. SMAS effectively combines gene selection based on statistical significance and expression changes, employing linear classifiers such as logistic regression to accurately differentiate between RT-qPCR positive and negative NHP samples. A key finding of our research is the identification of IFI6 and IFI27 as critical biomarkers, demonstrating exceptional predictive performance with 100% accuracy and Area Under the Curve (AUC) metrics in classifying various stages of Ebola infection. Alongside IFI6 and IFI27, genes, including MX1, OAS1, and ISG15, were significantly upregulated, highlighting their essential roles in the immune response to EBOV. Our results underscore the efficacy of the SMAS method in revealing complex genetic interactions and response mechanisms during EBOV infection. This research provides valuable insights into EBOV pathogenesis and aids in developing more precise diagnostic tools and therapeutic strategies to address EBOV infection in particular and viral infection in general.

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers. (arXiv:2401.08740v1 [cs.CV])

Authors: Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie

We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06.

Fixed Point Diffusion Models. (arXiv:2401.08741v1 [cs.CV])

Authors: Xingjian Bai, Luke Melas-Kyriazi

We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model, transforming the diffusion process into a sequence of closely-related fixed point problems. Combined with a new stochastic training method, this approach significantly reduces model size, reduces memory usage, and accelerates training. Moreover, it enables the development of two new techniques to improve sampling efficiency: reallocating computation across timesteps and reusing fixed point solutions between timesteps. We conduct extensive experiments with state-of-the-art models on ImageNet, FFHQ, CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in performance and efficiency. Compared to the state-of-the-art DiT model, FPDM contains 87% fewer parameters, consumes 60% less memory during training, and improves image generation quality in situations where sampling computation or time is limited. Our code and pretrained models are available at https://lukemelas.github.io/fixed-point-diffusion-models.

MMToM-QA: Multimodal Theory of Mind Question Answering. (arXiv:2401.08743v1 [cs.AI])

Authors: Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

Theory of Mind (ToM), the ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data, which can include visual cues, linguistic narratives, or both. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

The weird and the wonderful in our Solar System: Searching for serendipity in the Legacy Survey of Space and Time. (arXiv:2401.08763v1 [astro-ph.EP])

Authors: Brian Rogers, Chris J. Lintott, Steve Croft, Megan E. Schwamb, James R. A. Davenport

We present a novel method for anomaly detection in Solar System object data, in preparation for the Legacy Survey of Space and Time. We train a deep autoencoder for anomaly detection and use the learned latent space to search for other interesting objects. We demonstrate the efficacy of the autoencoder approach by finding interesting examples, such as interstellar objects, and show that using the autoencoder, further examples of interesting classes can be found. We also investigate the limits of classic unsupervised approaches to anomaly detection through the generation of synthetic anomalies and evaluate the feasibility of using a supervised learning approach. Future work should consider expanding the feature space to increase the variety of anomalies that can be uncovered during the survey using an autoencoder.

Robust Anomaly Detection for Particle Physics Using Multi-Background Representation Learning. (arXiv:2401.08777v1 [hep-ex])

Authors: Abhijith Gandrakota, Lily Zhang, Aahlad Puli, Kyle Cranmer, Jennifer Ngadiuba, Rajesh Ranganath, Nhan Tran

Anomaly, or out-of-distribution, detection is a promising tool for aiding discoveries of new particles or processes in particle physics. In this work, we identify and address two overlooked opportunities to improve anomaly detection for high-energy physics. First, rather than train a generative model on the single most dominant background process, we build detection algorithms using representation learning from multiple background types, thus taking advantage of more information to improve estimation of what is relevant for detection. Second, we generalize decorrelation to the multi-background setting, thus directly enforcing a more complete definition of robustness for anomaly detection. We demonstrate the benefit of the proposed robust multi-background anomaly detection algorithms on a high-dimensional dataset of particle decays at the Large Hadron Collider.

The Impact of Differential Feature Under-reporting on Algorithmic Fairness. (arXiv:2401.08788v1 [cs.LG])

Authors: Nil-Jana Akpinar, Zachary C. Lipton, Alexandra Chouldechova

Predictive risk models in the public sector are commonly developed using administrative data that is more complete for subpopulations that more greatly rely on public services. In the United States, for instance, information on health care utilization is routinely available to government agencies for individuals supported by Medicaid and Medicare, but not for the privately insured. Critiques of public sector algorithms have identified such differential feature under-reporting as a driver of disparities in algorithmic decision-making. Yet this form of data bias remains understudied from a technical viewpoint. While prior work has examined the fairness impacts of additive feature noise and features that are clearly marked as missing, the setting of data missingness absent indicators (i.e. differential feature under-reporting) has been lacking in research attention. In this work, we present an analytically tractable model of differential feature under-reporting which we then use to characterize the impact of this kind of data bias on algorithmic fairness. We demonstrate how standard missing data methods typically fail to mitigate bias in this setting, and propose a new set of methods specifically tailored to differential feature under-reporting. Our results show that, in real world data settings, under-reporting typically leads to increasing disparities. The proposed solution methods show success in mitigating increases in unfairness.

Sample Relationship from Learning Dynamics Matters for Generalisation. (arXiv:2401.08808v1 [cs.LG])

Authors: Shangmin Guo, Yi Ren, Stefano V.Albrecht, Kenny Smith

Although much research has been done on proposing new models or loss functions to improve the generalisation of artificial neural networks (ANNs), less attention has been directed to the impact of the training data on generalisation. In this work, we start from approximating the interaction between samples, i.e. how learning one sample would modify the model's prediction on other samples. Through analysing the terms involved in weight updates in supervised learning, we find that labels influence the interaction between samples. Therefore, we propose the labelled pseudo Neural Tangent Kernel (lpNTK) which takes label information into consideration when measuring the interactions between samples. We first prove that lpNTK asymptotically converges to the empirical neural tangent kernel in terms of the Frobenius norm under certain assumptions. Secondly, we illustrate how lpNTK helps to understand learning phenomena identified in previous work, specifically the learning difficulty of samples and forgetting events during learning. Moreover, we also show that using lpNTK to identify and remove poisoning training samples does not hurt the generalisation performance of ANNs.

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive. (arXiv:2401.08815v1 [cs.CV])

Authors: Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva

Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

Link Me Baby One More Time: Social Music Discovery on Spotify. (arXiv:2401.08818v1 [cs.SI])

Authors: Shazia'Ayn Babul, Desislava Hristova, Antonio Lima, Renaud Lambiotte, Mariano Beguerisse-Díaz

We explore the social and contextual factors that influence the outcome of person-to-person music recommendations and discovery. Specifically, we use data from Spotify to investigate how a link sent from one user to another results in the receiver engaging with the music of the shared artist. We consider several factors that may influence this process, such as the strength of the sender-receiver relationship, the user's role in the Spotify social network, their music social cohesion, and how similar the new artist is to the receiver's taste. We find that the receiver of a link is more likely to engage with a new artist when (1) they have similar music taste to the sender and the shared track is a good fit for their taste, (2) they have a stronger and more intimate tie with the sender, and (3) the shared artist is popular with the receiver's connections. Finally, we use these findings to build a Random Forest classifier to predict whether a shared music track will result in the receiver's engagement with the shared artist. This model elucidates which type of social and contextual features are most predictive, although peak performance is achieved when a diverse set of features are included. These findings provide new insights into the multifaceted mechanisms underpinning the interplay between music discovery and social processes.

Learning from Sparse Offline Datasets via Conservative Density Estimation. (arXiv:2401.08819v1 [cs.LG])

Authors: Zhepeng Cen, Zuxin Liu, Zitong Wang, Yihang Yao, Henry Lam, Ding Zhao

Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves state-of-the-art performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL.

Surface-Enhanced Raman Spectroscopy and Transfer Learning Toward Accurate Reconstruction of the Surgical Zone. (arXiv:2401.08821v1 [eess.IV])

Authors: Ashutosh Raman, Ren A. Odion, Kent K. Yamamoto, Weston Ross, Tuan Vo-Dinh, Patrick J. Codd

Raman spectroscopy, a photonic modality based on the inelastic backscattering of coherent light, is a valuable asset to the intraoperative sensing space, offering non-ionizing potential and highly-specific molecular fingerprint-like spectroscopic signatures that can be used for diagnosis of pathological tissue in the dynamic surgical field. Though Raman suffers from weakness in intensity, Surface-Enhanced Raman Spectroscopy (SERS), which uses metal nanostructures to amplify Raman signals, can achieve detection sensitivities that rival traditional photonic modalities. In this study, we outline a robotic Raman system that can reliably pinpoint the location and boundaries of a tumor embedded in healthy tissue, modeled here as a tissue-mimicking phantom with selectively infused Gold Nanostar regions. Further, due to the relative dearth of collected biological SERS or Raman data, we implement transfer learning to achieve 100% validation classification accuracy for Gold Nanostars compared to Control Agarose, thus providing a proof-of-concept for Raman-based deep learning training pipelines. We reconstruct a surgical field of 30x60mm in 10.2 minutes, and achieve 98.2% accuracy, preserving relative measurements between features in the phantom. We also achieve an 84.3% Intersection-over-Union score, which is the extent of overlap between the ground truth and predicted reconstructions. Lastly, we also demonstrate that the Raman system and classification algorithm do not discern based on sample color, but instead on presence of SERS agents. This study provides a crucial step in the translation of intelligent Raman systems in intraoperative oncological spaces.

AiGen-FoodReview: A Multimodal Dataset of Machine-Generated Restaurant Reviews and Images on Social Media. (arXiv:2401.08825v1 [cs.LG])

Authors: Alessandro Gambetti, Qiwei Han

Online reviews in the form of user-generated content (UGC) significantly impact consumer decision-making. However, the pervasive issue of not only human fake content but also machine-generated content challenges UGC's reliability. Recent advances in Large Language Models (LLMs) may pave the way to fabricate indistinguishable fake generated content at a much lower cost. Leveraging OpenAI's GPT-4-Turbo and DALL-E-2 models, we craft AiGen-FoodReview, a multi-modal dataset of 20,144 restaurant review-image pairs divided into authentic and machine-generated. We explore unimodal and multimodal detection models, achieving 99.80% multimodal accuracy with FLAVA. We use attributes from readability and photographic theories to score reviews and images, respectively, demonstrating their utility as hand-crafted features in scalable and interpretable detection models, with comparable performance. The paper contributes by open-sourcing the dataset and releasing fake review detectors, recommending its use in unimodal and multimodal fake review detection tasks, and evaluating linguistic and visual features in synthetic versus authentic data.

Stochastic Subnetwork Annealing: A Regularization Technique for Fine Tuning Pruned Subnetworks. (arXiv:2401.08830v1 [cs.LG])

Authors: Tim Whitaker, Darrell Whitley

Pruning methods have recently grown in popularity as an effective way to reduce the size and computational complexity of deep neural networks. Large numbers of parameters can be removed from trained models with little discernible loss in accuracy after a small number of continued training epochs. However, pruning too many parameters at once often causes an initial steep drop in accuracy which can undermine convergence quality. Iterative pruning approaches mitigate this by gradually removing a small number of parameters over multiple epochs. However, this can still lead to subnetworks that overfit local regions of the loss landscape. We introduce a novel and effective approach to tuning subnetworks through a regularization technique we call Stochastic Subnetwork Annealing. Instead of removing parameters in a discrete manner, we instead represent subnetworks with stochastic masks where each parameter has a probabilistic chance of being included or excluded on any given forward pass. We anneal these probabilities over time such that subnetwork structure slowly evolves as mask values become more deterministic, allowing for a smoother and more robust optimization of subnetworks at high levels of sparsity.

RIDGE: Reproducibility, Integrity, Dependability, Generalizability, and Efficiency Assessment of Medical Image Segmentation Models. (arXiv:2401.08847v1 [eess.IV])

Authors: Farhad Maleki, Linda Moy, Reza Forghani, Tapotosh Ghosh, Katie Ovens, Steve Langer, Pouria Rouzrokh, Bardia Khosravi, Ali Ganjizadeh, Daniel Warren, Roxana Daneshjou, Mana Moassefi, Atlas Haddadi Avval, Susan Sotardi, Neil Tenenholtz, Felipe Kitamura, Timothy Kline

Deep learning techniques, despite their potential, often suffer from a lack of reproducibility and generalizability, impeding their clinical adoption. Image segmentation is one of the critical tasks in medical image analysis, in which one or several regions/volumes of interest should be annotated. This paper introduces the RIDGE checklist, a framework for assessing the Reproducibility, Integrity, Dependability, Generalizability, and Efficiency of deep learning-based medical image segmentation models. The checklist serves as a guide for researchers to enhance the quality and transparency of their work, ensuring that segmentation models are not only scientifically sound but also clinically relevant.

REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes. (arXiv:2401.08850v1 [cs.LG])

Authors: David Ireland, Giovanni Montana

Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.

Using i-vectors for subject-independent cross-session EEG transfer learning. (arXiv:2401.08851v1 [cs.LG])

Authors: Jonathan Lasko, Jeff Ma, Mike Nicoletti, Jonathan Sussman-Fort, Sooyoung Jeong, William Hartmann

Cognitive load classification is the task of automatically determining an individual's utilization of working memory resources during performance of a task based on physiologic measures such as electroencephalography (EEG). In this paper, we follow a cross-disciplinary approach, where tools and methodologies from speech processing are used to tackle this problem. The corpus we use was released publicly in 2021 as part of the first passive brain-computer interface competition on cross-session workload estimation. We present our approach which used i-vector-based neural network classifiers to accomplish inter-subject cross-session EEG transfer learning, achieving 18% relative improvement over equivalent subject-dependent models. We also report experiments showing how our subject-independent models perform competitively on held-out subjects and improve with additional subject data, suggesting that subject-dependent training is not required for effective cognitive load determination.

Shabari: Delayed Decision-Making for Faster and Efficient Serverless Function. (arXiv:2401.08859v1 [cs.DC])

Authors: Prasoon Sinha, Kostis Kaffes, Neeraja J. Yadwadkar

Serverless computing relieves developers from the burden of resource management, thus providing ease-of-use to the users and the opportunity to optimize resource utilization for the providers. However, today's serverless systems lack performance guarantees for function invocations, thus limiting support for performance-critical applications: we observed severe performance variability (up to 6x). Providers lack visibility into user functions and hence find it challenging to right-size them: we observed heavy resource underutilization (up to 80%). To understand the causes behind the performance variability and underutilization, we conducted a measurement study of commonly deployed serverless functions and learned that the function performance and resource utilization depend crucially on function semantics and inputs. Our key insight is to delay making resource allocation decisions until after the function inputs are available. We introduce Shabari, a resource management framework for serverless systems that makes decisions as late as possible to right-size each invocation to meet functions' performance objectives (SLOs) and improve resource utilization. Shabari uses an online learning agent to right-size each function invocation based on the features of the function input and makes cold-start-aware scheduling decisions. For a range of serverless functions and inputs, Shabari reduces SLO violations by 11-73% while not wasting any vCPUs and reducing wasted memory by 64-94% in the median case, compared to state-of-the-art systems, including Aquatope, Parrotfish, and Cypress.

Semi-Supervised Learning Approach for Efficient Resource Allocation with Network Slicing in O-RAN. (arXiv:2401.08861v1 [cs.NI])

Authors: Salar Nouri, Mojdeh Karbalaee Motalleb, Vahid Shah-Mansouri, Seyed Pooya Shariatpanahi

The Open Radio Access Network (O-RAN) technology has emerged as a promising solution for network operators, providing them with an open and favorable environment. Ensuring effective coordination of x-applications (xAPPs) is crucial to enhance flexibility and optimize network performance within the O-RAN. In this paper, we introduce an innovative approach to the resource allocation problem, aiming to coordinate multiple independent xAPPs for network slicing and resource allocation in O-RAN. Our proposed method focuses on maximizing the weighted throughput among user equipments (UE), as well as allocating physical resource blocks (PRBs). We prioritize two service types, namely enhanced Mobile Broadband and Ultra Reliable Low Latency Communication. To achieve this, we have designed two xAPPs: a power control xAPP for each UE and a PRB allocation xAPP. The proposed method consists of a two-part training phase, where the first part uses supervised learning with a Variational Autoencoder trained to regress the power transmission as well as the user association and PRB allocation decisions, and the second part uses unsupervised learning with a contrastive loss approach to improve the generalization and robustness of the model. We evaluate the performance of our proposed method by comparing its results to those obtained from an exhaustive search algorithm, deep Q-network algorithm, and by reporting performance metrics for the regression task. We also evaluate the proposed model's performance in different scenarios among the service types. The results show that the proposed method is a more efficient and effective solution for network slicing problems compared to state-of-the-art methods.

Robust Localization of Key Fob Using Channel Impulse Response of Ultra Wide Band Sensors for Keyless Entry Systems. (arXiv:2401.08863v1 [cs.LG])

Authors: Abhiram Kolli, Filippo Casamassima, Horst Possegger, Horst Bischof

Using neural networks for localization of key fob within and surrounding a car as a security feature for keyless entry is fast emerging. In this paper we study: 1) the performance of pre-computed features of neural networks based UWB (ultra wide band) localization classification forming the baseline of our experiments. 2) Investigate the inherent robustness of various neural networks; therefore, we include the study of robustness of the adversarial examples without any adversarial training in this work. 3) Propose a multi-head self-supervised neural network architecture which outperforms the baseline neural networks without any adversarial training. The model's performance improved by 67% at certain ranges of adversarial magnitude for fast gradient sign method and 37% each for basic iterative method and projected gradient descent method.

Binaural Angular Separation Network. (arXiv:2401.08864v1 [eess.AS])

Authors: Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee, Matthias Grundmann

We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing.

The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images. (arXiv:2401.08865v1 [cs.CV])

Authors: Nicholas Konz, Maciej A. Mazurowski

This paper investigates discrepancies in how neural networks learn from different imaging domains, which are commonly overlooked when adopting computer vision techniques from the domain of natural images to other specialized domains such as medical images. Recent works have found that the generalization error of a trained network typically increases with the intrinsic dimension ($d_{data}$) of its training set. Yet, the steepness of this relationship varies significantly between medical (radiological) and natural imaging domains, with no existing theoretical explanation. We address this gap in knowledge by establishing and empirically validating a generalization scaling law with respect to $d_{data}$, and propose that the substantial scaling discrepancy between the two considered domains may be at least partially attributed to the higher intrinsic "label sharpness" ($K_F$) of medical imaging datasets, a metric which we propose. Next, we demonstrate an additional benefit of measuring the label sharpness of a training set: it is negatively correlated with the trained model's adversarial robustness, which notably leads to models for medical images having a substantially higher vulnerability to adversarial attack. Finally, we extend our $d_{data}$ formalism to the related metric of learned representation intrinsic dimension ($d_{repr}$), derive a generalization scaling law with respect to $d_{repr}$, and show that $d_{data}$ serves as an upper bound for $d_{repr}$. Our theoretical results are supported by thorough experiments with six models and eleven natural and medical imaging datasets over a range of training set sizes. Our findings offer insights into the influence of intrinsic dataset properties on generalization, representation learning, and robustness in deep neural networks.

MambaTab: A Simple Yet Effective Approach for Handling Tabular Data. (arXiv:2401.08867v1 [cs.LG])

Authors: Md Atik Ahamed, Qiang Cheng

Tabular data remains ubiquitous across domains despite growing use of images and texts for machine learning. While deep learning models like convolutional neural networks and transformers achieve strong performance on tabular data, they require extensive data preprocessing, tuning, and resources, limiting accessibility and scalability. This work develops an innovative approach based on a structured state-space model (SSM), MambaTab, for tabular data. SSMs have strong capabilities for efficiently extracting effective representations from data with long-range dependencies. MambaTab leverages Mamba, an emerging SSM variant, for end-to-end supervised learning on tables. Compared to state-of-the-art baselines, MambaTab delivers superior performance while requiring significantly fewer parameters and minimal preprocessing, as empirically validated on diverse benchmark datasets. MambaTab's efficiency, scalability, generalizability, and predictive gains signify it as a lightweight, "out-of-the-box" solution for diverse tabular data with promise for enabling wider practical applications.

DCRMTA: Unbiased Causal Representation for Multi-touch Attribution. (arXiv:2401.08875v1 [cs.LG])

Authors: Jiaming Tang

Multi-touch attribution (MTA) currently plays a pivotal role in achieving a fair estimation of the contributions of each advertising touchpoint to-wards conversion behavior, deeply influencing budget allocation and advertising recommenda-tion. Traditional multi-touch attribution methods initially build a conversion prediction model, an-ticipating learning the inherent relationship be-tween touchpoint sequences and user purchasing behavior through historical data. Based on this, counterfactual touchpoint sequences are con-structed from the original sequence subset, and conversions are estimated using the prediction model, thus calculating advertising contributions. A covert assumption of these methods is the un-biased nature of conversion prediction models. However, due to confounding variables factors arising from user preferences and internet recom-mendation mechanisms such as homogenization of ad recommendations resulting from past shop-ping records, bias can easily occur in conversion prediction models trained on observational data. This paper redefines the causal effect of user fea-tures on conversions and proposes a novel end-to-end approach, Deep Causal Representation for MTA (DCRMTA). Our model while eliminating confounding variables, extracts features with causal relations to conversions from users. Fur-thermore, Extensive experiments on both synthet-ic and real-world Criteo data demonstrate DCRMTA's superior performance in converting prediction across varying data distributions, while also effectively attributing value across dif-ferent advertising channels

Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling. (arXiv:2401.08876v1 [cs.HC])

Authors: Dongping Zhang, Angelos Chatzimparmpas, Negar Kamali, Jessica Hullman

As deep neural networks are more commonly deployed in high-stakes domains, their lack of interpretability makes uncertainty quantification challenging. We investigate the effects of presenting conformal prediction sets$\unicode{x2013}$a method for generating valid confidence sets in distribution-free uncertainty quantification$\unicode{x2013}$to express uncertainty in AI-advised decision-making. Through a large pre-registered experiment, we compare the utility of conformal prediction sets to displays of Top-1 and Top-k predictions for AI-advised image labeling. We find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-1 and Top-k displays for easy images, prediction sets excel at assisting humans in labeling out-of-distribution (OOD) images especially when the set size is small. Our results empirically pinpoint the practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.

RiemannONets: Interpretable Neural Operators for Riemann Problems. (arXiv:2401.08886v1 [cs.LG])

Authors: Ahmad Peyvan, Vivek Oommen, Ameya D. Jagtap, George Em Karniadakis

Developing the proper representations for simulating high-speed flows with strong shock waves, rarefactions, and contact discontinuities has been a long-standing question in numerical analysis. Herein, we employ neural operators to solve Riemann problems encountered in compressible flows for extreme pressure jumps (up to $10^{10}$ pressure ratio). In particular, we first consider the DeepONet that we train in a two-stage process, following the recent work of Lee and Shin, wherein the first stage, a basis is extracted from the trunk net, which is orthonormalized and subsequently is used in the second stage in training the branch net. This simple modification of DeepONet has a profound effect on its accuracy, efficiency, and robustness and leads to very accurate solutions to Riemann problems compared to the vanilla version. It also enables us to interpret the results physically as the hierarchical data-driven produced basis reflects all the flow features that would otherwise be introduced using ad hoc feature expansion layers. We also compare the results with another neural operator based on the U-Net for low, intermediate, and very high-pressure ratios that are very accurate for Riemann problems, especially for large pressure ratios, due to their multiscale nature but computationally more expensive. Overall, our study demonstrates that simple neural network architectures, if properly pre-trained, can achieve very accurate solutions of Riemann problems for real-time forecasting.

On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations. (arXiv:2401.08889v1 [cs.SD])

Authors: Matthew C. McCallum, Matthew E. P. Davies, Florian Henkel, Jaehun Kim, Samuel E. Sandberg

Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.

Tempo estimation as fully self-supervised binary classification. (arXiv:2401.08891v1 [cs.SD])

Authors: Florian Henkel, Jaehun Kim, Matthew C. McCallum, Samuel E. Sandberg, Matthew E. P. Davies

This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our approach in comparison with the state-of-the-art reveals highly competitive performance when the constraint of finding the precise tempo octave is relaxed.

MADA: Meta-Adaptive Optimizers through hyper-gradient Descent. (arXiv:2401.08893v1 [cs.LG])

Authors: Kaan Ozkara, Can Karakus, Parameswaran Raman, Mingyi Hong, Shoham Sabach, Branislav Kveton, Volkan Cevher

Since Adam was introduced, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and search through it using hyper-gradient descent. Numerical results suggest that MADA is robust against sub-optimally tuned hyper-parameters, and outperforms Adam, Lion, and Adan with their default hyper-parameters, often even with optimized hyper-parameters. We also propose AVGrad, a variant of AMSGrad where the maximum operator is replaced with averaging, and observe that it performs better within MADA. Finally, we provide a convergence analysis to show that interpolation of optimizers (specifically, AVGrad and Adam) can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.

cedar: Composable and Optimized Machine Learning Input Data Pipelines. (arXiv:2401.08895v1 [cs.LG])

Authors: Mark Zhao, Emanuel Adamiak, Christos Kozyrakis

The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex of transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources -- or worse -- underutilize expensive accelerators.

To address these demands, we present cedar, a programming model and framework that allows users to easily build, optimize, and execute input data pipelines. cedar presents an easy-to-use programming interface, allowing users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. Meanwhile, cedar transparently applies a complex and extensible set of optimization techniques (e.g., offloading, caching, prefetching, fusion, and reordering). It then orchestrates processing across a customizable set of local and distributed compute resources in order to maximize processing performance and efficiency, all without user input. On average across six diverse input data pipelines, cedar achieves a 2.49x, 1.87x, 2.18x, and 2.74x higher performance compared to tf.data, tf.data service, Ray Data, and PyTorch's DataLoader, respectively.

CFASL: Composite Factor-Aligned Symmetry Learning for Disentanglement in Variational AutoEncoder. (arXiv:2401.08897v1 [cs.LG])

Authors: Hee-Jun Jung, Jaehyoung Jeong, Kangil Kim

Symmetries of input and latent vectors have provided valuable insights for disentanglement learning in VAEs.However, only a few works were proposed as an unsupervised method, and even these works require known factor information in training data. We propose a novel method, Composite Factor-Aligned Symmetry Learning (CFASL), which is integrated into VAEs for learning symmetry-based disentanglement in unsupervised learning without any knowledge of the dataset factor information.CFASL incorporates three novel features for learning symmetry-based disentanglement: 1) Injecting inductive bias to align latent vector dimensions to factor-aligned symmetries within an explicit learnable symmetry codebook 2) Learning a composite symmetry to express unknown factors change between two random samples by learning factor-aligned symmetries within the codebook 3) Inducing group equivariant encoder and decoder in training VAEs with the two conditions. In addition, we propose an extended evaluation metric for multi-factor changes in comparison to disentanglement evaluation in VAEs. In quantitative and in-depth qualitative analysis, CFASL demonstrates a significant improvement of disentanglement in single-factor change, and multi-factor change conditions compared to state-of-the-art methods.

Bridging State and History Representations: Understanding Self-Predictive RL. (arXiv:2401.08898v1 [cs.LG])

Authors: Tianwei Ni, Benjamin Eysenbach, Erfan Seyedsalehi, Michel Ma, Clement Gehring, Aditya Mahajan, Pierre-Luc Bacon

Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of practical guidelines for RL practitioners.

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search. (arXiv:2401.08902v1 [cs.SD])

Authors: Matthew C. McCallum, Florian Henkel, Jaehun Kim, Samuel E. Sandberg, Matthew E. P. Davies

Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces representing specific, yet possibly correlated, attributes can be weighted to emphasize those attributes in downstream tasks. However, no research has been conducted into the independence of these subspaces, nor their manipulation, in order to retrieve tracks that are similar but different in a specific way. Here, we explore the manipulation of tempo in embedding spaces as a case-study towards this goal. We propose tempo translation functions that allow for efficient manipulation of tempo within a pre-existing embedding space whilst maintaining other properties such as genre. As this translation is specific to tempo it enables retrieval of tracks that are similar but have specifically different tempi. We show that such a function can be used as an efficient data augmentation strategy for both training of downstream tempo predictors, and improved nearest neighbor retrieval of properties largely independent of tempo.

PPR: Enhancing Dodging Attacks while Maintaining Impersonation Attacks on Face Recognition Systems. (arXiv:2401.08903v1 [cs.CV])

Authors: Fengfan Zhou, Heifei Ling

Adversarial Attacks on Face Recognition (FR) encompass two types: impersonation attacks and evasion attacks. We observe that achieving a successful impersonation attack on FR does not necessarily ensure a successful dodging attack on FR in the black-box setting. Introducing a novel attack method named Pre-training Pruning Restoration Attack (PPR), we aim to enhance the performance of dodging attacks whilst avoiding the degradation of impersonation attacks. Our method employs adversarial example pruning, enabling a portion of adversarial perturbations to be set to zero, while tending to maintain the attack performance. By utilizing adversarial example pruning, we can prune the pre-trained adversarial examples and selectively free up certain adversarial perturbations. Thereafter, we embed adversarial perturbations in the pruned area, which enhances the dodging performance of the adversarial face examples. The effectiveness of our proposed attack method is demonstrated through our experimental results, showcasing its superior performance.

Herding LLaMaS: Using LLMs as an OS Module. (arXiv:2401.08908v1 [cs.OS])

Authors: Aditya K Kamath, Sujay Yadalam

Computer systems are becoming increasingly heterogeneous with the emergence of new memory technologies and compute devices. GPUs alongside CPUs have become commonplace and CXL is poised to be a mainstay of cloud systems. The operating system is responsible for managing these hardware resources, requiring modification every time a new device is released. Years of research and development are sunk into tuning the OS for high performance with each new heterogeneous device. With the recent explosion in memory technologies and domain-specific accelerators, it would be beneficial to have an OS that could provide high performance for new devices without significant effort.

We propose LLaMaS which can adapt to new devices easily. LLaMaS uses Large Language Models (LLMs) to extract the useful features of new devices from their textual description and uses these features to make operating system decisions at runtime. Adding support to LLaMaS for a new device is as simple as describing the system and new device properties in plaintext.

LLaMaS reduces the burden on system administrators to enable easy integration of new devices into production systems.

Preliminary evaluation using ChatGPT shows that LLMs are capable of extracting device features from text and make correct OS decisions based on those features.

Characterising Gradients for Unsupervised Accuracy Estimation under Distribution Shift. (arXiv:2401.08909v1 [cs.LG])

Authors: Renchunzi Xie, Ambroise Odonnat, Vasilii Feofanov, Ievgen Redko, Jianfeng Zhang, Bo An

Estimating test accuracy without access to the ground-truth test labels under varying test environments is a challenging, yet extremely important problem in the safe deployment of machine learning algorithms. Existing works rely on the information from either the outputs or the extracted features of neural networks to formulate an estimation score correlating with the ground-truth test accuracy. In this paper, we investigate--both empirically and theoretically--how the information provided by the gradients can be predictive of the ground-truth test accuracy even under a distribution shift. Specifically, we use the norm of classification-layer gradients, backpropagated from the cross-entropy loss after only one gradient step over test data. Our key idea is that the model should be adjusted with a higher magnitude of gradients when it does not generalize to the test dataset with a distribution shift. We provide theoretical insights highlighting the main ingredients of such an approach ensuring its empirical success. Extensive experiments conducted on diverse distribution shifts and model structures demonstrate that our method significantly outperforms state-of-the-art algorithms.

Partial Diacritization: A Context-Contrastive Inference Approach. (arXiv:2401.08919v1 [cs.CL])

Authors: Muhammad ElNokrashy, Badr AlKhamissi

Diacritization plays a pivotal role in improving readability and disambiguating the meaning of Arabic texts. Efforts have so far focused on marking every eligible character (Full Diacritization). Comparatively overlooked, Partial Diacritzation (PD) is the selection of a subset of characters to be marked to aid comprehension where needed. Research has indicated that excessive diacritic marks can hinder skilled readers--reducing reading speed and accuracy. We conduct a behavioral experiment and show that partially marked text is often easier to read than fully marked text, and sometimes easier than plain text. In this light, we introduce Context-Contrastive Partial Diacritization (CCPD)--a novel approach to PD which integrates seamlessly with existing Arabic diacritization systems. CCPD processes each word twice, once with context and once without, and diacritizes only the characters with disparities between the two inferences. Further, we introduce novel indicators for measuring partial diacritization quality (SR, PDER, HDER, ERE), essential for establishing this as a machine learning task. Lastly, we introduce TD2, a Transformer-variant of an established model which offers a markedly different per formance profile on our proposed indicators compared to all other known systems.

DeLF: Designing Learning Environments with Foundation Models. (arXiv:2401.08936v1 [cs.AI])

Authors: Aida Afshar, Wenchao Li

Reinforcement learning (RL) offers a capable and intuitive structure for the fundamental sequential decision-making problem. Despite impressive breakthroughs, it can still be difficult to employ RL in practice in many simple applications. In this paper, we try to address this issue by introducing a method for designing the components of the RL environment for a given, user-intended application. We provide an initial formalization for the problem of RL component design, that concentrates on designing a good representation for observation and action space. We propose a method named DeLF: Designing Learning Environments with Foundation Models, that employs large language models to design and codify the user's intended learning scenario. By testing our method on four different learning environments, we demonstrate that DeLF can obtain executable environment codes for the corresponding RL problems.

CEL: A Continual Learning Model for Disease Outbreak Prediction by Leveraging Domain Adaptation via Elastic Weight Consolidation. (arXiv:2401.08940v1 [cs.LG])

Authors: Saba Aslam, Abdur Rasool, Hongyan Wu, Xiaoli Li

Continual learning, the ability of a model to learn over time without forgetting previous knowledge and, therefore, be adaptive to new data, is paramount in dynamic fields such as disease outbreak prediction. Deep neural networks, i.e., LSTM, are prone to error due to catastrophic forgetting. This study introduces a novel CEL model for continual learning by leveraging domain adaptation via Elastic Weight Consolidation (EWC). This model aims to mitigate the catastrophic forgetting phenomenon in a domain incremental setting. The Fisher Information Matrix (FIM) is constructed with EWC to develop a regularization term that penalizes changes to important parameters, namely, the important previous knowledge. CEL's performance is evaluated on three distinct diseases, Influenza, Mpox, and Measles, with different metrics. The high R-squared values during evaluation and reevaluation outperform the other state-of-the-art models in several contexts, indicating that CEL adapts to incremental data well. CEL's robustness and reliability are underscored by its minimal 65% forgetting rate and 18% higher memory stability compared to existing benchmark studies. This study highlights CEL's versatility in disease outbreak prediction, addressing evolving data with temporal patterns. It offers a valuable model for proactive disease control with accurate, timely predictions.

AntiPhishStack: LSTM-based Stacked Generalization Model for Optimized Phishing URLs Detection. (arXiv:2401.08947v1 [cs.CR])

Authors: Saba Aslam, Hafsa Aslam, Arslan Manzoor, Chen Hui, Abdur Rasool

The escalating reliance on revolutionary online web services has introduced heightened security risks, with persistent challenges posed by phishing despite extensive security measures. Traditional phishing systems, reliant on machine learning and manual features, struggle with evolving tactics. Recent advances in deep learning offer promising avenues for tackling novel phishing challenges and malicious URLs. This paper introduces a two-phase stack generalized model named AntiPhishStack, designed to detect phishing sites. The model leverages the learning of URLs and character-level TF-IDF features symmetrically, enhancing its ability to combat emerging phishing threats. In Phase I, features are trained on a base machine learning classifier, employing K-fold cross-validation for robust mean prediction. Phase II employs a two-layered stacked-based LSTM network with five adaptive optimizers for dynamic compilation, ensuring premier prediction on these features. Additionally, the symmetrical predictions from both phases are optimized and integrated to train a meta-XGBoost classifier, contributing to a final robust prediction. The significance of this work lies in advancing phishing detection with AntiPhishStack, operating without prior phishing-specific feature knowledge. Experimental validation on two benchmark datasets, comprising benign and phishing or malicious URLs, demonstrates the model's exceptional performance, achieving a notable 96.04% accuracy compared to existing studies. This research adds value to the ongoing discourse on symmetry and asymmetry in information security and provides a forward-thinking solution for enhancing network security in the face of evolving cyber threats.

Towards Off-Policy Reinforcement Learning for Ranking Policies with Human Feedback. (arXiv:2401.08959v1 [cs.LG])

Authors: Teng Xiao, Suhang Wang

Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods.

Cascading Reinforcement Learning. (arXiv:2401.08961v1 [cs.LG])

Authors: Yihan Du, R. Srikant, Wei Chen

Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.

DOO-RE: A dataset of ambient sensors in a meeting room for activity recognition. (arXiv:2401.08962v1 [cs.HC])

Authors: Hyunju Kim, Geon Kim, Taehoon Lee, Kisoo Kim, Dongman Lee

With the advancement of IoT technology, recognizing user activities with machine learning methods is a promising way to provide various smart services to users. High-quality data with privacy protection is essential for deploying such services in the real world. Data streams from surrounding ambient sensors are well suited to the requirement. Existing ambient sensor datasets only support constrained private spaces and those for public spaces have yet to be explored despite growing interest in research on them. To meet this need, we build a dataset collected from a meeting room equipped with ambient sensors. The dataset, DOO-RE, includes data streams from various ambient sensor types such as Sound and Projector. Each sensor data stream is segmented into activity units and multiple annotators provide activity labels through a cross-validation annotation process to improve annotation quality. We finally obtain 9 types of activities. To our best knowledge, DOO-RE is the first dataset to support the recognition of both single and group activities in a real meeting room with reliable annotations.

ACT-GAN: Radio map construction based on generative adversarial networks with ACT blocks. (arXiv:2401.08976v1 [cs.LG])

Authors: Chen Qi, Yang Jingjing, Huang Ming, Zhou Qiang

The radio map, serving as a visual representation of electromagnetic spatial characteristics, plays a pivotal role in assessment of wireless communication networks and radio monitoring coverage. Addressing the issue of low accuracy existing in the current radio map construction, this paper presents a novel radio map construction method based on generative adversarial network (GAN) in which the Aggregated Contextual-Transformation (AOT) block, Convolutional Block Attention Module (CBAM), and Transposed Convolution (T-Conv) block are applied to the generator, and we name it as ACT-GAN. It significantly improves the reconstruction accuracy and local texture of the radio maps. The performance of ACT-GAN across three different scenarios is demonstrated. Experiment results reveal that in the scenario without sparse discrete observations, the proposed method reduces the root mean square error (RMSE) by 14.6% in comparison to the state-of-the-art models. In the scenario with sparse discrete observations, the RMSE is diminished by 13.2%. Furthermore, the predictive results of the proposed model show a more lucid representation of electromagnetic spatial field distribution. To verify the universality of this model in radio map construction tasks, the scenario of unknown radio emission source is investigated. The results indicate that the proposed model is robust radio map construction and accurate in predicting the location of the emission source.

FedLoGe: Joint Local and Generic Federated Learning under Long-tailed Data. (arXiv:2401.08977v1 [cs.LG])

Authors: Zikai Xiao, Zihan Chen, Liyinglan Liu, Yang Feng, Jian Wu, Wanlu Liu, Joey Tianyi Zhou, Howard Hao Yang, Zuozhu Liu

Federated Long-Tailed Learning (Fed-LT), a paradigm wherein data collected from decentralized local clients manifests a globally prevalent long-tailed distribution, has garnered considerable attention in recent times. In the context of Fed-LT, existing works have predominantly centered on addressing the data imbalance issue to enhance the efficacy of the generic global model while neglecting the performance at the local level. In contrast, conventional Personalized Federated Learning (pFL) techniques are primarily devised to optimize personalized local models under the presumption of a balanced global data distribution. This paper introduces an approach termed Federated Local and Generic Model Training in Fed-LT (FedLoGe), which enhances both local and generic model performance through the integration of representation learning and classifier alignment within a neural collapse framework. Our investigation reveals the feasibility of employing a shared backbone as a foundational framework for capturing overarching global trends, while concurrently employing individualized classifiers to encapsulate distinct refinements stemming from each client's local features. Building upon this discovery, we establish the Static Sparse Equiangular Tight Frame Classifier (SSE-C), inspired by neural collapse principles that naturally prune extraneous noisy features and foster the acquisition of potent data representations. Furthermore, leveraging insights from imbalance neural collapse's classifier norm patterns, we develop Global and Local Adaptive Feature Realignment (GLA-FR) via an auxiliary global classifier and personalized Euclidean norm transfer to align global features with client preferences. Extensive experimental results on CIFAR-10/100-LT, ImageNet, and iNaturalist demonstrate the advantage of our method over state-of-the-art pFL and Fed-LT approaches.

A GAN-based data poisoning framework against anomaly detection in vertical federated learning. (arXiv:2401.08984v1 [cs.LG])

Authors: Xiaolin Chen, Daoguang Zan, Wei Li, Bei Guan, Yongji Wang

In vertical federated learning (VFL), commercial entities collaboratively train a model while preserving data privacy. However, a malicious participant's poisoning attack may degrade the performance of this collaborative model. The main challenge in achieving the poisoning attack is the absence of access to the server-side top model, leaving the malicious participant without a clear target model. To address this challenge, we introduce an innovative end-to-end poisoning framework P-GAN. Specifically, the malicious participant initially employs semi-supervised learning to train a surrogate target model. Subsequently, this participant employs a GAN-based method to produce adversarial perturbations to degrade the surrogate target model's performance. Finally, the generator is obtained and tailored for VFL poisoning. Besides, we develop an anomaly detection algorithm based on a deep auto-encoder (DAE), offering a robust defense mechanism to VFL scenarios. Through extensive experiments, we evaluate the efficacy of P-GAN and DAE, and further analyze the factors that influence their performance.

Rigid Protein-Protein Docking via Equivariant Elliptic-Paraboloid Interface Prediction. (arXiv:2401.08986v1 [cs.LG])

Authors: Ziyang Yu, Wenbing Huang, Yang Liu

The study of rigid protein-protein docking plays an essential role in a variety of tasks such as drug design and protein engineering. Recently, several learning-based methods have been proposed for the task, exhibiting much faster docking speed than those computational methods. In this paper, we propose a novel learning-based method called ElliDock, which predicts an elliptic paraboloid to represent the protein-protein docking interface. To be specific, our model estimates elliptic paraboloid interfaces for the two input proteins respectively, and obtains the roto-translation transformation for docking by making two interfaces coincide. By its design, ElliDock is independently equivariant with respect to arbitrary rotations/translations of the proteins, which is an indispensable property to ensure the generalization of the docking process. Experimental evaluations show that ElliDock achieves the fastest inference time among all compared methods and is strongly competitive with current state-of-the-art learning-based models such as DiffDock-PP and Multimer particularly for antibody-antigen docking.

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR. (arXiv:2401.08992v1 [cs.CL])

Authors: Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman

The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue.

MicroNAS: Zero-Shot Neural Architecture Search for MCUs. (arXiv:2401.08996v1 [cs.LG])

Authors: Ye Qiao, Haocheng Xu, Yifan Zhang, Sitao Huang

Neural Architecture Search (NAS) effectively discovers new Convolutional Neural Network (CNN) architectures, particularly for accuracy optimization. However, prior approaches often require resource-intensive training on super networks or extensive architecture evaluations, limiting practical applications. To address these challenges, we propose MicroNAS, a hardware-aware zero-shot NAS framework designed for microcontroller units (MCUs) in edge computing. MicroNAS considers target hardware optimality during the search, utilizing specialized performance indicators to identify optimal neural architectures without high computational costs. Compared to previous works, MicroNAS achieves up to 1104x improvement in search efficiency and discovers models with over 3.23x faster MCU inference while maintaining similar accuracy

Attack and Reset for Unlearning: Exploiting Adversarial Noise toward Machine Unlearning through Parameter Re-initialization. (arXiv:2401.08998v1 [cs.LG])

Authors: Yoonhwa Jung, Ikhyun Cho, Shun-Hsiang Hsu, Julia Hockenmaier

With growing concerns surrounding privacy and regulatory compliance, the concept of machine unlearning has gained prominence, aiming to selectively forget or erase specific learned information from a trained model. In response to this critical need, we introduce a novel approach called Attack-and-Reset for Unlearning (ARU). This algorithm leverages meticulously crafted adversarial noise to generate a parameter mask, effectively resetting certain parameters and rendering them unlearnable. ARU outperforms current state-of-the-art results on two facial machine-unlearning benchmark datasets, MUFAC and MUCAC. In particular, we present the steps involved in attacking and masking that strategically filter and re-initialize network parameters biased towards the forget set. Our work represents a significant advancement in rendering data unexploitable to deep learning models through parameter re-initialization, achieved by harnessing adversarial noise to craft a mask.

Continuous Time Continuous Space Homeostatic Reinforcement Learning (CTCS-HRRL) : Towards Biological Self-Autonomous Agent. (arXiv:2401.08999v1 [cs.AI])

Authors: Hugo Laurencon, Yesoda Bhargava, Riddhi Zantye, Charbel-Raphaël Ségerie, Johann Lussange, Veeky Baths, Boris Gutkin

Homeostasis is a biological process by which living beings maintain their internal balance. Previous research suggests that homeostasis is a learned behaviour. Recently introduced Homeostatic Regulated Reinforcement Learning (HRRL) framework attempts to explain this learned homeostatic behavior by linking Drive Reduction Theory and Reinforcement Learning. This linkage has been proven in the discrete time-space, but not in the continuous time-space. In this work, we advance the HRRL framework to a continuous time-space environment and validate the CTCS-HRRL (Continuous Time Continuous Space HRRL) framework. We achieve this by designing a model that mimics the homeostatic mechanisms in a real-world biological agent. This model uses the Hamilton-Jacobian Bellman Equation, and function approximation based on neural networks and Reinforcement Learning. Through a simulation-based experiment we demonstrate the efficacy of this model and uncover the evidence linked to the agent's ability to dynamically choose policies that favor homeostasis in a continuously changing internal-state milieu. Results of our experiments demonstrate that agent learns homeostatic behaviour in a CTCS environment, making CTCS-HRRL a promising framework for modellng animal dynamics and decision-making.

Augmenting Math Word Problems via Iterative Question Composing. (arXiv:2401.09003v1 [cs.CL])

Authors: Haoxiong Liu, Andrew Chi-Chih Yao

Despite recent progress in improving the mathematical reasoning ability of large language models(LLMs), solving competition-level math problems without the use of external tools remains challenging for open-source LLMs. In this work, we introduce the MMIQC dataset, a mixture of processed web data and synthetic question-response pairs, to equip base models with better mathematical reasoning skills. Mistral-7B-MMIQC, the model obtained by fine-tuning Mistral-7B(arXiv:2310.06825) on MMIQC, achieves 36.0\% accuracy on MATH(arXiv:2103.03874), 5.8\% higher than the previous (model size $\sim$7B) SOTA. Our experiments also show that a large part of the improvement attributes to our novel augmentation method IQC(Iterative Question Composing), where we iteratively ask an LLM to compose new questions from the given seed problems and do rejection sampling from another LLM. MMIQC has now been released on https://huggingface.co/datasets/Vivacem/MMIQC.

Inductive Models for Artificial Intelligence Systems are Insufficient without Good Explanations. (arXiv:2401.09011v1 [cs.LG])

Authors: Udesh Habaraduwa

This paper discusses the limitations of machine learning (ML), particularly deep artificial neural networks (ANNs), which are effective at approximating complex functions but often lack transparency and explanatory power. It highlights the `problem of induction' : the philosophical issue that past observations may not necessarily predict future events, a challenge that ML models face when encountering new, unseen data. The paper argues for the importance of not just making predictions but also providing good explanations, a feature that current models often fail to deliver. It suggests that for AI to progress, we must seek models that offer insights and explanations, not just predictions.

Residual Alignment: Uncovering the Mechanisms of Residual Networks. (arXiv:2401.09018v1 [cs.LG])

Authors: Jianing Li, Vardan Papyan

The ResNet architecture has been widely adopted in deep learning due to its significant boost to performance through the use of simple skip connections, yet the underlying mechanisms leading to its success remain largely unknown. In this paper, we conduct a thorough empirical study of the ResNet architecture in classification tasks by linearizing its constituent residual blocks using Residual Jacobians and measuring their singular value decompositions. Our measurements reveal a process called Residual Alignment (RA) characterized by four properties:

(RA1) intermediate representations of a given input are equispaced on a line, embedded in high dimensional space, as observed by Gai and Zhang [2021];

(RA2) top left and right singular vectors of Residual Jacobians align with each other and across different depths;

(RA3) Residual Jacobians are at most rank C for fully-connected ResNets, where C is the number of classes; and

(RA4) top singular values of Residual Jacobians scale inversely with depth.

RA consistently occurs in models that generalize well, in both fully-connected and convolutional architectures, across various depths and widths, for varying numbers of classes, on all tested benchmark datasets, but ceases to occur once the skip connections are removed. It also provably occurs in a novel mathematical model we propose. This phenomenon reveals a strong alignment between residual branches of a ResNet (RA2+4), imparting a highly rigid geometric structure to the intermediate representations as they progress linearly through the network (RA1) up to the final layer, where they undergo Neural Collapse.

Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation. (arXiv:2401.09031v1 [cs.LG])

Authors: Tong Xie, Haoyu Li, Andrew Bai, Cho-Jui Hsieh

Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand ``black-box'' neural networks. While prior research has established quantifiable links between model output and training data in diverse settings, interpreting diffusion model outputs in relation to training samples remains underexplored. In particular, diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts, posing a significant challenge to extend existing frameworks to diffusion models directly. Notably, we present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep. This trend leads to a prominent bias in influence estimation, and is particularly noticeable for samples trained on large-norm-inducing timesteps, causing them to be generally influential. To mitigate this effect, we introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest, facilitating a localized measurement of influence and considerably more intuitive visualization. We demonstrate the efficacy of our approach through various evaluation metrics and auxiliary tasks, reducing the amount of generally influential samples to $\frac{1}{3}$ of its original quantity.

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior. (arXiv:2401.09050v1 [cs.CV])

Authors: Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang

Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.

Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding. (arXiv:2401.09067v1 [cs.LG])

Authors: Depeng Li, Tianqi Wang, Junwei Chen, Qining Ren, Kenji Kawaguchi, Zhigang Zeng

Deep neural networks are susceptible to catastrophic forgetting when trained on sequential tasks. Various continual learning (CL) methods often rely on exemplar buffers or/and network expansion for balancing model stability and plasticity, which, however, compromises their practical value due to privacy and memory concerns. Instead, this paper considers a strict yet realistic setting, where the training data from previous tasks is unavailable and the model size remains relatively constant during sequential training. To achieve such desiderata, we propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. This is achieved by the synergy between two key components: HSIC-Bottleneck Orthogonalization (HBO) implements non-overwritten parameter updates mediated by Hilbert-Schmidt independence criterion in an orthogonal space and EquiAngular Embedding (EAE) enhances decision boundary adaptation between old and new tasks with predefined basis vectors. Extensive experiments demonstrate that our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.

DTMM: Deploying TinyML Models on Extremely Weak IoT Devices with Pruning. (arXiv:2401.09068v1 [cs.LG])

Authors: Lixiang Han, Zhen Xiao, Zhenjiang Li

DTMM is a library designed for efficient deployment and execution of machine learning models on weak IoT devices such as microcontroller units (MCUs). The motivation for designing DTMM comes from the emerging field of tiny machine learning (TinyML), which explores extending the reach of machine learning to many low-end IoT devices to achieve ubiquitous intelligence. Due to the weak capability of embedded devices, it is necessary to compress models by pruning enough weights before deploying. Although pruning has been studied extensively on many computing platforms, two key issues with pruning methods are exacerbated on MCUs: models need to be deeply compressed without significantly compromising accuracy, and they should perform efficiently after pruning. Current solutions only achieve one of these objectives, but not both. In this paper, we find that pruned models have great potential for efficient deployment and execution on MCUs. Therefore, we propose DTMM with pruning unit selection, pre-execution pruning optimizations, runtime acceleration, and post-execution low-cost storage to fill the gap for efficient deployment and execution of pruned models. It can be integrated into commercial ML frameworks for practical deployment, and a prototype system has been developed. Extensive experiments on various models show promising gains compared to state-of-the-art methods.

Rethinking Spectral Graph Neural Networks with Spatially Adaptive Filtering. (arXiv:2401.09071v1 [cs.LG])

Authors: Jingwei Guo, Kaizhu Huang, Xinping Yi, Zixian Su, Rui Zhang

Whilst spectral Graph Neural Networks (GNNs) are theoretically well-founded in the spectral domain, their practical reliance on polynomial approximation implies a profound linkage to the spatial domain. As previous studies rarely examine spectral GNNs from the spatial perspective, their spatial-domain interpretability remains elusive, e.g., what information is essentially encoded by spectral GNNs in the spatial domain? In this paper, to answer this question, we establish a theoretical connection between spectral filtering and spatial aggregation, unveiling an intrinsic interaction that spectral filtering implicitly leads the original graph to an adapted new graph, explicitly computed for spatial aggregation. Both theoretical and empirical investigations reveal that the adapted new graph not only exhibits non-locality but also accommodates signed edge weights to reflect label consistency between nodes. These findings thus highlight the interpretable role of spectral GNNs in the spatial domain and inspire us to rethink graph spectral filters beyond the fixed-order polynomials, which neglect global information. Built upon the theoretical findings, we revisit the state-of-the-art spectral GNNs and propose a novel Spatially Adaptive Filtering (SAF) framework, which leverages the adapted new graph by spectral filtering for an auxiliary non-local aggregation. Notably, our proposed SAF comprehensively models both node similarity and dissimilarity from a global perspective, therefore alleviating persistent deficiencies of GNNs related to long-range dependencies and graph heterophily. Extensive experiments over 13 node classification benchmarks demonstrate the superiority of our proposed framework to the state-of-the-art models.

Fixed-Budget Differentially Private Best Arm Identification. (arXiv:2401.09073v1 [cs.LG])

Authors: Zhirui Chen, P. N. Karthik, Yeow Meng Chee, Vincent Y. F. Tan

We study best arm identification (BAI) in linear bandits in the fixed-budget regime under differential privacy constraints, when the arm rewards are supported on the unit interval. Given a finite budget $T$ and a privacy parameter $\varepsilon>0$, the goal is to minimise the error probability in finding the arm with the largest mean after $T$ sampling rounds, subject to the constraint that the policy of the decision maker satisfies a certain {\em $\varepsilon$-differential privacy} ($\varepsilon$-DP) constraint. We construct a policy satisfying the $\varepsilon$-DP constraint (called {\sc DP-BAI}) by proposing the principle of {\em maximum absolute determinants}, and derive an upper bound on its error probability. Furthermore, we derive a minimax lower bound on the error probability, and demonstrate that the lower and the upper bounds decay exponentially in $T$, with exponents in the two bounds matching order-wise in (a) the sub-optimality gaps of the arms, (b) $\varepsilon$, and (c) the problem complexity that is expressible as the sum of two terms, one characterising the complexity of standard fixed-budget BAI (without privacy constraints), and the other accounting for the $\varepsilon$-DP constraint. Additionally, we present some auxiliary results that contribute to the derivation of the lower bound on the error probability. These results, we posit, may be of independent interest and could prove instrumental in proving lower bounds on error probabilities in several other bandit problems. Whereas prior works provide results for BAI in the fixed-budget regime without privacy constraints or in the fixed-confidence regime with privacy constraints, our work fills the gap in the literature by providing the results for BAI in the fixed-budget regime under the $\varepsilon$-DP constraint.

Code Simulation Challenges for Large Language Models. (arXiv:2401.09074v1 [cs.LG])

Authors: Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge

We investigate the extent to which Large Language Models (LLMs) can simulate the execution of computer code and algorithms. We begin by looking straight line programs, and show that current LLMs demonstrate poor performance even with such simple programs -- performance rapidly degrades with the length of code. We then investigate the ability of LLMs to simulate programs that contain critical paths and redundant instructions. We also go beyond straight line program simulation with sorting algorithms and nested loops, and we show the computational complexity of a routine directly affects the ability of an LLM to simulate its execution. We observe that LLMs execute instructions sequentially and with a low error margin only for short programs or standard procedures. LLMs' code simulation is in tension with their pattern recognition and memorisation capabilities: on tasks where memorisation is detrimental, we propose a novel prompting method to simulate code execution line by line. Empirically, our new Chain of Simulation (CoSm) method improves on the standard Chain of Thought prompting approach by avoiding the pitfalls of memorisation.

RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks. (arXiv:2401.09093v1 [cs.LG])

Authors: Haowen Hou, F. Richard Yu

Traditional Recurrent Neural Network (RNN) architectures, such as LSTM and GRU, have historically held prominence in time series tasks. However, they have recently seen a decline in their dominant position across various time series tasks. As a result, recent advancements in time series forecasting have seen a notable shift away from RNNs towards alternative architectures such as Transformers, MLPs, and CNNs. To go beyond the limitations of traditional RNNs, we design an efficient RNN-based model for time series tasks, named RWKV-TS, with three distinctive features: (i) A novel RNN architecture characterized by $O(L)$ time complexity and memory usage. (ii) An enhanced ability to capture long-term sequence information compared to traditional RNNs. (iii) High computational efficiency coupled with the capacity to scale up effectively. Through extensive experimentation, our proposed RWKV-TS model demonstrates competitive performance when compared to state-of-the-art Transformer-based or CNN-based models. Notably, RWKV-TS exhibits not only comparable performance but also demonstrates reduced latency and memory utilization. The success of RWKV-TS encourages further exploration and innovation in leveraging RNN-based approaches within the domain of Time Series. The combination of competitive performance, low latency, and efficient memory usage positions RWKV-TS as a promising avenue for future research in time series tasks. Code is available at:\href{https://github.com/howard-hou/RWKV-TS}{ https://github.com/howard-hou/RWKV-TS}

Understanding Heterophily for Graph Neural Networks. (arXiv:2401.09125v1 [cs.LG])

Authors: Junfu Wang, Yuanfang Guo, Liang Yang, Yunhong Wang

Graphs with heterophily have been regarded as challenging scenarios for Graph Neural Networks (GNNs), where nodes are connected with dissimilar neighbors through various patterns. In this paper, we present theoretical understandings of the impacts of different heterophily patterns for GNNs by incorporating the graph convolution (GC) operations into fully connected networks via the proposed Heterophilous Stochastic Block Models (HSBM), a general random graph model that can accommodate diverse heterophily patterns. Firstly, we show that by applying a GC operation, the separability gains are determined by two factors, i.e., the Euclidean distance of the neighborhood distributions and $\sqrt{\mathbb{E}\left[\operatorname{deg}\right]}$, where $\mathbb{E}\left[\operatorname{deg}\right]$ is the averaged node degree. It reveals that the impact of heterophily on classification needs to be evaluated alongside the averaged node degree. Secondly, we show that the topological noise has a detrimental impact on separability, which is equivalent to degrading $\mathbb{E}\left[\operatorname{deg}\right]$. Finally, when applying multiple GC operations, we show that the separability gains are determined by the normalized distance of the $l$-powered neighborhood distributions. It indicates that the nodes still possess separability as $l$ goes to infinity in a wide range of regimes. Extensive experiments on both synthetic and real-world data verify the effectiveness of our theory.

Asynchronous Local-SGD Training for Language Modeling. (arXiv:2401.09135v1 [cs.LG])

Authors: Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato

Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

ADCNet: a unified framework for predicting the activity of antibody-drug conjugates. (arXiv:2401.09176v1 [cs.LG])

Authors: Liye Chen, Biaoshun Li, Yihao Chen, Mujie Lin, Shipeng Zhang, Chenxin Li, Yu Pang, Ling Wang

Antibody-drug conjugate (ADC) has revolutionized the field of cancer treatment in the era of precision medicine due to their ability to precisely target cancer cells and release highly effective drug. Nevertheless, the realization of rational design of ADC is very difficult because the relationship between their structures and activities is difficult to understand. In the present study, we introduce a unified deep learning framework called ADCNet to help design potential ADCs. The ADCNet highly integrates the protein representation learning language model ESM-2 and small-molecule representation learning language model FG-BERT models to achieve activity prediction through learning meaningful features from antigen and antibody protein sequences of ADC, SMILES strings of linker and payload, and drug-antibody ratio (DAR) value. Based on a carefully designed and manually tailored ADC data set, extensive evaluation results reveal that ADCNet performs best on the test set compared to baseline machine learning models across all evaluation metrics. For example, it achieves an average prediction accuracy of 87.12%, a balanced accuracy of 0.8689, and an area under receiver operating characteristic curve of 0.9293 on the test set. In addition, cross-validation, ablation experiments, and external independent testing results further prove the stability, advancement, and robustness of the ADCNet architecture. For the convenience of the community, we develop the first online platform (https://ADCNet.idruglab.cn) for the prediction of ADCs activity based on the optimal ADCNet model, and the source code is publicly available at https://github.com/idrugLab/ADCNet.

Unsupervised Multiple Domain Translation through Controlled Disentanglement in Variational Autoencoder. (arXiv:2401.09180v1 [cs.LG])

Authors: Almudévar Antonio, Mariotte Théo, Ortega Alfonso, Tahon Marie

Unsupervised Multiple Domain Translation is the task of transforming data from one domain to other domains without having paired data to train the systems. Typically, methods based on Generative Adversarial Networks (GANs) are used to address this task. However, our proposal exclusively relies on a modified version of a Variational Autoencoder. This modification consists of the use of two latent variables disentangled in a controlled way by design. One of this latent variables is imposed to depend exclusively on the domain, while the other one must depend on the rest of the variability factors of the data. Additionally, the conditions imposed over the domain latent variable allow for better control and understanding of the latent space. We empirically demonstrate that our approach works on different vision datasets improving the performance of other well known methods. Finally, we prove that, indeed, one of the latent variables stores all the information related to the domain and the other one hardly contains any domain information.

Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer. (arXiv:2401.09181v1 [cs.LG])

Authors: Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, Huawen Feng

Multimodal Continual Instruction Tuning (MCIT) enables Multimodal Large Language Models (MLLMs) to meet continuously emerging requirements without expensive retraining. MCIT faces two major obstacles: catastrophic forgetting (where old knowledge is forgotten) and negative forward transfer (where the performance of future tasks is degraded). Although existing methods have greatly alleviated catastrophic forgetting, they still suffer from negative forward transfer. By performing singular value decomposition (SVD) on input embeddings, we discover a large discrepancy in different input embeddings. The discrepancy results in the model learning irrelevant information for old and pre-trained tasks, which leads to catastrophic forgetting and negative forward transfer. To address these issues, we propose Fwd-Prompt, a prompt-based method projecting prompt gradient to the residual space to minimize the interference between tasks and to the pre-trained subspace for reusing pre-trained knowledge. Our experiments demonstrate that Fwd-Prompt achieves state-of-the-art performance while updating fewer parameters and requiring no old samples. Our research sheds light on the potential of continuously adapting MLLMs to new tasks under the instruction tuning paradigm and encourages future studies to explore MCIT. The code will soon be publicly available.

A Two-Scale Complexity Measure for Deep Learning Models. (arXiv:2401.09184v1 [stat.ML])

Authors: Massimiliano Datres, Gian Paolo Leonardi, Alessio Figalli, David Sutter

We introduce a novel capacity measure 2sED for statistical models based on the effective dimension. The new quantity provably bounds the generalization error under mild assumptions on the model. Furthermore, simulations on standard data sets and popular model architectures show that 2sED correlates well with the training error. For Markovian models, we show how to efficiently approximate 2sED from below through a layerwise iterative approach, which allows us to tackle deep learning models with a large number of parameters. Simulation results suggest that the approximation is good for different prominent models and data sets.

Exploring the Role of Convolutional Neural Networks (CNN) in Dental Radiography Segmentation: A Comprehensive Systematic Literature Review. (arXiv:2401.09190v1 [cs.CV])

Authors: Walid Brahmi, Imen Jdey, Fadoua Drira

In the field of dentistry, there is a growing demand for increased precision in diagnostic tools, with a specific focus on advanced imaging techniques such as computed tomography, cone beam computed tomography, magnetic resonance imaging, ultrasound, and traditional intra-oral periapical X-rays. Deep learning has emerged as a pivotal tool in this context, enabling the implementation of automated segmentation techniques crucial for extracting essential diagnostic data. This integration of cutting-edge technology addresses the urgent need for effective management of dental conditions, which, if left undetected, can have a significant impact on human health. The impressive track record of deep learning across various domains, including dentistry, underscores its potential to revolutionize early detection and treatment of oral health issues. Objective: Having demonstrated significant results in diagnosis and prediction, deep convolutional neural networks (CNNs) represent an emerging field of multidisciplinary research. The goals of this study were to provide a concise overview of the state of the art, standardize the current debate, and establish baselines for future research. Method: In this study, a systematic literature review is employed as a methodology to identify and select relevant studies that specifically investigate the deep learning technique for dental imaging analysis. This study elucidates the methodological approach, including the systematic collection of data, statistical analysis, and subsequent dissemination of outcomes. Conclusion: This work demonstrates how Convolutional Neural Networks (CNNs) can be employed to analyze images, serving as effective tools for detecting dental pathologies. Although this research acknowledged some limitations, CNNs utilized for segmenting and categorizing teeth exhibited their highest level of performance overall.

An Optimal Transport Approach for Computing Adversarial Training Lower Bounds in Multiclass Classification. (arXiv:2401.09191v1 [cs.LG])

Authors: Nicolas Garcia Trillos, Matt Jacobs, Jakwang Kim, Matthew Werenski

Despite the success of deep learning-based algorithms, it is widely known that neural networks may fail to be robust. A popular paradigm to enforce robustness is adversarial training (AT), however, this introduces many computational and theoretical difficulties. Recent works have developed a connection between AT in the multiclass classification setting and multimarginal optimal transport (MOT), unlocking a new set of tools to study this problem. In this paper, we leverage the MOT connection to propose computationally tractable numerical algorithms for computing universal lower bounds on the optimal adversarial risk and identifying optimal classifiers. We propose two main algorithms based on linear programming (LP) and entropic regularization (Sinkhorn). Our key insight is that one can harmlessly truncate the higher order interactions between classes, preventing the combinatorial run times typically encountered in MOT problems. We validate these results with experiments on MNIST and CIFAR-$10$, which demonstrate the tractability of our approach.

Preparing Lessons for Progressive Training on Language Models. (arXiv:2401.09192v1 [cs.LG])

Authors: Yu Pan, Ye Yuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang, Lifeng Shang, Xin Jiang, Qun Liu

The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.

GNN-LoFI: a Novel Graph Neural Network through Localized Feature-based Histogram Intersection. (arXiv:2401.09193v1 [cs.LG])

Authors: Alessandro Bicciato, Luca Cosmo, Giorgia Minello, Luca Rossi, Andrea Torsello

Graph neural networks are increasingly becoming the framework of choice for graph-based machine learning. In this paper, we propose a new graph neural network architecture that substitutes classical message passing with an analysis of the local distribution of node features. To this end, we extract the distribution of features in the egonet for each local neighbourhood and compare them against a set of learned label distributions by taking the histogram intersection kernel. The similarity information is then propagated to other nodes in the network, effectively creating a message passing-like mechanism where the message is determined by the ensemble of the features. We perform an ablation study to evaluate the network's performance under different choices of its hyper-parameters. Finally, we test our model on standard graph classification and regression benchmarks, and we find that it outperforms widely used alternative approaches, including both graph kernels and graph neural networks.

Space and Time Continuous Physics Simulation From Partial Observations. (arXiv:2401.09198v1 [cs.LG])

Authors: Janny Steeven, Nadri Madiha, Digne Julie, Wolf Christian

Modern techniques for physical simulations rely on numerical schemes and mesh-refinement methods to address trade-offs between precision and complexity, but these handcrafted solutions are tedious and require high computational power. Data-driven methods based on large-scale machine learning promise high adaptivity by integrating long-range dependencies more directly and efficiently. In this work, we focus on fluid dynamics and address the shortcomings of a large part of the literature, which are based on fixed support for computations and predictions in the form of regular or irregular grids. We propose a novel setup to perform predictions in a continuous spatial and temporal domain while being trained on sparse observations. We formulate the task as a double observation problem and propose a solution with two interlinked dynamical systems defined on, respectively, the sparse positions and the continuous domain, which allows to forecast and interpolate a solution from the initial condition. Our practical implementation involves recurrent GNNs and a spatio-temporal attention observer capable of interpolating the solution at arbitrary locations. Our model not only generalizes to new initial conditions (as standard auto-regressive models do) but also performs evaluation at arbitrary space and time locations. We evaluate on three standard datasets in fluid dynamics and compare to strong baselines, which are outperformed both in classical settings and in the extended new task requiring continuous predictions.

A Real-Time Lyrics Alignment System Using Chroma And Phonetic Features For Classical Vocal Performance. (arXiv:2401.09200v1 [cs.SD])

Authors: Jiyun Park, Sangeon Yong, Taegyun Kwon, Juhan Nam

The goal of real-time lyrics alignment is to take live singing audio as input and to pinpoint the exact position within given lyrics on the fly. The task can benefit real-world applications such as the automatic subtitling of live concerts or operas. However, designing a real-time model poses a great challenge due to the constraints of only using past input and operating within a minimal latency. Furthermore, due to the lack of datasets for real-time models for lyrics alignment, previous studies have mostly evaluated with private in-house datasets, resulting in a lack of standard evaluation methods. This paper presents a real-time lyrics alignment system for classical vocal performances with two contributions. First, we improve the lyrics alignment algorithm by finding an optimal combination of chromagram and phonetic posteriorgram (PPG) that capture melodic and phonetics features of the singing voice, respectively. Second, we recast the Schubert Winterreise Dataset (SWD) which contains multiple performance renditions of the same pieces as an evaluation set for the real-time lyrics alignment.

A Characterization Theorem for Equivariant Networks with Point-wise Activations. (arXiv:2401.09235v1 [cs.LG])

Authors: Marco Pacini, Xiaowen Dong, Bruno Lepri, Gabriele Santin

Equivariant neural networks have shown improved performance, expressiveness and sample complexity on symmetrical domains. But for some specific symmetries, representations, and choice of coordinates, the most common point-wise activations, such as ReLU, are not equivariant, hence they cannot be employed in the design of equivariant neural networks. The theorem we present in this paper describes all possible combinations of finite-dimensional representations, choice of coordinates and point-wise activations to obtain an exactly equivariant layer, generalizing and strengthening existing characterizations. Notable cases of practical relevance are discussed as corollaries. Indeed, we prove that rotation-equivariant networks can only be invariant, as it happens for any network which is equivariant with respect to connected compact groups. Then, we discuss implications of our findings when applied to important instances of exactly equivariant networks. First, we completely characterize permutation equivariant networks such as Invariant Graph Networks with point-wise nonlinearities and their geometric counterparts, highlighting a plethora of models whose expressive power and performance are still unknown. Second, we show that feature spaces of disentangled steerable convolutional neural networks are trivial representations.

Classification and Reconstruction Processes in Deep Predictive Coding Networks: Antagonists or Allies?. (arXiv:2401.09237v1 [cs.LG])

Authors: Jan Rathjens, Laurenz Wiskott

Predictive coding-inspired deep networks for visual computing integrate classification and reconstruction processes in shared intermediate layers. Although synergy between these processes is commonly assumed, it has yet to be convincingly demonstrated. In this study, we take a critical look at how classifying and reconstructing interact in deep learning architectures. Our approach utilizes a purposefully designed family of model architectures reminiscent of autoencoders, each equipped with an encoder, a decoder, and a classification head featuring varying modules and complexities. We meticulously analyze the extent to which classification- and reconstruction-driven information can seamlessly coexist within the shared latent layer of the model architectures. Our findings underscore a significant challenge: Classification-driven information diminishes reconstruction-driven information in intermediate layers' shared representations and vice versa. While expanding the shared representation's dimensions or increasing the network's complexity can alleviate this trade-off effect, our results challenge prevailing assumptions in predictive coding and offer guidance for future iterations of predictive coding concepts in deep networks.

DiffClone: Enhanced Behaviour Cloning in Robotics with Diffusion-Driven Policy Learning. (arXiv:2401.09243v1 [cs.RO])

Authors: Sabariswaran Mani, Abhranil Chandra, Sreyas Venkataraman, Adyan Rizvi, Yash Sirvi, Soumojit Bhattacharya, Aritra Hazra

Robot learning tasks are extremely compute-intensive and hardware-specific. Thus the avenues of tackling these challenges, using a diverse dataset of offline demonstrations that can be used to train robot manipulation agents, is very appealing. The Train-Offline-Test-Online (TOTO) Benchmark provides a well-curated open-source dataset for offline training comprised mostly of expert data and also benchmark scores of the common offline-RL and behaviour cloning agents. In this paper, we introduce DiffClone, an offline algorithm of enhanced behaviour cloning agent with diffusion-based policy learning, and measured the efficacy of our method on real online physical robots at test time. This is also our official submission to the Train-Offline-Test-Online (TOTO) Benchmark Challenge organized at NeurIPS 2023. We experimented with both pre-trained visual representation and agent policies. In our experiments, we find that MOCO finetuned ResNet50 performs the best in comparison to other finetuned representations. Goal state conditioning and mapping to transitions resulted in a minute increase in the success rate and mean-reward. As for the agent policy, we developed DiffClone, a behaviour cloning agent improved using conditional diffusion.

Bridging the Gap Between General and Down-Closed Convex Sets in Submodular Maximization. (arXiv:2401.09251v1 [cs.LG])

Authors: Loay Mualem, Murad Tukan, Moran Fledman

Optimization of DR-submodular functions has experienced a notable surge in significance in recent times, marking a pivotal development within the domain of non-convex optimization. Motivated by real-world scenarios, some recent works have delved into the maximization of non-monotone DR-submodular functions over general (not necessarily down-closed) convex set constraints. Up to this point, these works have all used the minimum $\ell_\infty$ norm of any feasible solution as a parameter. Unfortunately, a recent hardness result due to Mualem \& Feldman~\cite{mualem2023resolving} shows that this approach cannot yield a smooth interpolation between down-closed and non-down-closed constraints. In this work, we suggest novel offline and online algorithms that provably provide such an interpolation based on a natural decomposition of the convex body constraint into two distinct convex bodies: a down-closed convex body and a general convex body. We also empirically demonstrate the superiority of our proposed algorithms across three offline and two online applications.

3D Scene Geometry Estimation from 360$^\circ$ Imagery: A Survey. (arXiv:2401.09252v1 [cs.CV])

Authors: Thiago Lopes Trugillo da Silveira, Paulo Gamarra Lessa Pinto, Jeffri Erwin Murrugarra Llerena, Claudio Rosito Jung

This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under the omnidirectional optics. We first revisit the basic concepts of the spherical camera model, and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360$^\circ$, spherical or panoramic) images and videos. We then survey monocular layout and depth inference approaches, highlighting the recent advances in learning-based solutions suited for spherical data. The classical stereo matching is then revised on the spherical domain, where methodologies for detecting and describing sparse and dense features become crucial. The stereo matching concepts are then extrapolated for multiple view camera setups, categorizing them among light fields, multi-view stereo, and structure from motion (or visual simultaneous localization and mapping). We also compile and discuss commonly adopted datasets and figures of merit indicated for each purpose and list recent results for completeness. We conclude this paper by pointing out current and future trends.

A First-Order Multi-Gradient Algorithm for Multi-Objective Bi-Level Optimization. (arXiv:2401.09257v1 [cs.LG])

Authors: Feiyang Ye, Baijiong Lin, Xiaofeng Cao, Yu Zhang, Ivor Tsang

In this paper, we study the Multi-Objective Bi-Level Optimization (MOBLO) problem, where the upper-level subproblem is a multi-objective optimization problem and the lower-level subproblem is for scalar optimization. Existing gradient-based MOBLO algorithms need to compute the Hessian matrix, causing the computational inefficient problem. To address this, we propose an efficient first-order multi-gradient method for MOBLO, called FORUM. Specifically, we reformulate MOBLO problems as a constrained multi-objective optimization (MOO) problem via the value-function approach. Then we propose a novel multi-gradient aggregation method to solve the challenging constrained MOO problem. Theoretically, we provide the complexity analysis to show the efficiency of the proposed method and a non-asymptotic convergence result. Empirically, extensive experiments demonstrate the effectiveness and efficiency of the proposed FORUM method in different learning problems. In particular, it achieves state-of-the-art performance on three multi-task learning benchmark datasets.

MSHyper: Multi-Scale Hypergraph Transformer for Long-Range Time Series Forecasting. (arXiv:2401.09261v1 [cs.LG])

Authors: Zongjiang Shang, Ling Chen

Demystifying interactions between temporal patterns of different scales is fundamental to precise long-range time series forecasting. However, previous works lack the ability to model high-order interactions. To promote more comprehensive pattern interaction modeling for long-range time series forecasting, we propose a Multi-Scale Hypergraph Transformer (MSHyper) framework. Specifically, a multi-scale hypergraph is introduced to provide foundations for modeling high-order pattern interactions. Then by treating hyperedges as nodes, we also build a hyperedge graph to enhance hypergraph modeling. In addition, a tri-stage message passing mechanism is introduced to aggregate pattern information and learn the interaction strength between temporal patterns of different scales. Extensive experiments on five real-world datasets demonstrate that MSHyper achieves state-of-the-art performance, reducing prediction errors by an average of 8.73% and 7.15% over the best baseline in MSE and MAE, respectively.

Risk-Aware Accelerated Wireless Federated Learning with Heterogeneous Clients. (arXiv:2401.09267v1 [cs.LG])

Authors: Mohamed Ads, Hesham ElSawy, Hossam S. Hassanein

Wireless Federated Learning (FL) is an emerging distributed machine learning paradigm, particularly gaining momentum in domains with confidential and private data on mobile clients. However, the location-dependent performance, in terms of transmission rates and susceptibility to transmission errors, poses major challenges for wireless FL's convergence speed and accuracy. The challenge is more acute for hostile environments without a metric that authenticates the data quality and security profile of the clients. In this context, this paper proposes a novel risk-aware accelerated FL framework that accounts for the clients heterogeneity in the amount of possessed data, transmission rates, transmission errors, and trustworthiness. Classifying clients according to their location-dependent performance and trustworthiness profiles, we propose a dynamic risk-aware global model aggregation scheme that allows clients to participate in descending order of their transmission rates and an ascending trustworthiness constraint. In particular, the transmission rate is the dominant participation criterion for initial rounds to accelerate the convergence speed. Our model then progressively relaxes the transmission rate restriction to explore more training data at cell-edge clients. The aggregation rounds incorporate a debiasing factor that accounts for transmission errors. Risk-awareness is enabled by a validation set, where the base station eliminates non-trustworthy clients at the fine-tuning stage. The proposed scheme is benchmarked against a conservative scheme (i.e., only allowing trustworthy devices) and an aggressive scheme (i.e., oblivious to the trust metric). The numerical results highlight the superiority of the proposed scheme in terms of accuracy and convergence speed when compared to both benchmarks.

A Dempster-Shafer approach to trustworthy AI with application to fetal brain MRI segmentation. (arXiv:2204.02779v4 [eess.IV] UPDATED)

Authors: Lucas Fidon, Michael Aertsen, Florian Kofler, Andrea Bink, Anna L. David, Thomas Deprest, Doaa Emam, Frédéric Guffens, András Jakab, Gregor Kasprian, Patric Kienast, Andrew Melbourne, Bjoern Menze, Nada Mufti, Ivana Pogledic, Daniela Prayer, Marlene Stuempflen, Esther Van Elslander, Sébastien Ourselin, Jan Deprest, Tom Vercauteren

Deep learning models for medical image segmentation can fail unexpectedly and spectacularly for pathological cases and images acquired at different centers than training images, with labeling errors that violate expert knowledge. Such errors undermine the trustworthiness of deep learning models for medical image segmentation. Mechanisms for detecting and correcting such failures are essential for safely translating this technology into clinics and are likely to be a requirement of future regulations on artificial intelligence (AI). In this work, we propose a trustworthy AI theoretical framework and a practical system that can augment any backbone AI system using a fallback method and a fail-safe mechanism based on Dempster-Shafer theory. Our approach relies on an actionable definition of trustworthy AI. Our method automatically discards the voxel-level labeling predicted by the backbone AI that violate expert knowledge and relies on a fallback for those voxels. We demonstrate the effectiveness of the proposed trustworthy AI approach on the largest reported annotated dataset of fetal MRI consisting of 540 manually annotated fetal brain 3D T2w MRIs from 13 centers. Our trustworthy AI method improves the robustness of a state-of-the-art backbone AI for fetal brain MRIs acquired across various centers and for fetuses with various brain abnormalities.

Model-Informed Generative Adversarial Network (MI-GAN) for Learning Optimal Power Flow. (arXiv:2206.01864v2 [cs.LG] UPDATED)

Authors: Yuxuan Li, Chaoyue Zhao, Chenang Liu

The optimal power flow (OPF) problem, as a critical component of power system operations, becomes increasingly difficult to solve due to the variability, intermittency, and unpredictability of renewable energy brought to the power system. Although traditional optimization techniques, such as stochastic and robust optimization approaches, could be leveraged to address the OPF problem, in the face of renewable energy uncertainty, i.e., the dynamic coefficients in the optimization model, their effectiveness in dealing with large-scale problems remains limited. As a result, deep learning techniques, such as neural networks, have recently been developed to improve computational efficiency in solving OPF problems with the utilization of data. However, the feasibility and optimality of the solution may not be guaranteed, and the system dynamics cannot be properly addressed as well. In this paper, we propose an optimization model-informed generative adversarial network (MI-GAN) framework to solve OPF under uncertainty. The main contributions are summarized into three aspects: (1) to ensure feasibility and improve optimality of generated solutions, three important layers are proposed: feasibility filter layer, comparison layer, and gradient-guided layer; (2) in the GAN-based framework, an efficient model-informed selector incorporating these three new layers is established; and (3) a new recursive iteration algorithm is also proposed to improve solution optimality and handle the system dynamics. The numerical results on IEEE test systems show that the proposed method is very effective and promising.

Lyapunov Function Consistent Adaptive Network Signal Control with Back Pressure and Reinforcement Learning. (arXiv:2210.02612v2 [eess.SY] UPDATED)

Authors: Chaolun Ma, Bruce Wang, Zihao Li, Ahmadreza Mahmoudzadeh, Yunlong Zhang

In traffic signal control, flow-based (optimizing the overall flow) and pressure-based methods (equalizing and alleviating congestion) are commonly used but often considered separately. This study introduces a unified framework using Lyapunov control theory, defining specific Lyapunov functions respectively for these methods. We have found interesting results. For example, the well-recognized back-pressure method is equal to differential queue lengths weighted by intersection lane saturation flows. We further improve it by adding basic traffic flow theory. Rather than ensuring that the control system be stable, the system should be also capable of adaptive to various performance metrics. Building on insights from Lyapunov theory, this study designs a reward function for the Reinforcement Learning (RL)-based network signal control, whose agent is trained with Double Deep Q-Network (DDQN) for effective control over complex traffic networks. The proposed algorithm is compared with several traditional and RL-based methods under pure passenger car flow and heterogenous traffic flow including freight, respectively. The numerical tests demonstrate that the proposed method outperforms the alternative control methods across different traffic scenarios, covering corridor and general network situations each with varying traffic demands, in terms of the average network vehicle waiting time per vehicle.

Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving. (arXiv:2210.06758v2 [cs.RO] UPDATED)

Authors: Shoaib Azam, Farzeen Munir, Ville Kyrki, Moongu Jeon, Witold Pedrycz

Learning contextual and spatial environmental representations enhances autonomous vehicle's hazard anticipation and decision-making in complex scenarios. Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context. Humans, when driving, naturally employ neural maps that integrate various factors such as historical data, situational subtleties, and behavioral predictions of other road users to form a rich contextual understanding of their surroundings. This neural map-based comprehension is integral to making informed decisions on the road. In contrast, even with their significant advancements, autonomous systems have yet to fully harness this depth of human-like contextual understanding. Motivated by this, our work draws inspiration from human driving patterns and seeks to formalize the sensor fusion approach within an end-to-end autonomous driving framework. We introduce a framework that integrates three cameras (left, right, and center) to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation. The sensor data is fused and encoded using a self-attention mechanism, leading to an auto-regressive waypoint prediction module. We treat feature representation as a sequential problem, employing a vision transformer to distill the contextual interplay between sensor modalities. The efficacy of the proposed method is experimentally evaluated in both open and closed-loop settings. Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset. In closed-loop evaluations on CARLA's Town05 Long and Longest6 benchmarks, the proposed method enhances driving performance, route completion, and reduces infractions.

Degeneracy is OK: Logarithmic Regret for Network Revenue Management with Indiscrete Distributions. (arXiv:2210.07996v3 [cs.LG] UPDATED)

Authors: Jiashuo Jiang, Will Ma, Jiawei Zhang

We study the classical Network Revenue Management (NRM) problem with accept/reject decisions and $T$ IID arrivals. We consider a distributional form where each arrival must fall under a finite number of possible categories, each with a deterministic resource consumption vector, but a random value distributed continuously over an interval. We develop an online algorithm that achieves $O(\log^2 T)$ regret under this model, with the only (necessary) assumption being that the probability densities are bounded away from 0. We derive a second result that achieves $O(\log T)$ regret under an additional assumption of second-order growth. To our knowledge, these are the first results achieving logarithmic-level regret in an NRM model with continuous values that do not require any kind of ``non-degeneracy'' assumptions. Our results are achieved via new techniques including a new method of bounding myopic regret, a ``semi-fluid'' relaxation of the offline allocation, and an improved bound on the ``dual convergence''.

Online Loss Function Learning. (arXiv:2301.13247v2 [cs.LG] UPDATED)

Authors: Christian Raymond, Qi Chen, Bing Xue, Mengjie Zhang

Loss function learning is a new meta-learning paradigm that aims to automate the essential task of designing a loss function for a machine learning model. Existing techniques for loss function learning have shown promising results, often improving a model's training dynamics and final inference performance. However, a significant limitation of these techniques is that the loss functions are meta-learned in an offline fashion, where the meta-objective only considers the very first few steps of training, which is a significantly shorter time horizon than the one typically used for training deep neural networks. This causes significant bias towards loss functions that perform well at the very start of training but perform poorly at the end of training. To address this issue we propose a new loss function learning technique for adaptively updating the loss function online after each update to the base model parameters. The experimental results show that our proposed method consistently outperforms the cross-entropy loss and offline loss function learning techniques on a diverse range of neural network architectures and datasets.

Supporting Safety Analysis of Image-processing DNNs through Clustering-based Approaches. (arXiv:2301.13506v3 [cs.SE] UPDATED)

Authors: Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, Lionel Briand

The adoption of deep neural networks (DNNs) in safety-critical contexts is often prevented by the lack of effective means to explain their results, especially when they are erroneous. In our previous work, we proposed a white-box approach (HUDD) and a black-box approach (SAFE) to automatically characterize DNN failures. They both identify clusters of similar images from a potentially large set of images leading to DNN failures. However, the analysis pipelines for HUDD and SAFE were instantiated in specific ways according to common practices, deferring the analysis of other pipelines to future work. In this paper, we report on an empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. They combine transfer learning, autoencoders, heatmaps of neuron relevance, dimensionality reduction techniques, and different clustering algorithms. Our results show that the best pipeline combines transfer learning, DBSCAN, and UMAP. It leads to clusters almost exclusively capturing images of the same failure scenario, thus facilitating root cause analysis. Further, it generates distinct clusters for each root cause of failure, thus enabling engineers to detect all the unsafe scenarios. Interestingly, these results hold even for failure scenarios that are only observed in a small percentage of the failing images.

A Comparative Study of Deep Learning and Iterative Algorithms for Joint Channel Estimation and Signal Detection. (arXiv:2303.03678v2 [eess.SP] UPDATED)

Authors: Haocheng Ju, Haimiao Zhang, Lin Li, Xiao Li, Bin Dong

Joint channel estimation and signal detection (JCESD) in wireless communication systems is a crucial and challenging task, especially since it inherently poses a nonlinear inverse problem. This challenge is further highlighted in low signal-to-noise ratio (SNR) scenarios, where traditional algorithms often perform poorly. Deep learning (DL) methods have been investigated, but concerns regarding computational expense and lack of validation in low-SNR settings remain. Hence, the development of a robust and low-complexity model that can deliver excellent performance across a wide range of SNRs is highly desirable. In this paper, we aim to establish a benchmark where traditional algorithms and DL methods are validated on different channel models, Doppler, and SNR settings. In particular, we propose a new DL model where the backbone network is formed by unrolling the iterative algorithm, and the hyperparameters are estimated by hypernetworks. Additionally, we adapt a lightweight DenseNet to the task of JCESD for comparison. We evaluate different methods in three aspects: generalization in terms of bit error rate (BER), robustness, and complexity. Our results indicate that DL approaches outperform traditional algorithms in the challenging low-SNR setting, while the iterative algorithm performs better in high-SNR settings. Furthermore, the iterative algorithm is more robust in the presence of carrier frequency offset, whereas DL methods excel when signals are corrupted by asymmetric Gaussian noise.

HomPINNs: homotopy physics-informed neural networks for solving the inverse problems of nonlinear differential equations with multiple solutions. (arXiv:2304.02811v2 [cs.LG] UPDATED)

Authors: Haoyang Zheng, Yao Huang, Ziyang Huang, Wenrui Hao, Guang Lin

Due to the complex behavior arising from non-uniqueness, symmetry, and bifurcations in the solution space, solving inverse problems of nonlinear differential equations (DEs) with multiple solutions is a challenging task. To address this, we propose homotopy physics-informed neural networks (HomPINNs), a novel framework that leverages homotopy continuation and neural networks (NNs) to solve inverse problems. The proposed framework begins with the use of NNs to simultaneously approximate unlabeled observations across diverse solutions while adhering to DE constraints. Through homotopy continuation, the proposed method solves the inverse problem by tracing the observations and identifying multiple solutions. The experiments involve testing the performance of the proposed method on one-dimensional DEs and applying it to solve a two-dimensional Gray-Scott simulation. Our findings demonstrate that the proposed method is scalable and adaptable, providing an effective solution for solving DEs with multiple solutions and unknown parameters. Moreover, it has significant potential for various applications in scientific computing, such as modeling complex systems and solving inverse problems in physics, chemistry, biology, etc.

ID-MixGCL: Identity Mixup for Graph Contrastive Learning. (arXiv:2304.10045v2 [cs.LG] UPDATED)

Authors: Gehang Zhang, Bowen Yu, Jiangxia Cao, Xinghua Zhang, Jiawei Sheng, Chuan Zhou, Tingwen Liu

Graph contrastive learning (GCL) has recently achieved substantial advancements. Existing GCL approaches compare two different ``views'' of the same graph in order to learn node/graph representations. The underlying assumption of these studies is that the graph augmentation strategy is capable of generating several different graph views such that the graph views are structurally different but semantically similar to the original graphs, and thus the ground-truth labels of the original and augmented graph/nodes can be regarded identical in contrastive learning. However, we observe that this assumption does not always hold. For instance, the deletion of a super-node within a social network can exert a substantial influence on the partitioning of communities for other nodes. Similarly, any perturbation to nodes or edges in a molecular graph will change the labels of the graph. Therefore, we believe that augmenting the graph, accompanied by an adaptation of the labels used for the contrastive loss, will facilitate the encoder to learn a better representation. Based on this idea, we propose ID-MixGCL, which allows the simultaneous interpolation of input nodes and corresponding identity labels to obtain soft-confidence samples, with a controllable degree of change, leading to the capture of fine-grained representations from self-supervised training on unlabeled graphs. Experimental results demonstrate that ID-MixGCL improves performance on graph classification and node classification tasks, as demonstrated by significant improvements on the Cora, IMDB-B, IMDB-M, and PROTEINS datasets compared to state-of-the-art techniques, by 3-29% absolute points.

Segment Anything Model for Medical Images?. (arXiv:2304.14660v7 [eess.IV] UPDATED)

Authors: Yuhao Huang, Xin Yang, Lian Liu, Han Zhou, Ao Chang, Xinrui Zhou, Rusi Chen, Junxuan Yu, Jiongquan Chen, Chaoyu Chen, Sijing Liu, Haozhe Chi, Xindi Hu, Kejuan Yue, Lei Li, Vicente Grau, Deng-Ping Fan, Fajin Dong, Dong Ni

The Segment Anything Model (SAM) is the first foundation model for general image segmentation. It has achieved impressive results on various natural image segmentation tasks. However, medical image segmentation (MIS) is more challenging because of the complex modalities, fine anatomical structures, uncertain and complex object boundaries, and wide-range object scales. To fully validate SAM's performance on medical data, we collected and sorted 53 open-source datasets and built a large medical segmentation dataset with 18 modalities, 84 objects, 125 object-modality paired targets, 1050K 2D images, and 6033K masks. We comprehensively analyzed different models and strategies on the so-called COSMOS 1050K dataset. Our findings mainly include the following: 1) SAM showed remarkable performance in some specific objects but was unstable, imperfect, or even totally failed in other situations. 2) SAM with the large ViT-H showed better overall performance than that with the small ViT-B. 3) SAM performed better with manual hints, especially box, than the Everything mode. 4) SAM could help human annotation with high labeling quality and less time. 5) SAM was sensitive to the randomness in the center point and tight box prompts, and may suffer from a serious performance drop. 6) SAM performed better than interactive methods with one or a few points, but will be outpaced as the number of points increases. 7) SAM's performance correlated to different factors, including boundary complexity, intensity differences, etc. 8) Finetuning the SAM on specific medical tasks could improve its average DICE performance by 4.39% and 6.68% for ViT-B and ViT-H, respectively. We hope that this comprehensive report can help researchers explore the potential of SAM applications in MIS, and guide how to appropriately use and develop SAM.

Flame: Simplifying Topology Extension in Federated Learning. (arXiv:2305.05118v2 [cs.LG] UPDATED)

Authors: Harshit Daga, Jaemin Shin, Dhruv Garg, Ada Gavrilovska, Myungjin Lee, Ramana Rao Kompella

Distributed machine learning approaches, including a broad class of federated learning (FL) techniques, present a number of benefits when deploying machine learning applications over widely distributed infrastructures. The benefits are highly dependent on the details of the underlying machine learning topology, which specifies the functionality executed by the participating nodes, their dependencies and interconnections. Current systems lack the flexibility and extensibility necessary to customize the topology of a machine learning deployment. We present Flame, a new system that provides flexibility of the topology configuration of distributed FL applications around the specifics of a particular deployment context, and is easily extensible to support new FL architectures. Flame achieves this via a new high-level abstraction Topology Abstraction Graphs (TAGs). TAGs decouple the ML application logic from the underlying deployment details, making it possible to specialize the application deployment with reduced development effort. Flame is released as an open source project, and its flexibility and extensibility support a variety of topologies and mechanisms, and can facilitate the development of new FL methodologies.

A Scalable Neural Network for DSIC Affine Maximizer Auction Design. (arXiv:2305.12162v3 [cs.GT] UPDATED)

Authors: Zhijian Duan, Haoran Sun, Yurong Chen, Xiaotie Deng

Automated auction design aims to find empirically high-revenue mechanisms through machine learning. Existing works on multi item auction scenarios can be roughly divided into RegretNet-like and affine maximizer auctions (AMAs) approaches. However, the former cannot strictly ensure dominant strategy incentive compatibility (DSIC), while the latter faces scalability issue due to the large number of allocation candidates. To address these limitations, we propose AMenuNet, a scalable neural network that constructs the AMA parameters (even including the allocation menu) from bidder and item representations. AMenuNet is always DSIC and individually rational (IR) due to the properties of AMAs, and it enhances scalability by generating candidate allocations through a neural network. Additionally, AMenuNet is permutation equivariant, and its number of parameters is independent of auction scale. We conduct extensive experiments to demonstrate that AMenuNet outperforms strong baselines in both contextual and non-contextual multi-item auctions, scales well to larger auctions, generalizes well to different settings, and identifies useful deterministic allocations. Overall, our proposed approach offers an effective solution to automated DSIC auction design, with improved scalability and strong revenue performance in various settings.

Demystifying Oversmoothing in Attention-Based Graph Neural Networks. (arXiv:2305.16102v3 [cs.LG] UPDATED)

Authors: Xinyi Wu, Amir Ajorlou, Zihui Wu, Ali Jadbabaie

Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations. While previous work has established that Graph Convolutional Networks (GCNs) exponentially lose expressive power, it remains controversial whether the graph attention mechanism can mitigate oversmoothing. In this work, we provide a definitive answer to this question through a rigorous mathematical analysis, by viewing attention-based GNNs as nonlinear time-varying dynamical systems and incorporating tools and techniques from the theory of products of inhomogeneous matrices and the joint spectral radius. We establish that, contrary to popular belief, the graph attention mechanism cannot prevent oversmoothing and loses expressive power exponentially. The proposed framework extends the existing results on oversmoothing for symmetric GCNs to a significantly broader class of GNN models, including random walk GCNs, Graph Attention Networks (GATs) and (graph) transformers. In particular, our analysis accounts for asymmetric, state-dependent and time-varying aggregation operators and a wide range of common nonlinear activation functions, such as ReLU, LeakyReLU, GELU and SiLU.

Causal Component Analysis. (arXiv:2305.17225v3 [stat.ML] UPDATED)

Authors: Liang Wendong, Armin Kekić, Julius von Kügelgen, Simon Buchholz, Michel Besserve, Luigi Gresele, Bernhard Schölkopf

Independent Component Analysis (ICA) aims to recover independent latent variables from observed mixtures thereof. Causal Representation Learning (CRL) aims instead to infer causally related (thus often statistically dependent) latent variables, together with the unknown graph encoding their causal relationships. We introduce an intermediate problem termed Causal Component Analysis (CauCA). CauCA can be viewed as a generalization of ICA, modelling the causal dependence among the latent components, and as a special case of CRL. In contrast to CRL, it presupposes knowledge of the causal graph, focusing solely on learning the unmixing function and the causal mechanisms. Any impossibility results regarding the recovery of the ground truth in CauCA also apply for CRL, while possibility results may serve as a stepping stone for extensions to CRL. We characterize CauCA identifiability from multiple datasets generated through different types of interventions on the latent causal variables. As a corollary, this interventional perspective also leads to new identifiability results for nonlinear ICA -- a special case of CauCA with an empty graph -- requiring strictly fewer datasets than previous results. We introduce a likelihood-based approach using normalizing flows to estimate both the unmixing function and the causal mechanisms, and demonstrate its effectiveness through extensive synthetic experiments in the CauCA and ICA setting.

Improved Probabilistic Image-Text Representations. (arXiv:2305.18171v3 [cs.CV] UPDATED)

Authors: Sanghyuk Chun

Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further: first, the incorporation of pseudo-positives to prevent the loss saturation problem under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt tuning for zero-shot classification is shown. The code is available at https://github.com/naver-ai/pcmepp.

Intensity Profile Projection: A Framework for Continuous-Time Representation Learning for Dynamic Networks. (arXiv:2306.06155v3 [cs.LG] UPDATED)

Authors: Alexander Modell, Ian Gallagher, Emma Ceccherini, Nick Whiteley, Patrick Rubin-Delanchy

We present a new representation learning framework, Intensity Profile Projection, for continuous-time dynamic network data. Given triples $(i,j,t)$, each representing a time-stamped ($t$) interaction between two entities ($i,j$), our procedure returns a continuous-time trajectory for each node, representing its behaviour over time. The framework consists of three stages: estimating pairwise intensity functions, e.g. via kernel smoothing; learning a projection which minimises a notion of intensity reconstruction error; and constructing evolving node representations via the learned projection. The trajectories satisfy two properties, known as structural and temporal coherence, which we see as fundamental for reliable inference. Moreoever, we develop estimation theory providing tight control on the error of any estimated trajectory, indicating that the representations could even be used in quite noise-sensitive follow-on analyses. The theory also elucidates the role of smoothing as a bias-variance trade-off, and shows how we can reduce the level of smoothing as the signal-to-noise ratio increases on account of the algorithm `borrowing strength' across the network.

Creating Multi-Level Skill Hierarchies in Reinforcement Learning. (arXiv:2306.09980v2 [cs.LG] UPDATED)

Authors: Joshua B. Evans, Özgür Şimşek

What is a useful skill hierarchy for an autonomous agent? We propose an answer based on a graphical representation of how the interaction between an agent and its environment may unfold. Our approach uses modularity maximisation as a central organising principle to expose the structure of the interaction graph at multiple levels of abstraction. The result is a collection of skills that operate at varying time scales, organised into a hierarchy, where skills that operate over longer time scales are composed of skills that operate over shorter time scales. The entire skill hierarchy is generated automatically, with no human intervention, including the skills themselves (their behaviour, when they can be called, and when they terminate) as well as the hierarchical dependency structure between them. In a wide range of environments, this approach generates skill hierarchies that are intuitively appealing and that considerably improve the learning performance of the agent.

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs. (arXiv:2306.11700v2 [math.OC] UPDATED)

Authors: Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, Alejandro Ribeiro

We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual variable via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. (arXiv:2306.13649v3 [cs.LG] UPDATED)

Authors: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.

Score-based Source Separation with Applications to Digital Communication Signals. (arXiv:2306.14411v3 [cs.LG] UPDATED)

Authors: Tejas Jayashankar, Gary C.F. Lee, Alejandro Lancho, Amir Weiss, Yury Polyanskiy, Gregory W. Wornell

We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $\alpha$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https://alpha-rgs.github.io

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks. (arXiv:2307.02040v2 [cs.LG] UPDATED)

Authors: Zhaomin Wu, Junyi Hou, Bingsheng He

Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.

Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls. (arXiv:2308.06895v2 [cs.LG] UPDATED)

Authors: Saurav Prakash, Jin Sima, Chao Pan, Eli Chien, Olgica Milenkovic

Hierarchical and tree-like data sets arise in many applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion. This problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setting, it becomes necessary to work with new federated learning methods tailored to hyperbolic spaces. As an initial step towards the development of the field of federated learning in hyperbolic spaces, we propose the first known approach to federated classification in hyperbolic spaces. Our contributions are as follows. First, we develop distributed versions of convex SVM classifiers for Poincar\'e discs. In this setting, the information conveyed from clients to the global classifier are convex hulls of clusters present in individual client data. Second, to avoid label switching issues, we introduce a number-theoretic approach for label recovery based on the so-called integer $B_h$ sequences. Third, we compute the complexity of the convex hulls in hyperbolic spaces to assess the extent of data leakage; at the same time, in order to limit communication cost for the hulls, we propose a new quantization method for the Poincar\'e disc coupled with Reed-Solomon-like encoding. Fourth, at the server level, we introduce a new approach for aggregating convex hulls of the clients based on balanced graph partitioning. We test our method on a collection of diverse data sets, including hierarchical single-cell RNA-seq data from different patients distributed across different repositories that have stringent privacy constraints. The classification accuracy of our method is up to $\sim 11\%$ better than its Euclidean counterpart, demonstrating the importance of privacy-preserving learning in hyperbolic spaces.

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods. (arXiv:2308.11863v2 [eess.AS] UPDATED)

Authors: Antoine Nzeyimana

Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in four successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.9% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.

A Probabilistic Fluctuation based Membership Inference Attack for Diffusion Models. (arXiv:2308.12143v3 [cs.LG] UPDATED)

Authors: Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, Tao Jiang

Membership Inference Attack (MIA) identifies whether a record exists in a machine learning model's training set by querying the model. MIAs on the classic classification models have been well-studied, and recent works have started to explore how to transplant MIA onto generative models. Our investigation indicates that existing MIAs designed for generative models mainly depend on the overfitting in target models. However, overfitting can be avoided by employing various regularization techniques, whereas existing MIAs demonstrate poor performance in practice. Unlike overfitting, memorization is essential for deep learning models to attain optimal performance, making it a more prevalent phenomenon. Memorization in generative models leads to an increasing trend in the probability distribution of generating records around the member record. Therefore, we propose a Probabilistic Fluctuation Assessing Membership Inference Attack (PFAMI), a black-box MIA that infers memberships by detecting these trends via analyzing the overall probabilistic fluctuations around given records. We conduct extensive experiments across multiple generative models and datasets, which demonstrate PFAMI can improve the attack success rate (ASR) by about 27.9% when compared with the best baseline.

How Safe Am I Given What I See? Calibrated Prediction of Safety Chances for Image-Controlled Autonomy. (arXiv:2308.12252v2 [cs.LG] UPDATED)

Authors: Zhenjiang Mao, Carson Sobolewski, Ivan Ruchkin

End-to-end learning has emerged as a major paradigm for developing autonomous systems. Unfortunately, with its performance and convenience comes an even greater challenge of safety assurance. A key factor of this challenge is the absence of the notion of a low-dimensional and interpretable dynamical state, around which traditional assurance methods revolve. Focusing on the online safety prediction problem, this paper proposes a configurable family of learning pipelines based on generative world models, which do not require low-dimensional states. To implement these pipelines, we overcome the challenges of learning safety-informed latent representations and missing safety labels under prediction-induced distribution shift. These pipelines come with statistical calibration guarantees on their safety chance predictions based on conformal prediction. We perform an extensive evaluation of the proposed learning pipelines on two case studies of image-controlled systems: a racing car and a cartpole.

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation. (arXiv:2309.04001v3 [cs.CV] UPDATED)

Authors: Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif

Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials. The code and pretrained models will be made available at https://github.com/csiplab/MMSFormer.

A DenseNet-based method for decoding auditory spatial attention with EEG. (arXiv:2309.07690v2 [eess.SP] UPDATED)

Authors: Xiran Xu, Bo Wang, Yujie Yan, Xihong Wu, Jing Chen

Auditory spatial attention detection (ASAD) aims to decode the attended spatial location with EEG in a multiple-speaker setting. ASAD methods are inspired by the brain lateralization of cortical neural responses during the processing of auditory spatial attention, and show promising performance for the task of auditory attention decoding (AAD) with neural recordings. In the previous ASAD methods, the spatial distribution of EEG electrodes is not fully exploited, which may limit the performance of these methods. In the present work, by transforming the original EEG channels into a two-dimensional (2D) spatial topological map, the EEG data is transformed into a three-dimensional (3D) arrangement containing spatial-temporal information. And then a 3D deep convolutional neural network (DenseNet-3D) is used to extract temporal and spatial features of the neural representation for the attended locations. The results show that the proposed method achieves higher decoding accuracy than the state-of-the-art (SOTA) method (94.3% compared to XANet's 90.6%) with 1-second decision window for the widely used KULeuven (KUL) dataset, and the code to implement our work is available on Github:

https://github.com/xuxiran/ASAD_DenseNet

FedDCSR: Federated Cross-domain Sequential Recommendation via Disentangled Representation Learning. (arXiv:2309.08420v7 [cs.LG] UPDATED)

Authors: Hongyu Zhang, Dongyi Zheng, Xu Yang, Jiyuan Feng, Qing Liao

Cross-domain Sequential Recommendation (CSR) which leverages user sequence data from multiple domains has received extensive attention in recent years. However, the existing CSR methods require sharing origin user data across domains, which violates the General Data Protection Regulation (GDPR). Thus, it is necessary to combine federated learning (FL) and CSR to fully utilize knowledge from different domains while preserving data privacy. Nonetheless, the sequence feature heterogeneity across different domains significantly impacts the overall performance of FL. In this paper, we propose FedDCSR, a novel federated cross-domain sequential recommendation framework via disentangled representation learning. Specifically, to address the sequence feature heterogeneity across domains, we introduce an approach called inter-intra domain sequence representation disentanglement (SRD) to disentangle the user sequence features into domain-shared and domain-exclusive features. In addition, we design an intra domain contrastive infomax (CIM) strategy to learn richer domain-exclusive features of users by performing data augmentation on user sequences. Extensive experiments on three real-world scenarios demonstrate that FedDCSR achieves significant improvements over existing baselines.

Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits. (arXiv:2309.08748v3 [cs.LG] UPDATED)

Authors: Yi Shen, Pan Xu, Michael M. Zavlanos

Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.

Regularized Contrastive Pre-training for Few-shot Bioacoustic Sound Detection. (arXiv:2309.08971v2 [cs.SD] UPDATED)

Authors: Ilyass Moummad, Romain Serizel, Nicolas Farrugia

Bioacoustic sound event detection allows for better understanding of animal behavior and for better monitoring biodiversity using audio. Deep learning systems can help achieve this goal, however it is difficult to acquire sufficient annotated data to train these systems from scratch. To address this limitation, the Detection and Classification of Acoustic Scenes and Events (DCASE) community has recasted the problem within the framework of few-shot learning and organize an annual challenge for learning to detect animal sounds from only five annotated examples. In this work, we regularize supervised contrastive pre-training to learn features that can transfer well on new target tasks with animal sounds unseen during training, achieving a high F-score of 61.52%(0.48) when no feature adaptation is applied, and an F-score of 68.19%(0.75) when we further adapt the learned features for each new target task. This work aims to lower the entry bar to few-shot bioacoustic sound event detection by proposing a simple and yet effective framework for this task, by also providing open-source code.

Semantic similarity prediction is better than other semantic similarity measures. (arXiv:2309.12697v2 [cs.CL] UPDATED)

Authors: Steffen Herbold

Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. (arXiv:2309.16042v2 [cs.LG] UPDATED)

Authors: Fred Zhang, Neel Nanda

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.

Implicit Gaussian process representation of vector fields over arbitrary latent manifolds. (arXiv:2309.16746v2 [cs.LG] UPDATED)

Authors: Robert L. Peach, Matteo Vinao-Carl, Nir Grossman, Michael David, Emma Mallas, David Sharp, Paresh A. Malhotra, Pierre Vandergheynst, Adam Gosztolai

Gaussian processes (GPs) are popular nonparametric statistical models for learning unknown functions and quantifying the spatiotemporal uncertainty in data. Recent works have extended GPs to model scalar and vector quantities distributed over non-Euclidean domains, including smooth manifolds appearing in numerous fields such as computer vision, dynamical systems, and neuroscience. However, these approaches assume that the manifold underlying the data is known, limiting their practical utility. We introduce RVGP, a generalisation of GPs for learning vector signals over latent Riemannian manifolds. Our method uses positional encoding with eigenfunctions of the connection Laplacian, associated with the tangent bundle, readily derived from common graph-based approximation of data. We demonstrate that RVGP possesses global regularity over the manifold, which allows it to super-resolve and inpaint vector fields while preserving singularities. Furthermore, we use RVGP to reconstruct high-density neural dynamics derived from low-density EEG recordings in healthy individuals and Alzheimer's patients. We show that vector field singularities are important disease markers and that their reconstruction leads to a comparable classification accuracy of disease states to high-density recordings. Thus, our method overcomes a significant practical limitation in experimental and clinical applications.

Combining Spatial and Temporal Abstraction in Planning for Better Generalization. (arXiv:2310.00229v2 [cs.AI] UPDATED)

Authors: Mingde Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio

Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning agent utilizing spatio-temporal abstractions to generalize learned skills in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and hence enables sparse decision-making and focused computation on the relevant parts of the environment. This relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to existing state-of-the-art hierarchical planning methods.

UniPredict: Large Language Models are Universal Tabular Classifiers. (arXiv:2310.03266v2 [cs.LG] UPDATED)

Authors: Ruiyu Wang, Zifeng Wang, Jimeng Sun

Tabular data prediction is a fundamental machine learning task for many applications. Existing methods predominantly employ discriminative modeling and operate under the assumption of a fixed target column, necessitating re-training for every new predictive task. Inspired by the generative power of large language models (LLMs), this paper exploits the idea of building universal tabular data predictors based on generative modeling, namely UniPredict. Here, we demonstrate the scalability of an LLM to extensive tabular datasets, enabling it to comprehend diverse tabular inputs and predict target variables following the provided instructions. Specifically, we train a single LLM on an aggregation of 169 tabular datasets with diverse targets and compare its performance against baselines that are trained on each dataset separately. We observe this versatile UniPredict model demonstrates an advantage over other models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting baseline and the best neural network baseline, respectively. We further test UniPredict in few-shot learning settings on another 62 tabular datasets. Our method achieves strong performance in quickly adapting to new tasks. In low-resource few-shot setup, we observed a 100%+ performance advantage compared with XGBoost, and significant margin over all baselines. We envision that UniPredict sheds light on developing a universal tabular data prediction system that learns from data at scale and serves a wide range of prediction tasks.

Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning. (arXiv:2310.03838v2 [cs.LG] UPDATED)

Authors: Harsh Chaudhari, Giorgio Severi, Alina Oprea, Jonathan Ullman

The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs.

Balancing stability and plasticity in continual learning: the readout-decomposition of activation change (RDAC) framework. (arXiv:2310.04741v4 [cs.LG] UPDATED)

Authors: Daniel Anthes, Sushrut Thorat, Peter König, Tim C. Kietzmann

Continual learning (CL) algorithms strive to acquire new knowledge while preserving prior information. However, this stability-plasticity trade-off remains a central challenge. This paper introduces a framework that dissects this trade-off, offering valuable insights into CL algorithms. The Readout-Decomposition of Activation Change (RDAC) framework first addresses the stability-plasticity dilemma and its relation to catastrophic forgetting. It relates learning-induced activation changes in the range of prior readouts to the degree of stability and changes in the null space to the degree of plasticity. In deep non-linear networks tackling split-CIFAR-110 tasks, the framework clarifies the stability-plasticity trade-offs of the popular regularization algorithms Synaptic intelligence (SI), Elastic-weight consolidation (EWC), and learning without Forgetting (LwF), and replay-based algorithms Gradient episodic memory (GEM), and data replay. GEM and data replay preserved stability and plasticity, while SI, EWC, and LwF traded off plasticity for stability. The inability of the regularization algorithms to maintain plasticity was linked to them restricting the change of activations in the null space of the prior readout. Additionally, for one-hidden-layer linear neural networks, we derived a gradient decomposition algorithm to restrict activation change only in the range of the prior readouts, to maintain high stability while not further sacrificing plasticity. Results demonstrate that the algorithm maintained stability without significant plasticity loss. The RDAC framework informs the behavior of existing CL algorithms and paves the way for novel CL approaches. Finally, it sheds light on the connection between learning-induced activation/representation changes and the stability-plasticity dilemma, also offering insights into representational drift in biological systems.

Post-hoc Bias Scoring Is Optimal For Fair Classification. (arXiv:2310.05725v2 [stat.ML] UPDATED)

Authors: Wenlong Chen, Yegor Klochkov, Yang Liu

We consider a binary classification problem under group fairness constraints, which can be one of Demographic Parity (DP), Equalized Opportunity (EOp), or Equalized Odds (EO). We propose an explicit characterization of Bayes optimal classifier under the fairness constraints, which turns out to be a simple modification rule of the unconstrained classifier. Namely, we introduce a novel instance-level measure of bias, which we call bias score, and the modification rule is a simple linear rule on top of the finite amount of bias scores.Based on this characterization, we develop a post-hoc approach that allows us to adapt to fairness constraints while maintaining high accuracy. In the case of DP and EOp constraints, the modification rule is thresholding a single bias score, while in the case of EO constraints we are required to fit a linear modification rule with 2 parameters. The method can also be applied for composite group-fairness criteria, such as ones involving several sensitive attributes.

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing. (arXiv:2310.06234v2 [cs.CV] UPDATED)

Authors: Wei Dong, Dawei Yan, Zhijun Lin, Peng Wang

The advent of high-capacity pre-trained models has revolutionized problem-solving in computer vision, shifting the focus from training task-specific models to adapting pre-trained models. Consequently, effectively adapting large pre-trained models to downstream tasks in an efficient manner has become a prominent research area. Existing solutions primarily concentrate on designing lightweight adapters and their interaction with pre-trained models, with the goal of minimizing the number of parameters requiring updates. In this study, we propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation from a fresh perspective. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme. Specifically, we leverage symmetric down-/up-projections to construct bottleneck operations, which are shared across layers. By learning low-dimensional re-scaling coefficients, we can effectively re-compose layer-adaptive adapters. This parameter-sharing strategy in adapter design allows us to significantly reduce the number of new parameters while maintaining satisfactory performance, thereby offering a promising approach to compress the adaptation cost. We conduct experiments on 24 downstream image classification tasks using various Vision Transformer variants to evaluate our method. The results demonstrate that our approach achieves compelling transfer learning performance with a reduced parameter count. Our code is available at \href{https://github.com/DavidYanAnDe/ARC}{https://github.com/DavidYanAnDe/ARC}.

Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation. (arXiv:2310.08056v3 [cs.LG] UPDATED)

Authors: Shreyas Havaldar, Navodita Sharma, Shubhi Sareen, Karthikeyan Shanmugam, Aravindan Raghuveer

Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines (up to 15%) for the LLP Binary Classification problem on various dataset types - tabular and Image. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.

Generalizing Medical Image Representations via Quaternion Wavelet Networks. (arXiv:2310.10224v3 [eess.IV] UPDATED)

Authors: Luigi Sigillo, Eleonora Grassucci, Aurelio Uncini, Danilo Comminiello

Neural network generalizability is becoming a broad research field due to the increasing availability of datasets from different sources and for various tasks. This issue is even wider when processing medical data, where a lack of methodological standards causes large variations being provided by different imaging centers or acquired with various devices and cofactors. To overcome these limitations, we introduce a novel, generalizable, data- and task-agnostic framework able to extract salient features from medical images. The proposed quaternion wavelet network (QUAVE) can be easily integrated with any pre-existing medical image analysis or synthesis task, and it can be involved with real, quaternion, or hypercomplex-valued models, generalizing their adoption to single-channel data. QUAVE first extracts different sub-bands through the quaternion wavelet transform, resulting in both low-frequency/approximation bands and high-frequency/fine-grained features. Then, it weighs the most representative set of sub-bands to be involved as input to any other neural model for image processing, replacing standard data samples. We conduct an extensive experimental evaluation comprising different datasets, diverse image analysis, and synthesis tasks including reconstruction, segmentation, and modality translation. We also evaluate QUAVE in combination with both real and quaternion-valued models. Results demonstrate the effectiveness and the generalizability of the proposed framework that improves network performance while being flexible to be adopted in manifold scenarios and robust to domain shifts. The full code is available at: https://github.com/ispamm/QWT.

Understanding Addition in Transformers. (arXiv:2310.13121v5 [cs.LG] UPDATED)

Authors: Philip Quirke, Fazl Barez

Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper presents an in-depth analysis of a one-layer Transformer model trained for n-digit integer addition. We reveal that the model divides the task into parallel, digit-specific streams and employs distinct algorithms for different digit positions. Our study also finds that the model starts calculations late but executes them rapidly. A rare use case with high loss is identified and explained. Overall, the model's algorithm is explained in detail. These findings are validated through rigorous testing and mathematical modeling, contributing to the broader works in Mechanistic Interpretability, AI safety, and alignment. Our approach opens the door for analyzing more complex tasks and multi-layer Transformer models.

Stochastic Thermodynamics of Learning Parametric Probabilistic Models. (arXiv:2310.19802v5 [cs.LG] UPDATED)

Authors: Shervin Sadat Parsi

We have formulated a family of machine learning problems as the time evolution of Parametric Probabilistic Models (PPMs), inherently rendering a thermodynamic process. Our primary motivation is to leverage the rich toolbox of thermodynamics of information to assess the information-theoretic content of learning a probabilistic model. We first introduce two information-theoretic metrics: Memorized-information (M-info) and Learned-information (L-info), which trace the flow of information during the learning process of PPMs. Then, we demonstrate that the accumulation of L-info during the learning process is associated with entropy production, and parameters serve as a heat reservoir in this process, capturing learned information in the form of M-info.

Patch-Based Deep Unsupervised Image Segmentation using Graph Cuts. (arXiv:2311.01475v2 [cs.CV] UPDATED)

Authors: Isaac Wasserman, Jeova Farias Sales Rocha Neto

Unsupervised image segmentation aims at grouping different semantic patterns in an image without the use of human annotation. Similarly, image clustering searches for groupings of images based on their semantic content without supervision. Classically, both problems have captivated researchers as they drew from sound mathematical concepts to produce concrete applications. With the emergence of deep learning, the scientific community turned its attention to complex neural network-based solvers that achieved impressive results in those domains but rarely leveraged the advances made by classical methods. In this work, we propose a patch-based unsupervised image segmentation strategy that bridges advances in unsupervised feature extraction from deep clustering methods with the algorithmic help of classical graph-based methods. We show that a simple convolutional neural network, trained to classify image patches and iteratively regularized using graph cuts, naturally leads to a state-of-the-art fully-convolutional unsupervised pixel-level segmenter. Furthermore, we demonstrate that this is the ideal setting for leveraging the patch-level pairwise features generated by vision transformer models. Our results on real image data demonstrate the effectiveness of our proposed methodology.

Efficient Generalized Low-Rank Tensor Contextual Bandits. (arXiv:2311.01771v3 [cs.LG] UPDATED)

Authors: Qianxin Yi, Yiyang Yang, Shaojie Tang, Jiapeng Liu, Yao Wang

In this paper, we aim to build a novel bandits algorithm that is capable of fully harnessing the power of multi-dimensional data and the inherent non-linearity of reward functions to provide high-usable and accountable decision-making services. To this end, we introduce a generalized low-rank tensor contextual bandits model in which an action is formed from three feature vectors, and thus can be represented by a tensor. In this formulation, the reward is determined through a generalized linear function applied to the inner product of the action's feature tensor and a fixed but unknown parameter tensor with a low tubal rank. To effectively achieve the trade-off between exploration and exploitation, we introduce a novel algorithm called "Generalized Low-Rank Tensor Exploration Subspace then Refine" (G-LowTESTR). This algorithm first collects raw data to explore the intrinsic low-rank tensor subspace information embedded in the decision-making scenario, and then converts the original problem into an almost lower-dimensional generalized linear contextual bandits problem. Rigorous theoretical analysis shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases. We conduct a series of simulations and real data experiments to further highlight the effectiveness of G-LowTESTR, leveraging its ability to capitalize on the low-rank tensor structure for enhanced learning.

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification. (arXiv:2311.10359v3 [cs.DC] UPDATED)

Authors: Wenqing Wu

Highly parallelized workloads like machine learning training, inferences and general HPC tasks are greatly accelerated using GPU devices. In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly demanded since there are always more task requests than the number of GPU available. Existing GPU sharing solutions focus on reducing task-level waiting time or task-level switching costs when multiple jobs competing for a single GPU. Non-stopped computation requests come with different priorities, having non-symmetric impact on QoS for sharing a GPU device. Existing work missed the kernel-level optimization opportunity brought by this setting. To address this problem, we present a novel kernel-level scheduling strategy called FIKIT: Filling Inter-kernel Idle Time. FIKIT incorporates task-level priority information, fine-grained kernel identification, and kernel measurement, allowing low priorities task's execution during high priority task's inter-kernel idle time. Thereby, filling the GPU's device runtime fully, and reduce overall GPU sharing impact to cloud services. Across a set of ML models, the FIKIT based inference system accelerated high priority tasks by 1.33 to 14.87 times compared to the JCT in GPU sharing mode, and more than half of the cases are accelerated by more than 3.5 times. Alternatively, under preemptive sharing, the low-priority tasks have a comparable to default GPU sharing mode JCT, with a 0.84 to 1 times ratio. We further limit the kernel measurement and runtime fine-grained kernel scheduling overhead to less than 10%.

LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning. (arXiv:2311.12023v2 [cs.CL] UPDATED)

Authors: Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.

Dynamic Fault Characteristics Evaluation in Power Grid. (arXiv:2311.16522v3 [cs.LG] UPDATED)

Authors: Hao Pei, Si Lin, Chuanfu Li, Che Wang, Haoming Chen, Sizhe Li

To enhance the intelligence degree in operation and maintenance, a novel method for fault detection in power grids is proposed. The proposed GNN-based approach first identifies fault nodes through a specialized feature extraction method coupled with a knowledge graph. By incorporating temporal data, the method leverages the status of nodes from preceding and subsequent time periods to help current fault detection. To validate the effectiveness of the node features, a correlation analysis of the output features from each node was conducted. The results from experiments show that this method can accurately locate fault nodes in simulation scenarios with a remarkable accuracy. Additionally, the graph neural network based feature modeling allows for a qualitative examination of how faults spread across nodes, which provides valuable insights for analyzing fault nodes.

A Comparison Between Invariant and Equivariant Classical and Quantum Graph Neural Networks. (arXiv:2311.18672v2 [quant-ph] UPDATED)

Authors: Roy T. Forestano, Marçal Comajoan Cara, Gopal Ramesh Dahale, Zhongtian Dong, Sergei Gleyzer, Daniel Justice, Kyoungchul Kong, Tom Magorsch, Konstantin T. Matchev, Katia Matcheva, Eyup B. Unlu

Machine learning algorithms are heavily relied on to understand the vast amounts of data from high-energy particle collisions at the CERN Large Hadron Collider (LHC). The data from such collision events can naturally be represented with graph structures. Therefore, deep geometric methods, such as graph neural networks (GNNs), have been leveraged for various data analysis tasks in high-energy physics. One typical task is jet tagging, where jets are viewed as point clouds with distinct features and edge connections between their constituent particles. The increasing size and complexity of the LHC particle datasets, as well as the computational models used for their analysis, greatly motivate the development of alternative fast and efficient computational paradigms such as quantum computation. In addition, to enhance the validity and robustness of deep networks, one can leverage the fundamental symmetries present in the data through the use of invariant inputs and equivariant layers. In this paper, we perform a fair and comprehensive comparison between classical graph neural networks (GNNs) and equivariant graph neural networks (EGNNs) and their quantum counterparts: quantum graph neural networks (QGNNs) and equivariant quantum graph neural networks (EQGNN). The four architectures were benchmarked on a binary classification task to classify the parton-level particle initiating the jet. Based on their AUC scores, the quantum networks were shown to outperform the classical networks. However, seeing the computational advantage of the quantum networks in practice may have to wait for the further development of quantum technology and its associated APIs.

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play. (arXiv:2312.04118v2 [cs.CV] UPDATED)

Authors: Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch

Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.

CLadder: Assessing Causal Reasoning in Language Models. (arXiv:2312.04350v3 [cs.CL] UPDATED)

Authors: Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, Bernhard Schölkopf

The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training. (arXiv:2312.08846v2 [cs.LG] UPDATED)

Authors: Chaoya Jiang, Wei ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios.

Language Modeling on a SpiNNaker 2 Neuromorphic Chip. (arXiv:2312.09084v2 [cs.NE] UPDATED)

Authors: Khaleelulla Khan Nazeer, Mark Schöne, Rishav Mukherji, Bernhard Vogginger, Christian Mayr, David Kappel, Anand Subramoney

As large language models continue to scale in size rapidly, so too does the computational power required to run them. Event-based networks on neuromorphic devices offer a potential way to reduce energy consumption for inference significantly. However, to date, most event-based networks that can run on neuromorphic hardware, including spiking neural networks (SNNs), have not achieved task performance even on par with LSTM models for language modeling. As a result, language modeling on neuromorphic devices has seemed a distant prospect. In this work, we demonstrate the first-ever implementation of a language model on a neuromorphic device - specifically the SpiNNaker 2 chip - based on a recently published event-based architecture called the EGRU. SpiNNaker 2 is a many-core neuromorphic chip designed for large-scale asynchronous processing, while the EGRU is architected to leverage such hardware efficiently while maintaining competitive task performance. This implementation marks the first time a neuromorphic language model matches LSTMs, setting the stage for taking task performance to the level of large language models. We also demonstrate results on a gesture recognition task based on inputs from a DVS camera. Overall, our results showcase the feasibility of this neuro-inspired neural network in hardware, highlighting significant gains versus conventional hardware in energy efficiency for the common use case of single batch inference.

Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles. (arXiv:2312.10209v2 [cs.HC] UPDATED)

Authors: Minxue Niu, Zhaobo Zheng, Kumar Akash, Teruhisa Misu

Humans' internal states play a key role in human-machine interaction, leading to the rise of human state estimation as a prominent field. Compared to swift state changes such as surprise and irritation, modeling gradual states like trust and satisfaction are further challenged by label sparsity: long time-series signals are usually associated with a single label, making it difficult to identify the critical span of state shifts. Windowing has been one widely-used technique to enable localized analysis of long time-series data. However, the performance of downstream models can be sensitive to the window size, and determining the optimal window size demands domain expertise and extensive search. To address this challenge, we propose a Selective Windowing Attention Network (SWAN), which employs window prompts and masked attention transformation to enable the selection of attended intervals with flexible lengths. We evaluate SWAN on the task of trust prediction on a new multimodal driving simulation dataset. Experiments show that SWAN significantly outperforms an existing empirical window selection baseline and neural network baselines including CNN-LSTM and Transformer. Furthermore, it shows robustness across a wide span of windowing ranges, compared to the traditional windowing approach.

Self-Supervised Learning for Few-Shot Bird Sound Classification. (arXiv:2312.15824v3 [cs.SD] UPDATED)

Authors: Ilyass Moummad, Romain Serizel, Nicolas Farrugia

Self-supervised learning (SSL) in audio holds significant potential across various domains, particularly in situations where abundant, unlabeled data is readily available at no cost. This is particularly pertinent in bioacoustics, where biologists routinely collect extensive sound datasets from the natural environment. In this study, we demonstrate that SSL is capable of acquiring meaningful representations of bird sounds from audio recordings without the need for annotations. Our experiments showcase that these learned representations exhibit the capacity to generalize to new bird species in few-shot learning (FSL) scenarios. Additionally, we show that selecting windows with high bird activation for self-supervised learning, using a pretrained audio neural network, significantly enhances the quality of the learned representations.

Efficient Reinforcemen Learning via Decoupling Exploration and Utilization. (arXiv:2312.15965v3 [cs.LG] UPDATED)

Authors: Jingpu Yang, Qirui Zhao, Helin Wang, Yuxiao Huang, Zirui Song, Miao Fang

Deep neural network(DNN) generalization is limited by the over-reliance of current offline reinforcement learning techniques on conservative processing of existing datasets. This method frequently results in algorithms that settle for suboptimal solutions that only adjust to a certain dataset. Similarly, in online reinforcement learning, the previously imposed punitive pessimism also deprives the model of its exploratory potential. Our research proposes a novel framework, Optimistic and Pessimistic Actor Reinforcement Learning (OPARL). OPARL employs a unique dual-actor approach: an optimistic actor dedicated to exploration and a pessimistic actor focused on utilization, thereby effectively differentiating between exploration and utilization strategies. This unique combination in reinforcement learning methods fosters a more balanced and efficient approach. It enables the optimization of policies that focus on actions yielding high rewards through pessimistic utilization strategies, while also ensuring extensive state coverage via optimistic exploration. Experiments and theoretical study demonstrates OPARL improves agents' capacities for application and exploration. In the most tasks of DMControl benchmark and Mujoco environment, OPARL performed better than state-of-the-art methods. Our code has released on https://github.com/yydsok/OPARL

Matching of Users and Creators in Two-Sided Markets with Departures. (arXiv:2401.00313v2 [cs.GT] UPDATED)

Authors: Daniel Huttenlocher, Hannah Li, Liang Lyu, Asuman Ozdaglar, James Siderius

Many online platforms of today, including social media sites, are two-sided markets bridging content creators and users. Most of the existing literature on platform recommendation algorithms largely focuses on user preferences and decisions, and does not simultaneously address creator incentives. We propose a model of content recommendation that explicitly focuses on the dynamics of user-content matching, with the novel property that both users and creators may leave the platform permanently if they do not experience sufficient engagement. In our model, each player decides to participate at each time step based on utilities derived from the current match: users based on alignment of the recommended content with their preferences, and creators based on their audience size. We show that a user-centric greedy algorithm that does not consider creator departures can result in arbitrarily poor total engagement, relative to an algorithm that maximizes total engagement while accounting for two-sided departures. Moreover, in stark contrast to the case where only users or only creators leave the platform, we prove that with two-sided departures, approximating maximum total engagement within any constant factor is NP-hard. We present two practical algorithms, one with performance guarantees under mild assumptions on user preferences, and another that tends to outperform algorithms that ignore two-sided departures in practice.

Multi-Lattice Sampling of Quantum Field Theories via Neural Operator-based Flows. (arXiv:2401.00828v3 [cs.LG] UPDATED)

Authors: Bálint Máté, François Fleuret

We consider the problem of sampling discrete field configurations $\phi$ from the Boltzmann distribution $[d\phi] Z^{-1} e^{-S[\phi]}$, where $S$ is the lattice-discretization of the continuous Euclidean action $\mathcal S$ of some quantum field theory. Since such densities arise as the approximation of the underlying functional density $[\mathcal D\phi(x)] \mathcal Z^{-1} e^{-\mathcal S[\phi(x)]}$, we frame the task as an instance of operator learning. In particular, we propose to approximate a time-dependent operator $\mathcal V_t$ whose time integral provides a mapping between the functional distributions of the free theory $[\mathcal D\phi(x)] \mathcal Z_0^{-1} e^{-\mathcal S_{0}[\phi(x)]}$ and of the target theory $[\mathcal D\phi(x)]\mathcal Z^{-1}e^{-\mathcal S[\phi(x)]}$. Whenever a particular lattice is chosen, the operator $\mathcal V_t$ can be discretized to a finite dimensional, time-dependent vector field $V_t$ which in turn induces a continuous normalizing flow between finite dimensional distributions over the chosen lattice. This flow can then be trained to be a diffeormorphism between the discretized free and target theories $[d\phi] Z_0^{-1} e^{-S_{0}[\phi]}$, $[d\phi] Z^{-1}e^{-S[\phi]}$. We run experiments on the $\phi^4$-theory to explore to what extent such operator-based flow architectures generalize to lattice sizes they were not trained on and show that pretraining on smaller lattices can lead to speedup over training only a target lattice size.

AIRI: Predicting Retention Indices and their Uncertainties using Artificial Intelligence. (arXiv:2401.01506v2 [cs.LG] UPDATED)

Authors: Lewis Y. Geer, Stephen E. Stein, William Gary Mallard, Douglas J. Slotta

The Kov\'ats Retention index (RI) is a quantity measured using gas chromatography and commonly used in the identification of chemical structures. Creating libraries of observed RI values is a laborious task, so we explore the use of a deep neural network for predicting RI values from structure for standard semipolar columns. This network generated predictions with a mean absolute error of 15.1 and, in a quantification of the tail of the error distribution, a 95th percentile absolute error of 46.5. Because of the Artificial Intelligence Retention Indices (AIRI) network's accuracy, it was used to predict RI values for the NIST EI-MS spectral libraries. These RI values are used to improve chemical identification methods and the quality of the library. Estimating uncertainty is an important practical need when using prediction models. To quantify the uncertainty of our network for each individual prediction, we used the outputs of an ensemble of 8 networks to calculate a predicted standard deviation for each RI value prediction. This predicted standard deviation was corrected to follow the error between observed and predicted RI values. The Z scores using these predicted standard deviations had a standard deviation of 1.52 and a 95th percentile absolute Z score corresponding to a mean RI value of 42.6.

Approximating Numerical Fluxes Using Fourier Neural Operators for Hyperbolic Conservation Laws. (arXiv:2401.01783v3 [math.NA] UPDATED)

Authors: Taeyoung Kim, Myungjoo Kang

Traditionally, classical numerical schemes have been employed to solve partial differential equations (PDEs) using computational methods. Recently, neural network-based methods have emerged. Despite these advancements, neural network-based methods, such as physics-informed neural networks (PINNs) and neural operators, exhibit deficiencies in robustness and generalization. To address these issues, numerous studies have integrated classical numerical frameworks with machine learning techniques, incorporating neural networks into parts of traditional numerical methods. In this study, we focus on hyperbolic conservation laws by replacing traditional numerical fluxes with neural operators. To this end, we developed loss functions inspired by established numerical schemes related to conservation laws and approximated numerical fluxes using Fourier neural operators (FNOs). Our experiments demonstrated that our approach combines the strengths of both traditional numerical schemes and FNOs, outperforming standard FNO methods in several respects. For instance, we demonstrate that our method is robust, has resolution invariance, and is feasible as a data-driven method. In particular, our method can make continuous predictions over time and exhibits superior generalization capabilities with out-of-distribution (OOD) samples, which are challenges that existing neural operator methods encounter.

The Rise of Diffusion Models in Time-Series Forecasting. (arXiv:2401.03006v2 [cs.LG] UPDATED)

Authors: Caspar Meijer, Lydia Y. Chen

This survey delves into the application of diffusion models in time-series forecasting. Diffusion models are demonstrating state-of-the-art results in various fields of generative AI. The paper includes comprehensive background information on diffusion models, detailing their conditioning methods and reviewing their use in time-series forecasting. The analysis covers 11 specific time-series implementations, the intuition and theory behind them, the effectiveness on different datasets, and a comparison among each other. Key contributions of this work are the thorough exploration of diffusion models' applications in time-series forecasting and a chronologically ordered overview of these models. Additionally, the paper offers an insightful discussion on the current state-of-the-art in this domain and outlines potential future research directions. This serves as a valuable resource for researchers in AI and time-series analysis, offering a clear view of the latest advancements and future potential of diffusion models.

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models. (arXiv:2401.03506v2 [eess.AS] UPDATED)

Authors: Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, Hank Liao

In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.

Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series. (arXiv:2401.03955v3 [cs.LG] UPDATED)

Authors: Vijay Ekambaram, Arindam Jati, Nam H. Nguyen, Pankaj Dayama, Chandra Reddy, Wesley M. Gifford, Jayant Kalagnanam

Large pre-trained models for zero/few-shot learning excel in language and vision domains but encounter challenges in multivariate time series (TS) due to the diverse nature and scarcity of publicly available pre-training data. Consequently, there has been a recent surge in utilizing pre-trained large language models (LLMs) with token adaptations for TS forecasting. These approaches employ cross-domain transfer learning and surprisingly yield impressive results. However, these models are typically very slow and large (~billion parameters) and do not consider cross-channel correlations. To address this, we present Tiny Time Mixers (TTM), a significantly small model based on the lightweight TSMixer architecture. TTM marks the first success in developing fast and tiny general pre-trained models (<1M parameters), exclusively trained on public TS datasets, with effective transfer learning capabilities for forecasting. To tackle the complexity of pre-training on multiple datasets with varied temporal resolutions, we introduce several novel enhancements such as adaptive patching, dataset augmentation via downsampling, and resolution prefix tuning. Moreover, we employ a multi-level modeling strategy to effectively model channel correlations and infuse exogenous signals during fine-tuning, a crucial capability lacking in existing benchmarks. TTM shows significant accuracy gains (12-38\%) over popular benchmarks in few/zero-shot forecasting. It also drastically reduces the compute needs as compared to LLM-TS methods, with a 14X cut in learnable parameters, 106X less total parameters, and substantial reductions in fine-tuning (65X) and inference time (54X). In fact, TTM's zero-shot often surpasses the few-shot results in many popular benchmarks, highlighting the efficacy of our approach. Code and pre-trained models will be open-sourced.

AUTOACT: Automatic Agent Learning from Scratch via Self-Planning. (arXiv:2401.05268v2 [cs.CL] UPDATED)

Authors: Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, Huajun Chen

Language agents have achieved considerable performance on various complex tasks. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework that does not rely on large-scale annotated data and synthetic trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. We even notice that AutoAct, when using the Llama-2-13b model, can achieve performance comparable to that of the zero-shot GPT-3.5-Turbo agent. Code will be available at https://github.com/zjunlp/AutoAct.

Bounds on the price of feedback for mistake-bounded online learning. (arXiv:2401.05794v2 [cs.LG] UPDATED)

Authors: Jesse Geneson, Linus Tang

We improve several worst-case bounds for various online learning scenarios from (Auer and Long, Machine Learning, 1999). In particular, we sharpen an upper bound for delayed ambiguous reinforcement learning by a factor of 2 and an upper bound for learning compositions of families of functions by a factor of 2.41. We also improve a lower bound from the same paper for learning compositions of $k$ families of functions by a factor of $\Theta(\ln{k})$, matching the upper bound up to a constant factor. In addition, we solve a problem from (Long, Theoretical Computer Science, 2020) on the price of bandit feedback with respect to standard feedback for multiclass learning, and we improve an upper bound from (Feng et al., Theoretical Computer Science, 2023) on the price of $r$-input delayed ambiguous reinforcement learning by a factor of $r$, matching a lower bound from the same paper up to the leading term.

Use of Prior Knowledge to Discover Causal Additive Models with Unobserved Variables and its Application to Time Series Data. (arXiv:2401.07231v2 [cs.LG] UPDATED)

Authors: Takashi Nicholas Maeda, Shohei Shohei

This paper proposes two methods for causal additive models with unobserved variables (CAM-UV). CAM-UV assumes that the causal functions take the form of generalized additive models and that latent confounders are present. First, we propose a method that leverages prior knowledge for efficient causal discovery. Then, we propose an extension of this method for inferring causality in time series data. The original CAM-UV algorithm differs from other existing causal function models in that it does not seek the causal order between observed variables, but rather aims to identify the causes for each observed variable. Therefore, the first proposed method in this paper utilizes prior knowledge, such as understanding that certain variables cannot be causes of specific others. Moreover, by incorporating the prior knowledge that causes precedes their effects in time, we extend the first algorithm to the second method for causal discovery in time series data. We validate the first proposed method by using simulated data to demonstrate that the accuracy of causal discovery increases as more prior knowledge is accumulated. Additionally, we test the second proposed method by comparing it with existing time series causal discovery methods, using both simulated data and real-world data.

E3x: $\mathrm{E}(3)$-Equivariant Deep Learning Made Easy. (arXiv:2401.07595v2 [cs.LG] UPDATED)

Authors: Oliver T. Unke, Hartmut Maennel

This work introduces E3x, a software package for building neural networks that are equivariant with respect to the Euclidean group $\mathrm{E}(3)$, consisting of translations, rotations, and reflections of three-dimensional space. Compared to ordinary neural networks, $\mathrm{E}(3)$-equivariant models promise benefits whenever input and/or output data are quantities associated with three-dimensional objects. This is because the numeric values of such quantities (e.g. positions) typically depend on the chosen coordinate system. Under transformations of the reference frame, the values change predictably, but the underlying rules can be difficult to learn for ordinary machine learning models. With built-in $\mathrm{E}(3)$-equivariance, neural networks are guaranteed to satisfy the relevant transformation rules exactly, resulting in superior data efficiency and accuracy. The code for E3x is available from https://github.com/google-research/e3x, detailed documentation and usage examples can be found on https://e3x.readthedocs.io.

CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures. (arXiv:2401.07671v2 [cs.AR] UPDATED)

Authors: Rebecca Pelke, Jose Cubero-Cascante, Nils Bosbach, Felix Staudigl, Rainer Leupers, Jan Moritz Joseph

The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA.

Deep Evolutional Instant Interest Network for CTR Prediction in Trigger-Induced Recommendation. (arXiv:2401.07769v2 [cs.IR] UPDATED)

Authors: Zhibo Xiao, Luwei Yang, Tao Zhang, Wen Jiang, Wei Ning, Yujiu Yang

The recommendation has been playing a key role in many industries, e.g., e-commerce, streaming media, social media, etc. Recently, a new recommendation scenario, called Trigger-Induced Recommendation (TIR), where users are able to explicitly express their instant interests via trigger items, is emerging as an essential role in many e-commerce platforms, e.g., Alibaba.com and Amazon. Without explicitly modeling the user's instant interest, traditional recommendation methods usually obtain sub-optimal results in TIR. Even though there are a few methods considering the trigger and target items simultaneously to solve this problem, they still haven't taken into account temporal information of user behaviors, the dynamic change of user instant interest when the user scrolls down and the interactions between the trigger and target items. To tackle these problems, we propose a novel method -- Deep Evolutional Instant Interest Network (DEI2N), for click-through rate prediction in TIR scenarios. Specifically, we design a User Instant Interest Modeling Layer to predict the dynamic change of the intensity of instant interest when the user scrolls down. Temporal information is utilized in user behavior modeling. Moreover, an Interaction Layer is introduced to learn better interactions between the trigger and target items. We evaluate our method on several offline and real-world industrial datasets. Experimental results show that our proposed DEI2N outperforms state-of-the-art baselines. In addition, online A/B testing demonstrates the superiority over the existing baseline in real-world production environments.

Carrying over algorithm in transformers. (arXiv:2401.07993v2 [cs.LG] UPDATED)

Authors: Jorrit Kruthoff

Addition is perhaps one of the simplest arithmetic tasks one can think of and is usually performed using the carrying over algorithm. This algorithm consists of two tasks: adding digits in the same position and carrying over a one whenever necessary. We study how transformer models implement this algorithm and how the two aforementioned tasks are allocated to different parts of the network. We first focus on two-layer encoder-only models and show that the carrying over algorithm is implemented in a modular fashion. The first layer is mostly responsible for adding digits in the same position. The second layer first decides, in the attention, which positions need a carried one or not, and then performs the carrying of the one in the final MLP. We provide a simple way of precisely identifying which neurons are responsible for that task. This implementation of the carrying over algorithm occurs across a range of hyperparameters for two as well as three-layer models. For small decoder-only models, we observe the same implementation and provide suggestive evidence for its existence in three 7B large language models.

Matrix Completion with Hypergraphs:Sharp Thresholds and Efficient Algorithms. (arXiv:2401.08197v2 [cs.LG] UPDATED)

Authors: Zhongtian Ma, Qiaosheng Zhang, Zhen Wang

This paper considers the problem of completing a rating matrix based on sub-sampled matrix entries as well as observed social graphs and hypergraphs. We show that there exists a \emph{sharp threshold} on the sample probability for the task of exactly completing the rating matrix -- the task is achievable when the sample probability is above the threshold, and is impossible otherwise -- demonstrating a phase transition phenomenon. The threshold can be expressed as a function of the ``quality'' of hypergraphs, enabling us to \emph{quantify} the amount of reduction in sample probability due to the exploitation of hypergraphs. This also highlights the usefulness of hypergraphs in the matrix completion problem. En route to discovering the sharp threshold, we develop a computationally efficient matrix completion algorithm that effectively exploits the observed graphs and hypergraphs. Theoretical analyses show that our algorithm succeeds with high probability as long as the sample probability exceeds the aforementioned threshold, and this theoretical result is further validated by synthetic experiments. Moreover, our experiments on a real social network dataset (with both graphs and hypergraphs) show that our algorithm outperforms other state-of-the-art matrix completion algorithms.

An Explainable Proxy Model for Multiabel Audio Segmentation. (arXiv:2401.08268v2 [eess.AS] UPDATED)

Authors: Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega

Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference. (arXiv:2401.08383v2 [cs.LG] UPDATED)

Authors: Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK) Panda

In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication required for expert routing and aggregation. This communication bottleneck exacerbates the already complex computational landscape, hindering the efficient utilization of high-performance computing resources. In this paper, we propose a lightweight optimization technique called ExFlow, to largely accelerate the inference of these MoE models. We take a new perspective on alleviating the communication overhead by exploiting the inter-layer expert affinity. Unlike previous methods, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation. By proposing a context-coherent expert parallelism on distributed systems, our design only uses one Alltoall communication to deliver the same functionality while previous methods all require two Alltoalls. By carefully examining the conditional probability in tokens' routing across multiple layers, we proved that pre-trained GPT MoE models implicitly exhibit a strong inter-layer expert affinity. We then design an efficient integer programming model to capture such features and show that by properly placing the experts on corresponding GPUs, we can reduce up to 67% cross-GPU routing latency. Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput. We further provide a detailed study of how the model implicitly acquires this expert affinity at the very early training stage and how this affinity evolves and stabilizes during training.