Timeline-based Process Discovery. (arXiv:2401.04114v1 [cs.HC])

Authors: Harleen Kaur, Jan Mendling, Christoffer Rubensson, Timotheus Kampik

A key concern of automatic process discovery is to provide insights into performance aspects of business processes. Waiting times are of particular importance in this context. For that reason, it is surprising that current techniques for automatic process discovery generate directly-follows graphs and comparable process models, but often miss the opportunity to explicitly represent the time axis. In this paper, we present an approach for automatically constructing process models that explicitly align with a time axis. We exemplify our approach for directly-follows graphs. Our evaluation using two BPIC datasets and a proprietary dataset highlight the benefits of this representation in comparison to standard layout techniques.

Why is the User Interface a Dark Pattern? : Explainable Auto-Detection and its Analysis. (arXiv:2401.04119v1 [cs.HC])

Authors: Yuki Yada, Tsuneo Matsumoto, Fuyuko Kido, Hayato Yamana

Dark patterns are deceptive user interface designs for online services that make users behave in unintended ways. Dark patterns, such as privacy invasion, financial loss, and emotional distress, can harm users. These issues have been the subject of considerable debate in recent years. In this paper, we study interpretable dark pattern auto-detection, that is, why a particular user interface is detected as having dark patterns. First, we trained a model using transformer-based pre-trained language models, BERT, on a text-based dataset for the automatic detection of dark patterns in e-commerce. Then, we applied post-hoc explanation techniques, including local interpretable model agnostic explanation (LIME) and Shapley additive explanations (SHAP), to the trained model, which revealed which terms influence each prediction as a dark pattern. In addition, we extracted and analyzed terms that affected the dark patterns. Our findings may prevent users from being manipulated by dark patterns, and aid in the construction of more equitable internet services. Our code is available at https://github.com/yamanalab/why-darkpattern.

DeepPhysiNet: Bridging Deep Learning and Atmospheric Physics for Accurate and Continuous Weather Modeling. (arXiv:2401.04125v1 [physics.ao-ph])

Authors: Wenyuan Li, Zili Liu, Keyan Chen, Hao Chen, Shunlin Liang, Zhengxia Zou, Zhenwei Shi

Accurate weather forecasting holds significant importance to human activities. Currently, there are two paradigms for weather forecasting: Numerical Weather Prediction (NWP) and Deep Learning-based Prediction (DLP). NWP utilizes atmospheric physics for weather modeling but suffers from poor data utilization and high computational costs, while DLP can learn weather patterns from vast amounts of data directly but struggles to incorporate physical laws. Both paradigms possess their respective strengths and weaknesses, and are incompatible, because physical laws adopted in NWP describe the relationship between coordinates and meteorological variables, while DLP directly learns the relationships between meteorological variables without consideration of coordinates. To address these problems, we introduce the DeepPhysiNet framework, incorporating physical laws into deep learning models for accurate and continuous weather system modeling. First, we construct physics networks based on multilayer perceptrons (MLPs) for individual meteorological variable, such as temperature, pressure, and wind speed. Physics networks establish relationships between variables and coordinates by taking coordinates as input and producing variable values as output. The physical laws in the form of Partial Differential Equations (PDEs) can be incorporated as a part of loss function. Next, we construct hyper-networks based on deep learning methods to directly learn weather patterns from a large amount of meteorological data. The output of hyper-networks constitutes a part of the weights for the physics networks. Experimental results demonstrate that, upon successful integration of physical laws, DeepPhysiNet can accomplish multiple tasks simultaneously, not only enhancing forecast accuracy but also obtaining continuous spatiotemporal resolution results, which is unattainable by either the NWP or DLP.

Unsupervised Test-Time Adaptation via Plug-and-Play Transformer Modules. (arXiv:2401.04130v1 [cs.LG])

Authors: Xiangyu Chang, Sk Miraj Ahmed, Basak Guler, Srikanth V. Krishnamurthy, Ananthram Swami, Samet Oymak, Amit K. Roy-Chowdhury

Parameter-efficient tuning (PET) methods such as LoRA, Adapter, and Visual Prompt Tuning (VPT) have found success in enabling adaptation to new domains by tuning small modules within a transformer model. However, the number of domains encountered during test time can be very large, and the data is usually unlabeled. Thus, adaptation to new domains is challenging; it is also impractical to generate customized tuned modules for each such domain. Toward addressing these challenges, this work introduces PLUTO: a Plug-and-pLay modUlar Test-time domain adaptatiOn strategy. We pre-train a large set of modules, each specialized for different source domains, effectively creating a ``module store''. Given a target domain with few-shot unlabeled data, we introduce an unsupervised test-time adaptation (TTA) method to (1) select a sparse subset of relevant modules from this store and (2) create a weighted combination of selected modules without tuning their weights. This plug-and-play nature enables us to harness multiple most-relevant source domains in a single inference call. Comprehensive evaluations demonstrate that PLUTO uniformly outperforms alternative TTA methods and that selecting $\leq$5 modules suffice to extract most of the benefit. At a high level, our method equips pre-trained transformers with the capability to dynamically adapt to new domains, motivating a new paradigm for efficient and scalable domain adaptation.

SynHIN: Generating Synthetic Heterogeneous Information Network for Explainable AI. (arXiv:2401.04133v1 [cs.LG])

Authors: Ming-Yi Hong, Yi-Hsiang Huang, You-Chen Teng, Chih-Yu Wang, Che Lin

Graph Neural Networks (GNNs) excel in various domains, from detecting e-commerce spam to social network classification problems. However, the lack of public graph datasets hampers research progress, particularly in heterogeneous information networks (HIN). The demand for datasets for fair HIN comparisons is growing due to advancements in GNN interpretation models. In response, we propose SynHIN, a unique method for generating synthetic heterogeneous information networks. SynHIN identifies motifs in real-world datasets, summarizes graph statistics, and constructs a synthetic network. Our approach utilizes In-Cluster and Out-Cluster Merge modules to build the synthetic HIN from primary motif clusters. After In/Our-Cluster mergers and a post-pruning process fitting the real dataset constraints, we ensure the synthetic graph statistics align closely with the reference one. SynHIN generates a synthetic heterogeneous graph dataset for node classification tasks, using the primary motif as the explanation ground truth. It can adapt and address the lack of heterogeneous graph datasets and motif ground truths, proving beneficial for assessing heterogeneous graph neural network explainers. We further present a benchmark dataset for future heterogeneous graph explainer model research. Our work marks a significant step towards explainable AI in HGNNs.

Global-Aware Enhanced Spatial-Temporal Graph Recurrent Networks: A New Framework For Traffic Flow Prediction. (arXiv:2401.04135v1 [cs.LG])

Authors: Haiyang Liu, Chunjiang Zhu, Detian Zhang

Traffic flow prediction plays a crucial role in alleviating traffic congestion and enhancing transport efficiency. While combining graph convolution networks with recurrent neural networks for spatial-temporal modeling is a common strategy in this realm, the restricted structure of recurrent neural networks limits their ability to capture global information. For spatial modeling, many prior studies learn a graph structure that is assumed to be fixed and uniform at all time steps, which may not be true. This paper introduces a novel traffic prediction framework, Global-Aware Enhanced Spatial-Temporal Graph Recurrent Network (GA-STGRN), comprising two core components: a spatial-temporal graph recurrent neural network and a global awareness layer. Within this framework, three innovative prediction models are formulated. A sequence-aware graph neural network is proposed and integrated into the Gated Recurrent Unit (GRU) to learn non-fixed graphs at different time steps and capture local temporal relationships. To enhance the model's global perception, three distinct global spatial-temporal transformer-like architectures (GST^2) are devised for the global awareness layer. We conduct extensive experiments on four real traffic datasets and the results demonstrate the superiority of our framework and the three concrete models.

CCNETS: A Novel Brain-Inspired Approach for Enhanced Pattern Recognition in Imbalanced Datasets. (arXiv:2401.04139v1 [cs.LG])

Authors: Hanbeot Park (1), Yunjeong Cho (2), Hoon-Hee Kim (3)

This study introduces CCNETS (Causal Learning with Causal Cooperative Nets), a novel generative model-based classifier designed to tackle the challenge of generating data for imbalanced datasets in pattern recognition. CCNETS is uniquely crafted to emulate brain-like information processing and comprises three main components: Explainer, Producer, and Reasoner. Each component is designed to mimic specific brain functions, which aids in generating high-quality datasets and enhancing classification performance.

The model is particularly focused on addressing the common and significant challenge of handling imbalanced datasets in machine learning. CCNETS's effectiveness is demonstrated through its application to a "fraud dataset," where normal transactions significantly outnumber fraudulent ones (99.83% vs. 0.17%). Traditional methods often struggle with such imbalances, leading to skewed performance metrics. However, CCNETS exhibits superior classification ability, as evidenced by its performance metrics. Specifically, it achieved an F1-score of 0.7992, outperforming traditional models like Autoencoders and Multi-layer Perceptrons (MLP) in the same context. This performance indicates CCNETS's proficiency in more accurately distinguishing between normal and fraudulent patterns.

The innovative structure of CCNETS enhances the coherence between generative and classification models, helping to overcome the limitations of pattern recognition that rely solely on generative models. This study emphasizes CCNETS's potential in diverse applications, especially where quality data generation and pattern recognition are key. It proves effective in machine learning, particularly for imbalanced datasets. CCNETS overcomes current challenges in these datasets and advances machine learning with brain-inspired approaches.

On The Potential of The Fractal Geometry and The CNNs Ability to Encode it. (arXiv:2401.04141v1 [cs.LG])

Authors: Julia El Zini, Bassel Musharrafieh, Mariette Awad

The fractal dimension provides a statistical index of object complexity by studying how the pattern changes with the measuring scale. Although useful in several classification tasks, the fractal dimension is under-explored in deep learning applications. In this work, we investigate the features that are learned by deep models and we study whether these deep networks are able to encode features as complex and high-level as the fractal dimensions. Specifically, we conduct a correlation analysis experiment to show that deep networks are not able to extract such a feature in none of their layers. We combine our analytical study with a human evaluation to investigate the differences between deep learning networks and models that operate on the fractal feature solely. Moreover, we show the effectiveness of fractal features in applications where the object structure is crucial for the classification task. We empirically show that training a shallow network on fractal features achieves performance comparable, even superior in specific cases, to that of deep networks trained on raw data while requiring less computational resources. Fractals improved the accuracy of the classification by 30% on average while requiring up to 84% less time to train. We couple our empirical study with a complexity analysis of the computational cost of extracting the proposed fractal features, and we study its limitation.

Robust Calibration For Improved Weather Prediction Under Distributional Shift. (arXiv:2401.04144v1 [cs.LG])

Authors: Sankalp Gilda, Neel Bhandari, Wendy Mak, Andrea Panizza

In this paper, we present results on improving out-of-domain weather prediction and uncertainty estimation as part of the \texttt{Shifts Challenge on Robustness and Uncertainty under Real-World Distributional Shift} challenge. We find that by leveraging a mixture of experts in conjunction with an advanced data augmentation technique borrowed from the computer vision domain, in conjunction with robust \textit{post-hoc} calibration of predictive uncertainties, we can potentially achieve more accurate and better-calibrated results with deep neural networks than with boosted tree models for tabular data. We quantify our predictions using several metrics and propose several future lines of inquiry and experimentation to boost performance.

Learn Once Plan Arbitrarily (LOPA): Attention-Enhanced Deep Reinforcement Learning Method for Global Path Planning. (arXiv:2401.04145v1 [cs.LG])

Authors: Guoming Huang, Mingxin Hou, Xiaofang Yuan, Shuqiao Huang, Yaonan Wang

Deep reinforcement learning (DRL) methods have recently shown promise in path planning tasks. However, when dealing with global planning tasks, these methods face serious challenges such as poor convergence and generalization. To this end, we propose an attention-enhanced DRL method called LOPA (Learn Once Plan Arbitrarily) in this paper. Firstly, we analyze the reasons of these problems from the perspective of DRL's observation, revealing that the traditional design causes DRL to be interfered by irrelevant map information. Secondly, we develop the LOPA which utilizes a novel attention-enhanced mechanism to attain an improved attention capability towards the key information of the observation. Such a mechanism is realized by two steps: (1) an attention model is built to transform the DRL's observation into two dynamic views: local and global, significantly guiding the LOPA to focus on the key information on the given maps; (2) a dual-channel network is constructed to process these two views and integrate them to attain an improved reasoning capability. The LOPA is validated via multi-objective global path planning experiments. The result suggests the LOPA has improved convergence and generalization performance as well as great path planning efficiency.

Online Test-Time Adaptation of Spatial-Temporal Traffic Flow Forecasting. (arXiv:2401.04148v1 [cs.LG])

Authors: Pengxin Guo, Pengrong Jin, Ziyue Li, Lei Bai, Yu Zhang

Accurate spatial-temporal traffic flow forecasting is crucial in aiding traffic managers in implementing control measures and assisting drivers in selecting optimal travel routes. Traditional deep-learning based methods for traffic flow forecasting typically rely on historical data to train their models, which are then used to make predictions on future data. However, the performance of the trained model usually degrades due to the temporal drift between the historical and future data. To make the model trained on historical data better adapt to future data in a fully online manner, this paper conducts the first study of the online test-time adaptation techniques for spatial-temporal traffic flow forecasting problems. To this end, we propose an Adaptive Double Correction by Series Decomposition (ADCSD) method, which first decomposes the output of the trained model into seasonal and trend-cyclical parts and then corrects them by two separate modules during the testing phase using the latest observed data entry by entry. In the proposed ADCSD method, instead of fine-tuning the whole trained model during the testing phase, a lite network is attached after the trained model, and only the lite network is fine-tuned in the testing process each time a data entry is observed. Moreover, to satisfy that different time series variables may have different levels of temporal drift, two adaptive vectors are adopted to provide different weights for different time series variables. Extensive experiments on four real-world traffic flow forecasting datasets demonstrate the effectiveness of the proposed ADCSD method. The code is available at https://github.com/Pengxin-Guo/ADCSD.

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning. (arXiv:2401.04151v1 [cs.LG])

Authors: Wenhan Xia, Chengwei Qin, Elad Hazan

Fine-tuning is the primary methodology for tailoring pre-trained large language models to specific tasks. As the model's scale and the diversity of tasks expand, parameter-efficient fine-tuning methods are of paramount importance. One of the most widely used family of methods is low-rank adaptation (LoRA) and its variants. LoRA encodes weight update as the product of two low-rank matrices. Despite its advantages, LoRA falls short of full-parameter fine-tuning in terms of generalization error for certain tasks.

We introduce Chain of LoRA (COLA), an iterative optimization framework inspired by the Frank-Wolfe algorithm, to bridge the gap between LoRA and full parameter fine-tuning, without incurring additional computational costs or memory overheads. COLA employs a residual learning procedure where it merges learned LoRA modules into the pre-trained language model parameters and re-initilize optimization for new born LoRA modules. We provide theoretical convergence guarantees as well as empirical results to validate the effectiveness of our algorithm. Across various models (OPT and llama-2) and seven benchmarking tasks, we demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification. (arXiv:2401.04154v1 [cs.CV])

Authors: Wentao Zhu

Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.

FlopPITy: Enabling self-consistent exoplanet atmospheric retrievals with machine learning. (arXiv:2401.04168v1 [astro-ph.EP])

Authors: Francisco Ardévol Martínez, Michiel Min, Daniela Huppenkothen, Inga Kamp, Paul I. Palmer

Interpreting the observations of exoplanet atmospheres to constrain physical and chemical properties is typically done using Bayesian retrieval techniques. Because these methods require many model computations, a compromise is made between model complexity and run time. Reaching this compromise leads to the simplification of many physical and chemical processes (e.g. parameterised temperature structure). Here we implement and test sequential neural posterior estimation (SNPE), a machine learning inference algorithm, for exoplanet atmospheric retrievals. The goal is to speed up retrievals so they can be run with more computationally expensive atmospheric models, such as those computing the temperature structure using radiative transfer. We generate 100 synthetic observations using ARCiS (ARtful Modeling Code for exoplanet Science, an atmospheric modelling code with the flexibility to compute models in varying degrees of complexity) and perform retrievals on them to test the faithfulness of the SNPE posteriors. The faithfulness quantifies whether the posteriors contain the ground truth as often as we expect. We also generate a synthetic observation of a cool brown dwarf using the self-consistent capabilities of ARCiS and run a retrieval with self-consistent models to showcase the possibilities that SNPE opens. We find that SNPE provides faithful posteriors and is therefore a reliable tool for exoplanet atmospheric retrievals. We are able to run a self-consistent retrieval of a synthetic brown dwarf spectrum using only 50,000 forward model evaluations. We find that SNPE can speed up retrievals between $\sim2\times$ and $\geq10\times$ depending on the computational load of the forward model, the dimensionality of the observation, and the signal-to-noise ratio of the observation. We make the code publicly available for the community on Github.

Dense Hopfield Networks in the Teacher-Student Setting. (arXiv:2401.04191v1 [cond-mat.dis-nn])

Authors: Robin Thériault, Daniele Tantari

Dense Hopfield networks are known for their feature to prototype transition and adversarial robustness. However, previous theoretical studies have been mostly concerned with their storage capacity. We bridge this gap by studying the phase diagram of p-body Hopfield networks in the teacher-student setting of an unsupervised learning problem, uncovering ferromagnetic phases reminiscent of the prototype and feature learning regimes. On the Nishimori line, we find the critical size of the training set necessary for efficient pattern retrieval. Interestingly, we find that that the paramagnetic to ferromagnetic transition of the teacher-student setting coincides with the paramagnetic to spin-glass transition of the direct model, i.e. with random patterns. Outside of the Nishimori line, we investigate the learning performance in relation to the inference temperature and dataset noise. Moreover, we show that using a larger p for the student than the teacher gives the student an extensive tolerance to noise. We then derive a closed-form expression measuring the adversarial robustness of such a student at zero temperature, corroborating the positive correlation between number of parameters and robustness observed in large neural networks. We also use our model to clarify why the prototype phase of modern Hopfield networks is adversarially robust.

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments. (arXiv:2401.04198v1 [cs.LG])

Authors: Shaurya Dewan, Anisha Jain, Zoe LaLena, Lifan Yu

The authors of 'Unsupervised Reinforcement Learning in Multiple environments' propose a method, alpha-MEPOL, to tackle unsupervised RL across multiple environments. They pre-train a task-agnostic exploration policy using interactions from an entire environment class and then fine-tune this policy for various tasks using supervision. We expanded upon this work, with the goal of improving performance. We primarily propose and experiment with five new modifications to the original work: sampling trajectories using an entropy-based probability distribution, dynamic alpha, higher KL Divergence threshold, curiosity-driven exploration, and alpha-percentile sampling on curiosity. Dynamic alpha and higher KL-Divergence threshold both provided a significant improvement over the baseline from the earlier work. PDF-sampling failed to provide any improvement due to it being approximately equivalent to the baseline method when the sample space is small. In high-dimensional environments, the addition of curiosity-driven exploration enhances learning by encouraging the agent to seek diverse experiences and explore the unknown more. However, its benefits are limited in low-dimensional and simpler environments where exploration possibilities are constrained and there is little that is truly unknown to the agent. Overall, some of our experiments did boost performance over the baseline and there are a few directions that seem promising for further research.

Towards a Machine Learning-Based Approach to Predict Space Object Density Distributions. (arXiv:2401.04212v1 [physics.space-ph])

Authors: Victor Rodriguez-Fernandez, Sumiyajav Sarangerel, Peng Mun Siew, Pablo Machuca, Daniel Jang, Richard Linares

With the rapid increase in the number of Anthropogenic Space Objects (ASOs), Low Earth Orbit (LEO) is facing significant congestion, thereby posing challenges to space operators and risking the viability of the space environment for varied uses. Current models for examining this evolution, while detailed, are computationally demanding. To address these issues, we propose a novel machine learning-based model, as an extension of the MIT Orbital Capacity Tool (MOCAT). This advanced model is designed to accelerate the propagation of ASO density distributions, and it is trained on hundreds of simulations generated by an established and accurate model of the space environment evolution. We study how different deep learning-based solutions can potentially be good candidates for ASO propagation and manage the high-dimensionality of the data. To assess the model's capabilities, we conduct experiments in long term forecasting scenarios (around 100 years), analyze how and why the performance degrades over time, and discuss potential solutions to make this solution better.

A learning-based mathematical programming formulation for the automatic configuration of optimization solvers. (arXiv:2401.04237v1 [math.OC])

Authors: Gabriele Iommazzo, Claudia D'Ambrosio, Antonio Frangioni, Leo Liberti

We propose a methodology, based on machine learning and optimization, for selecting a solver configuration for a given instance. First, we employ a set of solved instances and configurations in order to learn a performance function of the solver. Secondly, we formulate a mixed-integer nonlinear program where the objective/constraints explicitly encode the learnt information, and which we solve, upon the arrival of an unknown instance, to find the best solver configuration for that instance, based on the performance function. The main novelty of our approach lies in the fact that the configuration set search problem is formulated as a mathematical program, which allows us to a) enforce hard dependence and compatibility constraints on the configurations, and b) solve it efficiently with off-the-shelf optimization tools.

Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules. (arXiv:2401.04246v1 [cs.LG])

Authors: Joseph C. Kim, David Bloore, Karan Kapoor, Jun Feng, Ming-Hong Hao, Mengdi Wang

The Boltzmann distribution of a protein provides a roadmap to all of its functional states. Normalizing flows are a promising tool for modeling this distribution, but current methods are intractable for typical pharmacological targets; they become computationally intractable due to the size of the system, heterogeneity of intra-molecular potential energy, and long-range interactions. To remedy these issues, we present a novel flow architecture that utilizes split channels and gated attention to efficiently learn the conformational distribution of proteins defined by internal coordinates. We show that by utilizing a 2-Wasserstein loss, one can smooth the transition from maximum likelihood training to energy-based training, enabling the training of Boltzmann Generators for macromolecules. We evaluate our model and training strategy on villin headpiece HP35(nle-nle), a 35-residue subdomain, and protein G, a 56-residue protein. We demonstrate that standard architectures and training strategies, such as maximum likelihood alone, fail while our novel architecture and multi-stage training strategy are able to model the conformational distributions of protein G and HP35.

Explaining the Power of Topological Data Analysis in Graph Machine Learning. (arXiv:2401.04250v1 [cs.LG])

Authors: Funmilola Mary Taiwo, Umar Islambekov, Cuneyt Gurcan Akcora

Topological Data Analysis (TDA) has been praised by researchers for its ability to capture intricate shapes and structures within data. TDA is considered robust in handling noisy and high-dimensional datasets, and its interpretability is believed to promote an intuitive understanding of model behavior. However, claims regarding the power and usefulness of TDA have only been partially tested in application domains where TDA-based models are compared to other graph machine learning approaches, such as graph neural networks. We meticulously test claims on TDA through a comprehensive set of experiments and validate their merits. Our results affirm TDA's robustness against outliers and its interpretability, aligning with proponents' arguments. However, we find that TDA does not significantly enhance the predictive power of existing methods in our specific experiments, while incurring significant computational costs. We investigate phenomena related to graph characteristics, such as small diameters and high clustering coefficients, to mitigate the computational expenses of TDA computations. Our results offer valuable perspectives on integrating TDA into graph machine learning tasks.

Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking. (arXiv:2401.04266v1 [cs.LG])

Authors: Shourav B. Rabbani, Ivan V. Medri, Manar D. Samad

Despite groundbreaking success in image and text learning, deep learning has not achieved significant improvements against traditional machine learning (ML) when it comes to tabular data. This performance gap underscores the need for data-centric treatment and benchmarking of learning algorithms. Recently, attention and contrastive learning breakthroughs have shifted computer vision and natural language processing paradigms. However, the effectiveness of these advanced deep models on tabular data is sparsely studied using a few data sets with very large sample sizes, reporting mixed findings after benchmarking against a limited number of baselines. We argue that the heterogeneity of tabular data sets and selective baselines in the literature can bias the benchmarking outcomes. This article extensively evaluates state-of-the-art attention and contrastive learning methods on a wide selection of 28 tabular data sets (14 easy and 14 hard-to-classify) against traditional deep and machine learning. Our data-centric benchmarking demonstrates when traditional ML is preferred over deep learning and vice versa because no best learning method exists for all tabular data sets. Combining between-sample and between-feature attentions conquers the invincible traditional ML on tabular data sets by a significant margin but fails on high dimensional data, where contrastive learning takes a robust lead. While a hybrid attention-contrastive learning strategy mostly wins on hard-to-classify data sets, traditional methods are frequently superior on easy-to-classify data sets with presumably simpler decision boundaries. To the best of our knowledge, this is the first benchmarking paper with statistical analyses of attention and contrastive learning performances on a diverse selection of tabular data sets against traditional deep and machine learning baselines to facilitate further advances in this field.

Predicting the structure of dynamic graphs. (arXiv:2401.04280v1 [cs.LG])

Authors: Sevvandi Kandanaarachchi

Dynamic graph embeddings, inductive and incremental learning facilitate predictive tasks such as node classification and link prediction. However, predicting the structure of a graph at a future time step from a time series of graphs, allowing for new nodes has not gained much attention. In this paper, we present such an approach. We use time series methods to predict the node degree at future time points and combine it with flux balance analysis -- a linear programming method used in biochemistry -- to obtain the structure of future graphs. Furthermore, we explore the predictive graph distribution for different parameter values. We evaluate this method using synthetic and real datasets and demonstrate its utility and applicability.

A Fast Graph Search Algorithm with Dynamic Optimization and Reduced Histogram for Discrimination of Binary Classification Problem. (arXiv:2401.04282v1 [cs.LG])

Authors: Qinwu Xu

This study develops a graph search algorithm to find the optimal discrimination path for the binary classification problem. The objective function is defined as the difference of variations between the true positive (TP) and false positive (FP). It uses the depth first search (DFS) algorithm to find the top-down paths for discrimination. It proposes a dynamic optimization procedure to optimize TP at the upper levels and then reduce FP at the lower levels. To accelerate computing speed with improving accuracy, it proposes a reduced histogram algorithm with variable bin size instead of looping over all data points, to find the feature threshold of discrimination. The algorithm is applied on top of a Support Vector Machine (SVM) model for a binary classification problem on whether a person is fit or unfit. It significantly improves TP and reduces FP of the SVM results (e.g., reduced FP by 90% with a loss of only\ 5% TP). The graph search auto-generates 39 ranked discrimination paths within 9 seconds on an input of total 328,464 objects, using a dual-core Laptop computer with a processor of 2.59 GHz.

Universal Consistency of Wide and Deep ReLU Neural Networks and Minimax Optimal Convergence Rates for Kolmogorov-Donoho Optimal Function Classes. (arXiv:2401.04286v1 [stat.ML])

Authors: Hyunouk Ko, Xiaoming Huo

In this paper, we first extend the result of FL93 and prove universal consistency for a classification rule based on wide and deep ReLU neural networks trained on the logistic loss. Unlike the approach in FL93 that decomposes the estimation and empirical error, we directly analyze the classification risk based on the observation that a realization of a neural network that is wide enough is capable of interpolating an arbitrary number of points. Secondly, we give sufficient conditions for a class of probability measures under which classifiers based on neural networks achieve minimax optimal rates of convergence. Our result is motivated from the practitioner's observation that neural networks are often trained to achieve 0 training error, which is the case for our proposed neural network classifiers. Our proofs hinge on recent developments in empirical risk minimization and on approximation rates of deep ReLU neural networks for various function classes of interest. Applications to classical function spaces of smoothness illustrate the usefulness of our result.

Setting the Record Straight on Transformer Oversmoothing. (arXiv:2401.04301v1 [cs.LG])

Authors: Gbètondji J-S Dovonon, Michael M. Bronstein, Matt J. Kusner

Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown that Transformers are inherently low-pass filters that gradually oversmooth the inputs, reducing the expressivity of their representations. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we show that in fact Transformers are not inherently low-pass filters. Instead, whether Transformers oversmooth or not depends on the eigenspectrum of their update equations. Our analysis extends prior work in oversmoothing and in the closely-related phenomenon of rank collapse. We show that many successful Transformer models have attention and weights which satisfy conditions that avoid oversmoothing. Based on this analysis, we derive a simple way to parameterize the weights of the Transformer update equations that allows for control over its spectrum, ensuring that oversmoothing does not occur. Compared to a recent solution for oversmoothing, our approach improves generalization, even when training with more layers, fewer datapoints, and data that is corrupted.

Advancing Deep Active Learning & Data Subset Selection: Unifying Principles with Information-Theory Intuitions. (arXiv:2401.04305v1 [cs.LG])

Authors: Andreas Kirsch

At its core, this thesis aims to enhance the practicality of deep learning by improving the label and training efficiency of deep learning models. To this end, we investigate data subset selection techniques, specifically active learning and active sampling, grounded in information-theoretic principles. Active learning improves label efficiency, while active sampling enhances training efficiency. Supervised deep learning models often require extensive training with labeled data. Label acquisition can be expensive and time-consuming, and training large models is resource-intensive, hindering the adoption outside academic research and ``big tech.'' Existing methods for data subset selection in deep learning often rely on heuristics or lack a principled information-theoretic foundation. In contrast, this thesis examines several objectives for data subset selection and their applications within deep learning, striving for a more principled approach inspired by information theory. We begin by disentangling epistemic and aleatoric uncertainty in single forward-pass deep neural networks, which provides helpful intuitions and insights into different forms of uncertainty and their relevance for data subset selection. We then propose and investigate various approaches for active learning and data subset selection in (Bayesian) deep learning. Finally, we relate various existing and proposed approaches to approximations of information quantities in weight or prediction space. Underpinning this work is a principled and practical notation for information-theoretic quantities that includes both random variables and observed outcomes. This thesis demonstrates the benefits of working from a unified perspective and highlights the potential impact of our contributions to the practical application of deep learning.

Private Truly-Everlasting Robust-Prediction. (arXiv:2401.04311v1 [cs.LG])

Authors: Uri Stemmer

Private Everlasting Prediction (PEP), recently introduced by Naor et al. [2023], is a model for differentially private learning in which the learner never publicly releases a hypothesis. Instead, it provides black-box access to a "prediction oracle" that can predict the labels of an endless stream of unlabeled examples drawn from the underlying distribution. Importantly, PEP provides privacy both for the initial training set and for the endless stream of classification queries. We present two conceptual modifications to the definition of PEP, as well as new constructions exhibiting significant improvements over prior work. Specifically,

(1) Robustness: PEP only guarantees accuracy provided that all the classification queries are drawn from the correct underlying distribution. A few out-of-distribution queries might break the validity of the prediction oracle for future queries, even for future queries which are sampled from the correct distribution. We incorporate robustness against such poisoning attacks into the definition of PEP, and show how to obtain it.

(2) Dependence of the privacy parameter $\delta$ in the time horizon: We present a relaxed privacy definition, suitable for PEP, that allows us to disconnect the privacy parameter $\delta$ from the number of total time steps $T$. This allows us to obtain algorithms for PEP whose sample complexity is independent from $T$, thereby making them "truly everlasting". This is in contrast to prior work where the sample complexity grows with $polylog(T)$.

(3) New constructions: Prior constructions for PEP exhibit sample complexity that is quadratic in the VC dimension of the target class. We present new constructions of PEP for axis-aligned rectangles and for decision-stumps that exhibit sample complexity linear in the dimension (instead of quadratic). We show that our constructions satisfy very strong robustness properties.

Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study. (arXiv:2401.04331v1 [cs.LG])

Authors: Qiyu Kang, Kai Zhao, Yang Song, Yihang Xie, Yanan Zhao, Sijie Wang, Rui She, Wee Peng Tay

In this work, we rigorously investigate the robustness of graph neural fractional-order differential equation (FDE) models. This framework extends beyond traditional graph neural (integer-order) ordinary differential equation (ODE) models by implementing the time-fractional Caputo derivative. Utilizing fractional calculus allows our model to consider long-term memory during the feature updating process, diverging from the memoryless Markovian updates seen in traditional graph neural ODE models. The superiority of graph neural FDE models over graph neural ODE models has been established in environments free from attacks or perturbations. While traditional graph neural ODE models have been verified to possess a degree of stability and resilience in the presence of adversarial attacks in existing literature, the robustness of graph neural FDE models, especially under adversarial conditions, remains largely unexplored. This paper undertakes a detailed assessment of the robustness of graph neural FDE models. We establish a theoretical foundation outlining the robustness characteristics of graph neural FDE models, highlighting that they maintain more stringent output perturbation bounds in the face of input and graph topology disturbances, compared to their integer-order counterparts. Our empirical evaluations further confirm the enhanced robustness of graph neural FDE models, highlighting their potential in adversarially robust applications.

Deep Efficient Private Neighbor Generation for Subgraph Federated Learning. (arXiv:2401.04336v1 [cs.LG])

Authors: Ke Zhang, Lichao Sun, Bolin Ding, Siu Ming Yiu, Carl Yang

Behemoth graphs are often fragmented and separately stored by multiple data owners as distributed subgraphs in many realistic applications. Without harming data privacy, it is natural to consider the subgraph federated learning (subgraph FL) scenario, where each local client holds a subgraph of the entire global graph, to obtain globally generalized graph mining models. To overcome the unique challenge of incomplete information propagation on local subgraphs due to missing cross-subgraph neighbors, previous works resort to the augmentation of local neighborhoods through the joint FL of missing neighbor generators and GNNs. Yet their technical designs have profound limitations regarding the utility, efficiency, and privacy goals of FL. In this work, we propose FedDEP to comprehensively tackle these challenges in subgraph FL. FedDEP consists of a series of novel technical designs: (1) Deep neighbor generation through leveraging the GNN embeddings of potential missing neighbors; (2) Efficient pseudo-FL for neighbor generation through embedding prototyping; and (3) Privacy protection through noise-less edge-local-differential-privacy.

We analyze the correctness and efficiency of FedDEP, and provide theoretical guarantees on its privacy.

Empirical results on four real-world datasets justify the clear benefits of proposed techniques.

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems. (arXiv:2401.04338v1 [cs.LG])

Authors: Youshao Xiao, Shangchun Zhao, Zhenglei Zhou, Zhaoxin Huan, Lin Ju, Xiaolu Zhang, Lin Wang, Jun Zhou

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.

Private Fine-tuning of Large Language Models with Zeroth-order Optimization. (arXiv:2401.04343v1 [cs.LG])

Authors: Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, Prateek Mittal

Fine-tuning large pretrained models on private datasets may run the risk of violating privacy. Differential privacy is a framework for mitigating privacy risks by enforcing algorithmic stability. DP-SGD enables training models with private data in a privacy-preserving manner, but raises new obstacles in the form of performance loss and significant engineering challenges. We introduce DP-ZO, a new method for fine-tuning large language models that preserves the privacy of training data by privatizing zeroth-order optimization. A key insight into the design of our method is that the direction of the gradient in SPSA, the zeroth-order algorithm we use, is always random and the only information that depends on private data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO, which can be instantiated with either Laplace or Gaussian noise, provides a strong privacy-utility trade-off across different tasks, and model sizes, under conservative privacy budgets. One noteworthy result is that DP-ZO exhibits just $1.86\%$ performance degradation due to privacy at $(1,10^{-5})$-DP when fine-tuning OPT-66B on 1000 training samples from SQuAD.

A Change Point Detection Integrated Remaining Useful Life Estimation Model under Variable Operating Conditions. (arXiv:2401.04351v1 [cs.LG])

Authors: Anushiya Arunan, Yan Qin, Xiaoli Li, Chau Yuen

By informing the onset of the degradation process, health status evaluation serves as a significant preliminary step for reliable remaining useful life (RUL) estimation of complex equipment. This paper proposes a novel temporal dynamics learning-based model for detecting change points of individual devices, even under variable operating conditions, and utilises the learnt change points to improve the RUL estimation accuracy. During offline model development, the multivariate sensor data are decomposed to learn fused temporal correlation features that are generalisable and representative of normal operation dynamics across multiple operating conditions. Monitoring statistics and control limit thresholds for normal behaviour are dynamically constructed from these learnt temporal features for the unsupervised detection of device-level change points. The detected change points then inform the degradation data labelling for training a long short-term memory (LSTM)-based RUL estimation model. During online monitoring, the temporal correlation dynamics of a query device is monitored for breach of the control limit derived in offline training. If a change point is detected, the device's RUL is estimated with the well-trained offline model for early preventive action. Using C-MAPSS turbofan engines as the case study, the proposed method improved the accuracy by 5.6\% and 7.5\% for two scenarios with six operating conditions, when compared to existing LSTM-based RUL estimation models that do not consider heterogeneous change points.

SoK: Facial Deepfake Detectors. (arXiv:2401.04364v1 [cs.CV])

Authors: Binh M. Le, Jiwon Kim, Shahroz Tariq, Kristen Moore, Alsharif Abuadbba, Simon S. Woo

Deepfakes have rapidly emerged as a profound and serious threat to society, primarily due to their ease of creation and dissemination. This situation has triggered an accelerated development of deepfake detection technologies. However, many existing detectors rely heavily on lab-generated datasets for validation, which may not effectively prepare them for novel, emerging, and real-world deepfake techniques. In this paper, we conduct an extensive and comprehensive review and analysis of the latest state-of-the-art deepfake detectors, evaluating them against several critical criteria. These criteria facilitate the categorization of these detectors into 4 high-level groups and 13 fine-grained sub-groups, all aligned with a unified standard conceptual framework. This classification and framework offer deep and practical insights into the factors that affect detector efficacy. We assess the generalizability of 16 leading detectors across various standard attack scenarios, including black-box, white-box, and gray-box settings. Our systematized analysis and experimentation lay the groundwork for a deeper understanding of deepfake detectors and their generalizability, paving the way for future research focused on creating detectors adept at countering various attack scenarios. Additionally, this work offers insights for developing more proactive defenses against deepfakes.

Enhancing Acute Kidney Injury Prediction through Integration of Drug Features in Intensive Care Units. (arXiv:2401.04368v1 [cs.LG])

Authors: Gabriel D. M. Manalu, Mulomba Mukendi Christian, Songhee You, Hyebong Choi

The relationship between acute kidney injury (AKI) prediction and nephrotoxic drugs, or drugs that adversely affect kidney function, is one that has yet to be explored in the critical care setting. One contributing factor to this gap in research is the limited investigation of drug modalities in the intensive care unit (ICU) context, due to the challenges of processing prescription data into the corresponding drug representations and a lack in the comprehensive understanding of these drug representations. This study addresses this gap by proposing a novel approach that leverages patient prescription data as a modality to improve existing models for AKI prediction. We base our research on Electronic Health Record (EHR) data, extracting the relevant patient prescription information and converting it into the selected drug representation for our research, the extended-connectivity fingerprint (ECFP). Furthermore, we adopt a unique multimodal approach, developing machine learning models and 1D Convolutional Neural Networks (CNN) applied to clinical drug representations, establishing a procedure which has not been used by any previous studies predicting AKI. The findings showcase a notable improvement in AKI prediction through the integration of drug embeddings and other patient cohort features. By using drug features represented as ECFP molecular fingerprints along with common cohort features such as demographics and lab test values, we achieved a considerable improvement in model performance for the AKI prediction task over the baseline model which does not include the drug representations as features, indicating that our distinct approach enhances existing baseline techniques and highlights the relevance of drug data in predicting AKI in the ICU setting

Air Quality Forecasting Using Machine Learning: A Global perspective with Relevance to Low-Resource Settings. (arXiv:2401.04369v1 [cs.LG])

Authors: Mulomba Mukendi Christian, Hyebong Choi

Air pollution stands as the fourth leading cause of death globally. While extensive research has been conducted in this domain, most approaches rely on large datasets when it comes to prediction. This limits their applicability in low-resource settings though more vulnerable. This study addresses this gap by proposing a novel machine learning approach for accurate air quality prediction using two months of air quality data. By leveraging the World Weather Repository, the meteorological, air pollutant, and Air Quality Index features from 197 capital cities were considered to predict air quality for the next day. The evaluation of several machine learning models demonstrates the effectiveness of the Random Forest algorithm in generating reliable predictions, particularly when applied to classification rather than regression, approach which enhances the model's generalizability by 42%, achieving a cross-validation score of 0.38 for regression and 0.89 for classification. To instill confidence in the predictions, interpretable machine learning was considered. Finally, a cost estimation comparing the implementation of this solution in high-resource and low-resource settings is presented including a tentative of technology licensing business model. This research highlights the potential for resource-limited countries to independently predict air quality while awaiting larger datasets to further refine their predictions.

Stable generative modeling using diffusion maps. (arXiv:2401.04372v1 [stat.ML])

Authors: Georg Gottwald, Fengyi Li, Youssef Marzouk, Sebastian Reich

We consider the problem of sampling from an unknown distribution for which only a sufficiently large number of training samples are available. Such settings have recently drawn considerable interest in the context of generative modelling. In this paper, we propose a generative model combining diffusion maps and Langevin dynamics. Diffusion maps are used to approximate the drift term from the available training samples, which is then implemented in a discrete-time Langevin sampler to generate new samples. By setting the kernel bandwidth to match the time step size used in the unadjusted Langevin algorithm, our method effectively circumvents any stability issues typically associated with time-stepping stiff stochastic differential equations. More precisely, we introduce a novel split-step scheme, ensuring that the generated samples remain within the convex hull of the training samples. Our framework can be naturally extended to generate conditional samples. We demonstrate the performance of our proposed scheme through experiments on synthetic datasets with increasing dimensions and on a stochastic subgrid-scale parametrization conditional sampling problem.

Towards Explainable Artificial Intelligence (XAI): A Data Mining Perspective. (arXiv:2401.04374v1 [cs.AI])

Authors: Haoyi Xiong, Xuhong L, Xiaofei Zhang, Jiamin Chen, Xinhao Sun, Yuchen Li, Zeyi Sun, Mengnan Du

Given the complexity and lack of transparency in deep neural networks (DNNs), extensive efforts have been made to make these systems more interpretable or explain their behaviors in accessible terms. Unlike most reviews, which focus on algorithmic and model-centric perspectives, this work takes a "data-centric" view, examining how data collection, processing, and analysis contribute to explainable AI (XAI). We categorize existing work into three categories subject to their purposes: interpretations of deep models, referring to feature attributions and reasoning processes that correlate data points with model outputs; influences of training data, examining the impact of training data nuances, such as data valuation and sample anomalies, on decision-making processes; and insights of domain knowledge, discovering latent patterns and fostering new knowledge from data and models to advance social values and scientific discovery. Specifically, we distill XAI methodologies into data mining operations on training and testing data across modalities, such as images, text, and tabular data, as well as on training logs, checkpoints, models and other DNN behavior descriptors. In this way, our study offers a comprehensive, data-centric examination of XAI from a lens of data mining methods and applications.

Machine unlearning through fine-grained model parameters perturbation. (arXiv:2401.04385v1 [cs.LG])

Authors: Zhiwei Zuo, Zhuo Tang, Kenli Li, Anwitaman Datta

Machine unlearning techniques, which involve retracting data records and reducing influence of said data on trained models, help with the user privacy protection objective but incur significant computational costs. Weight perturbation-based unlearning is a general approach, but it typically involves globally modifying the parameters. We propose fine-grained Top-K and Random-k parameters perturbed inexact machine unlearning strategies that address the privacy needs while keeping the computational costs tractable.

In order to demonstrate the efficacy of our strategies we also tackle the challenge of evaluating the effectiveness of machine unlearning by considering the model's generalization performance across both unlearning and remaining data. To better assess the unlearning effect and model generalization, we propose novel metrics, namely, the forgetting rate and memory retention rate. However, for inexact machine unlearning, current metrics are inadequate in quantifying the degree of forgetting that occurs after unlearning strategies are applied. To address this, we introduce SPD-GAN, which subtly perturbs the distribution of data targeted for unlearning. Then, we evaluate the degree of unlearning by measuring the performance difference of the models on the perturbed unlearning data before and after the unlearning process. By implementing these innovative techniques and metrics, we achieve computationally efficacious privacy protection in machine learning applications without significant sacrifice of model performance. Furthermore, this approach provides a novel method for evaluating the degree of unlearning.

The Role of Higher-Order Cognitive Models in Active Learning. (arXiv:2401.04397v1 [cs.LG])

Authors: Oskar Keurulainen, Gokhan Alcan, Ville Kyrki

Building machines capable of efficiently collaborating with humans has been a longstanding goal in artificial intelligence. Especially in the presence of uncertainties, optimal cooperation often requires that humans and artificial agents model each other's behavior and use these models to infer underlying goals, beliefs or intentions, potentially involving multiple levels of recursion. Empirical evidence for such higher-order cognition in human behavior is also provided by previous works in cognitive science, linguistics, and robotics. We advocate for a new paradigm for active learning for human feedback that utilises humans as active data sources while accounting for their higher levels of agency. In particular, we discuss how increasing level of agency results in qualitatively different forms of rational communication between an active learning system and a teacher. Additionally, we provide a practical example of active learning using a higher-order cognitive model. This is accompanied by a computational study that underscores the unique behaviors that this model produces.

IGNITE: Individualized GeNeration of Imputations in Time-series Electronic health records. (arXiv:2401.04402v1 [cs.LG])

Authors: Ghadeer O. Ghosheh, Jin Li, Tingting Zhu

Electronic Health Records present a valuable modality for driving personalized medicine, where treatment is tailored to fit individual-level differences. For this purpose, many data-driven machine learning and statistical models rely on the wealth of longitudinal EHRs to study patients' physiological and treatment effects. However, longitudinal EHRs tend to be sparse and highly missing, where missingness could also be informative and reflect the underlying patient's health status. Therefore, the success of data-driven models for personalized medicine highly depends on how the EHR data is represented from physiological data, treatments, and the missing values in the data. To this end, we propose a novel deep-learning model that learns the underlying patient dynamics over time across multivariate data to generate personalized realistic values conditioning on an individual's demographic characteristics and treatments. Our proposed model, IGNITE (Individualized GeNeration of Imputations in Time-series Electronic health records), utilises a conditional dual-variational autoencoder augmented with dual-stage attention to generate missing values for an individual. In IGNITE, we further propose a novel individualized missingness mask (IMM), which helps our model generate values based on the individual's observed data and missingness patterns. We further extend the use of IGNITE from imputing missingness to a personalized data synthesizer, where it generates missing EHRs that were never observed prior or even generates new patients for various applications. We validate our model on three large publicly available datasets and show that IGNITE outperforms state-of-the-art approaches in missing data reconstruction and task prediction.

Fine-Grained Embedding Dimension Optimization During Training for Recommender Systems. (arXiv:2401.04408v1 [cs.IR])

Authors: Qinyi Luo, Penghan Wang, Wei Zhang, Fan Lai, Jiachen Mao, Xiaohan Wei, Jun Song, Wei-Yu Tsai, Shuai Yang, Yuxi Hu, Xuehai Qian

Huge embedding tables in modern Deep Learning Recommender Models (DLRM) require prohibitively large memory during training and inference. Aiming to reduce the memory footprint of training, this paper proposes FIne-grained In-Training Embedding Dimension optimization (FIITED). Given the observation that embedding vectors are not equally important, FIITED adjusts the dimension of each individual embedding vector continuously during training, assigning longer dimensions to more important embeddings while adapting to dynamic changes in data. A novel embedding storage system based on virtually-hashed physically-indexed hash tables is designed to efficiently implement the embedding dimension adjustment and effectively enable memory saving. Experiments on two industry models show that FIITED is able to reduce the size of embeddings by more than 65% while maintaining the trained model's quality, saving significantly more memory than a state-of-the-art in-training embedding pruning method. On public click-through rate prediction datasets, FIITED is able to prune up to 93.75%-99.75% embeddings without significant accuracy loss.

Meta-forests: Domain generalization on random forests with meta-learning. (arXiv:2401.04425v1 [cs.CV])

Authors: Yuyang Sun, Panagiotis Kosmas

Domain generalization is a popular machine learning technique that enables models to perform well on the unseen target domain, by learning from multiple source domains. Domain generalization is useful in cases where data is limited, difficult, or expensive to collect, such as in object recognition and biomedicine. In this paper, we propose a novel domain generalization algorithm called "meta-forests", which builds upon the basic random forests model by incorporating the meta-learning strategy and maximum mean discrepancy measure. The aim of meta-forests is to enhance the generalization ability of classifiers by reducing the correlation among trees and increasing their strength. More specifically, meta-forests conducts meta-learning optimization during each meta-task, while also utilizing the maximum mean discrepancy as a regularization term to penalize poor generalization performance in the meta-test process. To evaluate the effectiveness of our algorithm, we test it on two publicly object recognition datasets and a glucose monitoring dataset that we have used in a previous study. Our results show that meta-forests outperforms state-of-the-art approaches in terms of generalization performance on both object recognition and glucose monitoring datasets.

Sea wave data reconstruction using micro-seismic measurements and machine learning methods. (arXiv:2401.04431v1 [physics.ins-det])

Authors: Lorenzo Iafolla, Emiliano Fiorenza, Massimo Chiappini, Cosmo Carmisciano, Valerio Antonio Iafolla

Sea wave monitoring is key in many applications in oceanography such as the validation of weather and wave models. Conventional in situ solutions are based on moored buoys whose measurements are often recognized as a standard. However, being exposed to a harsh environment, they are not reliable, need frequent maintenance, and the datasets feature many gaps. To overcome the previous limitations, we propose a system including a buoy, a micro-seismic measuring station, and a machine learning algorithm. The working principle is based on measuring the micro-seismic signals generated by the sea waves. Thus, the machine learning algorithm will be trained to reconstruct the missing buoy data from the micro-seismic data. As the micro-seismic station can be installed indoor, it assures high reliability while the machine learning algorithm provides accurate reconstruction of the missing buoy data. In this work, we present the methods to process the data, develop and train the machine learning algorithm, and assess the reconstruction accuracy. As a case of study, we used experimental data collected in 2014 from the Northern Tyrrhenian Sea demonstrating that the data reconstruction can be done both for significant wave height and wave period. The proposed approach was inspired from Data Science, whose methods were the foundation for the new solutions presented in this work. For example, estimating the period of the sea waves, often not discussed in previous works, was relatively simple with machine learning. In conclusion, the experimental results demonstrated that the new system can overcome the reliability issues of the buoy keeping the same accuracy.

AI Competitions and Benchmarks, Practical issues: Proposals, grant money, sponsors, prizes, dissemination, publicity. (arXiv:2401.04452v1 [cs.LG])

Authors: Magali Richard (TIMC-MAGe), Yuna Blum (IGDR), Justin Guinney, Gustavo Stolovitzky, Adrien Pavão (LRI)

This chapter provides a comprehensive overview of the pragmatic aspects involved in organizing AI competitions. We begin by discussing strategies to incentivize participation, touching upon effective communication techniques, aligning with trending topics in the field, structuring awards, potential recruitment opportunities, and more. We then shift to the essence of community engagement, and into organizational best practices and effective means of disseminating challenge outputs. Lastly, the chapter addresses the logistics, exposing on costs, required manpower, and resource allocation for effectively managing and executing a challenge. By examining these practical problems, readers will gain actionable insights to navigate the multifaceted landscape of AI competition organization, from inception to completion.

PhilEO Bench: Evaluating Geo-Spatial Foundation Models. (arXiv:2401.04464v1 [cs.CV])

Authors: Casper Fibaek, Luke Camilleri, Andreas Luyts, Nikolaos Dionelis, Bertrand Le Saux

Massive amounts of unlabelled data are captured by Earth Observation (EO) satellites, with the Sentinel-2 constellation generating 1.6 TB of data daily. This makes Remote Sensing a data-rich domain well suited to Machine Learning (ML) solutions. However, a bottleneck in applying ML models to EO is the lack of annotated data as annotation is a labour-intensive and costly process. As a result, research in this domain has focused on Self-Supervised Learning and Foundation Model approaches. This paper addresses the need to evaluate different Foundation Models on a fair and uniform benchmark by introducing the PhilEO Bench, a novel evaluation framework for EO Foundation Models. The framework comprises of a testbed and a novel 400 GB Sentinel-2 dataset containing labels for three downstream tasks, building density estimation, road segmentation, and land cover classification. We present experiments using our framework evaluating different Foundation Models, including Prithvi and SatMAE, at multiple n-shots and convergence rates.

A Survey on Efficient Federated Learning Methods for Foundation Model Training. (arXiv:2401.04472v1 [cs.LG])

Authors: Herbert Woisetschläger, Alexander Isenko, Shiqiang Wang, Ruben Mayer, Hans-Arno Jacobsen

Federated Learning (FL) has become an established technique to facilitate privacy-preserving collaborative training. However, new approaches to FL often discuss their contributions involving small deep-learning models only. With the tremendous success of transformer models, the following question arises: What is necessary to operationalize foundation models in an FL application? Knowing that computation and communication often take up similar amounts of time in FL, we introduce a novel taxonomy focused on computational and communication efficiency methods in FL applications. This said, these methods aim to optimize the training time and reduce communication between clients and the server. We also look at the current state of widely used FL frameworks and discuss future research potentials based on existing approaches in FL research and beyond.

TwinBooster: Synergising Large Language Models with Barlow Twins and Gradient Boosting for Enhanced Molecular Property Prediction. (arXiv:2401.04478v1 [q-bio.BM])

Authors: Maximilian G. Schuh, Davide Boldini, Stephan A. Sieber

The success of drug discovery and development relies on the precise prediction of molecular activities and properties. While in silico molecular property prediction has shown remarkable potential, its use has been limited so far to assays for which large amounts of data are available. In this study, we use a fine-tuned large language model to integrate biological assays based on their textual information, coupled with Barlow Twins, a Siamese neural network using a novel self-supervised learning approach. This architecture uses both assay information and molecular fingerprints to extract the true molecular information. TwinBooster enables the prediction of properties of unseen bioassays and molecules by providing state-of-the-art zero-shot learning tasks. Remarkably, our artificial intelligence pipeline shows excellent performance on the FS-Mol benchmark. This breakthrough demonstrates the application of deep learning to critical property prediction tasks where data is typically scarce. By accelerating the early identification of active molecules in drug discovery and development, this method has the potential to help streamline the identification of novel therapeutics.

Continuously Learning New Words in Automatic Speech Recognition. (arXiv:2401.04482v1 [cs.CL])

Authors: Christian Huber, Alexander Waibel

Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities and domain-specific special words for which little or no data is available. To address the problem of recognizing these words, we propose an self-supervised continual learning approach. Given the audio of a lecture talk with corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from previous work. Then, we perform inference on the talk, collecting utterances that contain detected new words into an adaptation dataset. Continual learning is then performed on this set by adapting low-rank matrix weights added to each weight matrix of the model. The whole procedure is iterated for many talks. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently (more than 80% recall) while preserving the general performance of the model.

Optimal Survival Trees: A Dynamic Programming Approach. (arXiv:2401.04489v1 [cs.LG])

Authors: Tim Huisman, Jacobus G. M. van der Linden, Emir Demirović

Survival analysis studies and predicts the time of death, or other singular unrepeated events, based on historical data, while the true time of death for some instances is unknown. Survival trees enable the discovery of complex nonlinear relations in a compact human comprehensible model, by recursively splitting the population and predicting a distinct survival distribution in each leaf node. We use dynamic programming to provide the first survival tree method with optimality guarantees, enabling the assessment of the optimality gap of heuristics. We improve the scalability of our method through a special algorithm for computing trees up to depth two. The experiments show that our method's run time even outperforms some heuristics for realistic cases while obtaining similar out-of-sample performance with the state-of-the-art.

SpiNNaker2: A Large-Scale Neuromorphic System for Event-Based and Asynchronous Machine Learning. (arXiv:2401.04491v1 [cs.ET])

Authors: Hector A. Gonzalez, Jiaxin Huang, Florian Kelber, Khaleelulla Khan Nazeer, Tim Langer, Chen Liu, Matthias Lohrmann, Amirhossein Rostami, Mark Schöne, Bernhard Vogginger, Timo C. Wunderlich, Yexin Yan, Mahmoud Akl, Christian Mayr

The joint progress of artificial neural networks (ANNs) and domain specific hardware accelerators such as GPUs and TPUs took over many domains of machine learning research. This development is accompanied by a rapid growth of the required computational demands for larger models and more data. Concurrently, emerging properties of foundation models such as in-context learning drive new opportunities for machine learning applications. However, the computational cost of such applications is a limiting factor of the technology in data centers, and more importantly in mobile devices and edge systems. To mediate the energy footprint and non-trivial latency of contemporary systems, neuromorphic computing systems deeply integrate computational principles of neurobiological systems by leveraging low-power analog and digital technologies. SpiNNaker2 is a digital neuromorphic chip developed for scalable machine learning. The event-based and asynchronous design of SpiNNaker2 allows the composition of large-scale systems involving thousands of chips. This work features the operating principles of SpiNNaker2 systems, outlining the prototype of novel machine learning applications. These applications range from ANNs over bio-inspired spiking neural networks to generalized event-based neural networks. With the successful development and deployment of SpiNNaker2, we aim to facilitate the advancement of event-based and asynchronous algorithms for future generations of machine learning systems.

Data-driven Nonlinear Model Reduction using Koopman Theory: Integrated Control Form and NMPC Case Study. (arXiv:2401.04508v1 [eess.SY])

Authors: Jan C. Schulze, Alexander Mitsos

We use Koopman theory for data-driven model reduction of nonlinear dynamical systems with controls. We propose generic model structures combining delay-coordinate encoding of measurements and full-state decoding to integrate reduced Koopman modeling and state estimation. We present a deep-learning approach to train the proposed models. A case study demonstrates that our approach provides accurate control models and enables real-time capable nonlinear model predictive control of a high-purity cryogenic distillation column.

Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement. (arXiv:2401.04511v1 [eess.AS])

Authors: Soumya Dutta, Sriram Ganapathy

The problem of audio-to-audio (A2A) style transfer involves replacing the style features of the source audio with those from the target audio while preserving the content related attributes of the source audio. In this paper, we propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST), that allows the transfer of emotional content present in the given source audio with the one embedded in the target audio while retaining the speaker and speech content from the source. The proposed system builds upon decomposing speech into semantic tokens, speaker representations and emotion embeddings. Using these factors, we propose a framework to reconstruct the pitch contour of the given speech signal and train a decoder that reconstructs the speech signal. The model is trained using a self-supervision based reconstruction loss. During conversion, the emotion embedding is alone derived from the target audio, while rest of the factors are derived from the source audio. In our experiments, we show that, even without using parallel training data or labels from the source or target audio, we illustrate zero shot emotion transfer capabilities of the proposed ZEST model using objective and subjective quality evaluations.

Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search. (arXiv:2401.04514v1 [cs.SE])

Authors: Haochen Li, Xin Zhou, Zhiqi Shen

In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.

Semi-Supervised Deep Sobolev Regression: Estimation, Variable Selection and Beyond. (arXiv:2401.04535v1 [stat.ML])

Authors: Zhao Ding, Chenguang Duan, Yuling Jiao, Jerry Zhijian Yang

We propose SDORE, a semi-supervised deep Sobolev regressor, for the nonparametric estimation of the underlying regression function and its gradient. SDORE employs deep neural networks to minimize empirical risk with gradient norm regularization, allowing computation of the gradient norm on unlabeled data. We conduct a comprehensive analysis of the convergence rates of SDORE and establish a minimax optimal rate for the regression function. Crucially, we also derive a convergence rate for the associated plug-in gradient estimator, even in the presence of significant domain shift. These theoretical findings offer valuable prior guidance for selecting regularization parameters and determining the size of the neural network, while showcasing the provable advantage of leveraging unlabeled data in semi-supervised learning. To the best of our knowledge, SDORE is the first provable neural network-based approach that simultaneously estimates the regression function and its gradient, with diverse applications including nonparametric variable selection and inverse problems. The effectiveness of SDORE is validated through an extensive range of numerical simulations and real data analysis.

Evaluating Language Model Agency through Negotiations. (arXiv:2401.04536v1 [cs.CL])

Authors: Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, Robert West

Companies, organizations, and governments increasingly exploit Language Models' (LM) remarkable capability to display agent-like behavior. As LMs are adopted to perform tasks with growing autonomy, there exists an urgent need for reliable and scalable evaluation benchmarks. Current, predominantly static LM benchmarks are ill-suited to evaluate such dynamic applications. Thus, we propose jointly evaluating LM performance and alignment through the lenses of negotiation games. We argue that this common task better reflects real-world deployment conditions while offering insights into LMs' decision-making processes. Crucially, negotiation games allow us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental data leakage in evaluation. We report results for six publicly accessible LMs from several major providers on a variety of negotiation games, evaluating both self-play and cross-play performance. Noteworthy findings include: (i) open-source models are currently unable to complete these tasks; (ii) cooperative bargaining games prove challenging; and (iii) the most powerful models do not always "win".

Linear Recursive Feature Machines provably recover low-rank matrices. (arXiv:2401.04553v1 [stat.ML])

Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.

HyperGANStrument: Instrument Sound Synthesis and Editing with Pitch-Invariant Hypernetworks. (arXiv:2401.04558v1 [cs.SD])

Authors: Zhe Zhang, Taketo Akama

GANStrument, exploiting GANs with a pitch-invariant feature extractor and instance conditioning technique, has shown remarkable capabilities in synthesizing realistic instrument sounds. To further improve the reconstruction ability and pitch accuracy to enhance the editability of user-provided sound, we propose HyperGANStrument, which introduces a pitch-invariant hypernetwork to modulate the weights of a pre-trained GANStrument generator, given a one-shot sound as input. The hypernetwork modulation provides feedback for the generator in the reconstruction of the input sound. In addition, we take advantage of an adversarial fine-tuning scheme for the hypernetwork to improve the reconstruction fidelity and generation diversity of the generator. Experimental results show that the proposed model not only enhances the generation capability of GANStrument but also significantly improves the editability of synthesized sounds. Audio examples are available at the online demo page.

Robust Imitation Learning for Automated Game Testing. (arXiv:2401.04572v1 [cs.LG])

Authors: Pierluigi Vito Amadori, Timothy Bradley, Ryan Spick, Guy Moss

Game development is a long process that involves many stages before a product is ready for the market. Human play testing is among the most time consuming, as testers are required to repeatedly perform tasks in the search for errors in the code. Therefore, automated testing is seen as a key technology for the gaming industry, as it would dramatically improve development costs and efficiency. Toward this end, we propose EVOLUTE, a novel imitation learning-based architecture that combines behavioural cloning (BC) with energy based models (EBMs). EVOLUTE is a two-stream ensemble model that splits the action space of autonomous agents into continuous and discrete tasks. The EBM stream handles the continuous tasks, to have a more refined and adaptive control, while the BC stream handles discrete actions, to ease training. We evaluate the performance of EVOLUTE in a shooting-and-driving game, where the agent is required to navigate and continuously identify targets to attack. The proposed model has higher generalisation capabilities than standard BC approaches, showing a wider range of behaviours and higher performances. Also, EVOLUTE is easier to train than a pure end-to-end EBM model, as discrete tasks can be quite sparse in the dataset and cause model training to explore a much wider set of possible actions while training.

Masked Audio Generation using a Single Non-Autoregressive Transformer. (arXiv:2401.04577v1 [cs.SD])

Authors: Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.

Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models. (arXiv:2401.04585v1 [cs.CV])

Authors: Xuewen Liu, Zhikai Li, Junrui Xiao, Qingyi Gu

Diffusion models have achieved great success in image generation tasks through iterative noise estimation. However, the heavy denoising process and complex neural networks hinder their low-latency applications in real-world scenarios. Quantization can effectively reduce model complexity, and post-training quantization (PTQ), which does not require fine-tuning, is highly promising in accelerating the denoising process. Unfortunately, we find that due to the highly dynamic distribution of activations in different denoising steps, existing PTQ methods for diffusion models suffer from distribution mismatch issues at both calibration sample level and reconstruction output level, which makes the performance far from satisfactory, especially in low-bit cases. In this paper, we propose Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models (EDA-DM) to address the above issues. Specifically, at the calibration sample level, we select calibration samples based on the density and diversity in the latent space, thus facilitating the alignment of their distribution with the overall samples; and at the reconstruction output level, we propose Fine-grained Block Reconstruction, which can align the outputs of the quantized model and the full-precision model at different network granularity. Extensive experiments demonstrate that EDA-DM outperforms the existing post-training quantization frameworks in both unconditional and conditional generation scenarios. At low-bit precision, the quantized models with our method even outperform the full-precision models on most datasets.

Distribution-Free Conformal Joint Prediction Regions for Neural Marked Temporal Point Processes. (arXiv:2401.04612v1 [cs.LG])

Authors: Victor Dheur, Tanguy Bosser, Rafael Izbicki, Souhaib Ben Taieb

Sequences of labeled events observed at irregular intervals in continuous time are ubiquitous across various fields. Temporal Point Processes (TPPs) provide a mathematical framework for modeling these sequences, enabling inferences such as predicting the arrival time of future events and their associated label, called mark. However, due to model misspecification or lack of training data, these probabilistic models may provide a poor approximation of the true, unknown underlying process, with prediction regions extracted from them being unreliable estimates of the underlying uncertainty. This paper develops more reliable methods for uncertainty quantification in neural TPP models via the framework of conformal prediction. A primary objective is to generate a distribution-free joint prediction region for the arrival time and mark, with a finite-sample marginal coverage guarantee. A key challenge is to handle both a strictly positive, continuous response and a categorical response, without distributional assumptions. We first consider a simple but overly conservative approach that combines individual prediction regions for the event arrival time and mark. Then, we introduce a more effective method based on bivariate highest density regions derived from the joint predictive density of event arrival time and mark. By leveraging the dependencies between these two variables, this method exclude unlikely combinations of the two, resulting in sharper prediction regions while still attaining the pre-specified coverage level. We also explore the generation of individual univariate prediction regions for arrival times and marks through conformal regression and classification techniques. Moreover, we investigate the stronger notion of conditional coverage. Finally, through extensive experimentation on both simulated and real-world datasets, we assess the validity and efficiency of these methods.

Deep Reinforcement Multi-agent Learning framework for Information Gathering with Local Gaussian Processes for Water Monitoring. (arXiv:2401.04631v1 [cs.AI])

Authors: Samuel Yanes Luis, Dmitriy Shutin, Juan Marchal Gómez, Daniel Gutiérrez Reina, Sergio Toral Marín

The conservation of hydrological resources involves continuously monitoring their contamination. A multi-agent system composed of autonomous surface vehicles is proposed in this paper to efficiently monitor the water quality. To achieve a safe control of the fleet, the fleet policy should be able to act based on measurements and to the the fleet state. It is proposed to use Local Gaussian Processes and Deep Reinforcement Learning to jointly obtain effective monitoring policies. Local Gaussian processes, unlike classical global Gaussian processes, can accurately model the information in a dissimilar spatial correlation which captures more accurately the water quality information. A Deep convolutional policy is proposed, that bases the decisions on the observation on the mean and variance of this model, by means of an information gain reward. Using a Double Deep Q-Learning algorithm, agents are trained to minimize the estimation error in a safe manner thanks to a Consensus-based heuristic. Simulation results indicate an improvement of up to 24% in terms of the mean absolute error with the proposed models. Also, training results with 1-3 agents indicate that our proposed approach returns 20% and 24% smaller average estimation errors for, respectively, monitoring water quality variables and monitoring algae blooms, as compared to state-of-the-art approaches

Hypercomplex neural network in time series forecasting of stock data. (arXiv:2401.04632v1 [cs.NE])

Authors: Radosław Kycia, Agnieszka Niemczynowicz

The three classes of architectures for time series prediction were tested. They differ by input layers which contain either convolutional, LSTM, or dense hypercomplex layers for 4D algebras. The input was four related Stock Market time series, and the prediction of one of them is expected. The optimization of hyperparameters related to the classes of architectures was performed in order to compare the best neural networks within the class. The results show that in most cases, the architecture with a hypercomplex dense layer provides similar MAE accuracy to other architectures, however, with considerably less trainable parameters. Thanks to it, hypercomplex neural networks can be learned and process data faster than the other tested architectures. Moreover, the order of the input time series has an impact on effectively.

Applying Large Language Models API to Issue Classification Problem. (arXiv:2401.04637v1 [cs.SE])

Authors: Gabriel Aracena, Kyle Luster, Fabio Santos, Igor Steinmacher, Marco A. Gerosa

Effective prioritization of issue reports is crucial in software engineering to optimize resource allocation and address critical problems promptly. However, the manual classification of issue reports for prioritization is laborious and lacks scalability. Alternatively, many open source software (OSS) projects employ automated processes for this task, albeit relying on substantial datasets for adequate training. This research seeks to devise an automated approach that ensures reliability in issue prioritization, even when trained on smaller datasets. Our proposed methodology harnesses the power of Generative Pre-trained Transformers (GPT), recognizing their potential to efficiently handle this task. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports accurately, mitigating the necessity for extensive training data while maintaining reliability. In our research, we have developed a reliable GPT-based approach to accurately label and prioritize issue reports with a reduced training dataset. By reducing reliance on massive data requirements and focusing on few-shot fine-tuning, our methodology offers a more accessible and efficient solution for issue prioritization in software engineering. Our model predicted issue types in individual projects up to 93.2% in precision, 95% in recall, and 89.3% in F1-score.

Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks. (arXiv:2401.04647v1 [cs.CV])

Authors: Tanmay Garg, Deepika Vemuri, Vineeth N Balasubramanian

This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks. Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training. During training, the explanation module is optimized to extract visual concepts from the classifier's latent representations, while the GAN-based module aims to discriminate images generated from concepts, from true images. This joint training scheme enables the model to implicitly align its internally learned concepts with human-interpretable visual properties. Comprehensive experiments demonstrate the robustness of our approach, while producing coherent concept activations. We analyse the learned concepts, showing their semantic concordance with object parts and visual attributes. We also study how perturbations in the adversarial training protocol impact both classification and concept acquisition. In summary, this work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations - a key enabler for developing trustworthy AI for real-world perception tasks.

A novel framework for generalization of deep hidden physics models. (arXiv:2401.04648v1 [cs.LG])

Authors: Vijay Kag, Birupaksha Pal

Modelling of systems where the full system information is unknown is an oft encountered problem for various engineering and industrial applications, as it's either impossible to consider all the complex physics involved or simpler models are considered to keep within the limits of the available resources. Recent advances in greybox modelling like the deep hidden physics models address this space by combining data and physics. However, for most real-life applications, model generalizability is a key issue, as retraining a model for every small change in system inputs and parameters or modification in domain configuration can render the model economically unviable. In this work we present a novel enhancement to the idea of hidden physics models which can generalize for changes in system inputs, parameters and domains. We also show that this approach holds promise in system discovery as well and helps learn the hidden physics for the changed system inputs, parameters and domain configuration.

Benchmark Analysis of Various Pre-trained Deep Learning Models on ASSIRA Cats and Dogs Dataset. (arXiv:2401.04666v1 [cs.CV])

Authors: Galib Muhammad Shahriar Himel, Md. Masudul Islam

As the most basic application and implementation of deep learning, image classification has grown in popularity. Various datasets are provided by renowned data science communities for benchmarking machine learning algorithms and pre-trained models. The ASSIRA Cats & Dogs dataset is one of them and is being used in this research for its overall acceptance and benchmark standards. A comparison of various pre-trained models is demonstrated by using different types of optimizers and loss functions. Hyper-parameters are changed to gain the best result from a model. By applying this approach, we have got higher accuracy without major changes in the training model. To run the experiment, we used three different computer architectures: a laptop equipped with NVIDIA GeForce GTX 1070, a laptop equipped with NVIDIA GeForce RTX 3080Ti, and a desktop equipped with NVIDIA GeForce RTX 3090. The acquired results demonstrate supremacy in terms of accuracy over the previously done experiments on this dataset. From this experiment, the highest accuracy which is 99.65% is gained using the NASNet Large.

Transfer-Learning-Based Autotuning Using Gaussian Copula. (arXiv:2401.04669v1 [cs.LG])

Authors: Thomas Randall (1), Jaehoon Koo (2), Brice Videau (3), Michael Kruse (3), Xingfu Wu (3), Paul Hovland (3), Mary Hall (4), Rong Ge (1), Prasanna Balaprakash (5) ((1) Clemson University, (2) Hanyang University, (3) Argonne National Laboratory, (4) University of Utah, (5) Oak Ridge National Laboratory)

As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39$\times$ speedup, a dramatic improvement over the 20.58$\times$ speedup using prior techniques.

RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation. (arXiv:2401.04679v1 [cs.CL])

Authors: Mahdi Nikdan, Soroush Tabesh, Dan Alistarh

We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of large language models (LLMs). We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis (PCA) that jointly trains $\textit{low-rank}$ and $\textit{highly-sparse}$ components on top of a set of fixed pretrained weights to efficiently approximate the performance of a full-fine-tuning (FFT) solution. Across a series of challenging generative tasks such as grade-school math and SQL query generation, which require fine-tuning for good performance, we show that RoSA outperforms both LoRA and pure sparse fine-tuning, at the same parameter budget. We provide system support for RoSA to complement the training algorithm, specifically in the form of sparse GPU kernels which enable memory- and computationally-efficient training. Our code will be made available at https://github.com/IST-DASLab/RoSA}{\texttt{https://github.com/IST-DASLab/RoSA

Mixture of multilayer stochastic block models for multiview clustering. (arXiv:2401.04682v1 [cs.LG])

Authors: Kylliann De Santiago, Marie Szafranski, Christophe Ambroise

In this work, we propose an original method for aggregating multiple clustering coming from different sources of information. Each partition is encoded by a co-membership matrix between observations. Our approach uses a mixture of multilayer Stochastic Block Models (SBM) to group co-membership matrices with similar information into components and to partition observations into different clusters, taking into account their specificities within the components. The identifiability of the model parameters is established and a variational Bayesian EM algorithm is proposed for the estimation of these parameters. The Bayesian framework allows for selecting an optimal number of clusters and components. The proposed approach is compared using synthetic data with consensus clustering and tensor-based algorithms for community detection in large-scale complex networks. Finally, the method is utilized to analyze global food trading networks, leading to structures of interest.

AI-based Mapping of the Conservation Status of Orchid Assemblages at Global Scale. (arXiv:2401.04691v1 [cs.LG])

Authors: Joaquim Estopinan, Maximilien Servajean, Pierre Bonnet, Alexis Joly, François Munoz

Although increasing threats on biodiversity are now widely recognised, there are no accurate global maps showing whether and where species assemblages are at risk. We hereby assess and map at kilometre resolution the conservation status of the iconic orchid family, and discuss the insights conveyed at multiple scales. We introduce a new Deep Species Distribution Model trained on 1M occurrences of 14K orchid species to predict their assemblages at global scale and at kilometre resolution. We propose two main indicators of the conservation status of the assemblages: (i) the proportion of threatened species, and (ii) the status of the most threatened species in the assemblage. We show and analyze the variation of these indicators at World scale and in relation to currently protected areas in Sumatra island. Global and interactive maps available online show the indicators of conservation status of orchid assemblages, with sharp spatial variations at all scales. The highest level of threat is found at Madagascar and the neighbouring islands. In Sumatra, we found good correspondence of protected areas with our indicators, but supplementing current IUCN assessments with status predictions results in alarming levels of species threat across the island. Recent advances in deep learning enable reliable mapping of the conservation status of species assemblages on a global scale. As an umbrella taxon, orchid family provides a reference for identifying vulnerable ecosystems worldwide, and prioritising conservation actions both at international and local levels.

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation. (arXiv:2401.04722v1 [eess.IV])

Authors: Jun Ma, Feifei Li, Bo Wang

Convolutional Neural Networks (CNNs) and Transformers have been the most popular architectures for biomedical image segmentation, but both of them have limited ability to handle long-range dependencies because of inherent locality or computational complexity. To address this challenge, we introduce U-Mamba, a general-purpose network for biomedical image segmentation. Inspired by the State Space Sequence Models (SSMs), a new family of deep sequence models known for their strong capability in handling long sequences, we design a hybrid CNN-SSM block that integrates the local feature extraction power of convolutional layers with the abilities of SSMs for capturing the long-range dependency. Moreover, U-Mamba enjoys a self-configuring mechanism, allowing it to automatically adapt to various datasets without manual intervention. We conduct extensive experiments on four diverse tasks, including the 3D abdominal organ segmentation in CT and MR images, instrument segmentation in endoscopy images, and cell segmentation in microscopy images. The results reveal that U-Mamba outperforms state-of-the-art CNN-based and Transformer-based segmentation networks across all tasks. This opens new avenues for efficient long-range dependency modeling in biomedical image analysis. The code, models, and data are publicly available at https://wanglab.ai/u-mamba.html.

On the Effect of Contextual Information on Human Delegation Behavior in Human-AI collaboration. (arXiv:2401.04729v1 [cs.HC])

Authors: Philipp Spitzer, Joshua Holstein, Patrick Hemmer, Michael Vössing, Niklas Kühl, Dominik Martin, Gerhard Satzger

The constantly increasing capabilities of artificial intelligence (AI) open new possibilities for human-AI collaboration. One promising approach to leverage existing complementary capabilities is allowing humans to delegate individual instances to the AI. However, enabling humans to delegate instances effectively requires them to assess both their own and the AI's capabilities in the context of the given task. In this work, we explore the effects of providing contextual information on human decisions to delegate instances to an AI. We find that providing participants with contextual information significantly improves the human-AI team performance. Additionally, we show that the delegation behavior changes significantly when participants receive varying types of contextual information. Overall, this research advances the understanding of human-AI interaction in human delegation and provides actionable insights for designing more effective collaborative systems.

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes. (arXiv:2102.09385v3 [cs.LG] UPDATED)

Authors: Steffen Dereich, Sebastian Kassing

In this article, we consider convergence of stochastic gradient descent schemes (SGD), including momentum stochastic gradient descent (MSGD), under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays bounded we have convergence of the SGD if there is only a countable number of critical points or if the objective function satisfies Lojasiewicz-inequalities around all critical levels as all analytic functions do. In particular, we show that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying bounded, if the random variables modelling the signal and response in the training are compactly supported.

Privacy-Preserving Logistic Regression Training with A Faster Gradient Variant. (arXiv:2201.10838v5 [cs.CR] UPDATED)

Authors: John Chiang

Logistic regression training over encrypted data has been an attractive idea to security concerns for years. In this paper, we propose a faster gradient variant called $\texttt{quadratic gradient}$ for privacy-preserving logistic regression training. The core of $\texttt{quadratic gradient}$ can be seen as an extension of the simplified fixed Hessian.

We enhance Nesterov's accelerated gradient (NAG) and Adaptive Gradient Algorithm (Adagrad) respectively with $\texttt{quadratic gradient}$ and evaluate the enhanced algorithms on several datasets. %gradient $ascent$ methods with this gradient variant on the gene dataset provided by the 2017 iDASH competition and other datasets. Experiments show that the enhanced methods have a state-of-the-art performance in convergence speed compared to the raw first-order gradient methods. We then adopt the enhanced NAG method to implement homomorphic logistic regression training, obtaining a comparable result by only $3$ iterations.

There is a promising chance that $\texttt{quadratic gradient}$ could be used to enhance other first-order gradient methods for general numerical optimization problems.

Weighted Isolation and Random Cut Forest Algorithms for Anomaly Detection. (arXiv:2202.01891v5 [cs.LG] UPDATED)

Authors: Sijin Yeom, Jae-Hun Jung

Random cut forest (RCF) algorithms have been developed for anomaly detection, particularly in time series data. The RCF algorithm is an improved version of the isolation forest (IF) algorithm. Unlike the IF algorithm, the RCF algorithm can determine whether real-time input contains an anomaly by inserting the input into the constructed tree network. Various RCF algorithms, including Robust RCF (RRCF), have been developed, where the cutting procedure is adaptively chosen probabilistically. The RRCF algorithm demonstrates better performance than the IF algorithm, as dimension cuts are decided based on the geometric range of the data, whereas the IF algorithm randomly chooses dimension cuts. However, the overall data structure is not considered in both IF and RRCF, given that split values are chosen randomly. In this paper, we propose new IF and RCF algorithms, referred to as the weighted IF (WIF) and weighted RCF (WRCF) algorithms, respectively. Their split values are determined by considering the density of the given data. To introduce the WIF and WRCF, we first present a new geometric measure, a density measure, which is crucial for constructing the WIF and WRCF. We provide various mathematical properties of the density measure, accompanied by theorems that support and validate our claims through numerical examples.

Molecule Generation for Drug Design: a Graph Learning Perspective. (arXiv:2202.09212v2 [cs.LG] UPDATED)

Authors: Nianzu Yang, Huaijin Wu, Kaipeng Zeng, Yang Li, Junchi Yan

Machine learning, particularly graph learning, is gaining increasing recognition for its transformative impact across various fields. One such promising application is in the realm of molecule design and discovery, notably within the pharmaceutical industry. Our survey offers a comprehensive overview of state-of-the-art methods in molecule design, particularly focusing on \emph{de novo} drug design, which incorporates (deep) graph learning techniques. We categorize these methods into three distinct groups: \emph{i)} \emph{all-at-once}, \emph{ii)} \emph{fragment-based}, and \emph{iii)} \emph{node-by-node}. Additionally, we introduce some key public datasets and outline the commonly used evaluation metrics for both the generation and optimization of molecules. In the end, we discuss the existing challenges in this field and suggest potential directions for future research.

Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition. (arXiv:2205.01982v5 [cs.RO] UPDATED)

Authors: Hamidreza Kasaei, Songsong Xiong

Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. In particular, we form ensemble methods based on deep representations and handcrafted 3D shape descriptors. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. The proposed model is suitable for open-ended learning scenarios where the number of 3D object categories is not fixed and can grow over time. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios. For the evaluation purpose, in addition to real object datasets, we generate a large synthetic household objects dataset consisting of 27000 views of 90 objects. Experimental results demonstrate the effectiveness of the proposed method on online few-shot 3D object recognition tasks, as well as its superior performance over the state-of-the-art open-ended learning approaches. Furthermore, our results show that while ensemble learning is modestly beneficial in offline settings, it is significantly beneficial in lifelong few-shot learning situations. Additionally, we demonstrated the effectiveness of our approach in both simulated and real-robot settings, where the robot rapidly learned new categories from limited examples.

Diverse super-resolution with pretrained deep hiererarchical VAEs. (arXiv:2205.10347v4 [cs.CV] UPDATED)

Authors: Jean Prost, Antoine Houdard, Andrés Almansa, Nicolas Papadakis

We investigate the problem of producing diverse solutions to an image super-resolution problem. From a probabilistic perspective, this can be done by sampling from the posterior distribution of an inverse problem, which requires the definition of a prior distribution on the high-resolution images. In this work, we propose to use a pretrained hierarchical variational autoencoder (HVAE) as a prior. We train a lightweight stochastic encoder to encode low-resolution images in the latent space of a pretrained HVAE. At inference, we combine the low-resolution encoder and the pretrained generative model to super-resolve an image. We demonstrate on the task of face super-resolution that our method provides an advantageous trade-off between the computational efficiency of conditional normalizing flows techniques and the sample quality of diffusion based methods.

On the Evolution of A.I. and Machine Learning: Towards a Meta-level Measuring and Understanding Impact, Influence, and Leadership at Premier A.I. Conferences. (arXiv:2205.13131v2 [cs.AI] UPDATED)

Authors: Rafael B. Audibert, Henrique Lemos, Pedro Avelar, Anderson R. Tavares, Luís C. Lamb

Artificial Intelligence is now recognized as a general-purpose technology with ample impact on human life. This work aims at understanding the evolution of AI and, in particular Machine learning, from the perspective of researchers' contributions to the field. In order to do so, we present several measures allowing the analyses of AI and machine learning researchers' impact, influence, and leadership over the last decades. This work also contributes, to a certain extent, to shed new light on the history and evolution of AI by exploring the dynamics involved in the field's evolution by looking at papers published at the flagship AI and machine learning conferences since the first International Joint Conference on Artificial Intelligence (IJCAI) held in 1969. AI development and evolution have led to increasing research output, reflected in the number of articles published over the last sixty years. We construct comprehensive citation collaboration and paper-author datasets and compute corresponding centrality measures to carry out our analyses. These analyses allow a better understanding of how AI has reached its current state of affairs in research. Throughout the process, we correlate these datasets with the work of the ACM Turing Award winners and the so-called two AI winters the field has gone through. We also look at self-citation trends and new authors' behaviors. Finally, we present a novel way to infer the country of affiliation of a paper from its organization. Therefore, this work provides a deep analysis of Artificial Intelligence history from information gathered and analysed from large technical venues datasets and suggests novel insights that can contribute to understanding and measuring AI's evolution.

Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis. (arXiv:2209.08891v3 [cs.CV] UPDATED)

Authors: Lukas Struppek, Dominik Hintersdorf, Felix Friedrich, Manuel Brack, Patrick Schramowski, Kristian Kersting

Models for text-to-image synthesis, such as DALL-E~2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in a textual description, common models reflect cultural stereotypes and biases in their generated images. We analyze this behavior both qualitatively and quantitatively, and identify a model's text encoder as the root cause of the phenomenon. Additionally, malicious users or service providers may try to intentionally bias the image generation to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.

LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics. (arXiv:2209.14065v5 [cs.AR] UPDATED)

Authors: Zhiqiang Que, Hongxiang Fan, Marcus Loo, He Li, Michaela Blott, Maurizio Pierini, Alexander Tapper, Wayne Luk

This work presents a novel reconfigurable architecture for Low Latency Graph Neural Network (LL-GNN) designs for particle detectors, delivering unprecedented low latency performance. Incorporating FPGA-based GNNs into particle detectors presents a unique challenge since it requires sub-microsecond latency to deploy the networks for online event selection with a data rate of hundreds of terabytes per second in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a novel outer-product based matrix multiplication approach, which is enhanced by exploiting the structured adjacency matrix and a column-major data layout. Moreover, a fusion step is introduced to further reduce the end-to-end design latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under given latency constraints. To facilitate this, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 9.0 times faster and achieves up to 13.1 times higher power efficiency than a GPU implementation. Compared to the previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy. The proposed LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.

Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. (arXiv:2210.07675v7 [cs.CV] UPDATED)

Authors: Igor Zingman, Birgit Stierstorfer, Charlotte Lempp, Fabian Heinemann

We present a system for anomaly detection in histopathological images. In histology, normal samples are usually abundant, whereas anomalous (pathological) cases are scarce or not available. Under such settings, one-class classifiers trained on healthy data can detect out-of-distribution anomalous samples. Such approaches combined with pre-trained Convolutional Neural Network (CNN) representations of images were previously employed for anomaly detection (AD). However, pre-trained off-the-shelf CNN representations may not be sensitive to abnormal conditions in tissues, while natural variations of healthy tissue may result in distant representations. To adapt representations to relevant details in healthy tissue we propose training a CNN on an auxiliary task that discriminates healthy tissue of different species, organs, and staining reagents. Almost no additional labeling workload is required, since healthy samples come automatically with aforementioned labels. During training we enforce compact image representations with a center-loss term, which further improves representations for AD. The proposed system outperforms established AD methods on a published dataset of liver anomalies. Moreover, it provided comparable results to conventional methods specifically tailored for quantification of liver anomalies. We show that our approach can be used for toxicity assessment of candidate drugs at early development stages and thereby may reduce expensive late-stage drug attrition.

DyG2Vec: Efficient Representation Learning for Dynamic Graphs. (arXiv:2210.16906v3 [cs.LG] UPDATED)

Authors: Mohammad Ali Alomrani, Mahdi Biparva, Yingxue Zhang, Mark Coates

Temporal graph neural networks have shown promising results in learning inductive representations by automatically extracting temporal patterns. However, previous works often rely on complex memory modules or inefficient random walk methods to construct temporal representations. To address these limitations, we present an efficient yet effective attention-based encoder that leverages temporal edge encodings and window-based subgraph sampling to generate task-agnostic embeddings. Moreover, we propose a joint-embedding architecture using non-contrastive SSL to learn rich temporal embeddings without labels. Experimental results on 7 benchmark datasets indicate that on average, our model outperforms SoTA baselines on the future link prediction task by 4.23% for the transductive setting and 3.30% for the inductive setting while only requiring 5-10x less training/inference time. Lastly, different aspects of the proposed framework are investigated through experimental analysis and ablation studies. The code is publicly available at https://github.com/huawei-noah/noah-research/tree/master/graph_atlas.

Distribution Free Prediction Sets for Node Classification. (arXiv:2211.14555v3 [stat.ML] UPDATED)

Authors: Jase Clarkson

Graph Neural Networks (GNNs) are able to achieve high classification accuracy on many important real world datasets, but provide no rigorous notion of predictive uncertainty. Quantifying the confidence of GNN models is difficult due to the dependence between datapoints induced by the graph structure. We leverage recent advances in conformal prediction to construct prediction sets for node classification in inductive learning scenarios. We do this by taking an existing approach for conformal classification that relies on \textit{exchangeable} data and modifying it by appropriately weighting the conformal scores to reflect the network structure. We show through experiments on standard benchmark datasets using popular GNN models that our approach provides tighter and better calibrated prediction sets than a naive application of conformal prediction.

General-Purpose In-Context Learning by Meta-Learning Transformers. (arXiv:2212.04458v2 [cs.LG] UPDATED)

Authors: Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, Luke Metz

Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose in-context learning algorithms from scratch, using only black-box models with minimal inductive bias. Such a model takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general-purpose in-context learners. We characterize transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks, and meta-optimization. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and meta-generalization of general-purpose in-context learning algorithms.

A Data-Driven Gaussian Process Filter for Electrocardiogram Denoising. (arXiv:2301.02607v2 [eess.SP] UPDATED)

Authors: Mircea Dumitru, Qiao Li, Erick Andres Perez Alday, Ali Bahrami Rad, Gari D. Clifford, Reza Sameni

Objective: Gaussian Processes (GP)-based filters, which have been effectively used for various applications including electrocardiogram (ECG) filtering can be computationally demanding and the choice of their hyperparameters is typically ad hoc. Methods: We develop a data-driven GP filter to address both issues, using the notion of the ECG phase domain -- a time-warped representation of the ECG beats onto a fixed number of samples and aligned R-peaks, which is assumed to follow a Gaussian distribution. Under this assumption, the computation of the sample mean and covariance matrix is simplified, enabling an efficient implementation of the GP filter in a data-driven manner, with no ad hoc hyperparameters. The proposed filter is evaluated and compared with a state-of-the-art wavelet-based filter, on the PhysioNet QT Database. The performance is evaluated by measuring the signal-to-noise ratio (SNR) improvement of the filter at SNR levels ranging from -5 to 30dB, in 5dB steps, using additive noise. For a clinical evaluation, the error between the estimated QT-intervals of the original and filtered signals is measured and compared with the benchmark filter. Results: It is shown that the proposed GP filter outperforms the benchmark filter for all the tested noise levels. It also outperforms the state-of-the-art filter in terms of QT-interval estimation error bias and variance. Conclusion: The proposed GP filter is a versatile technique for preprocessing the ECG in clinical and research applications, is applicable to ECG of arbitrary lengths and sampling frequencies, and provides confidence intervals for its performance.

Two-Stage Constrained Actor-Critic for Short Video Recommendation. (arXiv:2302.01680v3 [cs.LG] UPDATED)

Authors: Qingpeng Cai, Zhenghai Xue, Chi Zhang, Wanqi Xue, Shuchang Liu, Ruohan Zhan, Xueliang Wang, Tianyou Zuo, Wentao Xie, Dong Zheng, Peng Jiang, Kun Gai

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users' cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

Non-separable Covariance Kernels for Spatiotemporal Gaussian Processes based on a Hybrid Spectral Method and the Harmonic Oscillator. (arXiv:2302.09580v3 [stat.ML] UPDATED)

Authors: Dionissios T.Hristopulos

Gaussian processes provide a flexible, non-parametric framework for the approximation of functions in high-dimensional spaces. The covariance kernel is the main engine of Gaussian processes, incorporating correlations that underpin the predictive distribution. For applications with spatiotemporal datasets, suitable kernels should model joint spatial and temporal dependence. Separable space-time covariance kernels offer simplicity and computational efficiency. However, non-separable kernels include space-time interactions that better capture observed correlations. Most non-separable kernels that admit explicit expressions are based on mathematical considerations (admissibility conditions) rather than first-principles derivations. We present a hybrid spectral approach for generating covariance kernels which is based on physical arguments. We use this approach to derive a new class of physically motivated, non-separable covariance kernels which have their roots in the stochastic, linear, damped, harmonic oscillator (LDHO). The new kernels incorporate functions with both monotonic and oscillatory decay of space-time correlations. The LDHO covariance kernels involve space-time interactions which are introduced by dispersion relations that modulate the oscillator coefficients. We derive explicit relations for the spatiotemporal covariance kernels in the three oscillator regimes (underdamping, critical damping, overdamping) and investigate their properties. We further illustrate the hybrid spectral method by deriving covariance kernels that are based on the Ornstein-Uhlenbeck model.

Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning. (arXiv:2302.10639v2 [cs.AI] UPDATED)

Authors: Yuxiao Lu, Arunesh Sinha, Pradeep Varakantham

Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks. In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as robots cleaning different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent (which computes a reward maximizing policy from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoSHRL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR) and can adjust to flexible constraint thresholds without retraining. We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading approaches in constrained and hierarchical RL.

Conformal Prediction Regions for Time Series using Linear Complementarity Programming. (arXiv:2304.01075v5 [eess.SY] UPDATED)

Authors: Matthew Cleaveland, Insup Lee, George J. Pappas, Lars Lindemann

Conformal prediction is a statistical tool for producing prediction regions of machine learning models that are valid with high probability. However, applying conformal prediction to time series data leads to conservative prediction regions. In fact, to obtain prediction regions over $T$ time steps with confidence $1-\delta$, {previous works require that each individual prediction region is valid} with confidence $1-\delta/T$. We propose an optimization-based method for reducing this conservatism to enable long horizon planning and verification when using learning-enabled time series predictors. Instead of considering prediction errors individually at each time step, we consider a parameterized prediction error over multiple time steps. By optimizing the parameters over an additional dataset, we find prediction regions that are not conservative. We show that this problem can be cast as a mixed integer linear complementarity program (MILCP), which we then relax into a linear complementarity program (LCP). Additionally, we prove that the relaxed LP has the same optimal cost as the original MILCP. Finally, we demonstrate the efficacy of our method on case studies using pedestrian trajectory predictors and F16 fighter jet altitude predictors.

Optimal rates of approximation by shallow ReLU$^k$ neural networks and applications to nonparametric regression. (arXiv:2304.01561v3 [stat.ML] UPDATED)

Authors: Yunfei Yang, Ding-Xuan Zhou

We study the approximation capacity of some variation spaces corresponding to shallow ReLU$^k$ neural networks. It is shown that sufficiently smooth functions are contained in these spaces with finite variation norms. For functions with less smoothness, the approximation rates in terms of the variation norm are established. Using these results, we are able to prove the optimal approximation rates in terms of the number of neurons for shallow ReLU$^k$ neural networks. It is also shown how these results can be used to derive approximation bounds for deep neural networks and convolutional neural networks (CNNs). As applications, we study convergence rates for nonparametric regression using three ReLU neural network models: shallow neural network, over-parameterized neural network, and CNN. In particular, we show that shallow neural networks can achieve the minimax optimal rates for learning H\"older functions, which complements recent results for deep neural networks. It is also proven that over-parameterized (deep or shallow) neural networks can achieve nearly optimal rates for nonparametric regression.

Cross-Class Feature Augmentation for Class Incremental Learning. (arXiv:2304.01899v3 [cs.CV] UPDATED)

Authors: Taehoon Kim, Jaeyoo Park, Bohyung Han

We propose a novel class incremental learning approach by incorporating a feature augmentation technique motivated by adversarial attacks. We employ a classifier learned in the past to complement training examples rather than simply play a role as a teacher for knowledge distillation towards subsequent models. The proposed approach has a unique perspective to utilize the previous knowledge in class incremental learning since it augments features of arbitrary target classes using examples in other classes via adversarial attacks on a previously learned classifier. By allowing the cross-class feature augmentations, each class in the old tasks conveniently populates samples in the feature space, which alleviates the collapse of the decision boundaries caused by sample deficiency for the previous tasks, especially when the number of stored exemplars is small. This idea can be easily incorporated into existing class incremental learning algorithms without any architecture modification. Extensive experiments on the standard benchmarks show that our method consistently outperforms existing class incremental learning methods by significant margins in various scenarios, especially under an environment with an extremely limited memory budget.

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs. (arXiv:2304.10819v3 [cs.LG] UPDATED)

Authors: Brian Belgodere, Pierre Dognin, Adam Ivankay, Igor Melnyk, Youssef Mroueh, Aleksandra Mojsilovic, Jiri Navratil, Apoorva Nitsure, Inkit Padhi, Mattia Rigotti, Jerret Ross, Yair Schiff, Radhika Vedpathak, Richard A. Young

Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues. This paradigm relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with "TrustFormers" across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.

Multi-Source to Multi-Target Decentralized Federated Domain Adaptation. (arXiv:2304.12422v2 [cs.DC] UPDATED)

Authors: Su Wang, Seyyedali Hosseinalipour, Christopher G. Brinton

Heterogeneity across devices in federated learning (FL) typically refers to statistical (e.g., non-i.i.d. data distributions) and resource (e.g., communication bandwidth) dimensions. In this paper, we focus on another important dimension that has received less attention: varying quantities/distributions of labeled and unlabeled data across devices. In order to leverage all data, we develop a decentralized federated domain adaptation methodology which considers the transfer of ML models from devices with high quality labeled data (called sources) to devices with low quality or unlabeled data (called targets). Our methodology, Source-Target Determination and Link Formation (ST-LF), optimizes both (i) classification of devices into sources and targets and (ii) source-target link formation, in a manner that considers the trade-off between ML model accuracy and communication energy efficiency. To obtain a concrete objective function, we derive a measurable generalization error bound that accounts for estimates of source-target hypothesis deviations and divergences between data distributions. The resulting optimization problem is a mixed-integer signomial program, a class of NP-hard problems, for which we develop an algorithm based on successive convex approximations to solve it tractably. Subsequent numerical evaluations of ST-LF demonstrate that it improves classification accuracy and energy efficiency over state-of-the-art baselines.

FedNC: A Secure and Efficient Federated Learning Method with Network Coding. (arXiv:2305.03292v3 [cs.LG] UPDATED)

Authors: Yuchen Shi, Zheqi Zhu, Pingyi Fan, Khaled B. Letaief, Chenghui Peng

Federated Learning (FL) is a promising distributed learning mechanism which still faces two major challenges, namely privacy breaches and system efficiency. In this work, we reconceptualize the FL system from the perspective of network information theory, and formulate an original FL communication framework, FedNC, which is inspired by Network Coding (NC). The main idea of FedNC is mixing the information of the local models by making random linear combinations of the original parameters, before uploading for further aggregation. Due to the benefits of the coding scheme, both theoretical and experimental analysis indicate that FedNC improves the performance of traditional FL in several important ways, including security, efficiency, and robustness. To the best of our knowledge, this is the first framework where NC is introduced in FL. As FL continues to evolve within practical network frameworks, more variants can be further designed based on FedNC.

Variational Classification. (arXiv:2305.10406v5 [cs.LG] UPDATED)

Authors: Shehzaad Dhuliawala, Mrinmaya Sachan, Carl Allen

We present a latent variable model for classification that provides a novel probabilistic interpretation of neural network softmax classifiers. We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders, that generalises the softmax cross-entropy loss. Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency between their anticipated distribution, required for accurate label predictions, and their empirical distribution found in practice. We augment the variational objective to mitigate such inconsistency and induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer. Overall, we provide new theoretical insight into the inner workings of widely-used softmax classifiers. Empirical evaluation on image and text classification datasets demonstrates that our proposed approach, variational classification, maintains classification accuracy while the reshaped latent space improves other desirable properties of a classifier, such as calibration, adversarial robustness, robustness to distribution shift and sample efficiency useful in low data settings.

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design. (arXiv:2305.13035v5 [cs.CV] UPDATED)

Authors: Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer

Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.

Union Subgraph Neural Networks. (arXiv:2305.15747v3 [cs.LG] UPDATED)

Authors: Jiaxing Xu, Aihu Zhang, Qingtian Bian, Vijay Prakash Dwivedi, Yiping Ke

Graph Neural Networks (GNNs) are widely used for graph representation learning in many application domains. The expressiveness of vanilla GNNs is upper-bounded by 1-dimensional Weisfeiler-Leman (1-WL) test as they operate on rooted subtrees through iterative message passing. In this paper, we empower GNNs by injecting neighbor-connectivity information extracted from a new type of substructure. We first investigate different kinds of connectivities existing in a local neighborhood and identify a substructure called union subgraph, which is able to capture the complete picture of the 1-hop neighborhood of an edge. We then design a shortest-path-based substructure descriptor that possesses three nice properties and can effectively encode the high-order connectivities in union subgraphs. By infusing the encoded neighbor connectivities, we propose a novel model, namely Union Subgraph Neural Network (UnionSNN), which is proven to be strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. Additionally, the local encoding from union subgraphs can also be injected into arbitrary message-passing neural networks (MPNNs) and Transformer-based models as a plugin. Extensive experiments on 18 benchmarks of both graph-level and node-level tasks demonstrate that UnionSNN outperforms state-of-the-art baseline models, with competitive computational efficiency. The injection of our local encoding to existing models is able to boost the performance by up to 11.09%. Our code is available at https://github.com/AngusMonroe/UnionSNN.

Clarify Confused Nodes Through Separated Learning. (arXiv:2306.02285v2 [cs.LG] UPDATED)

Authors: Jiajun Zhou, Shengbo Gong, Chenxuan Xie, Shanqing Yu, Qi Xuan, Xiaoniu Yang

Graph neural networks (GNNs) have achieved remarkable advances in graph-oriented tasks. However, real-world graphs invariably contain a certain proportion of heterophilous nodes, challenging the homophily assumption of classical GNNs and hindering their performance. Most existing studies continue to design generic models with shared weights between heterophilous and homophilous nodes. Despite the incorporation of high-order messages or multi-channel architectures, these efforts often fall short. A minority of studies attempt to train different node groups separately but suffer from inappropriate separation metrics and low efficiency. In this paper, we first propose a new metric, termed Neighborhood Confusion (NC), to facilitate a more reliable separation of nodes. We observe that node groups with different levels of NC values exhibit certain differences in intra-group accuracy and visualized embeddings. These pave the way for Neighborhood Confusion-guided Graph Convolutional Network (NCGCN), in which nodes are grouped by their NC values and accept intra-group weight sharing and message passing. Extensive experiments on both homophilous and heterophilous benchmarks demonstrate that our framework can effectively separate nodes and yield significant performance improvement compared to the latest methods. The source code will be released soon.

Wind Noise Reduction with a Diffusion-based Stochastic Regeneration Model. (arXiv:2306.12867v2 [eess.AS] UPDATED)

Authors: Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann

In this paper we present a method for single-channel wind noise reduction using our previously proposed diffusion-based stochastic regeneration model combining predictive and generative modelling. We introduce a non-additive speech in noise model to account for the non-linear deformation of the membrane caused by the wind flow and possible clipping. We show that our stochastic regeneration model outperforms other neural-network-based wind noise reduction methods as well as purely predictive and generative models, on a dataset using simulated and real-recorded wind noise. We further show that the proposed method generalizes well by testing on an unseen dataset with real-recorded wind noise. Audio samples, data generation scripts and code for the proposed methods can be found online (https://uhh.de/inf-sp-storm-wind).

Multigrid-Augmented Deep Learning Preconditioners for the Helmholtz Equation using Compact Implicit Layers. (arXiv:2306.17486v2 [cs.LG] UPDATED)

Authors: Bar Lerer, Ido Ben-Yair, Eran Treister

We present a deep learning-based iterative approach to solve the discrete heterogeneous Helmholtz equation for high wavenumbers. Combining classical iterative multigrid solvers and convolutional neural networks (CNNs) via preconditioning, we obtain a learned neural solver that is faster and scales better than a standard multigrid solver. Our approach offers three main contributions over previous neural methods of this kind. First, we construct a multilevel U-Net-like encoder-solver CNN with an implicit layer on the coarsest grid of the U-Net, where convolution kernels are inverted. This alleviates the field of view problem in CNNs and allows better scalability. Second, we improve upon the previous CNN preconditioner in terms of the number of parameters, computation time, and convergence rates. Third, we propose a multiscale training approach that enables the network to scale to problems of previously unseen dimensions while still maintaining a reasonable training procedure. Our encoder-solver architecture can be used to generalize over different slowness models of various difficulties and is efficient at solving for many right-hand sides per slowness model. We demonstrate the benefits of our novel architecture with numerical experiments on a variety of heterogeneous two-dimensional problems at high wavenumbers.

Online Laplace Model Selection Revisited. (arXiv:2307.06093v2 [cs.LG] UPDATED)

Authors: Jihao Andreas Lin, Javier Antorán, José Miguel Hernández-Lobato

The Laplace approximation provides a closed-form model selection objective for neural networks (NN). Online variants, which optimise NN parameters jointly with hyperparameters, like weight decay strength, have seen renewed interest in the Bayesian deep learning community. However, these methods violate Laplace's method's critical assumption that the approximation is performed around a mode of the loss, calling into question their soundness. This work re-derives online Laplace methods, showing them to target a variational bound on a mode-corrected variant of the Laplace evidence which does not make stationarity assumptions. Online Laplace and its mode-corrected counterpart share stationary points where 1. the NN parameters are a maximum a posteriori, satisfying the Laplace method's assumption, and 2. the hyperparameters maximise the Laplace evidence, motivating online methods. We demonstrate that these optima are roughly attained in practise by online algorithms using full-batch gradient descent on UCI regression datasets. The optimised hyperparameters prevent overfitting and outperform validation-based early stopping.

Reinforcement Learning for Photonic Component Design. (arXiv:2307.11075v2 [physics.optics] UPDATED)

Authors: Donald Witt, Jeff Young, Lukas Chrostowski

We present a new fab-in-the-loop reinforcement learning algorithm for the design of nano-photonic components that accounts for the imperfections present in nanofabrication processes. As a demonstration of the potential of this technique, we apply it to the design of photonic crystal grating couplers fabricated on an air clad 220 nm silicon on insulator single etch platform. This fab-in-the-loop algorithm improves the insertion loss from 8.8 to 3.24 dB. The widest bandwidth designs produced using our fab-in-the-loop algorithm can cover a 150 nm bandwidth with less than 10.2 dB of loss at their lowest point.

Attention to Entropic Communication. (arXiv:2307.11423v2 [cs.IT] UPDATED)

Authors: Torsten Enßlin, Carolin Weidinger, Philipp Frank

The concept of attention, numerical weights that emphasize the importance of particular data, has proven to be very relevant in artificial intelligence. Relative entropy (RE, aka Kullback-Leibler divergence) plays a central role in communication theory. Here we combine these concepts, attention and RE. RE guides optimal encoding of messages in bandwidth-limited communication as well as optimal message decoding via the maximum entropy principle (MEP). In the coding scenario, RE can be derived from four requirements, namely being analytical, local, proper, and calibrated. Weighted RE, used for attention steering in communications, turns out to be improper. To see how proper attention communication can emerge, we analyze a scenario of a message sender who wants to ensure that the receiver of the message can perform well-informed actions. If the receiver decodes the message using the MEP, the sender only needs to know the receiver's utility function to inform optimally, but not the receiver's initial knowledge state. In case only the curvature of the utility function maxima are known, it becomes desirable to accurately communicate an attention function, in this case a by this curvature weighted and re-normalized probability function. Entropic attention communication is here proposed as the desired generalization of entropic communication that permits weighting while being proper, thereby aiding the design of optimal communication protocols in technical applications and helping to understand human communication. For example, our analysis shows how to derive the level of cooperation expected under misaligned interests of otherwise honest communication partners.

Dynamic algorithms for k-center on graphs. (arXiv:2307.15557v2 [cs.DS] UPDATED)

Authors: Emilio Cruciani, Sebastian Forster, Gramoz Goranci, Yasamin Nazari, Antonis Skarlatos

In this paper we give the first efficient algorithms for the $k$-center problem on dynamic graphs undergoing edge updates. In this problem, the goal is to partition the input into $k$ sets by choosing $k$ centers such that the maximum distance from any data point to its closest center is minimized. It is known that it is NP-hard to get a better than $2$ approximation for this problem.

While in many applications the input may naturally be modeled as a graph, all prior works on $k$-center problem in dynamic settings are on point sets in arbitrary metric spaces. In this paper, we give a deterministic decremental $(2+\epsilon)$-approximation algorithm and a randomized incremental $(4+\epsilon)$-approximation algorithm, both with amortized update time $kn^{o(1)}$ for weighted graphs. Moreover, we show a reduction that leads to a fully dynamic $(2+\epsilon)$-approximation algorithm for the $k$-center problem, with worst-case update time that is within a factor $k$ of the state-of-the-art fully dynamic $(1+\epsilon)$-approximation single-source shortest paths algorithm in graphs. Matching this bound is a natural goalpost because the approximate distances of each vertex to its center can be used to maintain a $(2+\epsilon)$-approximation of the graph diameter and the fastest known algorithms for such a diameter approximation also rely on maintaining approximate single-source distances.

Price-Discrimination Game for Distributed Resource Management in Federated Learning. (arXiv:2308.13838v3 [cs.LG] UPDATED)

Authors: Han Zhang, Halvin Yang, Guopeng Zhang

In vanilla federated learning (FL) such as FedAvg, the parameter server (PS) and multiple distributed clients can form a typical buyer's market, where the number of PS/buyers of FL services is far less than the number of clients/sellers. In order to improve the performance of FL and reduce the cost of motivating clients to participate in FL, this paper proposes to differentiate the pricing for services provided by different clients rather than simply providing the same service pricing for different clients. The price is differentiated based on the performance improvements brought to FL and their heterogeneity in computing and communication capabilities. To this end, a price-discrimination game (PDG) is formulated to comprehensively address the distributed resource management problems in FL, including multi-objective trade-off, client selection, and incentive mechanism. As the PDG is a mixed-integer nonlinear programming (MINLP) problem, a distributed semi-heuristic algorithm with low computational complexity and low communication overhead is designed to solve it. The simulation result verifies the effectiveness of the proposed approach.

s-ID: Causal Effect Identification in a Sub-Population. (arXiv:2309.02281v2 [cs.LG] UPDATED)

Authors: Amir Mohammad Abouei, Ehsan Mokhtarian, Negar Kiyavash

Causal inference in a sub-population involves identifying the causal effect of an intervention on a specific subgroup, which is distinguished from the whole population through the influence of systematic biases in the sampling process. However, ignoring the subtleties introduced by sub-populations can either lead to erroneous inference or limit the applicability of existing methods. We introduce and advocate for a causal inference problem in sub-populations (henceforth called s-ID), in which we merely have access to observational data of the targeted sub-population (as opposed to the entire population). Existing inference problems in sub-populations operate on the premise that the given data distributions originate from the entire population, thus, cannot tackle the s-ID problem. To address this gap, we provide necessary and sufficient conditions that must hold in the causal graph for a causal effect in a sub-population to be identifiable from the observational distribution of that sub-population. Given these conditions, we present a sound and complete algorithm for the s-ID problem.

Long-term drought prediction using deep neural networks based on geospatial weather data. (arXiv:2309.06212v3 [cs.LG] UPDATED)

Authors: Vsevolod Grabar, Alexander Marusov, Yury Maximov, Nazar Sotiriadi, Alexander Bulkin, Alexey Zaytsev

The problem of high-quality drought forecasting up to a year in advance is critical for agriculture planning and insurance. Yet, it is still unsolved with reasonable accuracy due to data complexity and aridity stochasticity. We tackle drought data by introducing an end-to-end approach that adopts a spatio-temporal neural network model with accessible open monthly climate data as the input.

Our systematic research employs diverse proposed models and five distinct environmental regions as a testbed to evaluate the efficacy of the Palmer Drought Severity Index (PDSI) prediction. Key aggregated findings are the exceptional performance of a Transformer model, EarthFormer, in making accurate short-term (up to six months) forecasts. At the same time, the Convolutional LSTM excels in longer-term forecasting. Both models achieved high ROC AUC scores: 0.948 for one month ahead and 0.617 for twelve months ahead forecasts.

FedDCSR: Federated Cross-domain Sequential Recommendation via Disentangled Representation Learning. (arXiv:2309.08420v6 [cs.LG] UPDATED)

Authors: Hongyu Zhang, Dongyi Zheng, Xu Yang, Jiyuan Feng, Qing Liao

Cross-domain Sequential Recommendation (CSR) which leverages user sequence data from multiple domains has received extensive attention in recent years. However, the existing CSR methods require sharing origin user data across domains, which violates the General Data Protection Regulation (GDPR). Thus, it is necessary to combine federated learning (FL) and CSR to fully utilize knowledge from different domains while preserving data privacy. Nonetheless, the sequence feature heterogeneity across different domains significantly impacts the overall performance of FL. In this paper, we propose FedDCSR, a novel federated cross-domain sequential recommendation framework via disentangled representation learning. Specifically, to address the sequence feature heterogeneity across domains, we introduce an approach called inter-intra domain sequence representation disentanglement (SRD) to disentangle the user sequence features into domain-shared and domain-exclusive features. In addition, we design an intra domain contrastive infomax (CIM) strategy to learn richer domain-exclusive features of users by performing data augmentation on user sequences. Extensive experiments on three real-world scenarios demonstrate that FedDCSR achieves significant improvements over existing baselines.

Understanding Deep Gradient Leakage via Inversion Influence Functions. (arXiv:2309.13016v3 [cs.LG] UPDATED)

Authors: Haobo Zhang, Junyuan Hong, Yuyang Deng, Mehrdad Mahdavi, Jiayu Zhou

Deep Gradient Leakage (DGL) is a highly effective attack that recovers private training images from gradient vectors. This attack casts significant privacy challenges on distributed learning from clients with sensitive data, where clients are required to share gradients. Defending against such attacks requires but lacks an understanding of when and how privacy leakage happens, mostly because of the black-box nature of deep networks. In this paper, we propose a novel Inversion Influence Function (I$^2$F) that establishes a closed-form connection between the recovered images and the private gradients by implicitly solving the DGL problem. Compared to directly solving DGL, I$^2$F is scalable for analyzing deep networks, requiring only oracle access to gradients and Jacobian-vector products. We empirically demonstrate that I$^2$F effectively approximated the DGL generally on different model architectures, datasets, modalities, attack implementations, and perturbation-based defenses. With this novel tool, we provide insights into effective gradient perturbation directions, the unfairness of privacy protection, and privacy-preferred model initialization. Our codes are provided in https://github.com/illidanlab/inversion-influence-function.

BiSinger: Bilingual Singing Voice Synthesis. (arXiv:2309.14089v3 [eess.AS] UPDATED)

Authors: Huali Zhou, Yueqian Lin, Yao Shi, Peng Sun, Ming Li

Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with open-source singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.

A Comparative Study of Filters and Deep Learning Models to predict Diabetic Retinopathy. (arXiv:2309.15216v3 [cs.LG] UPDATED)

Authors: Roshan Vasu Muddaluru, Sharvaani Ravikumar Thoguluva, Shruti Prabha, Tanuja Konda Reddy, Dr. Suja Palaniswamy

The retina is an essential component of the visual system, and maintaining eyesight depends on the timely and accurate detection of disorders. The early-stage detection and severity classification of Diabetic Retinopathy (DR), a significant risk to the public's health is the primary goal of this work. This study compares the outcomes of various deep learning models, including InceptionNetV3, DenseNet121, and other CNN-based models, utilizing a variety of image filters, including Gaussian, grayscale, and Gabor. These models could detect subtle pathological alterations and use that information to estimate the risk of retinal illnesses. The objective is to improve the diagnostic processes for DR, the primary cause of diabetes-related blindness, by utilizing deep learning models. A comparative analysis between Greyscale, Gaussian and Gabor filters has been provided after applying these filters on the retinal images. The Gaussian filter has been identified as the most promising filter by resulting in 96% accuracy using InceptionNetV3.

Homotopy Relaxation Training Algorithms for Infinite-Width Two-Layer ReLU Neural Networks. (arXiv:2309.15244v2 [cs.LG] UPDATED)

Authors: Yahong Yang, Qipin Chen, Wenrui Hao

In this paper, we present a novel training approach called the Homotopy Relaxation Training Algorithm (HRTA), aimed at accelerating the training process in contrast to traditional methods. Our algorithm incorporates two key mechanisms: one involves building a homotopy activation function that seamlessly connects the linear activation function with the ReLU activation function; the other technique entails relaxing the homotopy parameter to enhance the training refinement process. We have conducted an in-depth analysis of this novel method within the context of the neural tangent kernel (NTK), revealing significantly improved convergence rates. Our experimental results, especially when considering networks with larger widths, validate the theoretical conclusions. This proposed HRTA exhibits the potential for other activation functions and deep neural networks.

Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design. (arXiv:2310.04343v3 [cs.LG] UPDATED)

Authors: Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Yang Yang, Lei Li

Proteins are macromolecules responsible for essential functions in almost all living organisms. Designing reasonable proteins with desired functions is crucial. A protein's sequence and structure are strongly correlated and they together determine its function. In this paper, we propose NAEPro, a model to jointly design Protein sequence and structure based on automatically detected functional sites. NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence and local influence from nearest amino acids in three dimensional (3D) space. Such an architecture facilitates effective yet economic message passing at two levels. We evaluate our model and several strong baselines on two protein datasets, $\beta$-lactamase and myoglobin. Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors. These findings prove the capability of our model to design protein sequences and structures that closely resemble their natural counterparts. Furthermore, in-depth analysis further confirms our model's ability to generate highly effective proteins capable of binding to their target metallocofactors. We provide code, data and models in Github.

FABind: Fast and Accurate Protein-Ligand Binding. (arXiv:2310.06763v5 [cs.LG] UPDATED)

Authors: Qizhi Pei, Kaiyuan Gao, Lijun Wu, Jinhua Zhu, Yingce Xia, Shufang Xie, Tao Qin, Kun He, Tie-Yan Liu, Rui Yan

Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https://github.com/QizhiPei/FABind

Risk Assessment and Statistical Significance in the Age of Foundation Models. (arXiv:2310.07132v2 [cs.LG] UPDATED)

Authors: Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jerret Ross

We propose a distributional framework for assessing socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.

CORN: Co-Trained Full- And No-Reference Speech Quality Assessment. (arXiv:2310.09388v2 [eess.AS] UPDATED)

Authors: Pranay Manocha, Donald Williamson, Adam Finkelstein

Perceptual evaluation constitutes a crucial aspect of various audio-processing tasks. Full reference (FR) or similarity-based metrics rely on high-quality reference recordings, to which lower-quality or corrupted versions of the recording may be compared for evaluation. In contrast, no-reference (NR) metrics evaluate a recording without relying on a reference. Both the FR and NR approaches exhibit advantages and drawbacks relative to each other. In this paper, we present a novel framework called CORN that amalgamates these dual approaches, concurrently training both FR and NR models together. After training, the models can be applied independently. We evaluate CORN by predicting several common objective metrics and across two different architectures. The NR model trained using CORN has access to a reference recording during training, and thus, as one would expect, it consistently outperforms baseline NR models trained independently. Perhaps even more remarkable is that the CORN FR model also outperforms its baseline counterpart, even though it relies on the same training data and the same model architecture. Thus, a single training regime produces two independently useful models, each outperforming independently trained models

Data Augmentations for Improved (Large) Language Model Generalization. (arXiv:2310.12803v2 [cs.LG] UPDATED)

Authors: Amir Feder, Yoav Wald, Claudia Shi, Suchi Saria, David Blei

The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices. (arXiv:2310.19991v2 [cs.LG] UPDATED)

Authors: Minghao Yan, Hongyi Wang, Shivaram Venkataraman

As neural networks (NN) are deployed across diverse sectors, their energy demand correspondingly grows. While several prior works have focused on reducing energy consumption during training, the continuous operation of ML-powered systems leads to significant energy use during inference. This paper investigates how the configuration of on-device hardware-elements such as GPU, memory, and CPU frequency, often neglected in prior studies, affects energy consumption for NN inference with regular fine-tuning. We propose PolyThrottle, a solution that optimizes configurations across individual hardware components using Constrained Bayesian Optimization in an energy-conserving manner. Our empirical evaluation uncovers novel facets of the energy-performance equilibrium showing that we can save up to 36 percent of energy for popular models. We also validate that PolyThrottle can quickly converge towards near-optimal settings while satisfying application constraints.

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination. (arXiv:2311.02960v2 [cs.LG] UPDATED)

Authors: Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, Qing Qu

Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank: Each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Empirically, our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks which aligns well with recent empirical studies. Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at \url{https://github.com/Heimine/PNC_DLN}.

LLMs cannot find reasoning errors, but can correct them!. (arXiv:2311.08516v2 [cs.AI] UPDATED)

Authors: Gladys Tyen, Hassan Mansoor, Victor Cărbune, Peter Chen, Tony Mak

While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we break down the self-correction process into two core components: mistake finding and output correction. For mistake finding, we release BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces. We provide benchmark numbers for several state-of-the-art LLMs, and demonstrate that LLMs generally struggle with finding logical mistakes. For output correction, we propose a backtracking method which provides large improvements when given information on mistake location. We construe backtracking as a lightweight alternative to reinforcement learning methods, and show that it remains effective with a reward model at 60-70% accuracy.

Deep Interactive Segmentation of Medical Images: A Systematic Review and Taxonomy. (arXiv:2311.13964v2 [eess.IV] UPDATED)

Authors: Zdravko Marinov, Paul F. Jäger, Jan Egger, Jens Kleesiek, Rainer Stiefelhagen

Interactive segmentation is a crucial research area in medical image analysis aiming to boost the efficiency of costly annotations by incorporating human feedback. This feedback takes the form of clicks, scribbles, or masks and allows for iterative refinement of the model output so as to efficiently guide the system towards the desired behavior. In recent years, deep learning-based approaches have propelled results to a new level causing a rapid growth in the field with 121 methods proposed in the medical imaging domain alone. In this review, we provide a structured overview of this emerging field featuring a comprehensive taxonomy, a systematic review of existing methods, and an in-depth analysis of current practices. Based on these contributions, we discuss the challenges and opportunities in the field. For instance, we find that there is a severe lack of comparison across methods which needs to be tackled by standardized baselines and benchmarks.

Advanced Large Language Model (LLM)-Driven Verilog Development: Enhancing Power, Performance, and Area Optimization in Code Synthesis. (arXiv:2312.01022v2 [cs.LG] UPDATED)

Authors: Kiran Thorat, Jiahui Zhao, Yaotian Liu, Hongwu Peng, Xi Xie, Bin Lei, Jeff Zhang, Caiwen Ding

The increasing use of Advanced Language Models (ALMs) in diverse sectors, particularly due to their impressive capability to generate top-tier content following linguistic instructions, forms the core of this investigation. This study probes into ALMs' deployment in electronic hardware design, with a specific emphasis on the synthesis and enhancement of Verilog programming. We introduce an innovative framework, crafted to assess and amplify ALMs' productivity in this niche. The methodology commences with the initial crafting of Verilog programming via ALMs, succeeded by a distinct dual-stage refinement protocol. The premier stage prioritizes augmenting the code's operational and linguistic precision, while the latter stage is dedicated to aligning the code with Power-Performance-Area (PPA) benchmarks, a pivotal component in proficient hardware design. This bifurcated strategy, merging error remediation with PPA enhancement, has yielded substantial upgrades in the caliber of ALM-created Verilog programming. Our framework achieves an 81.37% rate in linguistic accuracy and 62.0% in operational efficacy in programming synthesis, surpassing current leading-edge techniques, such as 73% in linguistic accuracy and 46% in operational efficacy. These findings illuminate ALMs' aptitude in tackling complex technical domains and signal a positive shift in the mechanization of hardware design operations.

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications. (arXiv:2312.02828v2 [stat.ML] UPDATED)

Authors: Rajeeva L. Karandikar, M. Vidyasagar

The Stochastic Approximation (SA) algorithm introduced by Robbins and Monro in 1951 has been a standard method for solving equations of the form $\mathbf{f}({\boldsymbol {\theta}}) = \mathbf{0}$, when only noisy measurements of $\mathbf{f}(\cdot)$ are available. If $\mathbf{f}({\boldsymbol {\theta}}) = \nabla J({\boldsymbol {\theta}})$ for some function $J(\cdot)$, then SA can also be used to find a stationary point of $J(\cdot)$. At each time $t$, the current guess ${\boldsymbol {\theta}}_t$ is updated to ${\boldsymbol {\theta}}_{t+1}$ using a noisy measurement of the form $\mathbf{f}({\boldsymbol {\theta}}_t) + {\boldsymbol {\xi}}_{t+1}$. In much of the literature, it is assumed that the error term ${\boldsymbol {\xi}}_{t+1}$ has zero conditional mean, and/or that its conditional variance is bounded as a function of $t$ (though not necessarily with respect to ${\boldsymbol {\theta}}_t$). Over the years, SA has been applied to a variety of areas, out of which the focus in this paper is on convex and nonconvex optimization. As it turns out, in these applications, the above-mentioned assumptions on the measurement error do not always hold. In zero-order methods, the error neither has zero mean nor bounded conditional variance. In the present paper, we extend SA theory to encompass errors with nonzero conditional mean and/or unbounded conditional variance. In addition, we derive estimates for the rate of convergence of the algorithm, and compute the ``optimal step size sequences'' to maximize the estimated rate of convergence.

STEAM & MoSAFE: SOTIF Error-and-Failure Model & Analysis for AI-Enabled Driving Automation. (arXiv:2312.09559v2 [cs.LG] UPDATED)

Authors: Krzysztof Czarnecki, Hiroshi Kuwajima

Driving Automation Systems (DAS) are subject to complex road environments and vehicle behaviors and increasingly rely on sophisticated sensors and Artificial Intelligence (AI). These properties give rise to unique safety faults stemming from specification insufficiencies and technological performance limitations, where sensors and AI introduce errors that vary in magnitude and temporal patterns, posing potential safety risks. The Safety of the Intended Functionality (SOTIF) standard emerges as a promising framework for addressing these concerns, focusing on scenario-based analysis to identify hazardous behaviors and their causes. Although the current standard provides a basic cause-and-effect model and high-level process guidance, it lacks concepts required to identify and evaluate hazardous errors, especially within the context of AI.

This paper introduces two key contributions to bridge this gap. First, it defines the SOTIF Temporal Error and Failure Model (STEAM) as a refinement of the SOTIF cause-and-effect model, offering a comprehensive system-design perspective. STEAM refines error definitions, introduces error sequences, and classifies them as error sequence patterns, providing particular relevance to systems employing advanced sensors and AI. Second, this paper proposes the Model-based SOTIF Analysis of Failures and Errors (MoSAFE) method, which allows instantiating STEAM based on system-design models by deriving hazardous error sequence patterns at module level from hazardous behaviors at vehicle level via weakest precondition reasoning. Finally, the paper presents a case study centered on an automated speed-control feature, illustrating the practical applicability of the refined model and the MoSAFE method in addressing complex safety challenges in DAS.

Time-Transformer: Integrating Local and Global Features for Better Time Series Generation. (arXiv:2312.11714v2 [cs.LG] UPDATED)

Authors: Yuansan Liu, Sudanthi Wijewickrema, Ang Li, Christofer Bester, Stephen O'Leary, James Bailey

Generating time series data is a promising approach to address data deficiency problems. However, it is also challenging due to the complex temporal properties of time series data, including local correlations as well as global dependencies. Most existing generative models have failed to effectively learn both the local and global properties of time series data. To address this open problem, we propose a novel time series generative model named 'Time-Transformer AAE', which consists of an adversarial autoencoder (AAE) and a newly designed architecture named 'Time-Transformer' within the decoder. The Time-Transformer first simultaneously learns local and global features in a layer-wise parallel design, combining the abilities of Temporal Convolutional Networks and Transformer in extracting local features and global dependencies respectively. Second, a bidirectional cross attention is proposed to provide complementary guidance across the two branches and achieve proper fusion between local and global features. Experimental results demonstrate that our model can outperform existing state-of-the-art models in 5 out of 6 datasets, specifically on those with data containing both global and local properties. Furthermore, we highlight our model's advantage on handling this kind of data via an artificial dataset. Finally, we show our model's ability to address a real-world problem: data augmentation to support learning with small datasets and imbalanced datasets.

Federated Class-Incremental Learning with New-Class Augmented Self-Distillation. (arXiv:2401.00622v2 [cs.LG] UPDATED)

Authors: Zhiyuan Wu, Tianliu He, Sheng Sun, Yuwei Wang, Min Liu, Bo Gao, Xuefeng Jiang

Federated Learning (FL) enables collaborative model training among participants while guaranteeing the privacy of raw data. Mainstream FL methodologies overlook the dynamic nature of real-world data, particularly its tendency to grow in volume and diversify in classes over time. This oversight results in FL methods suffering from catastrophic forgetting, where the trained models inadvertently discard previously learned information upon assimilating new data. In response to this challenge, we propose a novel Federated Class-Incremental Learning (FCIL) method, named \underline{Fed}erated \underline{C}lass-Incremental \underline{L}earning with New-Class \underline{A}ugmented \underline{S}elf-Di\underline{S}tillation (FedCLASS). The core of FedCLASS is to enrich the class scores of historical models with new class scores predicted by current models and utilize the combined knowledge for self-distillation, enabling a more sufficient and precise knowledge transfer from historical models to current models. Theoretical analyses demonstrate that FedCLASS stands on reliable foundations, considering scores of old classes predicted by historical models as conditional probabilities in the absence of new classes, and the scores of new classes predicted by current models as the conditional probabilities of class scores derived from historical models. Empirical experiments demonstrate the superiority of FedCLASS over four baseline algorithms in reducing average forgetting rate and boosting global accuracy.

A Comprehensive Study of Knowledge Editing for Large Language Models. (arXiv:2401.01286v3 [cs.CL] UPDATED)

Authors: Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, Huajun Chen

Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication. However, a primary limitation lies in the significant computational demands during training, arising from their extensive parameterization. This challenge is further intensified by the dynamic nature of the world, necessitating frequent updates to LLMs to correct outdated information or integrate new knowledge, thereby ensuring their continued relevance. Note that many applications demand continual model adjustments post-training to address deficiencies or undesirable behaviors. There is an increasing interest in efficient, lightweight methods for on-the-fly model modifications. To this end, recent years have seen a burgeoning in the techniques of knowledge editing for LLMs, which aim to efficiently modify LLMs' behaviors within specific domains while preserving overall performance across various inputs. In this paper, we first define the knowledge editing problem and then provide a comprehensive review of cutting-edge approaches. Drawing inspiration from educational and cognitive research theories, we propose a unified categorization criterion that classifies knowledge editing methods into three groups: resorting to external knowledge, merging knowledge into the model, and editing intrinsic knowledge. Furthermore, we introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches. Additionally, we provide an in-depth analysis of knowledge location, which can give a deeper understanding of the knowledge structures inherent within LLMs. Finally, we discuss several potential applications of knowledge editing, outlining its broad and impactful implications.

Multilingual Instruction Tuning With Just a Pinch of Multilinguality. (arXiv:2401.01854v2 [cs.CL] UPDATED)

Authors: Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, Matan Eyal

As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. One promising approach is cross-lingual transfer, where a model acquires specific functionality on some language by finetuning on another language. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in several languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that increasing the number of languages in the instruction tuning set from 1 to only 2, 3, or 4 increases cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.

GPS-SSL: Guided Positive Sampling to Inject Prior Into Self-Supervised Learning. (arXiv:2401.01990v2 [cs.CV] UPDATED)

Authors: Aarash Feizi, Randall Balestriero, Adriana Romero-Soriano, Reihaneh Rabbany

We propose Guided Positive Sampling Self-Supervised Learning (GPS-SSL), a general method to inject a priori knowledge into Self-Supervised Learning (SSL) positive samples selection. Current SSL methods leverage Data-Augmentations (DA) for generating positive samples and incorporate prior knowledge - an incorrect, or too weak DA will drastically reduce the quality of the learned representation. GPS-SSL proposes instead to design a metric space where Euclidean distances become a meaningful proxy for semantic relationship. In that space, it is now possible to generate positive samples from nearest neighbor sampling. Any prior knowledge can now be embedded into that metric space independently from the employed DA. From its simplicity, GPS-SSL is applicable to any SSL method, e.g. SimCLR or BYOL. A key benefit of GPS-SSL is in reducing the pressure in tailoring strong DAs. For example GPS-SSL reaches 85.58% on Cifar10 with weak DA while the baseline only reaches 37.51%. We therefore move a step forward towards the goal of making SSL less reliant on DA. We also show that even when using strong DAs, GPS-SSL outperforms the baselines on under-studied domains. We evaluate GPS-SSL along with multiple baseline SSL methods on numerous downstream datasets from different domains when the models use strong or minimal data augmentations. We hope that GPS-SSL will open new avenues in studying how to inject a priori knowledge into SSL in a principled manner.

Weakly Semi-supervised Tool Detection in Minimally Invasive Surgery Videos. (arXiv:2401.02791v2 [cs.CV] UPDATED)

Authors: Ryo Fujii, Ryo Hachiuma, Hideo Saito

Surgical tool detection is essential for analyzing and evaluating minimally invasive surgery videos. Current approaches are mostly based on supervised methods that require large, fully instance-level labels (i.e., bounding boxes). However, large image datasets with instance-level labels are often limited because of the burden of annotation. Thus, surgical tool detection is important when providing image-level labels instead of instance-level labels since image-level annotations are considerably more time-efficient than instance-level annotations. In this work, we propose to strike a balance between the extremely costly annotation burden and detection performance. We further propose a co-occurrence loss, which considers a characteristic that some tool pairs often co-occur together in an image to leverage image-level labels. Encapsulating the knowledge of co-occurrence using the co-occurrence loss helps to overcome the difficulty in classification that originates from the fact that some tools have similar shapes and textures. Extensive experiments conducted on the Endovis2018 dataset in various data settings show the effectiveness of our method.

Graph2Tac: Learning Hierarchical Representations of Math Concepts in Theorem proving. (arXiv:2401.02949v2 [cs.LG] UPDATED)

Authors: Jason Rute, Miroslav Olšák, Lasse Blaauwbroek, Fidel Ivan Schaposnik Massolo, Jelle Piepenbrock, Vasily Pestun

Concepts abound in mathematics and its applications. They vary greatly between subject areas, and new ones are introduced in each mathematical paper or application. A formal theory builds a hierarchy of definitions, theorems and proofs that reference each other. When an AI agent is proving a new theorem, most of the mathematical concepts and lemmas relevant to that theorem may have never been seen during training. This is especially true in the Coq proof assistant, which has a diverse library of Coq projects, each with its own definitions, lemmas, and even custom tactic procedures used to prove those lemmas. It is essential for agents to incorporate such new information into their knowledge base on the fly. We work towards this goal by utilizing a new, large-scale, graph-based dataset for machine learning in Coq. We leverage a faithful graph-representation of Coq terms that induces a directed graph of dependencies between definitions to create a novel graph neural network, Graph2Tac (G2T), that takes into account not only the current goal, but also the entire hierarchy of definitions that led to the current goal. G2T is an online model that is deeply integrated into the users' workflow and can adapt in real time to new Coq projects and their definitions. It complements well with other online models that learn in real time from new proof scripts. Our novel definition embedding task, which is trained to compute representations of mathematical concepts not seen during training, boosts the performance of the neural network to rival state-of-the-art k-nearest neighbor predictors.

The Tactician's Web of Large-Scale Formal Knowledge. (arXiv:2401.02950v2 [cs.LO] UPDATED)

Authors: Lasse Blaauwbroek

The Tactician's Web is a platform offering a large web of strongly interconnected, machine-checked, formal mathematical knowledge conveniently packaged for machine learning, analytics, and proof engineering. Built on top of the Coq proof assistant, the platform exports a dataset containing a wide variety of formal theories, presented as a web of definitions, theorems, proof terms, tactics, and proof states. Theories are encoded both as a semantic graph (rendered below) and as human-readable text, each with a unique set of advantages and disadvantages. Proving agents may interact with Coq through the same rich data representation and can be automatically benchmarked on a set of theorems. Tight integration with Coq provides the unique possibility to make agents available to proof engineers as practical tools.

Fair Sampling in Diffusion Models through Switching Mechanism. (arXiv:2401.03140v2 [cs.LG] UPDATED)

Authors: Yujin Choi, Jinseong Park, Hoki Kim, Jaewook Lee, Saeroom Park

Diffusion models have shown their effectiveness in generation tasks by well-approximating the underlying probability distribution. However, diffusion models are known to suffer from an amplified inherent bias from the training data in terms of fairness. While the sampling process of diffusion models can be controlled by conditional guidance, previous works have attempted to find empirical guidance to achieve quantitative fairness. To address this limitation, we propose a fairness-aware sampling method called \textit{attribute switching} mechanism for diffusion models. Without additional training, the proposed sampling can obfuscate sensitive attributes in generated data without relying on classifiers. We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects: (i) the generation of fair data and (ii) the preservation of the utility of the generated data.

Multi-Modal Representation Learning for Molecular Property Prediction: Sequence, Graph, Geometry. (arXiv:2401.03369v2 [q-bio.MN] UPDATED)

Authors: Zeyu Wang, Tianyi Jiang, Jinhuan Wang, Qi Xuan

Molecular property prediction refers to the task of labeling molecules with some biochemical properties, playing a pivotal role in the drug discovery and design process. Recently, with the advancement of machine learning, deep learning-based molecular property prediction has emerged as a solution to the resource-intensive nature of traditional methods, garnering significant attention. Among them, molecular representation learning is the key factor for molecular property prediction performance. And there are lots of sequence-based, graph-based, and geometry-based methods that have been proposed. However, the majority of existing studies focus solely on one modality for learning molecular representations, failing to comprehensively capture molecular characteristics and information. In this paper, a novel multi-modal representation learning model, which integrates the sequence, graph, and geometry characteristics, is proposed for molecular property prediction, called SGGRL. Specifically, we design a fusion layer to fusion the representation of different modalities. Furthermore, to ensure consistency across modalities, SGGRL is trained to maximize the similarity of representations for the same molecule while minimizing similarity for different molecules. To verify the effectiveness of SGGRL, seven molecular datasets, and several baselines are used for evaluation and comparison. The experimental results demonstrate that SGGRL consistently outperforms the baselines in most cases. This further underscores the capability of SGGRL to comprehensively capture molecular information. Overall, the proposed SGGRL model showcases its potential to revolutionize molecular property prediction by leveraging multi-modal representation learning to extract diverse and comprehensive molecular insights. Our code is released at https://github.com/Vencent-Won/SGGRL.

Token-free LLMs Can Generate Chinese Classical Poetry with More Accurate Format. (arXiv:2401.03512v2 [cs.CL] UPDATED)

Authors: Chengyue Yu, Lei Zang, Jiaotuan Wang, Chenyi Zhuang, Jinjie Gu

Finetuned large language models (such as ChatGPT and Qwen-chat) can generate Chinese classical poetry following human's instructions. LLMs perform well in content, but are usually lacking in format, with occasionally excess or insufficient number of characters in each line. Since most SOTA LLMs are token-based, we assume that the format inaccuracy is due to the difficulty of the "token planning" task, which means that the LLM need to know exactly how much characters are contained in each token and do length-control planning based on that knowledge. In this paper, we first confirm our assumption by showing that existing token-based large language models has limited knowledge on token-character relationship. We use a spelling bee probing procedure, and find that Qwen-chat failed in nearly 15% Chinese spelling test. We then show that a token-based model can be easily tailored into a token-free model (in terms of Chinese), which can largely solve the format accuracy problem. Our tailoring procedure removes long-tokens from the vocabulary and the language model head, and keeps only character-level or byte-level tokens. As part of our contribution, we release the finetuned token-free model (which is based on Qwen-chat-7B), which can generate chinese classical poetry following complex instructions like LLMs (such as story paraphrasing), and also perform well in format. On the test set, our token-free model achives an format accuracy of 0.96, compared to 0.84 for token-based equivalents and 0.38 for GPT-4.

Generalized Lagrangian Neural Networks. (arXiv:2401.03728v2 [math.DS] UPDATED)

Authors: Shanshan Xiao, Jiawei Zhang, Yifa Tang

Incorporating neural networks for the solution of Ordinary Differential Equations (ODEs) represents a pivotal research direction within computational mathematics. Within neural network architectures, the integration of the intrinsic structure of ODEs offers advantages such as enhanced predictive capabilities and reduced data utilization. Among these structural ODE forms, the Lagrangian representation stands out due to its significant physical underpinnings. Building upon this framework, Bhattoo introduced the concept of Lagrangian Neural Networks (LNNs). Then in this article, we introduce a groundbreaking extension (Genralized Lagrangian Neural Networks) to Lagrangian Neural Networks (LNNs), innovatively tailoring them for non-conservative systems. By leveraging the foundational importance of the Lagrangian within Lagrange's equations, we formulate the model based on the generalized Lagrange's equation. This modification not only enhances prediction accuracy but also guarantees Lagrangian representation in non-conservative systems. Furthermore, we perform various experiments, encompassing 1-dimensional and 2-dimensional examples, along with an examination of the impact of network parameters, which proved the superiority of Generalized Lagrangian Neural Networks(GLNNs).

Lessons Learned: Reproducibility, Replicability, and When to Stop. (arXiv:2401.03736v2 [cs.LG] UPDATED)

Authors: Milton S. Gomez, Tom Beucler

While extensive guidance exists for ensuring the reproducibility of one's own study, there is little discussion regarding the reproduction and replication of external studies within one's own research. To initiate this discussion, drawing lessons from our experience reproducing an operational product for predicting tropical cyclogenesis, we present a two-dimensional framework to offer guidance on reproduction and replication. Our framework, representing model fitting on one axis and its use in inference on the other, builds upon three key aspects: the dataset, the metrics, and the model itself. By assessing the trajectories of our studies on this 2D plane, we can better inform the claims made using our research. Additionally, we use this framework to contextualize the utility of benchmark datasets in the atmospheric sciences. Our two-dimensional framework provides a tool for researchers, especially early career researchers, to incorporate prior work in their own research and to inform the claims they can make in this context.

A Primer on Temporal Graph Learning. (arXiv:2401.03988v2 [cs.LG] UPDATED)

Authors: Aniq Ur Rahman, Justin P. Coon

This document aims to familiarize readers with temporal graph learning (TGL) through a concept-first approach. We have systematically presented vital concepts essential for understanding the workings of a TGL framework. In addition to qualitative explanations, we have incorporated mathematical formulations where applicable, enhancing the clarity of the text. Since TGL involves temporal and spatial learning, we introduce relevant learning architectures ranging from recurrent and convolutional neural networks to transformers and graph neural networks. We also discuss classical time series forecasting methods to inspire interpretable learning solutions for TGL.

Isolated pulsar population synthesis with simulation-based inference. (arXiv:2312.14848v1 [astro-ph.HE] CROSS LISTED)

Authors: Vanessa Graber, Michele Ronchi, Celsa Pardo-Araujo, Nanda Rea

We combine pulsar population synthesis with simulation-based inference to constrain the magneto-rotational properties of isolated Galactic radio pulsars. We first develop a flexible framework to model neutron-star birth properties and evolution, focusing on their dynamical, rotational and magnetic characteristics. In particular, we sample initial magnetic-field strengths, $B$, and spin periods, $P$, from log-normal distributions and capture the late-time magnetic-field decay with a power law. Each log-normal is described by a mean, $\mu_{\log B}, \mu_{\log P}$, and standard deviation, $\sigma_{\log B}, \sigma_{\log P}$, while the power law is characterized by the index, $a_{\rm late}$, resulting in five free parameters. We subsequently model the stars' radio emission and observational biases to mimic detections with three radio surveys, and produce a large database of synthetic $P$-$\dot{P}$ diagrams by varying our input parameters. We then follow a simulation-based inference approach that focuses on neural posterior estimation and employ this database to train deep neural networks to directly infer the posterior distributions of the five model parameters. After successfully validating these individual neural density estimators on simulated data, we use an ensemble of networks to infer the posterior distributions for the observed pulsar population. We obtain $\mu_{\log B} = 13.10^{+0.08}_{-0.10}$, $\sigma_{\log B} = 0.45^{+0.05}_{-0.05}$ and $\mu_{\log P} = -1.00^{+0.26}_{-0.21}$, $\sigma_{\log P} = 0.38^{+0.33}_{-0.18}$ for the log-normal distributions, and $a_{\rm late} = -1.80^{+0.65}_{-0.61}$ for the power law at $95\%$ credible interval. Our approach represents a crucial step towards robust statistical inference for complex population-synthesis frameworks and forms the basis for future multi-wavelength analyses of Galactic pulsars.

View-based Explanations for Graph Neural Networks. (arXiv:2401.02086v2 [cs.LG] CROSS LISTED)

Authors: Tingyang Chen, Dazhuo Qiu, Yinghui Wu, Arijit Khan, Xiangyu Ke, Yunjun Gao

Generating explanations for graph neural networks (GNNs) has been studied to understand their behavior in analytical tasks such as graph classification. Existing approaches aim to understand the overall results of GNNs rather than providing explanations for specific class labels of interest, and may return explanation structures that are hard to access, nor directly queryable.We propose GVEX, a novel paradigm that generates Graph Views for EXplanation. (1) We design a two-tier explanation structure called explanation views. An explanation view consists of a set of graph patterns and a set of induced explanation subgraphs. Given a database G of multiple graphs and a specific class label l assigned by a GNN-based classifier M, it concisely describes the fraction of G that best explains why l is assigned by M. (2) We propose quality measures and formulate an optimization problem to compute optimal explanation views for GNN explanation. We show that the problem is $\Sigma^2_P$-hard. (3) We present two algorithms. The first one follows an explain-and-summarize strategy that first generates high-quality explanation subgraphs which best explain GNNs in terms of feature influence maximization, and then performs a summarization step to generate patterns. We show that this strategy provides an approximation ratio of 1/2. Our second algorithm performs a single-pass to an input node stream in batches to incrementally maintain explanation views, having an anytime quality guarantee of 1/4 approximation. Using real-world benchmark data, we experimentally demonstrate the effectiveness, efficiency, and scalability of GVEX. Through case studies, we showcase the practical applications of GVEX.

On Sample-Efficient Offline Reinforcement Learning: Data Diversity, Posterior Sampling, and Beyond. (arXiv:2401.03301v1 [cs.LG] CROSS LISTED)

Authors: Thanh Nguyen-Tang, Raman Arora

We seek to understand what facilitates sample-efficient learning from historical datasets for sequential decision-making, a problem that is popularly known as offline reinforcement learning (RL). Further, we are interested in algorithms that enjoy sample efficiency while leveraging (value) function approximation. In this paper, we address these fundamental questions by (i) proposing a notion of data diversity that subsumes the previous notions of coverage measures in offline RL and (ii) using this notion to {unify} three distinct classes of offline RL algorithms based on version spaces (VS), regularized optimization (RO), and posterior sampling (PS). We establish that VS-based, RO-based, and PS-based algorithms, under standard assumptions, achieve \emph{comparable} sample efficiency, which recovers the state-of-the-art sub-optimality bounds for finite and linear model classes with the standard assumptions. This result is surprising, given that the prior work suggested an unfavorable sample complexity of the RO-based algorithm compared to the VS-based algorithm, whereas posterior sampling is rarely considered in offline RL due to its explorative nature. Notably, our proposed model-free PS-based algorithm for offline RL is {novel}, with sub-optimality bounds that are {frequentist} (i.e., worst-case) in nature.