Determinants of renewable energy consumption in Madagascar: Evidence from feature selection algorithms. (arXiv:2401.13671v1 [econ.GN])

Authors: Franck Ramaharo, Fitiavana Randriamifidy

The aim of this note is to identify the factors influencing renewable energy consumption in Madagascar. We tested 12 features covering macroeconomic, financial, social, and environmental aspects, including economic growth, domestic investment, foreign direct investment, financial development, industrial development, inflation, income distribution, trade openness, exchange rate, tourism development, environmental quality, and urbanization. To assess their significance, we assumed a linear relationship between renewable energy consumption and these features over the 1990-2021 period. Next, we applied different machine learning feature selection algorithms classified as filter-based (relative importance for linear regression, correlation method), embedded (LASSO), and wrapper-based (best subset regression, stepwise regression, recursive feature elimination, iterative predictor weighting partial least squares, Boruta, simulated annealing, and genetic algorithms) methods. Our analysis revealed that the five most influential drivers stem from macroeconomic aspects. We found that domestic investment, foreign direct investment, and inflation positively contribute to the adoption of renewable energy sources. On the other hand, industrial development and trade openness negatively affect renewable energy consumption in Madagascar.

Process Mining for Unstructured Data: Challenges and Research Directions. (arXiv:2401.13677v1 [cs.DB])

Authors: Agnes Koschmider, Milda Aleknonytė-Resch, Frederik Fonger, Christian Imenkamp, Arvid Lepsien, Kaan Apaydin, Maximilian Harms, Dominik Janssen, Dominic Langhammer, Tobias Ziolkowski, Yorck Zisgen

The application of process mining for unstructured data might significantly elevate novel insights into disciplines where unstructured data is a common data format. To efficiently analyze unstructured data by process mining and to convey confidence into the analysis result, requires bridging multiple challenges. The purpose of this paper is to discuss these challenges, present initial solutions and describe future research directions. We hope that this article lays the foundations for future collaboration on this topic.

Inverse analysis of granular flows using differentiable graph neural network simulator. (arXiv:2401.13695v1 [physics.geo-ph])

Authors: Yongjin Choi, Krishna Kumar

Inverse problems in granular flows, such as landslides and debris flows, involve estimating material parameters or boundary conditions based on target runout profile. Traditional high-fidelity simulators for these inverse problems are computationally demanding, restricting the number of simulations possible. Additionally, their non-differentiable nature makes gradient-based optimization methods, known for their efficiency in high-dimensional problems, inapplicable. While machine learning-based surrogate models offer computational efficiency and differentiability, they often struggle to generalize beyond their training data due to their reliance on low-dimensional input-output mappings that fail to capture the complete physics of granular flows. We propose a novel differentiable graph neural network simulator (GNS) by combining reverse mode automatic differentiation of graph neural networks with gradient-based optimization for solving inverse problems. GNS learns the dynamics of granular flow by representing the system as a graph and predicts the evolution of the graph at the next time step, given the current state. The differentiable GNS shows optimization capabilities beyond the training data. We demonstrate the effectiveness of our method for inverse estimation across single and multi-parameter optimization problems, including evaluating material properties and boundary conditions for a target runout distance and designing baffle locations to limit a landslide runout. Our proposed differentiable GNS framework offers an orders of magnitude faster solution to these inverse problems than the conventional finite difference approach to gradient-based optimization.

Generative AI-Driven Human Digital Twin in IoT-Healthcare: A Comprehensive Survey. (arXiv:2401.13699v1 [cs.HC])

Authors: Jiayuan Chen, You Shi, Changyan Yi, Hongyang Du, Jiawen Kang, Dusit Niyato

The Internet of things (IoT) can significantly enhance the quality of human life, specifically in healthcare, attracting extensive attentions to IoT-healthcare services. Meanwhile, the human digital twin (HDT) is proposed as an innovative paradigm that can comprehensively characterize the replication of the individual human body in the digital world and reflect its physical status in real time. Naturally, HDT is envisioned to empower IoT-healthcare beyond the application of healthcare monitoring by acting as a versatile and vivid human digital testbed, simulating the outcomes and guiding the practical treatments. However, successfully establishing HDT requires high-fidelity virtual modeling and strong information interactions but possibly with scarce, biased and noisy data. Fortunately, a recent popular technology called generative artificial intelligence (GAI) may be a promising solution because it can leverage advanced AI algorithms to automatically create, manipulate, and modify valuable while diverse data. This survey particularly focuses on the implementation of GAI-driven HDT in IoT-healthcare. We start by introducing the background of IoT-healthcare and the potential of GAI-driven HDT. Then, we delve into the fundamental techniques and present the overall framework of GAI-driven HDT. After that, we explore the realization of GAI-driven HDT in detail, including GAI-enabled data acquisition, communication, data management, digital modeling, and data analysis. Besides, we discuss typical IoT-healthcare applications that can be revolutionized by GAI-driven HDT, namely personalized health monitoring and diagnosis, personalized prescription, and personalized rehabilitation. Finally, we conclude this survey by highlighting some future research directions.

Accelerating hyperbolic t-SNE. (arXiv:2401.13708v1 [cs.HC])

Authors: Martin Skrodzki, Hunter van Geffen, Nicolas F. Chaves-de-Plaza, Thomas Höllt, Elmar Eisemann, Klaus Hildebrandt

The need to understand the structure of hierarchical or high-dimensional data is present in a variety of fields. Hyperbolic spaces have proven to be an important tool for embedding computations and analysis tasks as their non-linear nature lends itself well to tree or graph data. Subsequently, they have also been used in the visualization of high-dimensional data, where they exhibit increased embedding performance. However, none of the existing dimensionality reduction methods for embedding into hyperbolic spaces scale well with the size of the input data. That is because the embeddings are computed via iterative optimization schemes and the computation cost of every iteration is quadratic in the size of the input. Furthermore, due to the non-linear nature of hyperbolic spaces, Euclidean acceleration structures cannot directly be translated to the hyperbolic setting. This paper introduces the first acceleration structure for hyperbolic embeddings, building upon a polar quadtree. We compare our approach with existing methods and demonstrate that it computes embeddings of similar quality in significantly less time. Implementation and scripts for the experiments can be found at https://graphics.tudelft.nl/accelerating-hyperbolic-tsne.

EMP: Effective Multidimensional Persistence for Graph Representation Learning. (arXiv:2401.13713v1 [cs.LG])

Authors: Ignacio Segovia-Dominguez, Yuzhou Chen, Cuneyt G. Akcora, Zhiwei Zhen, Murat Kantarcioglu, Yulia R. Gel, Baris Coskunuzer

Topological data analysis (TDA) is gaining prominence across a wide spectrum of machine learning tasks that spans from manifold learning to graph classification. A pivotal technique within TDA is persistent homology (PH), which furnishes an exclusive topological imprint of data by tracing the evolution of latent structures as a scale parameter changes. Present PH tools are confined to analyzing data through a single filter parameter. However, many scenarios necessitate the consideration of multiple relevant parameters to attain finer insights into the data. We address this issue by introducing the Effective Multidimensional Persistence (EMP) framework. This framework empowers the exploration of data by simultaneously varying multiple scale parameters. The framework integrates descriptor functions into the analysis process, yielding a highly expressive data summary. It seamlessly integrates established single PH summaries into multidimensional counterparts like EMP Landscapes, Silhouettes, Images, and Surfaces. These summaries represent data's multidimensional aspects as matrices and arrays, aligning effectively with diverse ML models. We provide theoretical guarantees and stability proofs for EMP summaries. We demonstrate EMP's utility in graph classification tasks, showing its effectiveness. Results reveal that EMP enhances various single PH descriptors, outperforming cutting-edge methods on multiple benchmark datasets.

Value-Driven Mixed-Precision Quantization for Patch-Based Inference on Microcontrollers. (arXiv:2401.13714v1 [cs.CV])

Authors: Wei Tao, Shenglin He, Kai Lu, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang, Jing Xiao

Deploying neural networks on microcontroller units (MCUs) presents substantial challenges due to their constrained computation and memory resources. Previous researches have explored patch-based inference as a strategy to conserve memory without sacrificing model accuracy. However, this technique suffers from severe redundant computation overhead, leading to a substantial increase in execution latency. A feasible solution to address this issue is mixed-precision quantization, but it faces the challenges of accuracy degradation and a time-consuming search time. In this paper, we propose QuantMCU, a novel patch-based inference method that utilizes value-driven mixed-precision quantization to reduce redundant computation. We first utilize value-driven patch classification (VDPC) to maintain the model accuracy. VDPC classifies patches into two classes based on whether they contain outlier values. For patches containing outlier values, we apply 8-bit quantization to the feature maps on the dataflow branches that follow. In addition, for patches without outlier values, we utilize value-driven quantization search (VDQS) on the feature maps of their following dataflow branches to reduce search time. Specifically, VDQS introduces a novel quantization search metric that takes into account both computation and accuracy, and it employs entropy as an accuracy representation to avoid additional training. VDQS also adopts an iterative approach to determine the bitwidth of each feature map to further accelerate the search process. Experimental results on real-world MCU devices show that QuantMCU can reduce computation by 2.2x on average while maintaining comparable model accuracy compared to the state-of-the-art patch-based inference methods.

Can I trust my fake data -- A comprehensive quality assessment framework for synthetic tabular data in healthcare. (arXiv:2401.13716v1 [cs.LG])

Authors: Vibeke Binz Vallevik, Aleksandar Babic, Serena Elizabeth Marshall, Severin Elvatun, Helga Brøgger, Sharmini Alagaratnam, Bjørn Edwin, Narasimha Raghavan Veeraragavan, Anne Kjersti Befring, Jan Franz Nygård

Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. In response to privacy concerns and regulatory requirements, using synthetic data has been suggested. Synthetic data is created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been suggested, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. We performed a comprehensive literature review on the use of quality evaluation metrics on SD within the scope of tabular healthcare data and SD made using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. We present a conceptual framework for quality assurance of SD for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of SD.

Inference Attacks Against Face Recognition Model without Classification Layers. (arXiv:2401.13719v1 [cs.CV])

Authors: Yuanqing Huang, Huilong Chen, Yinggui Wang, Lei Wang

Face recognition (FR) has been applied to nearly every aspect of daily life, but it is always accompanied by the underlying risk of leaking private information. At present, almost all attack models against FR rely heavily on the presence of a classification layer. However, in practice, the FR model can obtain complex features of the input via the model backbone, and then compare it with the target for inference, which does not explicitly involve the outputs of the classification layer adopting logit or other losses. In this work, we advocate a novel inference attack composed of two stages for practical FR models without a classification layer. The first stage is the membership inference attack. Specifically, We analyze the distances between the intermediate features and batch normalization (BN) parameters. The results indicate that this distance is a critical metric for membership inference. We thus design a simple but effective attack model that can determine whether a face image is from the training dataset or not. The second stage is the model inversion attack, where sensitive private data is reconstructed using a pre-trained generative adversarial network (GAN) guided by the attack model in the first stage. To the best of our knowledge, the proposed attack model is the very first in the literature developed for FR models without a classification layer. We illustrate the application of the proposed attack model in the establishment of privacy-preserving FR techniques.

Uncertainty-Guided Alignment for Unsupervised Domain Adaptation in Regression. (arXiv:2401.13721v1 [cs.CV])

Authors: Ismail Nejjar, Gaetan Frusque, Florent Forest, Olga Fink

Unsupervised Domain Adaptation for Regression (UDAR) aims to adapt a model from a labeled source domain to an unlabeled target domain for regression tasks. Recent successful works in UDAR mostly focus on subspace alignment, involving the alignment of a selected subspace within the entire feature space. This contrasts with the feature alignment methods used for classification, which aim at aligning the entire feature space and have proven effective but are less so in regression settings. Specifically, while classification aims to identify separate clusters across the entire embedding dimension, regression induces less structure in the data representation, necessitating additional guidance for efficient alignment. In this paper, we propose an effective method for UDAR by incorporating guidance from uncertainty. Our approach serves a dual purpose: providing a measure of confidence in predictions and acting as a regularization of the embedding space. Specifically, we leverage the Deep Evidential Learning framework, which outputs both predictions and uncertainties for each input sample. We propose aligning the parameters of higher-order evidential distributions between the source and target domains using traditional alignment methods at the feature or posterior level. Additionally, we propose to augment the feature space representation by mixing source samples with pseudo-labeled target samples based on label similarity. This cross-domain mixing strategy produces more realistic samples than random mixing and introduces higher uncertainty, facilitating further alignment. We demonstrate the effectiveness of our approach on four benchmarks for UDAR, on which we outperform existing methods.

Supporting Sensemaking of Large Language Model Outputs at Scale. (arXiv:2401.13726v1 [cs.HC])

Authors: Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K. Kummerfeld, Elena L. Glassman

Large language models (LLMs) are capable of generating multiple responses to a single prompt, yet little effort has been expended to help end-users or system designers make use of this capability. In this paper, we explore how to present many LLM responses at once. We design five features, which include both pre-existing and novel methods for computing similarities and differences across textual documents, as well as how to render their outputs. We report on a controlled user study (n=24) and eight case studies evaluating these features and how they support users in different tasks. We find that the features support a wide variety of sensemaking tasks and even make tasks previously considered to be too difficult by our participants now tractable. Finally, we present design guidelines to inform future explorations of new LLM interfaces.

Conformal Prediction Sets Improve Human Decision Making. (arXiv:2401.13744v1 [cs.LG])

Authors: Jesse C. Cresswell, Yi Sui, Bhargava Kumar, Noël Vouitsis

In response to everyday queries, humans explicitly signal uncertainty and offer alternative answers when they are unsure. Machine learning models that output calibrated prediction sets through conformal prediction mimic this human behaviour; larger sets signal greater uncertainty while providing alternatives. In this work, we study the usefulness of conformal prediction sets as an aid for human decision making by conducting a pre-registered randomized controlled trial with conformal prediction sets provided to human subjects. With statistical significance, we find that when humans are given conformal prediction sets their accuracy on tasks improves compared to fixed-size prediction sets with the same coverage guarantee. The results show that quantifying model uncertainty with conformal prediction is helpful for human-in-the-loop decision making and human-AI teams.

A Systematic Approach to Robustness Modelling for Deep Convolutional Neural Networks. (arXiv:2401.13751v1 [cs.LG])

Authors: Charles Meyers, Mohammad Reza Saleh Sedghpour, Tommy Löfstedt, Erik Elmroth

Convolutional neural networks have shown to be widely applicable to a large number of fields when large amounts of labelled data are available. The recent trend has been to use models with increasingly larger sets of tunable parameters to increase model accuracy, reduce model loss, or create more adversarially robust models -- goals that are often at odds with one another. In particular, recent theoretical work raises questions about the ability for even larger models to generalize to data outside of the controlled train and test sets. As such, we examine the role of the number of hidden layers in the ResNet model, demonstrated on the MNIST, CIFAR10, CIFAR100 datasets. We test a variety of parameters including the size of the model, the floating point precision, and the noise level of both the training data and the model output. To encapsulate the model's predictive power and computational cost, we provide a method that uses induced failures to model the probability of failure as a function of time and relate that to a novel metric that allows us to quickly determine whether or not the cost of training a model outweighs the cost of attacking it. Using this approach, we are able to approximate the expected failure rate using a small number of specially crafted samples rather than increasingly larger benchmark datasets. We demonstrate the efficacy of this technique on both the MNIST and CIFAR10 datasets using 8-, 16-, 32-, and 64-bit floating-point numbers, various data pre-processing techniques, and several attacks on five configurations of the ResNet model. Then, using empirical measurements, we examine the various trade-offs between cost, robustness, latency, and reliability to find that larger models do not significantly aid in adversarial robustness despite costing significantly more to train.

NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis. (arXiv:2401.13756v1 [cs.LG])

Authors: Zaid Al-Ars, Obinna Agba, Zhuoran Guo, Christiaan Boerkamp, Ziyaad Jaber, Tareq Jaber

This paper offers a systematic method for creating medical knowledge-grounded patient records for use in activities involving differential diagnosis. Additionally, an assessment of machine learning models that can differentiate between various conditions based on given symptoms is also provided. We use a public disease-symptom data source called SymCat in combination with Synthea to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE to augment the synthetic data with additional contextual information for each condition. In addition, Naive Bayes and Random Forest models are evaluated and compared on the synthetic data. The paper shows how to successfully construct SymCat-based and NLICE-based datasets. We also show results for the effectiveness of using the datasets to train predictive disease models. The SymCat-based dataset is able to train a Naive Bayes and Random Forest model yielding a 58.8% and 57.1% Top-1 accuracy score, respectively. In contrast, the NLICE-based dataset improves the results, with a Top-1 accuracy of 82.0% and Top-5 accuracy values of more than 90% for both models. Our proposed data generation approach solves a major barrier to the application of artificial intelligence methods in the healthcare domain. Our novel NLICE symptom modeling approach addresses the incomplete and insufficient information problem in the current binary symptom representation approach. The NLICE code is open sourced at https://github.com/guozhuoran918/NLICE.

Multiview Graph Learning with Consensus Graph. (arXiv:2401.13769v1 [eess.SP])

Authors: Abdullah Karaaslanli, Selin Aviyente

Graph topology inference, i.e., learning graphs from a given set of nodal observations, is a significant task in many application domains. Existing approaches are mostly limited to learning a single graph assuming that the observed data is homogeneous. This is problematic because many modern datasets are heterogeneous or mixed and involve multiple related graphs, i.e., multiview graphs. Recent work proposing to learn multiview graphs ensures the similarity of learned view graphs through pairwise regularization, where each pair of views is encouraged to have similar structures. However, this approach cannot infer the shared structure across views. In this work, we propose an alternative method based on consensus regularization, where views are ensured to be similar through a learned consensus graph representing the common structure of the views. In particular, we propose an optimization problem, where graph data is assumed to be smooth over the multiview graph and the topology of the individual views and that of the consensus graph are learned, simultaneously. Our optimization problem is designed to be general in the sense that different regularization functions can be used depending on what the shared structure across views is. Moreover, we propose two regularization functions that extend fused and group graphical lasso to consensus based regularization. Proposed multiview graph learning is evaluated on simulated data and shown to have better performance than existing methods. It is also employed to infer the functional brain connectivity networks of multiple subjects from their electroencephalogram (EEG) recordings. The proposed method reveals the structure shared by subjects as well as the characteristics unique to each subject.

Faster Convergence with Less Communication: Broadcast-Based Subgraph Sampling for Decentralized Learning over Wireless Networks. (arXiv:2401.13779v1 [cs.IT])

Authors: Daniel Pérez Herrera, Zheng Chen, Erik G. Larsson

Consensus-based decentralized stochastic gradient descent (D-SGD) is a widely adopted algorithm for decentralized training of machine learning models across networked agents. A crucial part of D-SGD is the consensus-based model averaging, which heavily relies on information exchange and fusion among the nodes. Specifically, for consensus averaging over wireless networks, communication coordination is necessary to determine when and how a node can access the channel and transmit (or receive) information to (or from) its neighbors. In this work, we propose $\texttt{BASS}$, a broadcast-based subgraph sampling method designed to accelerate the convergence of D-SGD while considering the actual communication cost per iteration. $\texttt{BASS}$ creates a set of mixing matrix candidates that represent sparser subgraphs of the base topology. In each consensus iteration, one mixing matrix is sampled, leading to a specific scheduling decision that activates multiple collision-free subsets of nodes. The sampling occurs in a probabilistic manner, and the elements of the mixing matrices, along with their sampling probabilities, are jointly optimized. Simulation results demonstrate that $\texttt{BASS}$ enables faster convergence with fewer transmission slots compared to existing link-based scheduling methods. In conclusion, the inherent broadcasting nature of wireless channels offers intrinsic advantages in accelerating the convergence of decentralized optimization and learning.

Tweets to Citations: Unveiling the Impact of Social Media Influencers on AI Research Visibility. (arXiv:2401.13782v1 [cs.DL])

Authors: Iain Xie Weissburg, Mehir Arora, Liangming Pan, William Yang Wang

As the number of accepted papers at AI and ML conferences reaches into the thousands, it has become unclear how researchers access and read research publications. In this paper, we investigate the role of social media influencers in enhancing the visibility of machine learning research, particularly the citation counts of papers they share. We have compiled a comprehensive dataset of over 8,000 papers, spanning tweets from December 2018 to October 2023, alongside 1:1 matched controls based on publication year, venue, and abstract topics. Our analysis reveals a significant increase in citations for papers endorsed by these influencers, with median citation counts 2-3 times higher than those of the control group. Additionally, the study delves into the geographic, gender, and institutional diversity of highlighted authors. These findings highlight the expanding influence of social media in scholarly communication and underscore the importance of an evolving ecosystem in today's digital academic landscape.

Traffic Pattern Classification in Smart Cities Using Deep Recurrent Neural Network. (arXiv:2401.13794v1 [cs.LG])

Authors: Ayad Ghany Ismaeel, Krishnadas Janardhanan, Manishankar Sankar, Yuvaraj Natarajan, Sarmad Nozad Mahmood, Sameer Alani, Akram H. Shather

This paper examines the use of deep recurrent neural networks to classify traffic patterns in smart cities. We propose a novel approach to traffic pattern classification based on deep recurrent neural networks, which can effectively capture traffic patterns' dynamic and sequential features. The proposed model combines convolutional and recurrent layers to extract features from traffic pattern data and a SoftMax layer to classify traffic patterns. Experimental results show that the proposed model outperforms existing methods regarding accuracy, precision, recall, and F1 score. Furthermore, we provide an in depth analysis of the results and discuss the implications of the proposed model for smart cities. The results show that the proposed model can accurately classify traffic patterns in smart cities with a precision of as high as 95%. The proposed model is evaluated on a real world traffic pattern dataset and compared with existing classification methods.

Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning. (arXiv:2401.13796v1 [cs.LG])

Authors: Andrea Apicella, Francesco Isgrò, Roberto Prevete

Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, with the increasing accessibility of ML tools, many practitioners, lacking deep ML expertise, adopt a "push the button" approach, utilizing user-friendly interfaces without a thorough understanding of underlying algorithms. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. This paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Users, due to a lack of understanding, may inadvertently overlook crucial steps, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications.

Investigating the Efficacy of Large Language Models for Code Clone Detection. (arXiv:2401.13802v1 [cs.SE])

Authors: Mohamad Khajezade, Jie Wu, Fatemeh Hendijani Fard, Gema Rodríguez-Pérez, Mohamed Sami Shehata

Large Language Models (LLMs) have demonstrated remarkable success in various natural language processing and software engineering tasks, such as code generation. The LLMs are mainly utilized in the prompt-based zero/few-shot paradigm to guide the model in accomplishing the task. %\textbf{Goal:} GPT-based models are one of the popular ones studied for tasks such as code comment generation or test generation. These tasks are `generative' tasks. However, there is limited research on the usage of LLMs for `non-generative' tasks such as classification using the prompt-based paradigm. In this preliminary exploratory study, we investigated the applicability of LLMs for Code Clone Detection (CCD), a non-generative task. %\textbf{Method:} By building a mono-lingual and cross-lingual CCD dataset derived from CodeNet, we first investigated two different prompts using ChatGPT to detect \textcolor{black}{Type-4} code clones in Java-Java and Java-Ruby pairs in a zero-shot setting. We \textcolor{black}{then} conducted an analysis to understand the strengths and weaknesses of ChatGPT in CCD. %\textbf{Results:} ChatGPT surpasses the baselines in cross-language CCD \textcolor{black}{attaining an F1-score of 0.877 } and achieves comparable performance to fully fine-tuned models for mono-lingual CCD, \textcolor{black}{with an F1-score of 0.878}. Also, the \textcolor{black}{prompt and the} difficulty level of the problems has an impact on the performance of ChatGPT. \textcolor{black}{Finally,} we provide insights and future directions based on our initial analysis

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face. (arXiv:2401.13822v1 [cs.LG])

Authors: Xinyu Yang, Weixin Liang, James Zou

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, while the Considerations for Using the Data section receives the lowest proportion of content. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.

Traffic Learning and Proactive UAV Trajectory Planning for Data Uplink in Markovian IoT Models. (arXiv:2401.13827v1 [cs.LG])

Authors: Eslam Eldeeb, Mohammad Shehab, Hirley Alves

The age of information (AoI) is used to measure the freshness of the data. In IoT networks, the traditional resource management schemes rely on a message exchange between the devices and the base station (BS) before communication which causes high AoI, high energy consumption, and low reliability. Unmanned aerial vehicles (UAVs) as flying BSs have many advantages in minimizing the AoI, energy-saving, and throughput improvement. In this paper, we present a novel learning-based framework that estimates the traffic arrival of IoT devices based on Markovian events. The learning proceeds to optimize the trajectory of multiple UAVs and their scheduling policy. First, the BS predicts the future traffic of the devices. We compare two traffic predictors: the forward algorithm (FA) and the long short-term memory (LSTM). Afterward, we propose a deep reinforcement learning (DRL) approach to optimize the optimal policy of each UAV. Finally, we manipulate the optimum reward function for the proposed DRL approach. Simulation results show that the proposed algorithm outperforms the random-walk (RW) baseline model regarding the AoI, scheduling accuracy, and transmission power.

The Calibration Gap between Model and Human Confidence in Large Language Models. (arXiv:2401.13835v1 [cs.LG])

Authors: Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas Mayer, Padhraic Smyth

For large language models (LLMs) to be trusted by humans they need to be well-calibrated in the sense that they can accurately assess and communicate how likely it is that their predictions are correct. Recent work has focused on the quality of internal LLM confidence assessments, but the question remains of how well LLMs can communicate this internal model confidence to human users. This paper explores the disparity between external human confidence in an LLM's responses and the internal confidence of the model. Through experiments involving multiple-choice questions, we systematically examine human users' ability to discern the reliability of LLM outputs. Our study focuses on two key areas: (1) assessing users' perception of true LLM confidence and (2) investigating the impact of tailored explanations on this perception. The research highlights that default explanations from LLMs often lead to user overestimation of both the model's confidence and its' accuracy. By modifying the explanations to more accurately reflect the LLM's internal confidence, we observe a significant shift in user perception, aligning it more closely with the model's actual confidence levels. This adjustment in explanatory approach demonstrates potential for enhancing user trust and accuracy in assessing LLM outputs. The findings underscore the importance of transparent communication of confidence levels in LLMs, particularly in high-stakes applications where understanding the reliability of AI-generated information is essential.

Machine learning for industrial sensing and control: A survey and practical perspective. (arXiv:2401.13836v1 [eess.SY])

Authors: Nathan P. Lawrence, Seshu Kumar Damarla, Jong Woo Kim, Aditya Tulsyan, Faraz Amjad, Kai Wang, Benoit Chachuat, Jong Min Lee, Biao Huang, R. Bhushan Gopaluni

With the rise of deep learning, there has been renewed interest within the process industries to utilize data on large-scale nonlinear sensing and control problems. We identify key statistical and machine learning techniques that have seen practical success in the process industries. To do so, we start with hybrid modeling to provide a methodological framework underlying core application areas: soft sensing, process optimization, and control. Soft sensing contains a wealth of industrial applications of statistical and machine learning methods. We quantitatively identify research trends, allowing insight into the most successful techniques in practice.

We consider two distinct flavors for data-driven optimization and control: hybrid modeling in conjunction with mathematical programming techniques and reinforcement learning. Throughout these application areas, we discuss their respective industrial requirements and challenges.

A common challenge is the interpretability and efficiency of purely data-driven methods. This suggests a need to carefully balance deep learning techniques with domain knowledge. As a result, we highlight ways prior knowledge may be integrated into industrial machine learning applications. The treatment of methods, problems, and applications presented here is poised to inform and inspire practitioners and researchers to develop impactful data-driven sensing, optimization, and control solutions in the process industries.

Enumerating the k-fold configurations in multi-class classification problems. (arXiv:2401.13843v1 [cs.LG])

Authors: Attila Fazekas, Gyorgy Kovacs

K-fold cross-validation is a widely used tool for assessing classifier performance. The reproducibility crisis faced by artificial intelligence partly results from the irreproducibility of reported k-fold cross-validation-based performance scores. Recently, we introduced numerical techniques to test the consistency of claimed performance scores and experimental setups. In a crucial use case, the method relies on the combinatorial enumeration of all k-fold configurations, for which we proposed an algorithm in the binary classification case.

A V2X-based Privacy Preserving Federated Measuring and Learning System. (arXiv:2401.13848v1 [cs.LG])

Authors: Levente Alekszejenkó, Tadeusz Dobrowiecki

Future autonomous vehicles (AVs) will use a variety of sensors that generate a vast amount of data. Naturally, this data not only serves self-driving algorithms; but can also assist other vehicles or the infrastructure in real-time decision-making. Consequently, vehicles shall exchange their measurement data over Vehicle-to-Everything (V2X) technologies. Moreover, predicting the state of the road network might be beneficial too. With such a prediction, we might mitigate road congestion, balance parking lot usage, or optimize the traffic flow. That would decrease transportation costs as well as reduce its environmental impact.

In this paper, we propose a federated measurement and learning system that provides real-time data to fellow vehicles over Vehicle-to-Vehicle (V2V) communication while also operating a federated learning (FL) scheme over the Vehicle-to-Network (V2N) link to create a predictive model of the transportation network. As we are yet to have real-world AV data, we model it with a non-IID (independent and identically distributed) dataset to evaluate the capabilities of the proposed system in terms of performance and privacy. Results indicate that the proposed FL scheme improves learning performance and prevents eavesdropping at the aggregator server side.

Scaling NVIDIA's multi-speaker multi-lingual TTS systems with voice cloning to Indic Languages. (arXiv:2401.13851v1 [cs.SD])

Authors: Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

Embedding Attack Project (Work Report). (arXiv:2401.13854v1 [cs.LG])

Authors: Jiameng Pu, Zafar Takhirov

This report summarizes all the MIA experiments (Membership Inference Attacks) of the Embedding Attack Project, including threat models, experimental setup, experimental results, findings and discussion. Current results cover the evaluation of two main MIA strategies (loss-based and embedding-based MIAs) on 6 AI models ranging from Computer Vision to Language Modelling. There are two ongoing experiments on MIA defense and neighborhood-comparison embedding attacks. These are ongoing projects.

The current work on MIA and PIA can be summarized into six conclusions: (1) Amount of overfitting is directly proportional to model's vulnerability; (2) early embedding layers in the model are less susceptible to privacy leaks; (3) Deeper model layers contain more membership information; (4) Models are more vulnerable to MIA if both embeddings and corresponding training labels are compromised; (5) it is possible to use pseudo-labels to increase the MIA success; and (6) although MIA and PIA success rates are proportional, reducing the MIA does not necessarily reduce the PIA.

Inverse Molecular Design with Multi-Conditional Diffusion Guidance. (arXiv:2401.13858v1 [cs.LG])

Authors: Gang Liu, Jiaxin Xu, Tengfei Luo, Meng Jiang

Inverse molecular design with diffusion models holds great potential for advancements in material and drug discovery. Despite success in unconditional molecule generation, integrating multiple properties such as synthetic score and gas permeability as condition constraints into diffusion models remains unexplored. We introduce multi-conditional diffusion guidance. The proposed Transformer-based denoising model has a condition encoder that learns the representations of numerical and categorical conditions. The denoising model, consisting of a structure encoder-decoder, is trained for denoising under the representation of conditions. The diffusion process becomes graph-dependent to accurately estimate graph-related noise in molecules, unlike the previous models that focus solely on the marginal distributions of atoms or bonds. We extensively validate our model for multi-conditional polymer and small molecule generation. Results demonstrate our superiority across metrics from distribution learning to condition control for molecular properties. An inverse polymer design task for gas separation with feedback from domain experts further demonstrates its practical utility.

Edge Conditional Node Update Graph Neural Network for Multi-variate Time Series Anomaly Detection. (arXiv:2401.13872v1 [cs.LG])

Authors: Hayoung Jo, Seong-Whan Lee

With the rapid advancement in cyber-physical systems, the increasing number of sensors has significantly complicated manual monitoring of system states. Consequently, graph-based time-series anomaly detection methods have gained attention due to their ability to explicitly represent relationships between sensors. However, these methods often apply a uniform source node representation across all connected target nodes, even when updating different target node representations. Moreover, the graph attention mechanism, commonly used to infer unknown graph structures, could constrain the diversity of source node representations. In this paper, we introduce the Edge Conditional Node-update Graph Neural Network (ECNU-GNN). Our model, equipped with an edge conditional node update module, dynamically transforms source node representations based on connected edges to represent target nodes aptly. We validate performance on three real-world datasets: SWaT, WADI, and PSM. Our model demonstrates 5.4%, 12.4%, and 6.0% higher performance, respectively, compared to best F1 baseline models.

Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts?. (arXiv:2401.13875v1 [stat.ML])

Authors: Huy Nguyen, Pedram Akbarian, Nhat Ho

Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabilize the expert specialization. Nevertheless, while there are previous attempts to theoretically comprehend the sparse MoE, a comprehensive analysis of the dense-to-sparse gating MoE has remained elusive. Therefore, we aim to explore the impacts of the dense-to-sparse gate on the maximum likelihood estimation under the Gaussian MoE in this paper. We demonstrate that due to interactions between the temperature and other model parameters via some partial differential equations, the convergence rates of parameter estimations are slower than any polynomial rates, and could be as slow as $\mathcal{O}(1/\log(n))$, where $n$ denotes the sample size. To address this issue, we propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linearly independence conditions on the activation function and its derivatives, we show that the parameter estimation rates are significantly improved to polynomial rates.

Constant Stepsize Q-learning: Distributional Convergence, Bias and Extrapolation. (arXiv:2401.13884v1 [stat.ML])

Authors: Yixuan Zhang, Qiaomin Xie

Stochastic Approximation (SA) is a widely used algorithmic approach in various fields, including optimization and reinforcement learning (RL). Among RL algorithms, Q-learning is particularly popular due to its empirical success. In this paper, we study asynchronous Q-learning with constant stepsize, which is commonly used in practice for its fast convergence. By connecting the constant stepsize Q-learning to a time-homogeneous Markov chain, we show the distributional convergence of the iterates in Wasserstein distance and establish its exponential convergence rate. We also establish a Central Limit Theory for Q-learning iterates, demonstrating the asymptotic normality of the averaged iterates. Moreover, we provide an explicit expansion of the asymptotic bias of the averaged iterate in stepsize. Specifically, the bias is proportional to the stepsize up to higher-order terms and we provide an explicit expression for the linear coefficient. This precise characterization of the bias allows the application of Richardson-Romberg (RR) extrapolation technique to construct a new estimate that is provably closer to the optimal Q function. Numerical results corroborate our theoretical finding on the improvement of the RR extrapolation method.

A comparative study of zero-shot inference with large language models and supervised modeling in breast cancer pathology classification. (arXiv:2401.13887v1 [cs.CL])

Authors: Madhumita Sushil, Travis Zack, Divneet Mandair, Zhiwei Zheng, Ahmed Wali, Yan-Ning Yu, Yuwei Quan, Atul J. Butte

Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs can reduce the need for large-scale data annotations. We curated a manually-labeled dataset of 769 breast cancer pathology reports, labeled with 13 categories, to compare zero-shot classification capability of the GPT-4 model and the GPT-3.5 model with supervised classification performance of three model architectures: random forests classifier, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model. Across all 13 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, the LSTM-Att model (average macro F1 score of 0.83 vs. 0.75). On tasks with high imbalance between labels, the differences were more prominent. Frequent sources of GPT-4 errors included inferences from multiple samples and complex task design. On complex tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of large-scale data labeling. However, if the use of LLMs is prohibitive, the use of simpler supervised models with large annotated datasets can provide comparable results. LLMs demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for curating large annotated datasets. This may result in an increase in the utilization of NLP-based variables and outcomes in observational clinical studies.

Cross-Modal Prototype based Multimodal Federated Learning under Severely Missing Modality. (arXiv:2401.13898v1 [cs.LG])

Authors: Huy Q. Le, Chu Myaet Thwal, Yu Qiao, Ye Lin Tun, Minh N. H. Nguyen, Choong Seon Hong

Multimodal federated learning (MFL) has emerged as a decentralized machine learning paradigm, allowing multiple clients with different modalities to collaborate on training a machine learning model across diverse data sources without sharing their private data. However, challenges, such as data heterogeneity and severely missing modalities, pose crucial hindrances to the robustness of MFL, significantly impacting the performance of global model. The absence of a modality introduces misalignment during the local training phase, stemming from zero-filling in the case of clients with missing modalities. Consequently, achieving robust generalization in global model becomes imperative, especially when dealing with clients that have incomplete data. In this paper, we propose Multimodal Federated Cross Prototype Learning (MFCPL), a novel approach for MFL under severely missing modalities by conducting the complete prototypes to provide diverse modality knowledge in modality-shared level with the cross-modal regularization and modality-specific level with cross-modal contrastive mechanism. Additionally, our approach introduces the cross-modal alignment to provide regularization for modality-specific features, thereby enhancing overall performance, particularly in scenarios involving severely missing modalities. Through extensive experiments on three multimodal datasets, we demonstrate the effectiveness of MFCPL in mitigating these challenges and improving the overall performance.

Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression. (arXiv:2401.13904v1 [cs.LG])

Authors: Siyu Lou, Chengchun Liu, Yuntian Chen, Fanyang Mo

Thin-layer chromatography (TLC) is a crucial technique in molecular polarity analysis. Despite its importance, the interpretability of predictive models for TLC, especially those driven by artificial intelligence, remains a challenge. Current approaches, utilizing either high-dimensional molecular fingerprints or domain-knowledge-driven feature engineering, often face a dilemma between expressiveness and interpretability. To bridge this gap, we introduce Unsupervised Hierarchical Symbolic Regression (UHiSR), combining hierarchical neural networks and symbolic regression. UHiSR automatically distills chemical-intuitive polarity indices, and discovers interpretable equations that link molecular structure to chromatographic behavior.

A Survey of Deep Learning and Foundation Models for Time Series Forecasting. (arXiv:2401.13912v1 [cs.LG])

Authors: John A. Miller, Mohammed Aldosari, Farah Saeed, Nasid Habib Barna, Subas Rana, I. Budak Arpinar, Ninghao Liu

Deep Learning has been successfully applied to many application domains, yet its advantages have been slow to emerge for time series forecasting. For example, in the well-known Makridakis (M) Competitions, hybrids of traditional statistical or machine learning techniques have only recently become the top performers. With the recent architectural advances in deep learning being applied to time series forecasting (e.g., encoder-decoders with attention, transformers, and graph neural networks), deep learning has begun to show significant advantages. Still, in the area of pandemic prediction, there remain challenges for deep learning models: the time series is not long enough for effective training, unawareness of accumulated scientific knowledge, and interpretability of the model. To this end, the development of foundation models (large deep learning models with extensive pre-training) allows models to understand patterns and acquire knowledge that can be applied to new related problems before extensive training data becomes available. Furthermore, there is a vast amount of knowledge available that deep learning models can tap into, including Knowledge Graphs and Large Language Models fine-tuned with scientific domain knowledge. There is ongoing research examining how to utilize or inject such knowledge into deep learning models. In this survey, several state-of-the-art modeling techniques are reviewed, and suggestions for further work are provided.

Spectral Clustering for Discrete Distributions. (arXiv:2401.13913v1 [cs.LG])

Authors: Zixiao Wang, Dong Qiao, Jicong Fan

Discrete distribution clustering (D2C) was often solved by Wasserstein barycenter methods. These methods are under a common assumption that clusters can be well represented by barycenters, which may not hold in many real applications. In this work, we propose a simple yet effective framework based on spectral clustering and distribution affinity measures (e.g., maximum mean discrepancy and Wasserstein distance) for D2C. To improve the scalability, we propose to use linear optimal transport to construct affinity matrices efficiently on large datasets. We provide theoretical guarantees for the success of the proposed methods in clustering distributions. Experiments on synthetic and real data show that our methods outperform the baselines largely in terms of both clustering accuracy and computational efficiency.

LocMoE: A Low-overhead MoE for Large Language Model Training. (arXiv:2401.13920v1 [cs.LG])

Authors: Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen

The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-To-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-To-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGu-Sigma model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.

Towards 3D Molecule-Text Interpretation in Language Models. (arXiv:2401.13923v1 [cs.LG])

Authors: Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, Qi Tian

Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecules by equipping the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder's representation space and the LM's input space. Moreover, to enhance 3D-MoLM's ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset -- 3D-MoIT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-MoLM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including molecule-text retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties.

Reinforcement Learning with Hidden Markov Models for Discovering Decision-Making Dynamics. (arXiv:2401.13929v1 [cs.LG])

Authors: Xingche Guo, Donglin Zeng, Yuanjia Wang

Major depressive disorder (MDD) presents challenges in diagnosis and treatment due to its complex and heterogeneous nature. Emerging evidence indicates that reward processing abnormalities may serve as a behavioral marker for MDD. To measure reward processing, patients perform computer-based behavioral tasks that involve making choices or responding to stimulants that are associated with different outcomes. Reinforcement learning (RL) models are fitted to extract parameters that measure various aspects of reward processing to characterize how patients make decisions in behavioral tasks. Recent findings suggest the inadequacy of characterizing reward learning solely based on a single RL model; instead, there may be a switching of decision-making processes between multiple strategies. An important scientific question is how the dynamics of learning strategies in decision-making affect the reward learning ability of individuals with MDD. Motivated by the probabilistic reward task (PRT) within the EMBARC study, we propose a novel RL-HMM framework for analyzing reward-based decision-making. Our model accommodates learning strategy switching between two distinct approaches under a hidden Markov model (HMM): subjects making decisions based on the RL model or opting for random choices. We account for continuous RL state space and allow time-varying transition probabilities in the HMM. We introduce a computationally efficient EM algorithm for parameter estimation and employ a nonparametric bootstrap for inference. We apply our approach to the EMBARC study to show that MDD patients are less engaged in RL compared to the healthy controls, and engagement is associated with brain activities in the negative affect circuitry during an emotional conflict task.

Networked Multiagent Reinforcement Learning for Peer-to-Peer Energy Trading. (arXiv:2401.13947v1 [eess.SY])

Authors: Chen Feng, Andrew L. Liu

Utilizing distributed renewable and energy storage resources in local distribution networks via peer-to-peer (P2P) energy trading has long been touted as a solution to improve energy systems' resilience and sustainability. Consumers and prosumers (those who have energy generation resources), however, do not have the expertise to engage in repeated P2P trading, and the zero-marginal costs of renewables present challenges in determining fair market prices. To address these issues, we propose multi-agent reinforcement learning (MARL) frameworks to help automate consumers' bidding and management of their solar PV and energy storage resources, under a specific P2P clearing mechanism that utilizes the so-called supply-demand ratio. In addition, we show how the MARL frameworks can integrate physical network constraints to realize voltage control, hence ensuring physical feasibility of the P2P energy trading and paving way for real-world implementations.

Dynamic Long-Term Time-Series Forecasting via Meta Transformer Networks. (arXiv:2401.13968v1 [cs.LG])

Authors: Muhammad Anwar Ma'sum, MD Rasel Sarkar, Mahardhika Pratama, Savitha Ramasamy, Sreenatha Anavatti, Lin Liu, Habibullah, Ryszard Kowalczyk

A reliable long-term time-series forecaster is highly demanded in practice but comes across many challenges such as low computational and memory footprints as well as robustness against dynamic learning environments. This paper proposes Meta-Transformer Networks (MANTRA) to deal with the dynamic long-term time-series forecasting tasks. MANTRA relies on the concept of fast and slow learners where a collection of fast learners learns different aspects of data distributions while adapting quickly to changes. A slow learner tailors suitable representations to fast learners. Fast adaptations to dynamic environments are achieved using the universal representation transformer layers producing task-adapted representations with a small number of parameters. Our experiments using four datasets with different prediction lengths demonstrate the advantage of our approach with at least $3\%$ improvements over the baseline algorithms for both multivariate and univariate settings. Source codes of MANTRA are publicly available in \url{https://github.com/anwarmaxsum/MANTRA}.

Stochastic Weakly Convex Optimization Beyond Lipschitz Continuity. (arXiv:2401.13971v1 [math.OC])

Authors: Wenzhi Gao, Qi Deng

This paper considers stochastic weakly convex optimization without the standard Lipschitz continuity assumption. Based on new adaptive regularization (stepsize) strategies, we show that a wide class of stochastic algorithms, including the stochastic subgradient method, preserve the $\mathcal{O} ( 1 / \sqrt{K})$ convergence rate with constant failure rate. Our analyses rest on rather weak assumptions: the Lipschitz parameter can be either bounded by a general growth function of $\|x\|$ or locally estimated through independent random samples.

Evaluating the Determinants of Mode Choice Using Statistical and Machine Learning Techniques in the Indian Megacity of Bengaluru. (arXiv:2401.13977v1 [cs.LG])

Authors: Tanmay Ghosh, Nithin Nagaraj

The decision making involved behind the mode choice is critical for transportation planning. While statistical learning techniques like discrete choice models have been used traditionally, machine learning (ML) models have gained traction recently among the transportation planners due to their higher predictive performance. However, the black box nature of ML models pose significant interpretability challenges, limiting their practical application in decision and policy making. This study utilised a dataset of $1350$ households belonging to low and low-middle income bracket in the city of Bengaluru to investigate mode choice decision making behaviour using Multinomial logit model and ML classifiers like decision trees, random forests, extreme gradient boosting and support vector machines. In terms of accuracy, random forest model performed the best ($0.788$ on training data and $0.605$ on testing data) compared to all the other models. This research has adopted modern interpretability techniques like feature importance and individual conditional expectation plots to explain the decision making behaviour using ML models. A higher travel costs significantly reduce the predicted probability of bus usage compared to other modes (a $0.66\%$ and $0.34\%$ reduction using Random Forests and XGBoost model for $10\%$ increase in travel cost). However, reducing travel time by $10\%$ increases the preference for the metro ($0.16\%$ in Random Forests and 0.42% in XGBoost). This research augments the ongoing research on mode choice analysis using machine learning techniques, which would help in improving the understanding of the performance of these models with real-world data in terms of both accuracy and interpretability.

Leeroo Orchestrator: Elevating LLMs Performance Through Model Integration. (arXiv:2401.13979v1 [cs.CL])

Authors: Alireza Mohammadshahi, Ali Shaikh, Majid Yazdani

In this paper, we propose an architecture to harness the collective knowledge of multiple trained LLMs to create a new state-of-the-art. At the core of this framework is a LLM-based orchestrator that is adept at picking the right underlying LLM experts for optimal task execution. Inspired by self-play in reinforcement learning, we created a loop of query generation, orchestration, and evaluation to generate training data for the orchestrator. Our evaluation focused on the MMLU benchmark, employing models with 7B, 13B, and 34B parameters available on Hugging Face. The results demonstrate new state-of-the-art open-source models: Our Leeroo orchestrator achieves performance on par with the Mixtral model while incurring only two-thirds of its cost. Moreover, increasing the allowed cost surpasses Mixtral's accuracy by over 5% at the same cost level, reaching an accuracy of 75.9%. Further enhancements were observed when integrating GPT4 into the underlying model pool. The Leeroo orchestrator nearly matches GPT4's performance at half the cost and even exceeds GPT4's results with a 25% cost reduction. These findings illustrate the potential of our architecture in creating state-of-the-art and cost-effective LLMs by optimizing the synergy between multiple LLMs to achieve superior performance outcomes.

Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning. (arXiv:2401.13986v1 [cs.CL])

Authors: Yanda Chen, Chandan Singh, Xiaodong Liu, Simiao Zuo, Bin Yu, He He, Jianfeng Gao

Large language models (LLMs) often generate convincing, fluent explanations. However, different from humans, they often generate inconsistent explanations on different inputs. For example, an LLM may generate the explanation "all birds can fly" when answering the question "Can sparrows fly?" but meanwhile answer "no" to the related question "Can penguins fly?". Explanations should be consistent across related examples so that they allow a human to simulate the LLM's decision process on multiple examples. We propose explanation-consistency finetuning (EC-finetuning), a method that adapts LLMs to generate more consistent natural-language explanations on related examples. EC-finetuning involves finetuning LLMs on synthetic data that is carefully constructed to contain consistent explanations. Across a variety of question-answering datasets in various domains, EC-finetuning yields a 10.0% relative explanation consistency improvement on four finetuning datasets, and generalizes to seven out-of-distribution datasets not seen during finetuning (+4.5% relative). Code is available at https://github.com/yandachen/explanation-consistency-finetuning .

Cross-Domain Few-Shot Learning via Adaptive Transformer Networks. (arXiv:2401.13987v1 [cs.LG])

Authors: Naeem Paeedeh, Mahardhika Pratama, Muhammad Anwar Ma'sum, Wolfgang Mayer, Zehong Cao, Ryszard Kowlczyk

Most few-shot learning works rely on the same domain assumption between the base and the target tasks, hindering their practical applications. This paper proposes an adaptive transformer network (ADAPTER), a simple but effective solution for cross-domain few-shot learning where there exist large domain shifts between the base task and the target task. ADAPTER is built upon the idea of bidirectional cross-attention to learn transferable features between the two domains. The proposed architecture is trained with DINO to produce diverse, and less biased features to avoid the supervision collapse problem. Furthermore, the label smoothing approach is proposed to improve the consistency and reliability of the predictions by also considering the predicted labels of the close samples in the embedding space. The performance of ADAPTER is rigorously evaluated in the BSCD-FSL benchmarks in which it outperforms prior arts with significant margins.

Accelerating Retrieval-Augmented Language Model Serving with Speculation. (arXiv:2401.14021v1 [cs.LG])

Authors: Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, Zhihao Jia

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Instead of fine-tuning a fully parametric model, RaLM excels at its low-cost adaptation to the latest data and better source attribution mechanisms. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLMSpec can achieve a speed-up ratio of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59x and 2.45x when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline.

DNA Sequence Classification with Compressors. (arXiv:2401.14025v1 [q-bio.GN])

Authors: Şükrü Ozan

Recent studies in DNA sequence classification have leveraged sophisticated machine learning techniques, achieving notable accuracy in categorizing complex genomic data. Among these, methods such as k-mer counting have proven effective in distinguishing sequences from varied species like chimpanzees, dogs, and humans, becoming a staple in contemporary genomic research. However, these approaches often demand extensive computational resources, posing a challenge in terms of scalability and efficiency. Addressing this issue, our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis. This innovative approach utilizes a variety of compression algorithms, such as Gzip, Brotli, and LZMA, to efficiently process and classify genomic sequences. Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods. Our comprehensive evaluation demonstrates the proposed method's effectiveness in accurately classifying DNA sequences from multiple species. We present a detailed analysis of the performance of each algorithm used, highlighting the strengths and limitations of our approach in various genomic contexts. Furthermore, we discuss the broader implications of our findings for bioinformatics, particularly in genomic data processing and analysis. The results of our study pave the way for more efficient and scalable DNA sequence classification methods, offering significant potential for advancements in genomic research and applications.

The Risk of Federated Learning to Skew Fine-Tuning Features and Underperform Out-of-Distribution Robustness. (arXiv:2401.14027v1 [cs.LG])

Authors: Mengyao Du, Miao Zhang, Yuwen Pu, Kai Xu, Shouling Ji, Quanjun Yin

To tackle the scarcity and privacy issues associated with domain-specific datasets, the integration of federated learning in conjunction with fine-tuning has emerged as a practical solution. However, our findings reveal that federated learning has the risk of skewing fine-tuning features and compromising the out-of-distribution robustness of the model. By introducing three robustness indicators and conducting experiments across diverse robust datasets, we elucidate these phenomena by scrutinizing the diversity, transferability, and deviation within the model feature space. To mitigate the negative impact of federated learning on model robustness, we introduce GNP, a \underline{G}eneral \underline{N}oisy \underline{P}rojection-based robust algorithm, ensuring no deterioration of accuracy on the target distribution. Specifically, the key strategy for enhancing model robustness entails the transfer of robustness from the pre-trained model to the fine-tuned model, coupled with adding a small amount of Gaussian noise to augment the representative capacity of the model. Comprehensive experimental results demonstrate that our approach markedly enhances the robustness across diverse scenarios, encompassing various parameter-efficient fine-tuning methods and confronting different levels of data heterogeneity.

Towards a Systems Theory of Algorithms. (arXiv:2401.14029v1 [math.OC])

Authors: Florian Dörfler, Zhiyu He, Giuseppe Belgioioso, Saverio Bolognani, John Lygeros, Michael Muehlebach

Traditionally, numerical algorithms are seen as isolated pieces of code confined to an {\em in silico} existence. However, this perspective is not appropriate for many modern computational approaches in control, learning, or optimization, wherein {\em in vivo} algorithms interact with their environment. Examples of such {\em open} include various real-time optimization-based control strategies, reinforcement learning, decision-making architectures, online optimization, and many more. Further, even {\em closed} algorithms in learning or optimization are increasingly abstracted in block diagrams with interacting dynamic modules and pipelines. In this opinion paper, we state our vision on a to-be-cultivated {\em systems theory of algorithms} and argue in favour of viewing algorithms as open dynamical systems interacting with other algorithms, physical systems, humans, or databases. Remarkably, the manifold tools developed under the umbrella of systems theory also provide valuable insights into this burgeoning paradigm shift and its accompanying challenges in the algorithmic world. We survey various instances where the principles of algorithmic systems theory are being developed and outline pertinent modeling, analysis, and design challenges.

Sparse and Transferable Universal Singular Vectors Attack. (arXiv:2401.14031v1 [cs.LG])

Authors: Kseniia Kuvshinova, Olga Tsymboi, Ivan Oseledets

The research in the field of adversarial attacks and models' vulnerability is one of the fundamental directions in modern machine learning. Recent studies reveal the vulnerability phenomenon, and understanding the mechanisms behind this is essential for improving neural network characteristics and interpretability. In this paper, we propose a novel sparse universal white-box adversarial attack. Our approach is based on truncated power iteration providing sparsity to $(p,q)$-singular vectors of the hidden layers of Jacobian matrices. Using the ImageNet benchmark validation subset, we analyze the proposed method in various settings, achieving results comparable to dense baselines with more than a 50% fooling rate while damaging only 5% of pixels and utilizing 256 samples for perturbation fitting. We also show that our algorithm admits higher attack magnitude without affecting the human ability to solve the task. Furthermore, we investigate that the constructed perturbations are highly transferable among different models without significantly decreasing the fooling rate. Our findings demonstrate the vulnerability of state-of-the-art models to sparse attacks and highlight the importance of developing robust machine learning systems.

Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations. (arXiv:2401.14033v1 [cs.LG])

Authors: Patricia Pauli, Aaron Havens, Alexandre Araujo, Siddharth Garg, Farshad Khorrami, Frank Allgöwer, Bin Hu

Recently, semidefinite programming (SDP) techniques have shown great promise in providing accurate Lipschitz bounds for neural networks. Specifically, the LipSDP approach (Fazlyab et al., 2019) has received much attention and provides the least conservative Lipschitz upper bounds that can be computed with polynomial time guarantees. However, one main restriction of LipSDP is that its formulation requires the activation functions to be slope-restricted on $[0,1]$, preventing its further use for more general activation functions such as GroupSort, MaxMin, and Householder. One can rewrite MaxMin activations for example as residual ReLU networks. However, a direct application of LipSDP to the resultant residual ReLU networks is conservative and even fails in recovering the well-known fact that the MaxMin activation is 1-Lipschitz. Our paper bridges this gap and extends LipSDP beyond slope-restricted activation functions. To this end, we provide novel quadratic constraints for GroupSort, MaxMin, and Householder activations via leveraging their underlying properties such as sum preservation. Our proposed analysis is general and provides a unified approach for estimating $\ell_2$ and $\ell_\infty$ Lipschitz bounds for a rich class of neural network architectures, including non-residual and residual neural networks and implicit models, with GroupSort, MaxMin, and Householder activations. Finally, we illustrate the utility of our approach with a variety of experiments and show that our proposed SDPs generate less conservative Lipschitz bounds in comparison to existing approaches.

Left/Right Brain, human motor control and the implications for robotics. (arXiv:2401.14057v1 [cs.RO])

Authors: Jarrad Rinaldo, Levin Kuhlmann, Jason Friedman, Gideon Kowadlo

Neural Network movement controllers promise a variety of advantages over conventional control methods however they are not widely adopted due to their inability to produce reliably precise movements. This research explores a bilateral neural network architecture as a control system for motor tasks. We aimed to achieve hemispheric specialisation similar to what is observed in humans across different tasks; the dominant system (usually the right hand, left hemisphere) excels at tasks involving coordination and efficiency of movement, and the non-dominant system performs better at tasks requiring positional stability. Specialisation was achieved by training the hemispheres with different loss functions tailored toward the expected behaviour of the respective hemispheres. We compared bilateral models with and without specialised hemispheres, with and without inter-hemispheric connectivity (representing the biological Corpus Callosum), and unilateral models with and without specialisation. The models were trained and tested on two tasks common in the human motor control literature: the random reach task, suited to the dominant system, a model with better coordination, and the hold position task, suited to the non-dominant system, a model with more stable movement. Each system out-performed the non-favoured system in its preferred task. For both tasks, a bilateral model outperforms the 'non-preferred' hand, and is as good or better than the 'preferred' hand. The Corpus Callosum tends to improve performance, but not always for the specialised models.

Novel application of Relief Algorithm in cascaded artificial neural network to predict wind speed for wind power resource assessment in India. (arXiv:2401.14065v1 [cs.LG])

Authors: Hasmat Malik, Amit Kumar Yadav, Fausto Pedro García Márquez, Jesús María Pinar-Pérez

Wind power generated by wind has non-schedule nature due to stochastic nature of meteorological variable. Hence energy business and control of wind power generation requires prediction of wind speed (WS) from few seconds to different time steps in advance. To deal with prediction shortcomings, various WS prediction methods have been used. Predictive data mining offers variety of methods for WS predictions where artificial neural network (ANN) is one of the reliable and accurate methods. It is observed from the result of this study that ANN gives better accuracy in comparison conventional model. The accuracy of WS prediction models is found to be dependent on input parameters and architecture type algorithms utilized. So the selection of most relevant input parameters is important research area in WS predicton field. The objective of the paper is twofold: first extensive review of ANN for wind power and WS prediction is carried out. Discussion and analysis of feature selection using Relief Algorithm (RA) in WS prediction are considered for different Indian sites. RA identify atmospheric pressure, solar radiation and relative humidity are relevant input variables. Based on relevant input variables Cascade ANN model is developed and prediction accuracy is evaluated. It is found that root mean square error (RMSE) for comparison between predicted and measured WS for training and testing wind speed are found to be 1.44 m/s and 1.49 m/s respectively. The developed cascade ANN model can be used to predict wind speed for sites where there are not WS measuring instruments are installed in India.

Neural Sinkhorn Gradient Flow. (arXiv:2401.14069v1 [cs.LG])

Authors: Huminhao Zhu, Fangyikang Wang, Chao Zhang, Hanbin Zhao, Hui Qian

Wasserstein Gradient Flows (WGF) with respect to specific functionals have been widely used in the machine learning literature. Recently, neural networks have been adopted to approximate certain intractable parts of the underlying Wasserstein gradient flow and result in efficient inference procedures. In this paper, we introduce the Neural Sinkhorn Gradient Flow (NSGF) model, which parametrizes the time-varying velocity field of the Wasserstein gradient flow w.r.t. the Sinkhorn divergence to the target distribution starting a given source distribution. We utilize the velocity field matching training scheme in NSGF, which only requires samples from the source and target distribution to compute an empirical velocity field approximation. Our theoretical analyses show that as the sample size increases to infinity, the mean-field limit of the empirical approximation converges to the true underlying velocity field. To further enhance model efficiency on high-dimensional tasks, a two-phase NSGF++ model is devised, which first follows the Sinkhorn flow to approach the image manifold quickly ($\le 5$ NFEs) and then refines the samples along a simple straight flow. Numerical experiments with synthetic and real-world benchmark datasets support our theoretical results and demonstrate the effectiveness of the proposed methods.

ProCNS: Progressive Prototype Calibration and Noise Suppression for Weakly-Supervised Medical Image Segmentation. (arXiv:2401.14074v1 [cs.CV])

Authors: Y. Liu, L. Lin, K. K. Y. Wong, X. Tang

Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate the conflict between annotation cost and model performance by adopting sparse annotation formats (e.g., point, scribble, block, etc.). Typical approaches attempt to exploit anatomy and topology priors to directly expand sparse annotations into pseudo-labels. However, due to a lack of attention to the ambiguous edges in medical images and insufficient exploration of sparse supervision, existing approaches tend to generate erroneous and overconfident pseudo proposals in noisy regions, leading to cumulative model error and performance degradation. In this work, we propose a novel WSS approach, named ProCNS, encompassing two synergistic modules devised with the principles of progressive prototype calibration and noise suppression. Specifically, we design a Prototype-based Regional Spatial Affinity (PRSA) loss to maximize the pair-wise affinities between spatial and semantic elements, providing our model of interest with more reliable guidance. The affinities are derived from the input images and the prototype-refined predictions. Meanwhile, we propose an Adaptive Noise Perception and Masking (ANPM) module to obtain more enriched and representative prototype representations, which adaptively identifies and masks noisy regions within the pseudo proposals, reducing potential erroneous interference during prototype computation. Furthermore, we generate specialized soft pseudo-labels for the noisy regions identified by ANPM, providing supplementary supervision. Extensive experiments on three medical image segmentation tasks involving different modalities demonstrate that the proposed framework significantly outperforms representative state-of-the-art methods

Accelerating Fractional PINNs using Operational Matrices of Derivative. (arXiv:2401.14081v1 [cs.LG])

Authors: Tayebeh Taheri, Alireza Afzal Aghaei, Kourosh Parand

This paper presents a novel operational matrix method to accelerate the training of fractional Physics-Informed Neural Networks (fPINNs). Our approach involves a non-uniform discretization of the fractional Caputo operator, facilitating swift computation of fractional derivatives within Caputo-type fractional differential problems with $0<\alpha<1$. In this methodology, the operational matrix is precomputed, and during the training phase, automatic differentiation is replaced with a matrix-vector product. While our methodology is compatible with any network, we particularly highlight its successful implementation in PINNs, emphasizing the enhanced accuracy achieved when utilizing the Legendre Neural Block (LNB) architecture. LNB incorporates Legendre polynomials into the PINN structure, providing a significant boost in accuracy. The effectiveness of our proposed method is validated across diverse differential equations, including Delay Differential Equations (DDEs) and Systems of Differential Algebraic Equations (DAEs). To demonstrate its versatility, we extend the application of the method to systems of differential equations, specifically addressing nonlinear Pantograph fractional-order DDEs/DAEs. The results are supported by a comprehensive analysis of numerical outcomes.

Generating Likely Counterfactuals Using Sum-Product Networks. (arXiv:2401.14086v1 [cs.AI])

Authors: Jiri Nemecek, Tomas Pevny, Jakub Marecek

Due to user demand and recent regulation (GDPR, AI Act), decisions made by AI systems need to be explained. These decisions are often explainable only post hoc, where counterfactual explanations are popular. The question of what constitutes the best counterfactual explanation must consider multiple aspects, where "distance from the sample" is the most common. We argue that this requirement frequently leads to explanations that are unlikely and, therefore, of limited value. Here, we present a system that provides high-likelihood explanations. We show that the search for the most likely explanations satisfying many common desiderata for counterfactual explanations can be modeled using mixed-integer optimization (MIO). In the process, we propose an MIO formulation of a Sum-Product Network (SPN) and use the SPN to estimate the likelihood of a counterfactual, which can be of independent interest. A numerical comparison against several methods for generating counterfactual explanations is provided.

A Modular Approach to Automatic Cyber Threat Attribution using Opinion Pools. (arXiv:2401.14090v1 [cs.CR])

Authors: Koen T.W. Teuwen

Cyber threat attribution can play an important role in increasing resilience against digital threats. Recent research focuses on automating the threat attribution process and on integrating it with other efforts, such as threat hunting. To support increasing automation of the cyber threat attribution process, this paper proposes a modular architecture as an alternative to current monolithic automated approaches. The modular architecture can utilize opinion pools to combine the output of concrete attributors. The proposed solution increases the tractability of the threat attribution problem and offers increased usability and interpretability, as opposed to monolithic alternatives. In addition, a Pairing Aggregator is proposed as an aggregation method that forms pairs of attributors based on distinct features to produce intermediary results before finally producing a single Probability Mass Function (PMF) as output. The Pairing Aggregator sequentially applies both the logarithmic opinion pool and the linear opinion pool. An experimental validation suggests that the modular approach does not result in decreased performance and can even enhance precision and recall compared to monolithic alternatives. The results also suggest that the Pairing Aggregator can improve precision over the linear and logarithmic opinion pools. Furthermore, the improved k-accuracy in the experiment suggests that forensic experts can leverage the resulting PMF during their manual attribution processes to enhance their efficiency.

McUDI: Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions. (arXiv:2401.14093v1 [cs.SE])

Authors: Lorena Poenaru-Olaru, Luis Cruz, Jan Rellermeyer, Arie van Deursen

Due to the continuous change in operational data, AIOps solutions suffer from performance degradation over time. Although periodic retraining is the state-of-the-art technique to preserve the failure prediction AIOps models' performance over time, this technique requires a considerable amount of labeled data to retrain. In AIOps obtaining label data is expensive since it requires the availability of domain experts to intensively annotate it. In this paper, we present McUDI, a model-centric unsupervised degradation indicator that is capable of detecting the exact moment the AIOps model requires retraining as a result of changes in data. We further show how employing McUDI in the maintenance pipeline of AIOps solutions can reduce the number of samples that require annotations with 30k for job failure prediction and 260k for disk failure prediction while achieving similar performance with periodic retraining.

Learning under Label Noise through Few-Shot Human-in-the-Loop Refinement. (arXiv:2401.14107v1 [cs.LG])

Authors: Aaqib Saeed, Dimitris Spathis, Jungwoo Oh, Edward Choi, Ali Etemad

Wearable technologies enable continuous monitoring of various health metrics, such as physical activity, heart rate, sleep, and stress levels. A key challenge with wearable data is obtaining quality labels. Unlike modalities like video where the videos themselves can be effectively used to label objects or events, wearable data do not contain obvious cues about the physical manifestation of the users and usually require rich metadata. As a result, label noise can become an increasingly thorny issue when labeling such data. In this paper, we propose a novel solution to address noisy label learning, entitled Few-Shot Human-in-the-Loop Refinement (FHLR). Our method initially learns a seed model using weak labels. Next, it fine-tunes the seed model using a handful of expert corrections. Finally, it achieves better generalizability and robustness by merging the seed and fine-tuned models via weighted parameter averaging. We evaluate our approach on four challenging tasks and datasets, and compare it against eight competitive baselines designed to deal with noisy labels. We show that FHLR achieves significantly better performance when learning from noisy labels and achieves state-of-the-art by a large margin, with up to 19% accuracy improvement under symmetric and asymmetric noise. Notably, we find that FHLR is particularly robust to increased label noise, unlike prior works that suffer from severe performance degradation. Our work not only achieves better generalization in high-stakes health sensing benchmarks but also sheds light on how noise affects commonly-used models.

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks. (arXiv:2401.14109v1 [cs.CL])

Authors: Andrei Tomut, Saeed S. Jahromi, Sukhbinder Singh, Faysal Ishtiaq, Cesar Muñoz, Prabdeep Singh Bajaj, Ali Elborady, Gianni del Bimbo, Mehrazin Alizadeh, David Montero, Pablo Martin-Ramiro, Muhammad Ibrahim, Oussama Tahiri Alaoui, John Malcolm, Samuel Mugel, Roman Orus

Large Language Models (LLMs) such as ChatGPT and LlaMA are advancing rapidly in generative Artificial Intelligence (AI), but their immense size poses significant challenges, such as huge training and inference costs, substantial energy demands, and limitations for on-site deployment. Traditional compression methods such as pruning, distillation, and low-rank approximation focus on reducing the effective number of neurons in the network, while quantization focuses on reducing the numerical precision of individual weights to reduce the model size while keeping the number of neurons fixed. While these compression methods have been relatively successful in practice, there's no compelling reason to believe that truncating the number of neurons is an optimal strategy. In this context, this paper introduces CompactifAI, an innovative LLM compression approach using quantum-inspired Tensor Networks that focuses on the model's correlation space instead, allowing for a more controlled, refined and interpretable model compression. Our method is versatile and can be implemented with - or on top of - other compression techniques. As a benchmark, we demonstrate that CompactifAI alone enables compression of the LlaMA-2 7B model to only $30\%$ of its original size while recovering over $90\%$ of the original accuracy after a brief distributed retraining.

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators. (arXiv:2401.14110v1 [cs.LG])

Authors: Yaniv Blumenfeld, Itay Hubara, Daniel Soudry

The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still relies on high-accuracy core operations. Most significant is the operation of accumulating products. This high-precision accumulation operation is gradually becoming the main computational bottleneck. This is because, so far, the usage of low-precision accumulators led to a significant degradation in performance. In this work, we present a simple method to train and fine-tune high-end DNNs, to allow, for the first time, utilization of cheaper, $12$-bits accumulators, with no significant degradation in accuracy. Lastly, we show that as we decrease the accumulation precision further, using fine-grained gradient approximations can improve the DNN accuracy.

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design. (arXiv:2401.14112v1 [cs.LG])

Authors: Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code will be publicly available soon.

Attention-based Efficient Classification for 3D MRI Image of Alzheimer's Disease. (arXiv:2401.14130v1 [eess.IV])

Authors: Yihao Lin, Ximeng Li, Yan Zhang, Jinshan Tang

Early diagnosis of Alzheimer Diagnostics (AD) is a challenging task due to its subtle and complex clinical symptoms. Deep learning-assisted medical diagnosis using image recognition techniques has become an important research topic in this field. The features have to accurately capture main variations of anatomical brain structures. However, time-consuming is expensive for feature extraction by deep learning training. This study proposes a novel Alzheimer's disease detection model based on Convolutional Neural Networks. The model utilizes a pre-trained ResNet network as the backbone, incorporating post-fusion algorithm for 3D medical images and attention mechanisms. The experimental results indicate that the employed 2D fusion algorithm effectively improves the model's training expense. And the introduced attention mechanism accurately weights important regions in images, further enhancing the model's diagnostic accuracy.

Equivariant Manifold Neural ODEs and Differential Invariants. (arXiv:2401.14131v1 [cs.LG])

Authors: Emma Andersdotter, Fredrik Ohlsson

In this paper we develop a manifestly geometric framework for equivariant manifold neural ordinary differential equations (NODEs), and use it to analyse their modelling capabilities for symmetric data. First, we consider the action of a Lie group $G$ on a smooth manifold $M$ and establish the equivalence between equivariance of vector fields, symmetries of the corresponding Cauchy problems, and equivariance of the associated NODEs. We also propose a novel formulation of the equivariant NODEs in terms of the differential invariants of the action of $G$ on $M$, based on Lie theory for symmetries of differential equations, which provides an efficient parameterisation of the space of equivariant vector fields in a way that is agnostic to both the manifold $M$ and the symmetry group $G$. Second, we construct augmented manifold NODEs, through embeddings into equivariant flows, and show that they are universal approximators of equivariant diffeomorphisms on any path-connected $M$. Furthermore, we show that the augmented NODEs can be incorporated in the geometric framework and parameterised using higher order differential invariants. Finally, we consider the induced action of $G$ on different fields on $M$ and show how it can be used to generalise previous work, on, e.g., continuous normalizing flows, to equivariant models in any geometry.

Convolutional Neural Networks can achieve binary bail judgement classification. (arXiv:2401.14135v1 [cs.CL])

Authors: Amit Barman, Devangan Roy, Debapriya Paul, Indranil Dutta, Shouvik Kumar Guha, Samir Karmakar, Sudip Kumar Naskar

There is an evident lack of implementation of Machine Learning (ML) in the legal domain in India, and any research that does take place in this domain is usually based on data from the higher courts of law and works with English data. The lower courts and data from the different regional languages of India are often overlooked. In this paper, we deploy a Convolutional Neural Network (CNN) architecture on a corpus of Hindi legal documents. We perform a bail Prediction task with the help of a CNN model and achieve an overall accuracy of 93\% which is an improvement on the benchmark accuracy, set by Kapoor et al. (2022), albeit in data from 20 districts of the Indian state of Uttar Pradesh.

Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Conditional Interpretations. (arXiv:2401.14142v1 [cs.CV])

Authors: Xinyue Xu, Yi Qin, Lu Mi, Hao Wang, Xiaomeng Li

Existing methods, such as concept bottleneck models (CBMs), have been successful in providing concept-based interpretations for black-box deep learning models. They typically work by predicting concepts given the input and then predicting the final class label given the predicted concepts. However, (1) they often fail to capture the high-order, nonlinear interaction between concepts, e.g., correcting a predicted concept (e.g., "yellow breast") does not help correct highly correlated concepts (e.g., "yellow belly"), leading to suboptimal final accuracy; (2) they cannot naturally quantify the complex conditional dependencies between different concepts and class labels (e.g., for an image with the class label "Kentucky Warbler" and a concept "black bill", what is the probability that the model correctly predicts another concept "black crown"), therefore failing to provide deeper insight into how a black-box model works. In response to these limitations, we propose Energy-based Concept Bottleneck Models (ECBMs). Our ECBMs use a set of neural networks to define the joint energy of candidate (input, concept, class) tuples. With such a unified interface, prediction, concept correction, and conditional dependency quantification are then represented as conditional probabilities, which are generated by composing different energy functions. Our ECBMs address both limitations of existing CBMs, providing higher accuracy and richer concept interpretations. Empirical results show that our approach outperforms the state-of-the-art on real-world datasets.

True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning. (arXiv:2401.14151v1 [cs.LG])

Authors: Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, Bo An

Despite the impressive performance across numerous tasks, large language models (LLMs) often fail in solving simple decision-making tasks due to the misalignment of the knowledge in LLMs with environments. On the contrary, reinforcement learning (RL) agents learn policies from scratch, which makes them always align with environments but difficult to incorporate prior knowledge for efficient explorations. To narrow the gap, we propose TWOSOME, a novel general online framework that deploys LLMs as decision-making agents to efficiently interact and align with embodied environments via RL without requiring any prepared datasets or prior knowledge of the environments. Firstly, we query the joint probabilities of each valid action with LLMs to form behavior policies. Then, to enhance the stability and robustness of the policies, we propose two normalization methods and summarize four prompt design principles. Finally, we design a novel parameter-efficient training architecture where the actor and critic share one frozen LLM equipped with low-rank adapters (LoRA) updated by PPO. We conduct extensive experiments to evaluate TWOSOME. i) TWOSOME exhibits significantly better sample efficiency and performance compared to the conventional RL method, PPO, and prompt tuning method, SayCan, in both classical decision-making environment, Overcooked, and simulated household environment, VirtualHome. ii) Benefiting from LLMs' open-vocabulary feature, TWOSOME shows superior generalization ability to unseen tasks. iii) Under our framework, there is no significant loss of the LLMs' original ability during online PPO finetuning.

Alleviating Structural Distribution Shift in Graph Anomaly Detection. (arXiv:2401.14155v1 [cs.LG])

Authors: Yuan Gao, Xiang Wang, Xiangnan He, Zhenguang Liu, Huamin Feng, Yongdong Zhang

Graph anomaly detection (GAD) is a challenging binary classification problem due to its different structural distribution between anomalies and normal nodes -- abnormal nodes are a minority, therefore holding high heterophily and low homophily compared to normal nodes. Furthermore, due to various time factors and the annotation preferences of human experts, the heterophily and homophily can change across training and testing data, which is called structural distribution shift (SDS) in this paper. The mainstream methods are built on graph neural networks (GNNs), benefiting the classification of normals from aggregating homophilous neighbors, yet ignoring the SDS issue for anomalies and suffering from poor generalization.

This work solves the problem from a feature view. We observe that the degree of SDS varies between anomalies and normal nodes. Hence to address the issue, the key lies in resisting high heterophily for anomalies meanwhile benefiting the learning of normals from homophily. We tease out the anomaly features on which we constrain to mitigate the effect of heterophilous neighbors and make them invariant. We term our proposed framework as Graph Decomposition Network (GDN). Extensive experiments are conducted on two benchmark datasets, and the proposed framework achieves a remarkable performance boost in GAD, especially in an SDS environment where anomalies have largely different structural distribution across training and testing environments. Codes are open-sourced in https://github.com/blacksingular/wsdm_GDN.

Friendly Attacks to Improve Channel Coding Reliability. (arXiv:2401.14184v1 [cs.IT])

Authors: Anastasiia Kurmukova, Deniz Gunduz

This paper introduces a novel approach called "friendly attack" aimed at enhancing the performance of error correction channel codes. Inspired by the concept of adversarial attacks, our method leverages the idea of introducing slight perturbations to the neural network input, resulting in a substantial impact on the network's performance. By introducing small perturbations to fixed-point modulated codewords before transmission, we effectively improve the decoder's performance without violating the input power constraint. The perturbation design is accomplished by a modified iterative fast gradient method. This study investigates various decoder architectures suitable for computing gradients to obtain the desired perturbations. Specifically, we consider belief propagation (BP) for LDPC codes; the error correcting code transformer, BP and neural BP (NBP) for polar codes, and neural BCJR for convolutional codes. We demonstrate that the proposed friendly attack method can improve the reliability across different channels, modulations, codes, and decoders. This method allows us to increase the reliability of communication with a legacy receiver by simply modifying the transmitted codeword appropriately.

How Can Large Language Models Understand Spatial-Temporal Data?. (arXiv:2401.14192v1 [cs.LG])

Authors: Lei Liu, Shuo Yu, Runze Wang, Zhenxun Ma, Yanming Shen

While Large Language Models (LLMs) dominate tasks like natural language processing and computer vision, harnessing their power for spatial-temporal forecasting remains challenging. The disparity between sequential text and complex spatial-temporal data hinders this application. To address this issue, this paper introduces STG-LLM, an innovative approach empowering LLMs for spatial-temporal forecasting. We tackle the data mismatch by proposing: 1) STG-Tokenizer: This spatial-temporal graph tokenizer transforms intricate graph data into concise tokens capturing both spatial and temporal relationships; 2) STG-Adapter: This minimalistic adapter, consisting of linear encoding and decoding layers, bridges the gap between tokenized data and LLM comprehension. By fine-tuning only a small set of parameters, it can effectively grasp the semantics of tokens generated by STG-Tokenizer, while preserving the original natural language understanding capabilities of LLMs. Extensive experiments on diverse spatial-temporal benchmark datasets show that STG-LLM successfully unlocks LLM potential for spatial-temporal forecasting. Remarkably, our approach achieves competitive performance on par with dedicated SOTA methods.

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence. (arXiv:2401.14196v1 [cs.SE])

Authors: Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

MTRGL:Effective Temporal Correlation Discerning through Multi-modal Temporal Relational Graph Learning. (arXiv:2401.14199v1 [cs.LG])

Authors: Junwei Su, Shan Wu, Jinhui Li

In this study, we explore the synergy of deep learning and financial market applications, focusing on pair trading. This market-neutral strategy is integral to quantitative finance and is apt for advanced deep-learning techniques. A pivotal challenge in pair trading is discerning temporal correlations among entities, necessitating the integration of diverse data modalities. Addressing this, we introduce a novel framework, Multi-modal Temporal Relation Graph Learning (MTRGL). MTRGL combines time series data and discrete features into a temporal graph and employs a memory-based temporal graph neural network. This approach reframes temporal correlation identification as a temporal graph link prediction task, which has shown empirical success. Our experiments on real-world datasets confirm the superior performance of MTRGL, emphasizing its promise in refining automated pair trading strategies.

At the junction between deep learning and statistics of extremes: formalizing the landslide hazard definition. (arXiv:2401.14210v1 [cs.LG])

Authors: Ashok Dahal, Raphaël Huser, Luigi Lombardo

The most adopted definition of landslide hazard combines spatial information about landslide location (susceptibility), threat (intensity), and frequency (return period). Only the first two elements are usually considered and estimated when working over vast areas. Even then, separate models constitute the standard, with frequency being rarely investigated. Frequency and intensity are intertwined and depend on each other because larger events occur less frequently and vice versa. However, due to the lack of multi-temporal inventories and joint statistical models, modelling such properties via a unified hazard model has always been challenging and has yet to be attempted. Here, we develop a unified model to estimate landslide hazard at the slope unit level to address such gaps. We employed deep learning, combined with a model motivated by extreme-value theory to analyse an inventory of 30 years of observed rainfall-triggered landslides in Nepal and assess landslide hazard for multiple return periods. We also use our model to further explore landslide hazard for the same return periods under different climate change scenarios up to the end of the century. Our results show that the proposed model performs excellently and can be used to model landslide hazard in a unified manner. Geomorphologically, we find that under both climate change scenarios (SSP245 and SSP885), landslide hazard is likely to increase up to two times on average in the lower Himalayan regions while remaining the same in the middle Himalayan region whilst decreasing slightly in the upper Himalayan region areas.

Communication-Efficient Federated Learning through Adaptive Weight Clustering and Server-Side Distillation. (arXiv:2401.14211v1 [cs.LG])

Authors: Vasileios Tsouvalas. Aaqib Saeed, Tanir Ozcelebi, Nirvana Meratnia

Federated Learning (FL) is a promising technique for the collaborative training of deep neural networks across multiple devices while preserving data privacy. Despite its potential benefits, FL is hindered by excessive communication costs due to repeated server-client communication during training. To address this challenge, model compression techniques, such as sparsification and weight clustering are applied, which often require modifying the underlying model aggregation schemes or involve cumbersome hyperparameter tuning, with the latter not only adjusts the model's compression rate but also limits model's potential for continuous improvement over growing data. In this paper, we propose FedCompress, a novel approach that combines dynamic weight clustering and server-side knowledge distillation to reduce communication costs while learning highly generalizable models. Through a comprehensive evaluation on diverse public datasets, we demonstrate the efficacy of our approach compared to baselines in terms of communication costs and inference speed. We will make our implementation public upon acceptance.

Sample Efficient Reinforcement Learning by Automatically Learning to Compose Subtasks. (arXiv:2401.14226v1 [cs.LG])

Authors: Shuai Han, Mehdi Dastani, Shihan Wang

Improving sample efficiency is central to Reinforcement Learning (RL), especially in environments where the rewards are sparse. Some recent approaches have proposed to specify reward functions as manually designed or learned reward structures whose integrations in the RL algorithms are claimed to significantly improve the learning efficiency. Manually designed reward structures can suffer from inaccuracy and existing automatically learning methods are often computationally intractable for complex tasks. The integration of inaccurate or partial reward structures in RL algorithms fail to learn optimal policies. In this work, we propose an RL algorithm that can automatically structure the reward function for sample efficiency, given a set of labels that signify subtasks. Given such minimal knowledge about the task, we train a high-level policy that selects optimal sub-tasks in each state together with a low-level policy that efficiently learns to complete each sub-task. We evaluate our algorithm in a variety of sparse-reward environments. The experiment results show that our approach significantly outperforms the state-of-art baselines as the difficulty of the task increases.

Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods. (arXiv:2401.14228v1 [cs.CL])

Authors: Mohammed Sabry, Anya Belz

As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific learning. In this paper, we investigate the inverse: porting whole functional modules that encode task-specific knowledge from one model to another. We designed a study comprising 1,440 training/testing runs to test the portability of modules trained by parameter-efficient finetuning (PEFT) techniques, using sentiment analysis as an example task. We test portability in a wide range of scenarios, involving different PEFT techniques and different pretrained host models, among other dimensions. We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module. We find that the ported modules far outperform the two alternatives tested, but that there are interesting performance differences between the four PEFT techniques. We conclude that task-specific knowledge in the form of structurally modular sets of parameters as produced by PEFT techniques is highly portable, but that degree of success depends on type of PEFT and on differences between originating and receiving pretrained models.

AR-GAN: Generative Adversarial Network-Based Defense Method Against Adversarial Attacks on the Traffic Sign Classification System of Autonomous Vehicles. (arXiv:2401.14232v1 [cs.CV])

Authors: M Sabbir Salek, Abdullah Al Mamun, Mashrur Chowdhury

This study developed a generative adversarial network (GAN)-based defense method for traffic sign classification in an autonomous vehicle (AV), referred to as the attack-resilient GAN (AR-GAN). The novelty of the AR-GAN lies in (i) assuming zero knowledge of adversarial attack models and samples and (ii) providing consistently high traffic sign classification performance under various adversarial attack types. The AR-GAN classification system consists of a generator that denoises an image by reconstruction, and a classifier that classifies the reconstructed image. The authors have tested the AR-GAN under no-attack and under various adversarial attacks, such as Fast Gradient Sign Method (FGSM), DeepFool, Carlini and Wagner (C&W), and Projected Gradient Descent (PGD). The authors considered two forms of these attacks, i.e., (i) black-box attacks (assuming the attackers possess no prior knowledge of the classifier), and (ii) white-box attacks (assuming the attackers possess full knowledge of the classifier). The classification performance of the AR-GAN was compared with several benchmark adversarial defense methods. The results showed that both the AR-GAN and the benchmark defense methods are resilient against black-box attacks and could achieve similar classification performance to that of the unperturbed images. However, for all the white-box attacks considered in this study, the AR-GAN method outperformed the benchmark defense methods. In addition, the AR-GAN was able to maintain its high classification performance under varied white-box adversarial perturbation magnitudes, whereas the performance of the other defense methods dropped abruptly at increased perturbation magnitudes.

Enhanced Labeling Technique for Reddit Text and Fine-Tuned Longformer Models for Classifying Depression Severity in English and Luganda. (arXiv:2401.14240v1 [cs.CL])

Authors: Richard Kimera, Daniela N. Rim, Joseph Kirabira, Ubong Godwin Udomah, Heeyoul Choi

Depression is a global burden and one of the most challenging mental health conditions to control. Experts can detect its severity early using the Beck Depression Inventory (BDI) questionnaire, administer appropriate medication to patients, and impede its progression. Due to the fear of potential stigmatization, many patients turn to social media platforms like Reddit for advice and assistance at various stages of their journey. This research extracts text from Reddit to facilitate the diagnostic process. It employs a proposed labeling approach to categorize the text and subsequently fine-tunes the Longformer model. The model's performance is compared against baseline models, including Naive Bayes, Random Forest, Support Vector Machines, and Gradient Boosting. Our findings reveal that the Longformer model outperforms the baseline models in both English (48%) and Luganda (45%) languages on a custom-made dataset.

Interpretable Solutions for Breast Cancer Diagnosis with Grammatical Evolution and Data Augmentation. (arXiv:2401.14255v1 [cs.LG])

Authors: Yumnah Hasan, Allan de Lima, Fatemeh Amerehi, Darian Reyes Fernandez de Bulnes, Patrick Healy, Conor Ryan

Medical imaging diagnosis increasingly relies on Machine Learning (ML) models. This is a task that is often hampered by severely imbalanced datasets, where positive cases can be quite rare. Their use is further compromised by their limited interpretability, which is becoming increasingly important. While post-hoc interpretability techniques such as SHAP and LIME have been used with some success on so-called black box models, the use of inherently understandable models makes such endeavors more fruitful. This paper addresses these issues by demonstrating how a relatively new synthetic data generation technique, STEM, can be used to produce data to train models produced by Grammatical Evolution (GE) that are inherently understandable. STEM is a recently introduced combination of the Synthetic Minority Oversampling Technique (SMOTE), Edited Nearest Neighbour (ENN), and Mixup; it has previously been successfully used to tackle both between class and within class imbalance issues. We test our technique on the Digital Database for Screening Mammography (DDSM) and the Wisconsin Breast Cancer (WBC) datasets and compare Area Under the Curve (AUC) results with an ensemble of the top three performing classifiers from a set of eight standard ML classifiers with varying degrees of interpretability. We demonstrate that the GE-derived models present the best AUC while still maintaining interpretable solutions.

Producing Plankton Classifiers that are Robust to Dataset Shift. (arXiv:2401.14256v1 [cs.CV])

Authors: Cheng Chen, Sreenath Kyathanahally, Marta Reyes, Stefanie Merkli, Ewa Merz, Emanuele Francazi, Marvin Hoege, Francesco Pomati, Marco Baity-Jesi

Modern plankton high-throughput monitoring relies on deep learning classifiers for species recognition in water ecosystems. Despite satisfactory nominal performances, a significant challenge arises from Dataset Shift, which causes performances to drop during deployment. In our study, we integrate the ZooLake dataset with manually-annotated images from 10 independent days of deployment, serving as test cells to benchmark Out-Of-Dataset (OOD) performances. Our analysis reveals instances where classifiers, initially performing well in In-Dataset conditions, encounter notable failures in practical scenarios. For example, a MobileNet with a 92% nominal test accuracy shows a 77% OOD accuracy. We systematically investigate conditions leading to OOD performance drops and propose a preemptive assessment method to identify potential pitfalls when classifying new data, and pinpoint features in OOD images that adversely impact classification. We present a three-step pipeline: (i) identifying OOD degradation compared to nominal test performance, (ii) conducting a diagnostic analysis of degradation causes, and (iii) providing solutions. We find that ensembles of BEiT vision transformers, with targeted augmentations addressing OOD robustness, geometric ensembling, and rotation-based test-time augmentation, constitute the most robust model, which we call BEsT model. It achieves an 83% OOD accuracy, with errors concentrated on container classes. Moreover, it exhibits lower sensitivity to dataset shift, and reproduces well the plankton abundances. Our proposed pipeline is applicable to generic plankton classifiers, contingent on the availability of suitable test cells. By identifying critical shortcomings and offering practical procedures to fortify models against dataset shift, our study contributes to the development of more reliable plankton classification technologies.

Information Leakage Detection through Approximate Bayes-optimal Prediction. (arXiv:2401.14283v1 [stat.ML])

Authors: Pritha Gupta, Marcel Wever, Eyke Hüllermeier

In today's data-driven world, the proliferation of publicly available information intensifies the challenge of information leakage (IL), raising security concerns. IL involves unintentionally exposing secret (sensitive) information to unauthorized parties via systems' observable information. Conventional statistical approaches, which estimate mutual information (MI) between observable and secret information for detecting IL, face challenges such as the curse of dimensionality, convergence, computational complexity, and MI misestimation. Furthermore, emerging supervised machine learning (ML) methods, though effective, are limited to binary system-sensitive information and lack a comprehensive theoretical framework. To address these limitations, we establish a theoretical framework using statistical learning theory and information theory to accurately quantify and detect IL. We demonstrate that MI can be accurately estimated by approximating the log-loss and accuracy of the Bayes predictor. As the Bayes predictor is typically unknown in practice, we propose to approximate it with the help of automated machine learning (AutoML). First, we compare our MI estimation approaches against current baselines, using synthetic data sets generated using the multivariate normal (MVN) distribution with known MI. Second, we introduce a cut-off technique using one-sided statistical tests to detect IL, employing the Holm-Bonferroni correction to increase confidence in detection decisions. Our study evaluates IL detection performance on real-world data sets, highlighting the effectiveness of the Bayes predictor's log-loss estimation, and finds our proposed method to effectively estimate MI on synthetic data sets and thus detect ILs accurately.

Speech foundation models on intelligibility prediction for hearing-impaired listeners. (arXiv:2401.14289v1 [cs.SD])

Authors: Santiago Cuervo, Ricard Marxer

Speech foundation models (SFMs) have been benchmarked on many speech processing tasks, often achieving state-of-the-art performance with minimal adaptation. However, the SFM paradigm has been significantly less explored for applications of interest to the speech perception community. In this paper we present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction. We focus on the non-intrusive setup of the Clarity Prediction Challenge 2 (CPC2), where the task is to predict the percentage of words correctly perceived by hearing-impaired listeners from speech-in-noise recordings. We propose a simple method that learns a lightweight specialized prediction head on top of frozen SFMs to approach the problem. Our results reveal statistically significant differences in performance across SFMs. Our method resulted in the winning submission in the CPC2, demonstrating its promise for speech perception applications.

Topologies of Reasoning: Demystifying Chains, Trees, and Graphs of Thoughts. (arXiv:2401.14295v1 [cs.CL])

Authors: Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Onur Mutlu, Torsten Hoefler

The field of natural language processing (NLP) has witnessed significant progress in recent years, with a notable focus on improving large language models' (LLM) performance through innovative prompting techniques. Among these, prompt engineering coupled with structures has emerged as a promising paradigm, with designs such as Chain-of-Thought, Tree of Thoughts, or Graph of Thoughts, in which the overall LLM reasoning is guided by a structure such as a graph. As illustrated with numerous examples, this paradigm significantly enhances the LLM's capability to solve numerous tasks, ranging from logical or mathematical reasoning to planning or creative writing. To facilitate the understanding of this growing field and pave the way for future developments, we devise a general blueprint for effective and efficient LLM reasoning schemes. For this, we conduct an in-depth analysis of the prompt execution pipeline, clarifying and clearly defining different concepts. We then build the first taxonomy of structure-enhanced LLM reasoning schemes. We focus on identifying fundamental classes of harnessed structures, and we analyze the representations of these structures, algorithms executed with these structures, and many others. We refer to these structures as reasoning topologies, because their representation becomes to a degree spatial, as they are contained within the LLM context. Our study compares existing prompting schemes using the proposed taxonomy, discussing how certain design choices lead to different patterns in performance and cost. We also outline theoretical underpinnings, relationships between prompting and others parts of the LLM ecosystem such as knowledge bases, and the associated research challenges. Our work will help to advance future prompt engineering techniques.

"All of Me": Mining Users' Attributes from their Public Spotify Playlists. (arXiv:2401.14296v1 [cs.CR])

Authors: Pier Paolo Tricomi, Luca Pajola, Luca Pasa, Mauro Conti

In the age of digital music streaming, playlists on platforms like Spotify have become an integral part of individuals' musical experiences. People create and publicly share their own playlists to express their musical tastes, promote the discovery of their favorite artists, and foster social connections. These publicly accessible playlists transcend the boundaries of mere musical preferences: they serve as sources of rich insights into users' attributes and identities. For example, the musical preferences of elderly individuals may lean more towards Frank Sinatra, while Billie Eilish remains a favored choice among teenagers. These playlists thus become windows into the diverse and evolving facets of one's musical identity.

In this work, we investigate the relationship between Spotify users' attributes and their public playlists. In particular, we focus on identifying recurring musical characteristics associated with users' individual attributes, such as demographics, habits, or personality traits. To this end, we conducted an online survey involving 739 Spotify users, yielding a dataset of 10,286 publicly shared playlists encompassing over 200,000 unique songs and 55,000 artists. Through extensive statistical analyses, we first assess a deep connection between a user's Spotify playlists and their real-life attributes. For instance, we found individuals high in openness often create playlists featuring a diverse array of artists, while female users prefer Pop and K-pop music genres. Building upon these observed associations, we create accurate predictive models for users' attributes, presenting a novel DeepSet application that outperforms baselines in most of these users' attributes.

SunBlock: Cloudless Protection for IoT Systems. (arXiv:2401.14332v1 [cs.CR])

Authors: Vadim Safronov, Anna Maria Mandalari, Daniel J. Dubois, David Choffnes, Hamed Haddadi

With an increasing number of Internet of Things (IoT) devices present in homes, there is a rise in the number of potential information leakage channels and their associated security threats and privacy risks. Despite a long history of attacks on IoT devices in unprotected home networks, the problem of accurate, rapid detection and prevention of such attacks remains open. Many existing IoT protection solutions are cloud-based, sometimes ineffective, and might share consumer data with unknown third parties. This paper investigates the potential for effective IoT threat detection locally, on a home router, using AI tools combined with classic rule-based traffic-filtering algorithms. Our results show that with a slight rise of router hardware resources caused by machine learning and traffic filtering logic, a typical home router instrumented with our solution is able to effectively detect risks and protect a typical home IoT network, equaling or outperforming existing popular solutions, without any effects on benign IoT functionality, and without relying on cloud services and third parties.

Estimation of partially known Gaussian graphical models with score-based structural priors. (arXiv:2401.14340v1 [stat.ML])

Authors: Martín Sevilla, Antonio García Marques, Santiago Segarra

We propose a novel algorithm for the support estimation of partially known Gaussian graphical models that incorporates prior information about the underlying graph. In contrast to classical approaches that provide a point estimate based on a maximum likelihood or a maximum a posteriori criterion using (simple) priors on the precision matrix, we consider a prior on the graph and rely on annealed Langevin diffusion to generate samples from the posterior distribution. Since the Langevin sampler requires access to the score function of the underlying graph prior, we use graph neural networks to effectively estimate the score from a graph dataset (either available beforehand or generated from a known distribution). Numerical experiments demonstrate the benefits of our approach.

Class-attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective. (arXiv:2401.14343v1 [cs.LG])

Authors: Xuechen Zhang, Mingchen Li, Jiasi Chen, Christos Thrampoulidis, Samet Oymak

Modern classification problems exhibit heterogeneities across individual classes: Each class may have unique attributes, such as sample size, label quality, or predictability (easy vs difficult), and variable importance at test-time. Without care, these heterogeneities impede the learning process, most notably, when optimizing fairness objectives. Confirming this, under a gaussian mixture setting, we show that the optimal SVM classifier for balanced accuracy needs to be adaptive to the class attributes. This motivates us to propose CAP: An effective and general method that generates a class-specific learning strategy (e.g. hyperparameter) based on the attributes of that class. This way, optimization process better adapts to heterogeneities. CAP leads to substantial improvements over the naive approach of assigning separate hyperparameters to each class. We instantiate CAP for loss function design and post-hoc logit adjustment, with emphasis on label-imbalanced problems. We show that CAP is competitive with prior art and its flexibility unlocks clear benefits for fairness objectives beyond balanced accuracy. Finally, we evaluate CAP on problems with label noise as well as weighted test objectives to showcase how CAP can jointly adapt to different heterogeneities.

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models. (arXiv:2401.14351v1 [cs.LG])

Authors: Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai

This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ServerlessLLM exploits the substantial capacity and bandwidth of storage and memory devices available on GPU servers, thereby reducing costly remote checkpoint downloads and achieving efficient checkpoint loading. ServerlessLLM achieves this through three main contributions: (i) fast LLM checkpoint loading via a novel loading-optimized checkpoint format design, coupled with an efficient multi-tier checkpoint loading system; (ii) locality-driven LLM inference with live migration, which allows ServerlessLLM to effectively achieve locality-driven server allocation while preserving the low latency of ongoing LLM inference; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement. Our comprehensive experiments, which include microbenchmarks and real-world traces, show that ServerlessLLM surpasses state-of-the-art systems by 10 - 200X in latency performance when running various LLM inference workloads.

MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving. (arXiv:2401.14361v1 [cs.LG])

Authors: Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina

This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. MoE-Infinity features sequence-level expert activation tracing, a new approach adept at identifying sparse activations and capturing the temporal locality of MoE inference. By analyzing these traces, MoE-Infinity performs novel activation-aware expert prefetching and caching, substantially reducing the latency overheads usually associated with offloading experts for improved cost performance. Extensive experiments in a cluster show that MoE-Infinity outperforms numerous existing systems and approaches, reducing latency by 4 - 20X and decreasing deployment costs by over 8X for various MoEs. MoE-Infinity's source code is publicly available at https://github.com/TorchMoE/MoE-Infinity

Genie: Achieving Human Parity in Content-Grounded Datasets Generation. (arXiv:2401.14367v1 [cs.CL])

Authors: Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen

The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose Genie, a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs or summaries). (c) Filtering mechanism aiming to ensure the quality and faithfulness of the generated data. We showcase this methodology by generating three large-scale synthetic data, making wishes, for Long-Form Question-Answering (LFQA), summarization, and information extraction. In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data -- ELI5 and ASQA for LFQA and CNN-DailyMail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. Finally, we applied our method to create LFQA data within the medical domain and compared a model trained on it with models trained on other domains.

TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation. (arXiv:2401.14373v1 [cs.CL])

Authors: Gökçe Uludoğan, Zeynep Yirmibeşoğlu Balal, Furkan Akkurt, Melikşah Türker, Onur Güngör, Susan Üsküdarlı

The recent advances in natural language processing have predominantly favored well-resourced English-centric models, resulting in a significant gap with low-resource languages. In this work, we introduce the language model TURNA, which is developed for the low-resource language Turkish and is capable of both natural language understanding and generation tasks. TURNA is pretrained with an encoder-decoder architecture based on the unified framework UL2 with a diverse corpus that we specifically curated for this purpose. We evaluated TURNA with three generation tasks and five understanding tasks for Turkish. The results show that TURNA outperforms several multilingual models in both understanding and generation tasks, and competes with monolingual Turkish models in understanding tasks. TURNA is made available at https://huggingface.co/boun-tabi-LMG/TURNA .

UrbanGenAI: Reconstructing Urban Landscapes using Panoptic Segmentation and Diffusion Models. (arXiv:2401.14379v1 [cs.CV])

Authors: Timo Kapsalis

In contemporary design practices, the integration of computer vision and generative artificial intelligence (genAI) represents a transformative shift towards more interactive and inclusive processes. These technologies offer new dimensions of image analysis and generation, which are particularly relevant in the context of urban landscape reconstruction. This paper presents a novel workflow encapsulated within a prototype application, designed to leverage the synergies between advanced image segmentation and diffusion models for a comprehensive approach to urban design. Our methodology encompasses the OneFormer model for detailed image segmentation and the Stable Diffusion XL (SDXL) diffusion model, implemented through ControlNet, for generating images from textual descriptions. Validation results indicated a high degree of performance by the prototype application, showcasing significant accuracy in both object detection and text-to-image generation. This was evidenced by superior Intersection over Union (IoU) and CLIP scores across iterative evaluations for various categories of urban landscape features. Preliminary testing included utilising UrbanGenAI as an educational tool enhancing the learning experience in design pedagogy, and as a participatory instrument facilitating community-driven urban planning. Early results suggested that UrbanGenAI not only advances the technical frontiers of urban landscape reconstruction but also provides significant pedagogical and participatory planning benefits. The ongoing development of UrbanGenAI aims to further validate its effectiveness across broader contexts and integrate additional features such as real-time feedback mechanisms and 3D modelling capabilities. Keywords: generative AI; panoptic image segmentation; diffusion models; urban landscape design; design pedagogy; co-design

Manifold GCN: Diffusion-based Convolutional Neural Network for Manifold-valued Graphs. (arXiv:2401.14381v1 [cs.LG])

Authors: Martin Hanik, Gabriele Steidl, Christoph von Tycowicz

We propose two graph neural network layers for graphs with features in a Riemannian manifold. First, based on a manifold-valued graph diffusion equation, we construct a diffusion layer that can be applied to an arbitrary number of nodes and graph connectivity patterns. Second, we model a tangent multilayer perceptron by transferring ideas from the vector neuron framework to our general setting. Both layers are equivariant with respect to node permutations and isometries of the feature manifold. These properties have been shown to lead to a beneficial inductive bias in many deep learning tasks. Numerical examples on synthetic data as well as on triangle meshes of the right hippocampus to classify Alzheimer's disease demonstrate the very good performance of our layers.

An Orthogonal Polynomial Kernel-Based Machine Learning Model for Differential-Algebraic Equations. (arXiv:2401.14382v1 [math.NA])

Authors: Tayebeh Taheri, Alireza Afzal Aghaei, Kourosh Parand

The recent introduction of the Least-Squares Support Vector Regression (LS-SVR) algorithm for solving differential and integral equations has sparked interest. In this study, we expand the application of this algorithm to address systems of differential-algebraic equations (DAEs). Our work presents a novel approach to solving general DAEs in an operator format by establishing connections between the LS-SVR machine learning model, weighted residual methods, and Legendre orthogonal polynomials. To assess the effectiveness of our proposed method, we conduct simulations involving various DAE scenarios, such as nonlinear systems, fractional-order derivatives, integro-differential, and partial DAEs. Finally, we carry out comparisons between our proposed method and currently established state-of-the-art approaches, demonstrating its reliability and effectiveness.

Smooth Ranking SVM via Cutting-Plane Method. (arXiv:2401.14388v1 [cs.LG])

Authors: Erhan Can Ozcan, Berk Görgülü, Mustafa G. Baydogan, Ioannis Ch. Paschalidis

The most popular classification algorithms are designed to maximize classification accuracy during training. However, this strategy may fail in the presence of class imbalance since it is possible to train models with high accuracy by overfitting to the majority class. On the other hand, the Area Under the Curve (AUC) is a widely used metric to compare classification performance of different algorithms when there is a class imbalance, and various approaches focusing on the direct optimization of this metric during training have been proposed. Among them, SVM-based formulations are especially popular as this formulation allows incorporating different regularization strategies easily. In this work, we develop a prototype learning approach that relies on cutting-plane method, similar to Ranking SVM, to maximize AUC. Our algorithm learns simpler models by iteratively introducing cutting planes, thus overfitting is prevented in an unconventional way. Furthermore, it penalizes the changes in the weights at each iteration to avoid large jumps that might be observed in the test performance, thus facilitating a smooth learning process. Based on the experiments conducted on 73 binary classification datasets, our method yields the best test AUC in 25 datasets among its relevant competitors.

pix2gestalt: Amodal Segmentation by Synthesizing Wholes. (arXiv:2401.14398v1 [cs.CV])

Authors: Ege Ozguroglu, Ruoshi Liu, Dídac Surís, Dian Chen, Achal Dave, Pavel Tokmakov, Carl Vondrick

We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, including examples that break natural and physical priors, such as art. As training data, we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.

Adaptive Mobile Manipulation for Articulated Objects In the Open World. (arXiv:2401.14403v1 [cs.RO])

Authors: Haoyu Xiong, Russell Mendonca, Kenneth Shaw, Deepak Pathak

Deploying robots in open-ended unstructured environments such as homes has been a long-standing research problem. However, robots are often studied only in closed-off lab settings, and prior mobile manipulation work is restricted to pick-move-place, which is arguably just the tip of the iceberg in this area. In this paper, we introduce Open-World Mobile Manipulation System, a full-stack approach to tackle realistic articulated object operation, e.g. real-world doors, cabinets, drawers, and refrigerators in open-ended unstructured environments. The robot utilizes an adaptive learning framework to initially learns from a small set of data through behavior cloning, followed by learning from online practice on novel objects that fall outside the training distribution. We also develop a low-cost mobile manipulation hardware platform capable of safe and autonomous online adaptation in unstructured environments with a cost of around 20,000 USD. In our experiments we utilize 20 articulate objects across 4 buildings in the CMU campus. With less than an hour of online learning for each object, the system is able to increase success rate from 50% of BC pre-training to 95% using online adaptation. Video results at https://open-world-mobilemanip.github.io/

Deconstructing Denoising Diffusion Models for Self-Supervised Learning. (arXiv:2401.14404v1 [cs.CV])

Authors: Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He

In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive procedure allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities. (arXiv:2401.14405v1 [cs.CV])

Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.

Gradient Flows for Regularized Stochastic Control Problems. (arXiv:2006.05956v5 [math.OC] UPDATED)

Authors: David Šiška, Łukasz Szpruch

This paper studies stochastic control problems with the action space taken to be probability measures, with the objective penalised by the relative entropy. We identify suitable metric space on which we construct a gradient flow for the measure-valued control process, in the set of admissible controls, along which the cost functional is guaranteed to decrease. It is shown that any invariant measure of this gradient flow satisfies the Pontryagin optimality principle. If the problem we work with is sufficiently convex, the gradient flow converges exponentially fast. Furthermore, the optimal measure-valued control process admits a Bayesian interpretation which means that one can incorporate prior knowledge when solving such stochastic control problems. This work is motivated by a desire to extend the theoretical underpinning for the convergence of stochastic gradient type algorithms widely employed in the reinforcement learning community to solve control problems.

Adversarial Graph Disentanglement. (arXiv:2103.07295v4 [cs.LG] UPDATED)

Authors: Shuai Zheng, Zhenfeng Zhu, Zhizhe Liu, Jian Cheng, Yao Zhao

A real-world graph has a complex topological structure, which is often formed by the interaction of different latent factors. However, most existing methods lack consideration of the intrinsic differences in relations between nodes caused by factor entanglement. In this paper, we propose an \underline{\textbf{A}}dversarial \underline{\textbf{D}}isentangled \underline{\textbf{G}}raph \underline{\textbf{C}}onvolutional \underline{\textbf{N}}etwork (ADGCN) for disentangled graph representation learning. To begin with, we point out two aspects of graph disentanglement that need to be considered, i.e., micro-disentanglement and macro-disentanglement. For them, a component-specific aggregation approach is proposed to achieve micro-disentanglement by inferring latent components that cause the links between nodes. On the basis of micro-disentanglement, we further propose a macro-disentanglement adversarial regularizer to improve the separability among component distributions, thus restricting the interdependence among components. Additionally, to reveal the topological graph structure, a diversity-preserving node sampling approach is proposed, by which the graph structure can be progressively refined in a way of local structure awareness. The experimental results on various real-world graph data verify that our ADGCN obtains more favorable performance over currently available alternatives. The source codes of ADGCN are available at \textit{\url{https://github.com/SsGood/ADGCN}}.

A Link between Coding Theory and Cross-Validation with Applications. (arXiv:2103.11856v2 [cs.LG] UPDATED)

Authors: Tapio Pahikkala, Parisa Movahedi, Ileana Montoya, Havu Miikonen, Stephan Foldes, Antti Airola, Laszlo Major

How many different binary classification problems a single learning algorithm can solve on a fixed data with exactly zero or at most a given number of cross-validation errors? While the number in the former case is known to be limited by the no-free-lunch theorem, we show that the exact answers are given by the theory of error detecting codes. As a case study, we focus on the AUC performance measure and leave-pair-out cross-validation (LPOCV), in which every possible pair of data with different class labels is held out at a time. We shown that the maximal number of classification problems with fixed class proportion, for which a learning algorithm can achieve zero LPOCV error, equals the maximal number of code words in a constant weight code (CWC), with certain technical properties. We then generalize CWCs by introducing light CWCs and prove an analogous result for nonzero LPOCV errors and light CWCs. Moreover, we prove both upper and lower bounds on the maximal numbers of code words in light CWCs. Finally, as an immediate practical application, we develop new LPOCV based randomization tests for learning algorithms that generalize the classical Wilcoxon-Mann-Whitney U test.

Derivative-free Alternating Projection Algorithms for General Nonconvex-Concave Minimax Problems. (arXiv:2108.00473v5 [math.OC] UPDATED)

Authors: Zi Xu, Ziqi Wang, Jingjing Shen, Yuhong Dai

In this paper, we study zeroth-order algorithms for nonconvex-concave minimax problems, which have attracted widely attention in machine learning, signal processing and many other fields in recent years. We propose a zeroth-order alternating randomized gradient projection (ZO-AGP) algorithm for smooth nonconvex-concave minimax problems, and its iteration complexity to obtain an $\varepsilon$-stationary point is bounded by $\mathcal{O}(\varepsilon^{-4})$, and the number of function value estimation is bounded by $\mathcal{O}(d_{x}+d_{y})$ per iteration. Moreover, we propose a zeroth-order block alternating randomized proximal gradient algorithm (ZO-BAPG) for solving block-wise nonsmooth nonconvex-concave minimax optimization problems, and the iteration complexity to obtain an $\varepsilon$-stationary point is bounded by $\mathcal{O}(\varepsilon^{-4})$ and the number of function value estimation per iteration is bounded by $\mathcal{O}(K d_{x}+d_{y})$. To the best of our knowledge, this is the first time that zeroth-order algorithms with iteration complexity gurantee are developed for solving both general smooth and block-wise nonsmooth nonconvex-concave minimax problems. Numerical results on data poisoning attack problem and distributed nonconvex sparse principal component analysis problem validate the efficiency of the proposed algorithms.

EvadeDroid: A Practical Evasion Attack on Machine Learning for Black-box Android Malware Detection. (arXiv:2110.03301v4 [cs.LG] UPDATED)

Authors: Hamid Bostani, Veelasha Moonsamy

Over the last decade, researchers have extensively explored the vulnerabilities of Android malware detectors to adversarial examples through the development of evasion attacks; however, the practicality of these attacks in real-world scenarios remains arguable. The majority of studies have assumed attackers know the details of the target classifiers used for malware detection, while in reality, malicious actors have limited access to the target classifiers. This paper introduces EvadeDroid, a problem-space adversarial attack designed to effectively evade black-box Android malware detectors in real-world scenarios. EvadeDroid constructs a collection of problem-space transformations derived from benign donors that share opcode-level similarity with malware apps by leveraging an n-gram-based approach. These transformations are then used to morph malware instances into benign ones via an iterative and incremental manipulation strategy. The proposed manipulation technique is a query-efficient optimization algorithm that can find and inject optimal sequences of transformations into malware apps. Our empirical evaluations, carried out on 1K malware apps, demonstrate the effectiveness of our approach in generating real-world adversarial examples in both soft- and hard-label settings. Our findings reveal that EvadeDroid can effectively deceive diverse malware detectors that utilize different features with various feature types. Specifically, EvadeDroid achieves evasion rates of 80%-95% against DREBIN, Sec-SVM, ADE-MA, MaMaDroid, and Opcode-SVM with only 1-9 queries. Furthermore, we show that the proposed problem-space adversarial attack is able to preserve its stealthiness against five popular commercial antiviruses with an average of 79% evasion rate, thus demonstrating its feasibility in the real world.

MCCE: Monte Carlo sampling of realistic counterfactual explanations. (arXiv:2111.09790v2 [stat.ML] UPDATED)

Authors: Annabelle Redelmeier, Martin Jullum, Kjersti Aas, Anders Løland

We introduce MCCE: Monte Carlo sampling of valid and realistic Counterfactual Explanations for tabular data, a novel counterfactual explanation method that generates on-manifold, actionable and valid counterfactuals by modeling the joint distribution of the mutable features given the immutable features and the decision. Unlike other on-manifold methods that tend to rely on variational autoencoders and have strict prediction model and data requirements, MCCE handles any type of prediction model and categorical features with more than two levels. MCCE first models the joint distribution of the features and the decision with an autoregressive generative model where the conditionals are estimated using decision trees. Then, it samples a large set of observations from this model, and finally, it removes the samples that do not obey certain criteria. We compare MCCE with a range of state-of-the-art on-manifold counterfactual methods using four well-known data sets and show that MCCE outperforms these methods on all common performance metrics and speed. In particular, including the decision in the modeling process improves the efficiency of the method substantially.

Risk Measures and Upper Probabilities: Coherence and Stratification. (arXiv:2206.03183v3 [cs.LG] UPDATED)

Authors: Christian Fröhlich, Robert C. Williamson

Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of alternative aggregation functionals, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we arrive at a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.

Self-Supervised Training with Autoencoders for Visual Anomaly Detection. (arXiv:2206.11723v7 [cs.CV] UPDATED)

Authors: Alexander Bauer, Shinichi Nakajima, Klaus-Robert Müller

Recently, deep auto-encoders have been used for the task of anomaly detection in the visual domain. By optimising for the reconstruction error using anomaly-free examples, the common belief is that a corresponding network should fail to accurately reconstruct anomalous regions in the application phase. This goal is typically addressed by controlling the capacity of the network, either by reducing the size of the bottleneck layer or by enforcing sparsity constraints on its activations. However, neither of these techniques does explicitly penalise reconstruction of anomalous signals often resulting in poor detection. We tackle this problem by adapting a self-supervised learning regime that allows the use of discriminative information during training but focuses on the data manifold of normal examples. Precisely, we investigate two different training objectives inspired by the task of neural image inpainting. Our main objective regularises the model to produce locally consistent reconstructions, while replacing irregularities, therefore, acting as a filter that removes anomalous patterns. Our formal analysis shows that under mild conditions the corresponding model resembles a non-linear orthogonal projection of partially corrupted images onto the manifold of uncorrupted (defect-free) examples. This insight makes the reconstruction error a natural choice for defining the anomaly score of a sample according to its distance from a corresponding projection on the data manifold. We emphasise that inference with our approach is very efficient during training and prediction requiring a single forward pass for each input image. Our experiments on the MVTec AD dataset demonstrate high detection and localisation performance. On the texture-subset, in particular, our approach consistently outperforms recent anomaly detection methods by a significant margin.

Convolutional Persistence Transforms. (arXiv:2208.02107v2 [math.AT] UPDATED)

Authors: Elchanan Solomon, Paul Bendich

In this paper, we consider topological featurizations of data defined over simplicial complexes, like images and labeled graphs, obtained by convolving this data with various filters before computing persistence. Viewing a convolution filter as a local motif, the persistence diagram of the resulting convolution describes the way the motif is distributed across the simplicial complex. This pipeline, which we call convolutional persistence, extends the capacity of topology to observe patterns in such data. Moreover, we prove that (generically speaking) for any two labeled complexes one can find some filter for which they produce different persistence diagrams, so that the collection of all possible convolutional persistence diagrams is an injective invariant. This is proven by showing convolutional persistence to be a special case of another topological invariant, the Persistent Homology Transform. Other advantages of convolutional persistence are improved stability, greater flexibility for data-dependent vectorizations, and reduced computational complexity for certain data types. Additionally, we have a suite of experiments showing that convolutions greatly improve the predictive power of persistence on a host of classification tasks, even if one uses random filters and vectorizes the resulting diagrams by recording only their total persistences.

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds. (arXiv:2210.14051v3 [cs.LG] UPDATED)

Authors: Hao Liang, Zhi-Quan Luo

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.

We prove that they both attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK})$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number of states, actions, episodes, and the time horizon, respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with novel distributional analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity.

Acknowledging the computational inefficiency associated with the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach not only maintains the established regret bounds but also significantly amplifies computational efficiency.

We also prove a tighter minimax lower bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for the $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.

Learning Individual Treatment Effects under Heterogeneous Interference in Networks. (arXiv:2210.14080v2 [cs.LG] UPDATED)

Authors: Ziyu Zhao, Yuqi Bai, Kun Kuang, Ruoxuan Xiong, Fei Wu

Estimates of individual treatment effects from networked observational data are attracting increasing attention these days. One major challenge in network scenarios is the violation of the stable unit treatment value assumption (SUTVA), which assumes that the treatment assignment of a unit does not influence others' outcomes. In network data, due to interference, the outcome of a unit is influenced not only by its treatment (i.e., direct effects) but also by others' treatments (i.e., spillover effects). Furthermore, the influences from other units are always heterogeneous (e.g., friends with similar interests affect a person differently than friends with different interests). In this paper, we focus on the problem of estimating individual treatment effects (both direct and spillover effects) under heterogeneous interference. To address this issue, we propose a novel Dual Weighting Regression (DWR) algorithm by simultaneously learning attention weights that capture the heterogeneous interference and sample weights to eliminate the complex confounding bias in networks. We formulate the entire learning process as a bi-level optimization problem. In theory, we present generalization error bounds for individual treatment effect estimation. Extensive experiments on four benchmark datasets demonstrate that the proposed DWR algorithm outperforms state-of-the-art methods for estimating individual treatment effects under heterogeneous interference.

HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks. (arXiv:2211.01839v2 [cs.SD] UPDATED)

Authors: Filip Szatkowski, Karol J. Piczak, Przemysław Spurek, Jacek Tabor, Tomasz Trzciński

Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals. Recent applications of INRs include image super-resolution, compression of high-dimensional signals, or 3D rendering. However, these solutions usually focus on visual data, and adapting them to the audio domain is not trivial. Moreover, it requires a separately trained model for every data sample. To address this limitation, we propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time. We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.

Transfer Learning for Contextual Multi-armed Bandits. (arXiv:2211.12612v2 [stat.ML] UPDATED)

Authors: Changxiao Cai, T. Tony Cai, Hongzhe Li

Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits.

In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the auxiliary source domains for learning in the target domain.

Machine Learning Systems are Bloated and Vulnerable. (arXiv:2212.09437v3 [cs.SE] UPDATED)

Authors: Huaifeng Zhang, Fahmi Abdulqadir Ahmed, Dyako Fatih, Akayou Kitessa, Mohannad Alhanahnah, Philipp Leitner, Ahmed Ali-Eldin

Today's software is bloated with both code and features that are not used by most users. This bloat is prevalent across the entire software stack, from operating systems and applications to containers. Containers are lightweight virtualization technologies used to package code and dependencies, providing portable, reproducible and isolated environments. For their ease of use, data scientists often utilize machine learning containers to simplify their workflow. However, this convenience comes at a cost: containers are often bloated with unnecessary code and dependencies, resulting in very large sizes. In this paper, we analyze and quantify bloat in machine learning containers. We develop MMLB, a framework for analyzing bloat in software systems, focusing on machine learning containers. MMLB measures the amount of bloat at both the container and package levels, quantifying the sources of bloat. In addition, MMLB integrates with vulnerability analysis tools and performs package dependency analysis to evaluate the impact of bloat on container vulnerabilities. Through experimentation with 15 machine learning containers from TensorFlow, PyTorch, and Nvidia, we show that bloat accounts for up to 80% of machine learning container sizes, increasing container provisioning times by up to 370% and exacerbating vulnerabilities by up to 99%.

GNN-based Passenger Request Prediction. (arXiv:2301.02515v2 [cs.LG] UPDATED)

Authors: Aqsa Ashraf Makhdomi, Iqra Altaf Gillani

Passenger request prediction is essential for operations planning, control, and management in ride-sharing platforms. While the demand prediction problem has been studied extensively, the Origin-Destination (OD) flow prediction of passengers has received less attention from the research community. This paper develops a Graph Neural Network framework along with the Attention Mechanism to predict the OD flow of passengers. The proposed framework exploits various linear and non-linear dependencies that arise among requests originating from different locations and captures the repetition pattern and the contextual data of that place. Moreover, the optimal size of the grid cell that covers the road network and preserves the complexity and accuracy of the model is determined. Extensive simulations are conducted to examine the characteristics of our proposed approach and its various components. The results show the superior performance of our proposed model compared to the existing baselines.

Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging. (arXiv:2302.01622v3 [eess.IV] UPDATED)

Authors: Soroosh Tayebi Arasteh, Alexander Ziller, Christiane Kuhl, Marcus Makowski, Sven Nebelung, Rickmer Braren, Daniel Rueckert, Daniel Truhn, Georgios Kaissis

Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure its protection are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. Prior work indicates that DP has negative implications on model accuracy and fairness, which are unacceptable in medicine and represent a main barrier to the widespread use of privacy-preserving techniques. In this work, we evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training. For this, we used two datasets: (1) A large dataset (N=193,311) of high quality clinical chest radiographs, and (2) a dataset (N=1,625) of 3D abdominal computed tomography (CT) images, with the task of classifying the presence of pancreatic ductal adenocarcinoma (PDAC). Both were retrospectively collected and manually labeled by experienced radiologists. We then compared non-private deep convolutional neural networks (CNNs) and privacy-preserving (DP) models with respect to privacy-utility trade-offs measured as area under the receiver-operator-characteristic curve (AUROC), and privacy-fairness trade-offs, measured as Pearson's r or Statistical Parity Difference. We found that, while the privacy-preserving trainings yielded lower accuracy, they did largely not amplify discrimination against age, sex or co-morbidity. Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.

RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion. (arXiv:2302.01757v3 [cs.CR] UPDATED)

Authors: Zhuoqun Huang, Neil G. Marchant, Keane Lucas, Lujo Bauer, Olga Ohrimenko, Benjamin I. P. Rubinstein

Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.

A Generalized Surface Loss for Reducing the Hausdorff Distance in Medical Imaging Segmentation. (arXiv:2302.03868v3 [eess.IV] UPDATED)

Authors: Adrian Celaya, Beatrice Riviere, David Fuentes

Within medical imaging segmentation, the Dice coefficient and Hausdorff-based metrics are standard measures of success for deep learning models. However, modern loss functions for medical image segmentation often only consider the Dice coefficient or similar region-based metrics during training. As a result, segmentation architectures trained over such loss functions run the risk of achieving high accuracy for the Dice coefficient but low accuracy for Hausdorff-based metrics. Low accuracy on Hausdorff-based metrics can be problematic for applications such as tumor segmentation, where such benchmarks are crucial. For example, high Dice scores accompanied by significant Hausdorff errors could indicate that the predictions fail to detect small tumors. We propose the Generalized Surface Loss function, a novel loss function to minimize Hausdorff-based metrics with more desirable numerical properties than current methods and with weighting terms for class imbalance. Our loss function outperforms other losses when tested on the LiTS and BraTS datasets using the state-of-the-art nnUNet architecture. These results suggest we can improve medical imaging segmentation accuracy with our novel loss function.

When Can We Track Significant Preference Shifts in Dueling Bandits?. (arXiv:2302.06595v2 [cs.LG] UPDATED)

Authors: Joe Suk, Arpit Agarwal

The $K$-armed dueling bandits problem, where the feedback is in the form of noisy pairwise preferences, has been widely studied due its applications in information retrieval, recommendation systems, etc. Motivated by concerns that user preferences/tastes can evolve over time, we consider the problem of dueling bandits with distribution shifts. Specifically, we study the recent notion of significant shifts (Suk and Kpotufe, 2022), and ask whether one can design an adaptive algorithm for the dueling problem with $O(\sqrt{K\tilde{L}T})$ dynamic regret, where $\tilde{L}$ is the (unknown) number of significant shifts in preferences. We show that the answer to this question depends on the properties of underlying preference distributions.

Firstly, we give an impossibility result that rules out any algorithm with $O(\sqrt{K\tilde{L}T})$ dynamic regret under the well-studied Condorcet and SST classes of preference distributions. Secondly, we show that $\text{SST} \cap \text{STI}$ is the largest amongst popular classes of preference distributions where it is possible to design such an algorithm. Overall, our results provides an almost complete resolution of the above question for the hierarchy of distribution classes.

Correlation Clustering with Active Learning of Pairwise Similarities. (arXiv:2302.10295v3 [cs.LG] UPDATED)

Authors: Linus Aronsson, Morteza Haghir Chehreghani

Correlation clustering is a well-known unsupervised learning setting that deals with positive and negative pairwise similarities. In this paper, we study the case where the pairwise similarities are not given in advance and must be queried in a cost-efficient way. Thereby, we develop a generic active learning framework for this task that benefits from several advantages, e.g., flexibility in the type of feedback that a user/annotator can provide, adaptation to any correlation clustering algorithm and query strategy, and robustness to noise. In addition, we propose and analyze a number of novel query strategies suited to this setting. We demonstrate the effectiveness of our framework and the proposed query strategies via several experimental studies.

Rotation Invariant Quantization for Model Compression. (arXiv:2303.03106v2 [cs.LG] UPDATED)

Authors: Joseph Kampeas, Yury Nahshan, Hanoch Kremer, Gil Lederman, Shira Zaloshinski, Zheng Li, Emir Haleva

Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. In this study, we investigate the rate-distortion tradeoff for NN model compression. First, we suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model, yielding a different rate at each layer, i.e., mixed-precision quantization. Then, we prove that our rotation-invariant approach is optimal in terms of compression. We rigorously evaluate RIQ and demonstrate its capabilities on various models and tasks. For example, RIQ facilitates $\times 19.4$ and $\times 52.9$ compression ratios on pre-trained VGG dense and pruned models, respectively, with $<0.4\%$ accuracy degradation. Code is available in \url{https://github.com/ehaleva/RIQ}.

Domain Randomization for Robust, Affordable and Effective Closed-loop Control of Soft Robots. (arXiv:2303.04136v2 [cs.RO] UPDATED)

Authors: Gabriele Tiboni, Andrea Protopapa, Tatiana Tommasi, Giuseppe Averta

Soft robots are gaining popularity thanks to their intrinsic safety to contacts and adaptability. However, the potentially infinite number of Degrees of Freedom makes their modeling a daunting task, and in many cases only an approximated description is available. This challenge makes reinforcement learning (RL) based approaches inefficient when deployed on a realistic scenario, due to the large domain gap between models and the real platform. In this work, we demonstrate, for the first time, how Domain Randomization (DR) can solve this problem by enhancing RL policies for soft robots with: i) robustness w.r.t. unknown dynamics parameters; ii) reduced training times by exploiting drastically simpler dynamic models for learning; iii) better environment exploration, which can lead to exploitation of environmental constraints for optimal performance. Moreover, we introduce a novel algorithmic extension to previous adaptive domain randomization methods for the automatic inference of dynamics parameters for deformable objects. We provide an extensive evaluation in simulation on four different tasks and two soft robot designs, opening interesting perspectives for future research on Reinforcement Learning for closed-loop soft robot control.

Lipschitz-bounded 1D convolutional neural networks using the Cayley transform and the controllability Gramian. (arXiv:2303.11835v2 [cs.LG] UPDATED)

Authors: Patricia Pauli, Ruigang Wang, Ian R. Manchester, Frank Allgöwer

We establish a layer-wise parameterization for 1D convolutional neural networks (CNNs) with built-in end-to-end robustness guarantees. In doing so, we use the Lipschitz constant of the input-output mapping characterized by a CNN as a robustness measure. We base our parameterization on the Cayley transform that parameterizes orthogonal matrices and the controllability Gramian of the state space representation of the convolutional layers. The proposed parameterization by design fulfills linear matrix inequalities that are sufficient for Lipschitz continuity of the CNN, which further enables unconstrained training of Lipschitz-bounded 1D CNNs. Finally, we train Lipschitz-bounded 1D CNNs for the classification of heart arrythmia data and show their improved robustness.

The effectiveness of MAE pre-pretraining for billion-scale pretraining. (arXiv:2303.13496v3 [cs.CV] UPDATED)

Authors: Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.

Structural Group Unfairness: Measurement and Mitigation by means of the Effective Resistance. (arXiv:2305.03223v2 [cs.SI] UPDATED)

Authors: Adrian Arnaiz-Rodriguez, Georgina Curto, Nuria Oliver

Social networks contribute to the distribution of social capital, defined as the relationships, norms of trust and reciprocity within a community or society that facilitate cooperation and collective action. Social capital exists in the relations among individuals, such that better positioned members in a social network benefit from faster access to diverse information and higher influence on information dissemination. A variety of methods have been proposed in the literature to measure social capital at an individual level. However, there is a lack of methods to quantify social capital at a group level, which is particularly important when the groups are defined on the grounds of protected attributes. Furthermore, state-of-the-art approaches fail to model the role of long-range interactions between nodes in the network and their contributions to social capital. To fill this gap, we propose to measure the social capital of a group of nodes by means of their information flow and emphasize the importance of considering the whole network topology. Grounded in spectral graph theory, we introduce three effective resistance-based measures of group social capital, namely group isolation, group diameter and group control. We denote the social capital disparity among different groups in a network as structural group unfairness, and propose to mitigate it by means of a budgeted edge augmentation heuristic that systematically increases the social capital of the most disadvantaged group. In experiments on real networks, we uncover significant levels of structural group unfairness when using gender as the protected attribute, with females being the most disadvantaged group in comparison to males. We also illustrate how our proposed edge augmentation approach is able to not only effectively mitigate the structural group unfairness but also increase the social capital of all groups in the network.

Diffusion Language Models Generation Can Be Halted Early. (arXiv:2305.10818v3 [cs.LG] UPDATED)

Authors: Sofia Maria Lo Cicero Vaina, Nikita Balagansky, Daniil Gavrilov

Diffusion Language models (DLMs) are a promising avenue for text generation due to their practical properties on tractable controllable generation. They also have the advantage of not having to predict text autoregressively. However, despite these notable features, DLMs have not yet reached the performance levels of their autoregressive counterparts. One of the ways to reduce the performance gap between these two types of language models is to speed up the generation of DLMs. Therefore, we propose a novel methodology to address this issue in this work. It enables the execution of more generation steps within a given time frame, leading to higher-quality outputs. Specifically, our methods estimate DLMs completeness of text generation and allow adaptive halting of the generation process. We evaluate our methods on Plaid, SSD, and CDCD DLMs and create a cohesive perspective on their generation workflows. Finally, we confirm that our methods allow halting these models and decrease the generation time by $10$-$40$\% without a drop in the quality of model samples.

Mitigating Label Noise through Data Ambiguation. (arXiv:2305.13764v2 [cs.LG] UPDATED)

Authors: Julian Lienen, Eyke Hüllermeier

Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by "ambiguating" the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of so-called superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels.

Point2SSM: Learning Morphological Variations of Anatomies from Point Cloud. (arXiv:2305.14486v2 [cs.CV] UPDATED)

Authors: Jadie Adams, Shireen Elhabian

We present Point2SSM, a novel unsupervised learning approach for constructing correspondence-based statistical shape models (SSMs) directly from raw point clouds. SSM is crucial in clinical research, enabling population-level analysis of morphological variation in bones and organs. Traditional methods of SSM construction have limitations, including the requirement of noise-free surface meshes or binary volumes, reliance on assumptions or templates, and prolonged inference times due to simultaneous optimization of the entire cohort. Point2SSM overcomes these barriers by providing a data-driven solution that infers SSMs directly from raw point clouds, reducing inference burdens and increasing applicability as point clouds are more easily acquired. While deep learning on 3D point clouds has seen success in unsupervised representation learning and shape correspondence, its application to anatomical SSM construction is largely unexplored. We conduct a benchmark of state-of-the-art point cloud deep networks on the SSM task, revealing their limited robustness to clinical challenges such as noisy, sparse, or incomplete input and limited training data. Point2SSM addresses these issues through an attention-based module, providing effective correspondence mappings from learned point features. Our results demonstrate that the proposed method significantly outperforms existing networks in terms of accurate surface sampling and correspondence, better capturing population-level statistics.

Successor-Predecessor Intrinsic Exploration. (arXiv:2305.15277v3 [cs.LG] UPDATED)

Authors: Changmin Yu, Neil Burgess, Maneesh Sahani, Samuel J. Gershman

Exploration is essential in reinforcement learning, particularly in environments where external rewards are sparse. Here we focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards. Although the study of intrinsic rewards has a long history, existing methods focus on composing the intrinsic reward based on measures of future prospects of states, ignoring the information contained in the retrospective structure of transition sequences. Here we argue that the agent can utilise retrospective information to generate explorative behaviour with structure-awareness, facilitating efficient exploration based on global instead of local information. We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information. We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods. We also implement SPIE in deep reinforcement learning agents, and show that the resulting agent achieves stronger empirical performance than existing methods on sparse-reward Atari games.

Context selectivity with dynamic availability enables lifelong continual learning. (arXiv:2306.01690v2 [cs.LG] UPDATED)

Authors: Martin Barry, Wulfram Gerstner, Guillaume Bellec

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.

Pure Exploration in Bandits with Linear Constraints. (arXiv:2306.12774v4 [cs.LG] UPDATED)

Authors: Emil Carlsson, Debabrota Basu, Fredrik D. Johansson, Devdatt Dubhashi

We address the problem of identifying the optimal policy with a fixed confidence level in a multi-armed bandit setup, when \emph{the arms are subject to linear constraints}. Unlike the standard best-arm identification problem which is well studied, the optimal policy in this case may not be deterministic and could mix between several arms. This changes the geometry of the problem which we characterize via an information-theoretic lower bound. We introduce two asymptotically optimal algorithms for this setting, one based on the Track-and-Stop method and the other based on a game-theoretic approach. Both these algorithms try to track an optimal allocation based on the lower bound and computed by a weighted projection onto the boundary of a normal cone. Finally, we provide empirical results that validate our bounds and visualize how constraints change the hardness of the problem.

Adversarial Resilience in Sequential Prediction via Abstention. (arXiv:2306.13119v2 [cs.LG] UPDATED)

Authors: Surbhi Goel, Steve Hanneke, Shay Moran, Abhishek Shetty

We study the problem of sequential prediction in the stochastic setting with an adversary that is allowed to inject clean-label adversarial (or out-of-distribution) examples. Algorithms designed to handle purely stochastic data tend to fail in the presence of such adversarial examples, often leading to erroneous predictions. This is undesirable in many high-stakes applications such as medical recommendations, where abstaining from predictions on adversarial examples is preferable to misclassification. On the other hand, assuming fully adversarial data leads to very pessimistic bounds that are often vacuous in practice.

To capture this motivation, we propose a new model of sequential prediction that sits between the purely stochastic and fully adversarial settings by allowing the learner to abstain from making a prediction at no cost on adversarial examples. Assuming access to the marginal distribution on the non-adversarial examples, we design a learner whose error scales with the VC dimension (mirroring the stochastic setting) of the hypothesis class, as opposed to the Littlestone dimension which characterizes the fully adversarial setting. Furthermore, we design a learner for VC dimension~1 classes, which works even in the absence of access to the marginal distribution. Our key technical contribution is a novel measure for quantifying uncertainty for learning VC classes, which may be of independent interest.

Realistic Synthetic Financial Transactions for Anti-Money Laundering Models. (arXiv:2306.16424v3 [cs.AI] UPDATED)

Authors: Erik Altman, Jovan Blanuša, Luc von Niederhäusern, Béni Egressy, Andreea Anghel, Kubilay Atasu

With the widespread digitization of finance and the increasing popularity of cryptocurrencies, the sophistication of fraud schemes devised by cybercriminals is growing. Money laundering -- the movement of illicit funds to conceal their origins -- can cross bank and national boundaries, producing complex transaction patterns. The UN estimates 2-5\% of global GDP or \$0.8 - \$2.0 trillion dollars are laundered globally each year. Unfortunately, real data to train machine learning models to detect laundering is generally not available, and previous synthetic data generators have had significant shortcomings. A realistic, standardized, publicly-available benchmark is needed for comparing models and for the advancement of the area.

To this end, this paper contributes a synthetic financial transaction dataset generator and a set of synthetically generated AML (Anti-Money Laundering) datasets. We have calibrated this agent-based generator to match real transactions as closely as possible and made the datasets public. We describe the generator in detail and demonstrate how the datasets generated can help compare different machine learning models in terms of their AML abilities. In a key way, using synthetic data in these comparisons can be even better than using real data: the ground truth labels are complete, whilst many laundering transactions in real data are never detected.

What do self-supervised speech models know about words?. (arXiv:2307.00162v2 [cs.CL] UPDATED)

Authors: Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu

Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks -- word discrimination, word segmentation, and semantic sentence similarity -- S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate an improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.

Variational quantum regression algorithm with encoded data structure. (arXiv:2307.03334v3 [quant-ph] UPDATED)

Authors: C.-C. Joseph Wang, Ryan S. Bennink

Hybrid variational quantum algorithms (VQAs) are promising for solving practical problems such as combinatorial optimization, quantum chemistry simulation, quantum machine learning, and quantum error correction on noisy quantum computers. However, with typical random ansatz or quantum alternating operator ansatz, derived variational quantum algorithms become a black box for model interpretation. In this paper we construct a quantum regression algorithm wherein the quantum state directly encodes the classical data table and the variational parameters correspond directly to the regression coefficients which are real numbers by construction, providing a high degree of model interpretability and minimal cost to optimize with the right expressiveness. Instead of assuming the state preparation is given by granted, we discuss the state preparation with different encoders and their time complexity and overall resource cost. We can take advantage of the encoded data structure to cut down the algorithm time complexity. To the best of our knowledge, we show for the first time explicitly how the linkage of the classical data structure can be taken advantage of directly through quantum subroutines by construction. For nonlinear regression, our algorithm can be extended by building nonlinear features into the training data as demonstrated by numerical results. In addition, we demonstrate that the model trainability is achievable only when the number of features $M$ is much less than the number of records $L$ for the encoded data structure to justify $L\gg M$ in our resource estimation.

DyEdgeGAT: Dynamic Edge via Graph Attention for Early Fault Detection in IIoT Systems. (arXiv:2307.03761v3 [cs.LG] UPDATED)

Authors: Mengjie Zhao, Olga Fink

In the Industrial Internet of Things (IIoT), condition monitoring sensor signals from complex systems often exhibit nonlinear and stochastic spatial-temporal dynamics under varying conditions. These complex dynamics make fault detection particularly challenging. While previous methods effectively model these dynamics, they often neglect the evolution of relationships between sensor signals. Undetected shifts in these relationships can lead to significant system failures. Furthermore, these methods frequently misidentify novel operating conditions as faults. Addressing these limitations, we propose DyEdgeGAT (Dynamic Edge via Graph Attention), a novel approach for early-stage fault detection in IIoT systems. DyEdgeGAT's primary innovation lies in a novel graph inference scheme for multivariate time series that tracks the evolution of relationships between time series, enabled by dynamic edge construction. Another key innovation of DyEdgeGAT is its ability to incorporate operating condition contexts into node dynamics modeling, enhancing its accuracy and robustness. We rigorously evaluated DyEdgeGAT using both a synthetic dataset, simulating varying levels of fault severity, and a real-world industrial-scale multiphase flow facility benchmark with diverse fault types under varying operating conditions and detection complexities. The results show that DyEdgeGAT significantly outperforms other baseline methods in fault detection, particularly in the early stages with low severity, and exhibits robust performance under novel operating conditions.

Grounded Object Centric Learning. (arXiv:2307.09437v2 [cs.LG] UPDATED)

Authors: Avinash Kori, Francesco Locatello, Fabio De Sousa Ribeiro, Francesca Toni, Ben Glocker

The extraction of modular object-centric representations for downstream tasks is an emerging area of research. Learning grounded representations of objects that are guaranteed to be stable and invariant promises robust performance across different tasks and environments. Slot Attention (SA) learns object-centric representations by assigning objects to \textit{slots}, but presupposes a \textit{single} distribution from which all slots are randomly initialised. This results in an inability to learn \textit{specialized} slots which bind to specific object types and remain invariant to identity-preserving changes in object appearance. To address this, we present \emph{\textsc{Co}nditional \textsc{S}lot \textsc{A}ttention} (\textsc{CoSA}) using a novel concept of \emph{Grounded Slot Dictionary} (GSD) inspired by vector quantization. Our proposed GSD comprises (i) canonical object-level property vectors and (ii) parametric Gaussian distributions, which define a prior over the slots. We demonstrate the benefits of our method in multiple downstream tasks such as scene generation, composition, and task adaptation, whilst remaining competitive with SA in popular object discovery benchmarks.

Variational Autoencoding of Dental Point Clouds. (arXiv:2307.10895v2 [cs.CV] UPDATED)

Authors: Johan Ziruo Ye, Thomas Ørkild, Peter Lempel Søndergaard, Søren Hauberg

Digital dentistry has made significant advancements, yet numerous challenges remain. This paper introduces the FDI 16 dataset, an extensive collection of tooth meshes and point clouds. Additionally, we present a novel approach: Variational FoldingNet (VF-Net), a fully probabilistic variational autoencoder designed for point clouds. Notably, prior latent variable models for point clouds lack a one-to-one correspondence between input and output points. Instead, they rely on optimizing Chamfer distances, a metric that lacks a normalized distributional counterpart, rendering it unsuitable for probabilistic modeling. We replace the explicit minimization of Chamfer distances with a suitable encoder, increasing computational efficiency while simplifying the probabilistic extension. This allows for straightforward application in various tasks, including mesh generation, shape completion, and representation learning. Empirically, we provide evidence of lower reconstruction error in dental reconstruction and interpolation, showcasing state-of-the-art performance in dental sample generation while identifying valuable latent representations.

Multi-Objective Optimization for Sparse Deep Multi-Task Learning. (arXiv:2308.12243v3 [cs.LG] UPDATED)

Authors: S. S. Hotegni, M. Berkemeier, S. Peitz

Different conflicting optimization criteria arise naturally in various Deep Learning scenarios. These can address different main tasks (i.e., in the setting of Multi-Task Learning), but also main and secondary tasks such as loss minimization versus sparsity. The usual approach is a simple weighting of the criteria, which formally only works in the convex setting. In this paper, we present a Multi-Objective Optimization algorithm using a modified Weighted Chebyshev scalarization for training Deep Neural Networks (DNNs) with respect to several tasks. By employing this scalarization technique, the algorithm can identify all optimal solutions of the original problem while reducing its complexity to a sequence of single-objective problems. The simplified problems are then solved using an Augmented Lagrangian method, enabling the use of popular optimization techniques such as Adam and Stochastic Gradient Descent, while efficaciously handling constraints. Our work aims to address the (economical and also ecological) sustainability issue of DNN models, with a particular focus on Deep Multi-Task models, which are typically designed with a very large number of weights to perform equally well on multiple tasks. Through experiments conducted on two Machine Learning datasets, we demonstrate the possibility of adaptively sparsifying the model during training without significantly impacting its performance, if we are willing to apply task-specific adaptations to the network weights. The code is available at https://github.com/salomonhotegni/MDMTN

Heterogeneous Federated Learning via Personalized Generative Networks. (arXiv:2308.13265v2 [cs.LG] UPDATED)

Authors: Zahra Taghiyarrenani, Abdallah Alabdallah, Slawomir Nowaczyk, Sepideh Pashami

Federated Learning (FL) allows several clients to construct a common global machine-learning model without having to share their data. FL, however, faces the challenge of statistical heterogeneity between the client's data, which degrades performance and slows down the convergence toward the global model. In this paper, we provide theoretical proof that minimizing heterogeneity between clients facilitates the convergence of a global model for every single client. This becomes particularly important under empirical concept shifts among clients, rather than merely considering imbalanced classes, which have been studied until now. Therefore, we propose a method for knowledge transfer between clients where the server trains client-specific generators. Each generator generates samples for the corresponding client to remove the conflict with other clients' models. Experiments conducted on synthetic and real data, along with a theoretical study, support the effectiveness of our method in constructing a well-generalizable global model by reducing the conflict between local models.

Towards Generalizable Neural Solvers for Vehicle Routing Problems via Ensemble with Transferrable Local Policy. (arXiv:2308.14104v2 [cs.LG] UPDATED)

Authors: Chengrui Gao, Haopu Shang, Ke Xue, Dong Li, Chao Qian

Machine learning has been adapted to help solve NP-hard combinatorial optimization problems. One prevalent way is learning to construct solutions by deep neural networks, which has been receiving more and more attention due to the high efficiency and less requirement for expert knowledge. However, many neural construction methods for Vehicle Routing Problems (VRPs) focus on synthetic problem instances with specified node distributions and limited scales, leading to poor performance on real-world problems which usually involve complex and unknown node distributions together with large scales. To make neural VRP solvers more practical, we design an auxiliary policy that learns from the local transferable topological features, named local policy, and integrate it with a typical construction policy (which learns from the global information of VRP instances) to form an ensemble policy. With joint training, the aggregated policies perform cooperatively and complementarily to boost generalization. The experimental results on two well-known benchmarks, TSPLIB and CVRPLIB, of travelling salesman problem and capacitated VRP show that the ensemble policy significantly improves both cross-distribution and cross-scale generalization performance, and even performs well on real-world problems with several thousand nodes.

Temporal Inductive Path Neural Network for Temporal Knowledge Graph Reasoning. (arXiv:2309.03251v3 [cs.AI] UPDATED)

Authors: Hao Dong, Pengyang Wang, Meng Xiao, Zhiyuan Ning, Pengfei Wang, Yuanchun Zhou

Temporal Knowledge Graph (TKG) is an extension of traditional Knowledge Graph (KG) that incorporates the dimension of time. Reasoning on TKGs is a crucial task that aims to predict future facts based on historical occurrences. The key challenge lies in uncovering structural dependencies within historical subgraphs and temporal patterns. Most existing approaches model TKGs relying on entity modeling, as nodes in the graph play a crucial role in knowledge representation. However, the real-world scenario often involves an extensive number of entities, with new entities emerging over time. This makes it challenging for entity-dependent methods to cope with extensive volumes of entities, and effectively handling newly emerging entities also becomes a significant challenge. Therefore, we propose Temporal Inductive Path Neural Network (TiPNN), which models historical information in an entity-independent perspective. Specifically, TiPNN adopts a unified graph, namely history temporal graph, to comprehensively capture and encapsulate information from history. Subsequently, we utilize the defined query-aware temporal paths on a history temporal graph to model historical path information related to queries for reasoning. Extensive experiments illustrate that the proposed model not only attains significant performance enhancements but also handles inductive settings, while additionally facilitating the provision of reasoning evidence through history temporal graphs.

PRISM: Leveraging Prototype Patient Representations with Feature-Missing-Aware Calibration for EHR Data Sparsity Mitigation. (arXiv:2309.04160v3 [cs.LG] UPDATED)

Authors: Yinghao Zhu, Zixiang Wang, Long He, Shiyun Xie, Liantao Ma, Chengwei Pan

Electronic Health Record (EHR) data, while rich in information, often suffers from sparsity, posing significant challenges in predictive modeling. Traditional imputation methods inadequately distinguish between real and imputed data, leading to potential inaccuracies in models. Addressing this, we introduce PRISM, a novel approach that indirectly imputes data through prototype representations of similar patients, thus ensuring denser and more accurate embeddings. PRISM innovates further with a feature confidence learner module, which evaluates the reliability of each feature in light of missing data. Additionally, it incorporates a novel patient similarity metric that accounts for feature confidence, avoiding overreliance on imprecise imputed values. Our extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate PRISM's superior performance in predicting in-hospital mortality and 30-day readmission tasks, showcasing its effectiveness in handling EHR data sparsity. For the sake of reproducibility and further research, we have made the code publicly available at https://github.com/yhzhu99/PRISM.

Online Infinite-Dimensional Regression: Learning Linear Operators. (arXiv:2309.06548v3 [stat.ML] UPDATED)

Authors: Vinod Raman, Unique Subedi, Ambuj Tewari

We consider the problem of learning linear operators under squared loss between two infinite-dimensional Hilbert spaces in the online setting. We show that the class of linear operators with uniformly bounded $p$-Schatten norm is online learnable for any $p \in [1, \infty)$. On the other hand, we prove an impossibility result by showing that the class of uniformly bounded linear operators with respect to the operator norm is \textit{not} online learnable. Moreover, we show a separation between sequential uniform convergence and online learnability by identifying a class of bounded linear operators that is online learnable but uniform convergence does not hold. Finally, we prove that the impossibility result and the separation between uniform convergence and learnability also hold in the batch setting.

A Strong and Simple Deep Learning Baseline for BCI MI Decoding. (arXiv:2309.07159v2 [eess.SP] UPDATED)

Authors: Yassine El Ouahidi, Vincent Gripon, Bastien Pasdeloup, Ghaith Bouallegue, Nicolas Farrugia, Giulia Lioi

We propose EEG-SimpleConv, a straightforward 1D convolutional neural network for Motor Imagery decoding in BCI. Our main motivation is to propose a simple and performing baseline to compare to, using only very standard ingredients from the literature. We evaluate its performance on four EEG Motor Imagery datasets, including simulated online setups, and compare it to recent Deep Learning and Machine Learning approaches. EEG-SimpleConv is at least as good or far more efficient than other approaches, showing strong knowledge-transfer capabilities across subjects, at the cost of a low inference time. We advocate that using off-the-shelf ingredients rather than coming with ad-hoc solutions can significantly help the adoption of Deep Learning approaches for BCI. We make the code of the models and the experiments accessible.

MIML: Multiplex Image Machine Learning for High Precision Cell Classification via Mechanical Traits within Microfluidic Systems. (arXiv:2309.08421v2 [eess.IV] UPDATED)

Authors: Khayrul Islam, Ratul Paul, Shen Wang, Yaling Liu

Label-free cell classification is advantageous for supplying pristine cells for further use or examination, yet existing techniques frequently fall short in terms of specificity and speed. In this study, we address these limitations through the development of a novel machine learning framework, Multiplex Image Machine Learning (MIML). This architecture uniquely combines label-free cell images with biomechanical property data, harnessing the vast, often underutilized morphological information intrinsic to each cell. By integrating both types of data, our model offers a more holistic understanding of the cellular properties, utilizing morphological information typically discarded in traditional machine learning models. This approach has led to a remarkable 98.3\% accuracy in cell classification, a substantial improvement over models that only consider a single data type. MIML has been proven effective in classifying white blood cells and tumor cells, with potential for broader application due to its inherent flexibility and transfer learning capability. It's particularly effective for cells with similar morphology but distinct biomechanical properties. This innovative approach has significant implications across various fields, from advancing disease diagnostics to understanding cellular behavior.

Secure and Effective Data Appraisal for Machine Learning. (arXiv:2310.02373v3 [cs.LG] UPDATED)

Authors: Xu Ouyang, Changhong Yang, Felix Xiaozhu Lin, Yangfeng Ji

Essential for an unfettered data market is the ability to discreetly select and evaluate training data before finalizing a transaction between the data owner and model owner. To safeguard the privacy of both data and model, this process involves scrutinizing the target model through Multi-Party Computation (MPC). While prior research has posited that the MPC-based evaluation of Transformer models is excessively resource-intensive, this paper introduces an innovative approach that renders data selection practical. The contributions of this study encompass three pivotal elements: (1) a groundbreaking pipeline for confidential data selection using MPC, (2) replicating intricate high-dimensional operations with simplified low-dimensional MLPs trained on a limited subset of pertinent data, and (3) implementing MPC in a concurrent, multi-phase manner. The proposed method is assessed across an array of Transformer models and NLP/CV benchmarks. In comparison to the direct MPC-based evaluation of the target model, our approach substantially reduces the time required, from thousands of hours to mere tens of hours, with only a nominal 0.20% dip in accuracy when training with the selected data.

Facial Action Unit Detection Based on Multi-task Learning Strategy for Unlabeled Facial Images in the Wild. (arXiv:2310.05207v3 [cs.CV] UPDATED)

Authors: Ziqiao Shang, Bin Liu

Facial Action Unit (AU) detection often relies on highly-cost accurate labeling or inaccurate pseudo labeling techniques in recent years. How to introduce large amounts of unlabeled facial images in the wild into supervised AU detection frameworks has become a challenging problem. Additionally, nearly every type of AUs has the problem of unbalanced positive and negative samples. Inspired by other multi-task learning frameworks, we first propose a multi-task learning strategy boosting AU detection in the wild through jointing facial landmark detection and AU domain separation and reconstruction. Our introduced dual domains facial landmark detection framework can solve the lack of accurate facial landmark coordinates during the AU domain separation and reconstruction training process, while the parameters of homostructural facial extraction modules from these two similar facial tasks are shared. Moreover, we propose a pixel-level feature alignment scheme to maintain the consistency of features obtained from two separation and reconstruction processes. Furthermore, a weighted asymmetric loss is proposed to change the contribution of positive and negative samples of each type of AUs to model parameters updating. Experimental results on three widely used benchmarks demonstrate our superiority to most state-of-the-art methods for AU detection.

Domain-invariant Clinical Representation Learning by Bridging Data Distribution Shift across EMR Datasets. (arXiv:2310.07799v2 [cs.LG] UPDATED)

Authors: Zhongji Zhang, Yuhang Wang, Yinghao Zhu, Xinyu Ma, Tianlong Wang, Chaohe Zhang, Yasha Wang, Liantao Ma

Due to the limited information about emerging diseases, symptoms are hard to be noticed and recognized, so that the window for clinical intervention could be ignored. An effective prognostic model is expected to assist doctors in making right diagnosis and designing personalized treatment plan, so to promptly prevent unfavorable outcomes. However, in the early stage of a disease, limited data collection and clinical experiences, plus the concern out of privacy and ethics, may result in restricted data availability for reference, to the extent that even data labels are difficult to mark correctly. In addition, Electronic Medical Record (EMR) data of different diseases or of different sources of the same disease can prove to be having serious cross-dataset feature misalignment problems, greatly mutilating the efficiency of deep learning models. This article introduces a domain-invariant representation learning method to build a transition model from source dataset to target dataset. By way of constraining the distribution shift of features generated in disparate domains, domain-invariant features that are exclusively relative to downstream tasks are captured, so to cultivate a unified domain-invariant encoder across various task domains to achieve better feature representation. Experimental results of several target tasks demonstrate that our proposed model outperforms competing baseline methods and has higher rate of training convergence, especially in dealing with limited data amount. A multitude of experiences have proven the efficacy of our method to provide more accurate predictions concerning newly emergent pandemics and other diseases.

Benchmarking the Sim-to-Real Gap in Cloth Manipulation. (arXiv:2310.09543v2 [cs.RO] UPDATED)

Authors: David Blanco-Mulero, Oriol Barbany, Gokhan Alcan, Adrià Colomé, Carme Torras, Ville Kyrki

Realistic physics engines play a crucial role for learning to manipulate deformable objects such as garments in simulation. By doing so, researchers can circumvent challenges such as sensing the deformation of the object in the realworld. In spite of the extensive use of simulations for this task, few works have evaluated the reality gap between deformable object simulators and real-world data. We present a benchmark dataset to evaluate the sim-to-real gap in cloth manipulation. The dataset is collected by performing a dynamic as well as a quasi-static cloth manipulation task involving contact with a rigid table. We use the dataset to evaluate the reality gap, computational time, and simulation stability of four popular deformable object simulators: MuJoCo, Bullet, Flex, and SOFA. Additionally, we discuss the benefits and drawbacks of each simulator. The benchmark dataset is open-source. Supplementary material, videos, and code, can be found at https://sites.google.com/view/cloth-sim2real-benchmark.

A Survey on Trustworthy Edge Intelligence: From Security and Reliability To Transparency and Sustainability. (arXiv:2310.17944v2 [cs.LG] UPDATED)

Authors: Xiaojie Wang, Beibei Wang, Yu Wu, Zhaolong Ning, Song Guo, Fei Richard Yu

Edge Intelligence (EI) integrates Edge Computing (EC) and Artificial Intelligence (AI) to push the capabilities of AI to the network edge for real-time, efficient and secure intelligent decision-making and computation. However, EI faces various challenges due to resource constraints, heterogeneous network environments, and diverse service requirements of different applications, which together affect the trustworthiness of EI in the eyes of stakeholders. This survey comprehensively summarizes the characteristics, architecture, technologies, and solutions of trustworthy EI. Specifically, we first emphasize the need for trustworthy EI in the context of the trend toward large models. We then provide an initial definition of trustworthy EI, explore its key characteristics and give a multi-layered architecture for trustworthy EI. Then, we summarize several important issues that hinder the achievement of trustworthy EI. Subsequently, we present enabling technologies for trustworthy EI systems and provide an in-depth literature review of the state-of-the-art solutions for realizing the trustworthiness of EI. Finally, we discuss the corresponding research challenges and open issues.

Leveraging sinusoidal representation networks to predict fMRI signals from EEG. (arXiv:2311.04234v2 [eess.SP] UPDATED)

Authors: Yamin Li, Ange Lou, Ziyuan Xu, Shiyu Wang, Catie Chang

In modern neuroscience, functional magnetic resonance imaging (fMRI) has been a crucial and irreplaceable tool that provides a non-invasive window into the dynamics of whole-brain activity. Nevertheless, fMRI is limited by hemodynamic blurring as well as high cost, immobility, and incompatibility with metal implants. Electroencephalography (EEG) is complementary to fMRI and can directly record the cortical electrical activity at high temporal resolution, but has more limited spatial resolution and is unable to recover information about deep subcortical brain structures. The ability to obtain fMRI information from EEG would enable cost-effective, imaging across a wider set of brain regions. Further, beyond augmenting the capabilities of EEG, cross-modality models would facilitate the interpretation of fMRI signals. However, as both EEG and fMRI are high-dimensional and prone to artifacts, it is currently challenging to model fMRI from EEG. To address this challenge, we propose a novel architecture that can predict fMRI signals directly from multi-channel EEG without explicit feature engineering. Our model achieves this by implementing a Sinusoidal Representation Network (SIREN) to learn frequency information in brain dynamics from EEG, which serves as the input to a subsequent encoder-decoder to effectively reconstruct the fMRI signal from a specific brain region. We evaluate our model using a simultaneous EEG-fMRI dataset with 8 subjects and investigate its potential for predicting subcortical fMRI signals. The present results reveal that our model outperforms a recent state-of-the-art model, and indicates the potential of leveraging periodic activation functions in deep neural networks to model functional neuroimaging data.

Massive Editing for Large Language Models via Meta Learning. (arXiv:2311.04661v3 [cs.CL] UPDATED)

Authors: Chenmien Tan, Ge Zhang, Jie Fu

While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5-XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. Remarkably, MALMEN is capable of editing hundreds of times more facts than strong baselines with the identical hyper-network architecture and outperforms editor specifically designed for GPT. Our code is available at https://github.com/ChenmienTan/malmen.

2D-RC: Two-Dimensional Neural Network Approach for OTFS Symbol Detection. (arXiv:2311.08543v2 [eess.SP] UPDATED)

Authors: Jiarui Xu, Karim Said, Lizhong Zheng, Lingjia Liu

Orthogonal time frequency space (OTFS) is a promising modulation scheme for wireless communication in high-mobility scenarios. Recently, a reservoir computing (RC) based approach has been introduced for online subframe-based symbol detection in the OTFS system, where only a limited number of over-the-air (OTA) pilot symbols are utilized for training. However, this approach does not leverage the domain knowledge specific to the OTFS system to fully unlock the potential of RC. This paper introduces a novel two-dimensional RC (2D-RC) method that incorporates the domain knowledge of the OTFS system into the design for symbol detection in an online subframe-based manner. Specifically, as the channel interaction in the delay-Doppler (DD) domain is a two-dimensional (2D) circular operation, the 2D-RC is designed to have the 2D circular padding procedure and the 2D filtering structure to embed this knowledge. With the introduced architecture, 2D-RC can operate in the DD domain with only a single neural network, instead of necessitating multiple RCs to track channel variations in the time domain as in previous work. Numerical experiments demonstrate the advantages of the 2D-RC approach over the previous RC-based approach and compared model-based methods across different OTFS system variants and modulation orders.

Short vs. Long-term Coordination of Drones: When Distributed Optimization Meets Deep Reinforcement Learning. (arXiv:2311.09852v2 [cs.RO] UPDATED)

Authors: Chuhao Qin, Evangelos Pournaras

Swarms of autonomous interactive drones, with the support of recharging technology, can provide compelling sensing capabilities in Smart Cities, such as traffic monitoring and disaster response. Existing approaches, including distributed optimization and deep reinforcement learning (DRL), aim to coordinate drones to achieve cost-effective, high-quality navigation, sensing, and charging. However, they face grand challenges: short-term optimization is not effective in dynamic environments with unanticipated changes, while long-term learning lacks scalability, resilience, and flexibility. To bridge this gap, this paper introduces a new progressive approach that combines short-term plan generation and selection based on distributed optimization with a DRL-based long-term strategic scheduling of flying direction. Extensive experimentation with datasets generated from realistic urban mobility underscores an outstanding performance of the proposed solution compared to state-of-the-art. We also provide compelling new insights about the role of drones density in different sensing missions, the energy safety of drone operations and how to prioritize investments for key locations of charging infrastructure.

Can LLMs Patch Security Issues?. (arXiv:2312.00024v2 [cs.CR] UPDATED)

Authors: Kamel Alrashedy, Abdullah Aljasser

Large Language Models (LLMs) have shown impressive proficiency in code generation. Nonetheless, similar to human developers, these models might generate code that contains security vulnerabilities and flaws. Writing secure code remains a substantial challenge, as vulnerabilities often arise during interactions between programs and external systems or services, such as databases and operating systems. In this paper, we propose a novel approach, Feedback-Driven Solution Synthesis (FDSS), designed to explore the use of LLMs in receiving feedback from Bandit, which is a static code analysis tool, and then the LLMs generate potential solutions to resolve security vulnerabilities. Each solution, along with the vulnerable code, is then sent back to the LLM for code refinement. Our approach shows a significant improvement over the baseline and outperforms existing approaches. Furthermore, we introduce a new dataset, PythonSecurityEval, collected from real-world scenarios on Stack Overflow to evaluate the LLMs' ability to generate secure code. Code and data are available at \url{https://github.com/Kamel773/LLM-code-refine}

On the Nystrom Approximation for Preconditioning in Kernel Machines. (arXiv:2312.03311v4 [stat.ML] UPDATED)

Authors: Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin

Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.

Semi-Supervised Active Learning for Semantic Segmentation in Unknown Environments Using Informative Path Planning. (arXiv:2312.04402v2 [cs.RO] UPDATED)

Authors: Julius Rückin, Federico Magistri, Cyrill Stachniss, Marija Popović

Semantic segmentation enables robots to perceive and reason about their environments beyond geometry. Most of such systems build upon deep learning approaches. As autonomous robots are commonly deployed in initially unknown environments, pre-training on static datasets cannot always capture the variety of domains and limits the robot's perception performance during missions. Recently, self-supervised and fully supervised active learning methods emerged to improve a robot's vision. These approaches rely on large in-domain pre-training datasets or require substantial human labelling effort. We propose a planning method for semi-supervised active learning of semantic segmentation that substantially reduces human labelling requirements compared to fully supervised approaches. We leverage an adaptive map-based planner guided towards the frontiers of unexplored space with high model uncertainty collecting training data for human labelling. A key aspect of our approach is to combine the sparse high-quality human labels with pseudo labels automatically extracted from highly certain environment map areas. Experimental results show that our method reaches segmentation performance close to fully supervised approaches with drastically reduced human labelling effort while outperforming self-supervised approaches.

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. (arXiv:2312.05934v2 [cs.AI] UPDATED)

Authors: Oded Ovadia, Menachem Brief, Moshik Mishaeli, Oren Elisha

Large language models (LLMs) encapsulate a vast amount of factual information within their pre-trained weights, as evidenced by their ability to answer diverse questions across different domains. However, this knowledge is inherently limited, relying heavily on the characteristics of the training data. Consequently, using external datasets to incorporate new information or refine the capabilities of LLMs on previously seen information poses a significant challenge. In this study, we compare two common approaches: unsupervised fine-tuning and retrieval-augmented generation (RAG). We evaluate both approaches on a variety of knowledge-intensive tasks across different topics. Our findings reveal that while unsupervised fine-tuning offers some improvement, RAG consistently outperforms it, both for existing knowledge encountered during training and entirely new knowledge. Moreover, we find that LLMs struggle to learn new factual information through unsupervised fine-tuning, and that exposing them to numerous variations of the same fact during training could alleviate this problem.

TrojFST: Embedding Trojans in Few-shot Prompt Tuning. (arXiv:2312.10467v2 [cs.LG] UPDATED)

Authors: Mengxin Zheng, Jiaqi Xue, Xun Chen, YanShan Wang, Qian Lou, Lei Jiang

Prompt-tuning has emerged as a highly effective approach for adapting a pre-trained language model (PLM) to handle new natural language processing tasks with limited input samples. However, the success of prompt-tuning has led to adversaries attempting backdoor attacks against this technique. Previous prompt-based backdoor attacks faced challenges when implemented through few-shot prompt-tuning, requiring either full-model fine-tuning or a large training dataset. We observe the difficulty in constructing a prompt-based backdoor using few-shot prompt-tuning, which involves freezing the PLM and tuning a soft prompt with a restricted set of input samples. This approach introduces an imbalanced poisoned dataset, making it susceptible to overfitting and lacking attention awareness. To address these challenges, we introduce TrojFST for backdoor attacks within the framework of few-shot prompt-tuning. TrojFST comprises three modules: balanced poison learning, selective token poisoning, and trojan-trigger attention. In comparison to previous prompt-based backdoor attacks, TrojFST demonstrates significant improvements, enhancing ASR $> 9\%$ and CDA by $> 4\%$ across various PLMs and a diverse set of downstream tasks.

A Survey of Reasoning with Foundation Models. (arXiv:2312.11562v5 [cs.AI] UPDATED)

Authors: Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng, Jifeng Dai, Ping Luo, Jingdong Wang, Ji-Rong Wen, Xipeng Qiu, Yike Guo, Hui Xiong, Qun Liu, Zhenguo Li

Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training. (arXiv:2312.11819v2 [cs.LG] UPDATED)

Authors: Youshao Xiao, Weichang Wu, Zhenglei Zhou, Fagui Mao, Shangchun Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, Jun Zhou

Recently, ChatGPT or InstructGPT like large language models (LLM) has made a significant impact in the AI world. Many works have attempted to reproduce the complex InstructGPT's training pipeline, namely Reinforcement Learning with Human Feedback (RLHF). However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Flattening strategy. This strategy treats all four interdependent models involved in RLHF as a single entity, distributing them across all devices and applying parallelism techniques designed for a single model, regardless of the different workloads inherent to each model. As a result, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose an adaptive model placement framework that offers two flexible model placement strategies. The Interleaving strategy helps reduce memory redundancy and communication costs of RLHF training by placing models without dependencies on exclusive devices with careful orchestration. On the other hand, the Separation strategy improves the throughput of model training by separating the training and inference runtime of the RLHF pipeline with additional shadow models. Furthermore, our framework provides a simple user interface and allows for the agile allocation of models across devices in a fine-grained manner for various training scenarios, involving models of varying sizes and devices of different scales. Extensive experiments have demonstrated that our Interleaving and Separation strategies can achieve notable improvements up to 11X, compared to the current SOTA approaches. The results highlight the effectiveness and adaptability of our approaches in accelerating the training of distributed RLHF.

Robust Neural Pruning with Gradient Sampling Optimization for Residual Neural Networks. (arXiv:2312.16020v2 [cs.LG] UPDATED)

Authors: Juyoung Yun

In this study, we explore an innovative approach for neural network optimization, focusing on the application of gradient sampling techniques, similar to those in StochGradAdam, during the pruning process. Our primary objective is to maintain high accuracy levels in pruned models, a critical challenge in resource-limited scenarios. Our extensive experiments reveal that models optimized with gradient sampling techniques are more effective at preserving accuracy during pruning compared to those using traditional optimization methods. This finding underscores the significance of gradient sampling in facilitating robust learning and enabling networks to retain crucial information even after substantial reduction in their complexity. We validate our approach across various datasets and neural architectures, demonstrating its broad applicability and effectiveness. The paper also delves into the theoretical aspects, explaining how gradient sampling techniques contribute to the robustness of models during pruning. Our results suggest a promising direction for creating efficient neural networks that do not compromise on accuracy, even in environments with constrained computational resources.

Harmonizing Covariance and Expressiveness for Deep Hamiltonian Regression in Crystalline Material Research: a Hybrid Cascaded Regression Framework. (arXiv:2401.00744v5 [physics.comp-ph] UPDATED)

Authors: Shi Yin, Xinyang Pan, Xudong Zhu, Tianyu Gao, Haochong Zhang, Feng Wu, Lixin He

Deep learning for Hamiltonian regression of quantum systems in material research necessitates satisfying the covariance laws, among which achieving SO(3)-equivariance without sacrificing the expressiveness capability of networks remains an elusive challenge due to the restriction to non-linear mappings on guaranteeing theoretical equivariance. To alleviate the covariance-expressiveness dilemma, we propose a hybrid framework with two cascaded regression stages. The first stage, i.e., a theoretically-guaranteed covariant neural network modeling symmetry properties of 3D atom systems, predicts baseline Hamiltonians with theoretically covariant features extracted, assisting the second stage in learning covariance. Meanwhile, the second stage, powered by a non-linear 3D graph Transformer network we propose for structural modeling of atomic systems, refines the first stage's output as a fine-grained prediction of Hamiltonians with better expressiveness capability. The combination of a theoretically covariant yet inevitably less expressive model with a highly expressive non-linear network enables precise, generalizable predictions while maintaining robust covariance under coordinate transformations. Our method achieves state-of-the-art performance in Hamiltonian prediction for electronic structure calculations, confirmed through experiments on six crystalline material databases. The codes and configuration scripts are available in the supplementary material.

Convergence Rate Maximization for Split Learning-based Control of EMG Prosthetic Devices. (arXiv:2401.03233v2 [cs.LG] UPDATED)

Authors: Matea Marinova, Daniel Denkovski, Hristijan Gjoreski, Zoran Hadzi-Velkov, Valentin Rakovic

Split Learning (SL) is a promising Distributed Learning approach in electromyography (EMG) based prosthetic control, due to its applicability within resource-constrained environments. Other learning approaches, such as Deep Learning and Federated Learning (FL), provide suboptimal solutions, since prosthetic devices are extremely limited in terms of processing power and battery life. The viability of implementing SL in such scenarios is caused by its inherent model partitioning, with clients executing the smaller model segment. However, selecting an inadequate cut layer hinders the training process in SL systems. This paper presents an algorithm for optimal cut layer selection in terms of maximizing the convergence rate of the model. The performance evaluation demonstrates that the proposed algorithm substantially accelerates the convergence in an EMG pattern recognition task for improving prosthetic device control.

Can Probabilistic Feedback Drive User Impacts in Online Platforms?. (arXiv:2401.05304v2 [cs.LG] UPDATED)

Authors: Jessica Dai, Bailey Flanigan, Nika Haghtalab, Meena Jagadeesan, Chara Podimata

A common explanation for negative user impacts of content recommender systems is misalignment between the platform's objective and user welfare. In this work, we show that misalignment in the platform's objective is not the only potential cause of unintended impacts on users: even when the platform's objective is fully aligned with user welfare, the platform's learning algorithm can induce negative downstream impacts on users. The source of these user impacts is that different pieces of content may generate observable user reactions (feedback information) at different rates; these feedback rates may correlate with content properties, such as controversiality or demographic similarity of the creator, that affect the user experience. Since differences in feedback rates can impact how often the learning algorithm engages with different content, the learning algorithm may inadvertently promote content with certain such properties. Using the multi-armed bandit framework with probabilistic feedback, we examine the relationship between feedback rates and a learning algorithm's engagement with individual arms for different no-regret algorithms. We prove that no-regret algorithms can exhibit a wide range of dependencies: if the feedback rate of an arm increases, some no-regret algorithms engage with the arm more, some no-regret algorithms engage with the arm less, and other no-regret algorithms engage with the arm approximately the same number of times. From a platform design perspective, our results highlight the importance of looking beyond regret when measuring an algorithm's performance, and assessing the nature of a learning algorithm's engagement with different types of content as well as their resulting downstream impacts.

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models. (arXiv:2401.08491v2 [cs.CL] UPDATED)

Authors: Tassilo Klein, Moin Nabi

The generation of undesirable and factually incorrect content of large language models poses a significant challenge and remains largely an unsolved issue. This paper studies the integration of a contrastive learning objective for fine-tuning LLMs for implicit knowledge editing and controlled text generation. Optimizing the training objective entails aligning text perplexities in a contrastive fashion. To facilitate training the model in a self-supervised fashion, we leverage an off-the-shelf LLM for training data generation. We showcase applicability in the domain of detoxification. Herein, the proposed approach leads to a significant decrease in the generation of toxic content while preserving general utility for downstream tasks such as commonsense reasoning and reading comprehension. The proposed approach is conceptually simple but empirically powerful.

DiConStruct: Causal Concept-based Explanations through Black-Box Distillation. (arXiv:2401.08534v3 [cs.LG] UPDATED)

Authors: Ricardo Moreira, Jacopo Bono, Mário Cardoso, Pedro Saleiro, Mário A. T. Figueiredo, Pedro Bizarro

Model interpretability plays a central role in human-AI decision-making systems. Ideally, explanations should be expressed using human-interpretable semantic concepts. Moreover, the causal relations between these concepts should be captured by the explainer to allow for reasoning about the explanations. Lastly, explanation methods should be efficient and not compromise the performance of the predictive task. Despite the rapid advances in AI explainability in recent years, as far as we know to date, no method fulfills these three properties. Indeed, mainstream methods for local concept explainability do not produce causal explanations and incur a trade-off between explainability and prediction performance. We present DiConStruct, an explanation method that is both concept-based and causal, with the goal of creating more interpretable local explanations in the form of structural causal models and concept attributions. Our explainer works as a distillation model to any black-box machine learning model by approximating its predictions while producing the respective explanations. Because of this, DiConStruct generates explanations efficiently while not impacting the black-box prediction task. We validate our method on an image dataset and a tabular dataset, showing that DiConStruct approximates the black-box models with higher fidelity than other concept explainability baselines, while providing explanations that include the causal relations between the concepts.

SAiD: Speech-driven Blendshape Facial Animation with Diffusion. (arXiv:2401.08655v2 [cs.CV] UPDATED)

Authors: Inkyu Park, Jaewoong Cho

Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.

Decoupled Prototype Learning for Reliable Test-Time Adaptation. (arXiv:2401.08703v2 [cs.LG] UPDATED)

Authors: Guowei Wang, Changxing Ding, Wentao Tan, Mingkui Tan

Test-time adaptation (TTA) is a task that continually adapts a pre-trained source model to the target domain during inference. One popular approach involves fine-tuning model with cross-entropy loss according to estimated pseudo-labels. However, its performance is significantly affected by noisy pseudo-labels. This study reveals that minimizing the classification error of each sample causes the cross-entropy loss's vulnerability to label noise. To address this issue, we propose a novel Decoupled Prototype Learning (DPL) method that features prototype-centric loss computation. First, we decouple the optimization of class prototypes. For each class prototype, we reduce its distance with positive samples and enlarge its distance with negative samples in a contrastive manner. This strategy prevents the model from overfitting to noisy pseudo-labels. Second, we propose a memory-based strategy to enhance DPL's robustness for the small batch sizes often encountered in TTA. We update each class's pseudo-feature from a memory in a momentum manner and insert an additional DPL loss. Finally, we introduce a consistency regularization-based approach to leverage samples with unconfident pseudo-labels. This approach transfers feature styles of samples with unconfident pseudo-labels to those with confident pseudo-labels. Thus, more reliable samples for TTA are created. The experimental results demonstrate that our methods achieve state-of-the-art performance on domain generalization benchmarks, and reliably improve the performance of self-training-based methods on image corruption benchmarks. The code will be released.

Shabari: Delayed Decision-Making for Faster and Efficient Serverless Functions. (arXiv:2401.08859v2 [cs.DC] UPDATED)

Authors: Prasoon Sinha, Kostis Kaffes, Neeraja J. Yadwadkar

Serverless computing relieves developers from the burden of resource management, thus providing ease-of-use to the users and the opportunity to optimize resource utilization for the providers. However, today's serverless systems lack performance guarantees for function invocations, thus limiting support for performance-critical applications: we observed severe performance variability (up to 6x). Providers lack visibility into user functions and hence find it challenging to right-size them: we observed heavy resource underutilization (up to 80%). To understand the causes behind the performance variability and underutilization, we conducted a measurement study of commonly deployed serverless functions and learned that the function performance and resource utilization depend crucially on function semantics and inputs. Our key insight is to delay making resource allocation decisions until after the function inputs are available. We introduce Shabari, a resource management framework for serverless systems that makes decisions as late as possible to right-size each invocation to meet functions' performance objectives (SLOs) and improve resource utilization. Shabari uses an online learning agent to right-size each function invocation based on the features of the function input and makes cold-start-aware scheduling decisions. For a range of serverless functions and inputs, Shabari reduces SLO violations by 11-73% while not wasting any vCPUs and reducing wasted memory by 64-94% in the median case, compared to state-of-the-art systems, including Aquatope, Parrotfish, and Cypress.

cedar: Composable and Optimized Machine Learning Input Data Pipelines. (arXiv:2401.08895v2 [cs.LG] UPDATED)

Authors: Mark Zhao, Emanuel Adamiak, Christos Kozyrakis

The input data pipeline is an essential component of each machine learning (ML) training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources -- or worse -- underutilize expensive accelerators.

To address these demands, we present cedar, a programming model and framework that allows users to easily build, optimize, and execute input data pipelines. cedar presents an easy-to-use programming interface, allowing users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. Meanwhile, cedar transparently applies a complex and extensible set of optimization techniques (e.g., offloading, caching, prefetching, fusion, and reordering). It then orchestrates processing across a customizable set of local and distributed compute resources in order to maximize processing performance and efficiency, all without user input. On average across six diverse input data pipelines, cedar achieves a 2.49x, 1.87x, 2.18x, and 2.74x higher performance compared to tf.data, tf.data service, Ray Data, and PyTorch's DataLoader, respectively.

SymTC: A Symbiotic Transformer-CNN Net for Instance Segmentation of Lumbar Spine MRI. (arXiv:2401.09627v2 [eess.IV] UPDATED)

Authors: Jiasong Chen, Linchen Qian, Linhai Ma, Timur Urakov, Weiyong Gu, Liang Liang

Intervertebral disc disease, a prevalent ailment, frequently leads to intermittent or persistent low back pain, and diagnosing and assessing of this disease rely on accurate measurement of vertebral bone and intervertebral disc geometries from lumbar MR images. Deep neural network (DNN) models may assist clinicians with more efficient image segmentation of individual instances (disks and vertebrae) of the lumbar spine in an automated way, which is termed as instance image segmentation. In this work, we proposed SymTC, an innovative lumbar spine MR image segmentation model that combines the strengths of Transformer and Convolutional Neural Network (CNN). Specifically, we designed a parallel dual-path architecture to merge CNN layers and Transformer layers, and we integrated a novel position embedding into the self-attention module of Transformer, enhancing the utilization of positional information for more accurate segmentation. To further improves model performance, we introduced a new data augmentation technique to create synthetic yet realistic MR image dataset, named SSMSpine, which is made publicly available. We evaluated our SymTC and the other 15 existing image segmentation models on our private in-house dataset and the public SSMSpine dataset, using two metrics, Dice Similarity Coefficient and 95% Hausdorff Distance. The results show that our SymTC has the best performance for segmenting vertebral bones and intervertebral discs in lumbar spine MR images. The SymTC code and SSMSpine dataset are available at https://github.com/jiasongchen/SymTC.

Exploration and Anti-Exploration with Distributional Random Network Distillation. (arXiv:2401.09750v2 [cs.LG] UPDATED)

Authors: Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li

Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the ``bonus inconsistency'' issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks.

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences. (arXiv:2401.10529v2 [cs.CV] UPDATED)

Authors: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.

Neglected Hessian component explains mysteries in Sharpness regularization. (arXiv:2401.10809v2 [cs.LG] UPDATED)

Authors: Yann N. Dauphin, Atish Agarwala, Hossein Mobahi

Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.

AI in Supply Chain Risk Assessment: A Systematic Literature Review and Bibliometric Analysis. (arXiv:2401.10895v2 [cs.LG] UPDATED)

Authors: Md Abrar Jahin, Saleh Akram Naife, Anik Kumar Saha, M. F. Mridha

Supply chain risk assessment (SCRA) has witnessed a profound evolution through the integration of artificial intelligence (AI) and machine learning (ML) techniques, revolutionizing predictive capabilities and risk mitigation strategies. The significance of this evolution stems from the critical role of robust risk management strategies in ensuring operational resilience and continuity within modern supply chains. Previous reviews have outlined established methodologies but have overlooked emerging AI/ML techniques, leaving a notable research gap in understanding their practical implications within SCRA. This paper conducts a systematic literature review combined with a comprehensive bibliometric analysis. We meticulously examined 1,717 papers and derived key insights from a select group of 48 articles published between 2014 and 2023. The review fills this research gap by addressing pivotal research questions, and exploring existing AI/ML techniques, methodologies, findings, and future trajectories, thereby providing a more encompassing view of the evolving landscape of SCRA. Our study unveils the transformative impact of AI/ML models, such as Random Forest, XGBoost, and hybrids, in substantially enhancing precision within SCRA. It underscores adaptable post-COVID strategies, advocating for resilient contingency plans and aligning with evolving risk landscapes. Significantly, this review surpasses previous examinations by accentuating emerging AI/ML techniques and their practical implications within SCRA. Furthermore, it highlights the contributions through a comprehensive bibliometric analysis, revealing publication trends, influential authors, and highly cited articles.

The Synergy Between Optimal Transport Theory and Multi-Agent Reinforcement Learning. (arXiv:2401.10949v2 [cs.MA] UPDATED)

Authors: Ali Baheri, Mykel J. Kochenderfer

This paper explores the integration of optimal transport (OT) theory with multi-agent reinforcement learning (MARL). This integration uses OT to handle distributions and transportation problems to enhance the efficiency, coordination, and adaptability of MARL. There are five key areas where OT can impact MARL: (1) policy alignment, where OT's Wasserstein metric is used to align divergent agent strategies towards unified goals; (2) distributed resource management, employing OT to optimize resource allocation among agents; (3) addressing non-stationarity, using OT to adapt to dynamic environmental shifts; (4) scalable multi-agent learning, harnessing OT for decomposing large-scale learning objectives into manageable tasks; and (5) enhancing energy efficiency, applying OT principles to develop sustainable MARL systems. This paper articulates how the synergy between OT and MARL can address scalability issues, optimize resource distribution, align agent policies in cooperative environments, and ensure adaptability in dynamically changing conditions.

TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients. (arXiv:2401.12012v2 [cs.LG] UPDATED)

Authors: Mengdi Wang, Anna Bodonhelyi, Efe Bozkir, Enkelejda Kasneci

Federated learning is a distributed collaborative machine learning paradigm that has gained strong momentum in recent years. In federated learning, a central server periodically coordinates models with clients and aggregates the models trained locally by clients without necessitating access to local data. Despite its potential, the implementation of federated learning continues to encounter several challenges, predominantly the slow convergence that is largely due to data heterogeneity. The slow convergence becomes particularly problematic in cross-device federated learning scenarios where clients may be strongly limited by computing power and storage space, and hence counteracting methods that induce additional computation or memory cost on the client side such as auxiliary objective terms and larger training iterations can be impractical. In this paper, we propose a novel federated aggregation strategy, TurboSVM-FL, that poses no additional computation burden on the client side and can significantly accelerate convergence for federated classification task, especially when clients are "lazy" and train their models solely for few epochs for next global aggregation. TurboSVM-FL extensively utilizes support vector machine to conduct selective aggregation and max-margin spread-out regularization on class embeddings. We evaluate TurboSVM-FL on multiple datasets including FEMNIST, CelebA, and Shakespeare using user-independent validation with non-iid data distribution. Our results show that TurboSVM-FL can significantly outperform existing popular algorithms on convergence rate and reduce communication rounds while delivering better test metrics including accuracy, F1 score, and MCC.

The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness. (arXiv:2401.12236v2 [cs.LG] UPDATED)

Authors: Yifan Hao, Tong Zhang

Recent empirical and theoretical studies have established the generalization capabilities of large machine learning models that are trained to (approximately or exactly) fit noisy data. In this work, we prove a surprising result that even if the ground truth itself is robust to adversarial examples, and the benignly overfitted model is benign in terms of the ``standard'' out-of-sample risk objective, this benign overfitting process can be harmful when out-of-sample data are subject to adversarial manipulation. More specifically, our main results contain two parts: (i) the min-norm estimator in overparameterized linear model always leads to adversarial vulnerability in the ``benign overfitting'' setting; (ii) we verify an asymptotic trade-off result between the standard risk and the ``adversarial'' risk of every ridge regression estimator, implying that under suitable conditions these two items cannot both be small at the same time by any single choice of the ridge regularization parameter. Furthermore, under the lazy training regime, we demonstrate parallel results on two-layer neural tangent kernel (NTK) model, which align with empirical observations in deep neural networks. Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models. (arXiv:2401.12522v2 [cs.CL] UPDATED)

Authors: Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

Energy-based Automated Model Evaluation. (arXiv:2401.12689v2 [cs.LG] UPDATED)

Authors: Ru Peng, Heming Zou, Haobo Wang, Yawen Zeng, Zenan Huang, Junbo Zhao

The conventional evaluation protocols on machine learning models rely heavily on a labeled, i.i.d-assumed testing dataset, which is not often present in real world applications. The Automated Model Evaluation (AutoEval) shows an alternative to this traditional workflow, by forming a proximal prediction pipeline of the testing performance without the presence of ground-truth labels. Despite its recent successes, the AutoEval frameworks still suffer from an overconfidence issue, substantial storage and computational cost. In that regard, we propose a novel measure -- Meta-Distribution Energy (MDE) -- that allows the AutoEval framework to be both more efficient and effective. The core of the MDE is to establish a meta-distribution statistic, on the information (energy) associated with individual samples, then offer a smoother representation enabled by energy-based learning. We further provide our theoretical insights by connecting the MDE with the classification loss. We provide extensive experiments across modalities, datasets and different architectural backbones to validate MDE's validity, together with its superiority compared with prior approaches. We also prove MDE's versatility by showing its seamless integration with large-scale models, and easy adaption to learning scenarios with noisy- or imbalanced- labels. Code and data are available: https://github.com/pengr/Energy_AutoEval

Binary structured physics-informed neural networks for solving equations with rapidly changing solutions. (arXiv:2401.12806v2 [cs.LG] UPDATED)

Authors: Yanzhi Liu, Ruifan Wu, Ying Jiang

Physics-informed neural networks (PINNs), rooted in deep learning, have emerged as a promising approach for solving partial differential equations (PDEs). By embedding the physical information described by PDEs into feedforward neural networks, PINNs are trained as surrogate models to approximate solutions without the need for label data. Nevertheless, even though PINNs have shown remarkable performance, they can face difficulties, especially when dealing with equations featuring rapidly changing solutions. These difficulties encompass slow convergence, susceptibility to becoming trapped in local minima, and reduced solution accuracy. To address these issues, we propose a binary structured physics-informed neural network (BsPINN) framework, which employs binary structured neural network (BsNN) as the neural network component. By leveraging a binary structure that reduces inter-neuron connections compared to fully connected neural networks, BsPINNs excel in capturing the local features of solutions more effectively and efficiently. These features are particularly crucial for learning the rapidly changing in the nature of solutions. In a series of numerical experiments solving Burgers equation, Euler equation, Helmholtz equation, and high-dimension Poisson equation, BsPINNs exhibit superior convergence speed and heightened accuracy compared to PINNs. From these experiments, we discover that BsPINNs resolve the issues caused by increased hidden layers in PINNs resulting in over-smoothing, and prevent the decline in accuracy due to non-smoothness of PDEs solutions.

Debiased Sample Selection for Combating Noisy Labels. (arXiv:2401.13360v2 [cs.LG] UPDATED)

Authors: Qi Wei, Lei Feng, Haobo Wang, Bo An

Learning with noisy labels aims to ensure model generalization given a label-corrupted training set. The sample selection strategy achieves promising performance by selecting a label-reliable subset for model training. In this paper, we empirically reveal that existing sample selection methods suffer from both data and training bias that are represented as imbalanced selected sets and accumulation errors in practice, respectively. However, only the training bias was handled in previous studies. To address this limitation, we propose a noIse-Tolerant Expert Model (ITEM) for debiased learning in sample selection. Specifically, to mitigate the training bias, we design a robust network architecture that integrates with multiple experts. Compared with the prevailing double-branch network, our network exhibits better performance of selection and prediction by ensembling these experts while training with fewer parameters. Meanwhile, to mitigate the data bias, we propose a mixed sampling strategy based on two weight-based data samplers. By training on the mixture of two class-discriminative mini-batches, the model mitigates the effect of the imbalanced training set while avoiding sparse representations that are easily caused by sampling strategies. Extensive experiments and analyses demonstrate the effectiveness of ITEM. Our code is available at this url \href{https://github.com/1998v7/ITEM}{ITEM}.

Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space. (arXiv:2401.13530v2 [cs.LG] UPDATED)

Authors: Mingyang Yi, Bohan Wang

Recently, optimization on the Riemannian manifold has provided new insights to the optimization community. In this regard, the manifold taken as the probability measure metric space equipped with the second-order Wasserstein distance is of particular interest, since optimization on it can be linked to practical sampling processes. In general, the oracle (continuous) optimization method on Wasserstein space is Riemannian gradient flow (i.e., Langevin dynamics when minimizing KL divergence). In this paper, we aim to enrich the continuous optimization methods in the Wasserstein space by extending the gradient flow into the stochastic gradient descent (SGD) flow and stochastic variance reduction gradient (SVRG) flow. The two flows on Euclidean space are standard stochastic optimization methods, while their Riemannian counterparts are not explored yet. By leveraging the structures in Wasserstein space, we construct a stochastic differential equation (SDE) to approximate the discrete dynamics of desired stochastic methods in the corresponded random vector space. Then, the flows of probability measures are naturally obtained by applying Fokker-Planck equation to such SDE. Furthermore, the convergence rates of the proposed Riemannian stochastic flows are proven, and they match the results in Euclidean space.

Finetuning Foundation Models for Joint Analysis Optimization. (arXiv:2401.13536v2 [hep-ex] UPDATED)

Authors: Matthias Vigl, Nicole Hartman, Lukas Heinrich

In this work we demonstrate that significant gains in performance and data efficiency can be achieved in High Energy Physics (HEP) by moving beyond the standard paradigm of sequential optimization or reconstruction and analysis components. We conceptually connect HEP reconstruction and analysis to modern machine learning workflows such as pretraining, finetuning, domain adaptation and high-dimensional embedding spaces and quantify the gains in the example usecase of searches of heavy resonances decaying via an intermediate di-Higgs system to four $b$-jets.

Masked Particle Modeling on Sets: Towards Self-Supervised High Energy Physics Foundation Models. (arXiv:2401.13537v2 [hep-ph] UPDATED)

Authors: Lukas Heinrich, Tobias Golling, Michael Kagan, Samuel Klein, Matthew Leigh, Margarita Osadchy, John Andrew Raine

We propose masked particle modeling (MPM) as a self-supervised method for learning generic, transferable, and reusable representations on unordered sets of inputs for use in high energy physics (HEP) scientific data. This work provides a novel scheme to perform masked modeling based pre-training to learn permutation invariant functions on sets. More generally, this work provides a step towards building large foundation models for HEP that can be generically pre-trained with self-supervised learning and later fine-tuned for a variety of down-stream tasks. In MPM, particles in a set are masked and the training objective is to recover their identity, as defined by a discretized token representation of a pre-trained vector quantized variational autoencoder. We study the efficacy of the method in samples of high energy jets at collider physics experiments, including studies on the impact of discretization, permutation invariance, and ordering. We also study the fine-tuning capability of the model, showing that it can be adapted to tasks such as supervised and weakly supervised jet classification, and that the model can transfer efficiently with small fine-tuning data sets to new classes and new data domains.

Graph-Informed Neural Networks for Sparse Grid-Based Discontinuity Detectors. (arXiv:2401.13652v2 [cs.LG] UPDATED)

Authors: Francesco Della Santa, Sandra Pieraccini

In this paper, we present a novel approach for detecting the discontinuity interfaces of a discontinuous function. This approach leverages Graph-Informed Neural Networks (GINNs) and sparse grids to address discontinuity detection also in domains of dimension larger than 3. GINNs, trained to identify troubled points on sparse grids, exploit graph structures built on the grids to achieve efficient and accurate discontinuity detection performances. We also introduce a recursive algorithm for general sparse grid-based detectors, characterized by convergence properties and easy applicability. Numerical experiments on functions with dimensions n = 2 and n = 4 demonstrate the efficiency and robust generalization of GINNs in detecting discontinuity interfaces. Notably, the trained GINNs offer portability and versatility, allowing integration into various algorithms and sharing among users.

Inadequacy of common stochastic neural networks for reliable clinical decision support. (arXiv:2401.13657v2 [cs.LG] UPDATED)

Authors: Adrian Lindenmeyer, Malte Blattmann, Stefan Franke, Thomas Neumuth, Daniel Schneider

Widespread adoption of AI for medical decision making is still hindered due to ethical and safety-related concerns. For AI-based decision support systems in healthcare settings it is paramount to be reliable and trustworthy. Common deep learning approaches, however, have the tendency towards overconfidence under data shift. Such inappropriate extrapolation beyond evidence-based scenarios may have dire consequences. This highlights the importance of reliable estimation of local uncertainty and its communication to the end user. While stochastic neural networks have been heralded as a potential solution to these issues, this study investigates their actual reliability in clinical applications. We centered our analysis on the exemplary use case of mortality prediction for ICU hospitalizations using EHR from MIMIC3 study. For predictions on the EHR time series, Encoder-Only Transformer models were employed. Stochasticity of model functions was achieved by incorporating common methods such as Bayesian neural network layers and model ensembles. Our models achieve state of the art performance in terms of discrimination performance (AUC ROC: 0.868+-0.011, AUC PR: 0.554+-0.034) and calibration on the mortality prediction benchmark. However, epistemic uncertainty is critically underestimated by the selected stochastic deep learning methods. A heuristic proof for the responsible collapse of the posterior distribution is provided. Our findings reveal the inadequacy of commonly used stochastic deep learning approaches to reliably recognize OoD samples. In both methods, unsubstantiated model confidence is not prevented due to strongly biased functional posteriors, rendering them inappropriate for reliable clinical decision support. This highlights the need for approaches with more strictly enforced or inherent distance-awareness to known data points, e.g., using kernel-based techniques.

Instructional Fingerprinting of Large Language Models. (arXiv:2401.12255v1 [cs.CR] CROSS LISTED)

Authors: Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, Muhao Chen

The exorbitant cost of training Large language models (LLMs) from scratch makes it essential to fingerprint the models to protect intellectual property via ownership authentication and to ensure downstream users and developers comply with their license terms (e.g. restricting commercial use). In this study, we present a pilot study on LLM fingerprinting as a form of very lightweight instruction tuning. Model publisher specifies a confidential private key and implants it as an instruction backdoor that causes the LLM to generate specific text when the key is present. Results on 11 popularly-used LLMs showed that this approach is lightweight and does not affect the normal behavior of the model. It also prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License. Code is available in https://cnut1648.github.io/Model-Fingerprint/.

DittoGym: Learning to Control Soft Shape-Shifting Robots. (arXiv:2401.13231v1 [cs.RO] CROSS LISTED)

Authors: Suning Huang, Boyuan Chen, Huazhe Xu, Vincent Sitzmann

Robot co-design, where the morphology of a robot is optimized jointly with a learned policy to solve a specific task, is an emerging area of research. It holds particular promise for soft robots, which are amenable to novel manufacturing techniques that can realize learned morphologies and actuators. Inspired by nature and recent novel robot designs, we propose to go a step further and explore the novel reconfigurable robots, defined as robots that can change their morphology within their lifetime. We formalize control of reconfigurable soft robots as a high-dimensional reinforcement learning (RL) problem. We unify morphology change, locomotion, and environment interaction in the same action space, and introduce an appropriate, coarse-to-fine curriculum that enables us to discover policies that accomplish fine-grained control of the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark for reconfigurable soft robots that require fine-grained morphology changes to accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine algorithm on DittoGym and demonstrate robots that learn to change their morphology several times within a sequence, uniquely enabled by our RL algorithm. More results are available at https://dittogym.github.io.