Authors: Gent Wu
Abstract: Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark. We systematically evaluate the impact of data augmentation, patch token initialization, low-rank compression, and multi-class token strategies on model performance. Our experiments reveal that low-rank compression of queries in Multi-Head Latent Attention (MLA) incurs minimal performance loss, indicating redundancy in ViTs. Additionally, introducing multiple CLS tokens improves global representation capacity, boosting accuracy. These findings provide a comprehensive framework for optimizing Tiny ViTs, offering practical insights for efficient and effective designs. Code is available at https://github.com/erow/PoorViTs.
Authors: Chi-Sheng Chen, Ying-Jung Chen
Abstract: Graph Neural Networks (GNNs) have emerged as transformative tools for modeling complex relational data, offering unprecedented capabilities in tasks like forecasting and optimization. This study investigates the application of GNNs to demand forecasting within supply chain networks using the SupplyGraph dataset, a benchmark for graph-based supply chain analysis. By leveraging advanced GNN methodologies, we enhance the accuracy of forecasting models, uncover latent dependencies, and address temporal complexities inherent in supply chain operations. Comparative analyses demonstrate that GNN-based models significantly outperform traditional approaches, including Multilayer Perceptrons (MLPs) and Graph Convolutional Networks (GCNs), particularly in single-node demand forecasting tasks. The integration of graph representation learning with temporal data highlights GNNs' potential to revolutionize predictive capabilities for inventory management, production scheduling, and logistics optimization. This work underscores the pivotal role of forecasting in supply chain management and provides a robust framework for advancing research and applications in this domain.
Authors: Pritisha Sarkar, Kushalava reddy Jala, Mousumi Saha
Abstract: Acknowledging the effects of outdoor air pollution, the literature inadequately addresses indoor air pollution's impacts. Despite daily health risks, existing research primarily focused on monitoring, lacking accuracy in pinpointing indoor pollution sources. In our research work, we thoroughly investigated the influence of indoor activities on pollution levels. A survey of 143 participants revealed limited awareness of indoor air pollution. Leveraging 65 days of diverse data encompassing activities like incense stick usage, indoor smoking, inadequately ventilated cooking, excessive AC usage, and accidental paper burning, we developed a comprehensive monitoring system. We identify pollutant sources and effects with high precision through clustering analysis and interpretability models (LIME and SHAP). Our method integrates Decision Trees, Random Forest, Naive Bayes, and SVM models, excelling at 99.8% accuracy with Decision Trees. Continuous 24-hour data allows personalized assessments for targeted pollution reduction strategies, achieving 91% accuracy in predicting activities and pollution exposure.
Authors: Norman Koch, Siavash Ghiasvand
Abstract: Machine learning (ML) has become crucial in modern life, with growing interest from researchers and the public. Despite its potential, a significant entry barrier prevents widespread adoption, making it challenging for non-experts to understand and implement ML techniques. The increasing desire to leverage ML is counterbalanced by its technical complexity, creating a gap between potential and practical application. This work introduces asanAI, an offline-first, open-source, no-code machine learning toolkit designed for users of all skill levels. It allows individuals to design, debug, train, and test ML models directly in a web browser, eliminating the need for software installations and coding. The toolkit runs on any device with a modern web browser, including smartphones, and ensures user privacy through local computations while utilizing WebGL for enhanced GPU performance. Users can quickly experiment with neural networks and train custom models using various data sources, supported by intuitive visualizations of network structures and data flows. asanAI simplifies the teaching of ML concepts in educational settings and is released under an open-source MIT license, encouraging modifications. It also supports exporting models in industry-ready formats, empowering a diverse range of users to effectively learn and apply machine learning in their projects. The proposed toolkit is successfully utilized by researchers of ScaDS.AI to swiftly draft and test machine learning ideas, by trainers to effectively educate enthusiasts, and by teachers to introduce contemporary ML topics in classrooms with minimal effort and high clarity.
Authors: Biao Li, Qing-Kai Song, Wen-Gang Qi, Fu-Ping Gao
Abstract: Predicting the lateral pile response is challenging due to the complexity of pile-soil interactions. Machine learning (ML) techniques have gained considerable attention for their effectiveness in non-linear analysis and prediction. This study develops an interpretable ML-based model for predicting p-y curves of monopile foundations. An XGBoost model was trained using a database compiled from existing research. The results demonstrate that the model achieves superior predictive accuracy. Shapley Additive Explanations (SHAP) was employed to enhance interpretability. The SHAP value distributions for each variable demonstrate strong alignment with established theoretical knowledge on factors affecting the lateral response of pile foundations.
Authors: Yingbin Chen, Milad Arzani, Xuan Mu, Sophia Jin, Shaoping Xiao
Abstract: Metastructured auxetic patches, characterized by negative Poisson's ratios, offer unique mechanical properties that closely resemble the behavior of human tissues and organs. As a result, these patches have gained significant attention for their potential applications in organ repair and tissue regeneration. This study focuses on neural networks-based computational modeling of auxetic patches with a sinusoidal metastructure fabricated from silk fibroin, a bio-inspired material known for its biocompatibility and strength. The primary objective of this research is to introduce a novel, data-driven framework for patch design. To achieve this, we conducted experimental fabrication and mechanical testing to determine material properties and validate the corresponding finite element models. Finite element simulations were then employed to generate the necessary data, while greedy sampling, an active learning technique, was utilized to reduce the computational cost associated with data labeling. Two neural networks were trained to accurately predict Poisson's ratios and stresses for strains up to 15\%, respectively. Both models achieved $R^2$ scores exceeding 0.995, which indicates highly reliable predictions. Building on this, we developed a neural network-based design model capable of tailoring patch designs to achieve specific mechanical properties. This model demonstrated superior performance when compared to traditional optimization methods, such as genetic algorithms, by providing more efficient and precise design solutions. The proposed framework represents a significant advancement in the design of bio-inspired metastructures for medical applications, paving the way for future innovations in tissue engineering and regenerative medicine.
Authors: Adrien Bufort, Laurent Lebocq, Stefan Cathabard
Abstract: Modeling radio propagation is essential for wireless network design and performance optimization. Traditional methods rely on physics models of radio propagation, which can be inaccurate or inflexible. In this work, we propose using graph neural networks to learn radio propagation behaviors directly from real-world network data. Our approach converts the radio propagation environment into a graph representation, with nodes corresponding to locations and edges representing spatial and ray-tracing relationships between locations. The graph is generated by converting images of the environment into a graph structure, with specific relationships between nodes. The model is trained on this graph representation, using sensor measurements as target data. We demonstrate that the graph neural network, which learns to predict radio propagation directly from data, achieves competitive performance compared to traditional heuristic models. This data-driven approach outperforms classic numerical solvers in terms of both speed and accuracy. To the best of our knowledge, we are the first to apply graph neural networks to real-world radio propagation data to generate coverage maps, enabling generative models of signal propagation with point measurements only.
Authors: Danhua Lei, Jochen Jankowai, Petar Hristov, Hamish Carr, Leif Denby, Talha Bin Masood, Ingrid Hotz
Abstract: Feature level sets (FLS) have shown significant potential in the analysis of multi-field data by using traits defined in attribute space to specify features in the domain. In this work, we address key challenges in the practical use of FLS: trait design and feature selection for rendering. To simplify trait design, we propose a Cartesian decomposition of traits into simpler components, making the process more intuitive and computationally efficient. Additionally, we utilize dictionary learning results to automatically suggest point traits. To enhance feature selection, we introduce trait-induced merge trees (TIMTs), a generalization of merge trees for feature level sets, aimed at topologically analyzing tensor fields or general multi-variate data. The leaves in the TIMT represent areas in the input data that are closest to the defined trait, thereby most closely resembling the defined feature. This merge tree provides a hierarchy of features, enabling the querying of the most relevant and persistent features. Our method includes various query techniques for the tree, allowing the highlighting of different aspects. We demonstrate the cross-application capabilities of this approach through five case studies from different domains.
Authors: Daoyuan Ye, Juntao Li, Yiting Shen
Abstract: Capsule networks(CapsNet) are recently proposed neural network models with new processing layers, specifically for entity representation and discovery of images. It is well known that CapsNet have some advantages over traditional neural networks, especially in generalization capability. At the same time, some studies report negative experimental results. The causes of this contradiction have not been thoroughly analyzed. The preliminary experimental results show that the behavior of routing algorithms does not always produce good results as expected, and in most cases, different routing algorithms do not change the classification results, but simply polarize the link strength, especially when they continue to repeat without stopping. To realize the true potential of the CapsNet, deep mathematical analysis of the routing algorithms is crucial. In this paper, we will give the objective function that is minimized by the dynamic routing algorithm, which is a concave function. The dynamic routing algorithm can be regarded as nonlinear gradient method to solving an optimization algorithm under linear constraints, and its convergence can be strictly proved mathematically. Furthermore, the mathematically rigorous proof of the convergence is given for this class of iterative routing procedures. We analyze the relation between the objective function and the constraints solved by the dynamic routing algorithm in detail, and perform the corresponding routing experiment to analyze the effect of our convergence proof.
Authors: Philip Adzanoukpe
Abstract: This study investigates the efficacy of machine learning models for predicting house rental prices in Ghana, addressing the need for accurate and accessible housing market information. Utilising a comprehensive dataset of rental listings, we trained and evaluated various models, including CatBoost, XGBoost, and Random Forest. CatBoost emerged as the best-performing model, achieving an $R^2$ of 0.876, demonstrating its ability to effectively capture complex relationships within the housing market. Feature importance analysis revealed that location-based features, number of bedrooms, bathrooms, and furnishing status are key drivers of rental prices. Our findings provide valuable insights for stakeholders, including real estate professionals, investors, and policymakers, while also highlighting opportunities for future research, such as incorporating temporal data and exploring regional variations.
Authors: Roberto-Rafael Maura-Rivero, Chirag Nagpal, Roma Patel, Francesco Visin
Abstract: Current methods that train large language models (LLMs) with reinforcement learning feedback, often resort to averaging outputs of multiple rewards functions during training. This overlooks crucial aspects of individual reward dimensions and inter-reward dependencies that can lead to sub-optimal outcomes in generations. In this work, we show how linear aggregation of rewards exhibits some vulnerabilities that can lead to undesired properties of generated text. We then propose a transformation of reward functions inspired by economic theory of utility functions (specifically Inada conditions), that enhances sensitivity to low reward values while diminishing sensitivity to already high values. We compare our approach to the existing baseline methods that linearly aggregate rewards and show how the Inada-inspired reward feedback is superior to traditional weighted averaging. We quantitatively and qualitatively analyse the difference in the methods, and see that models trained with Inada-transformations score as more helpful while being less harmful.
Authors: Qi Sun, Edoardo Cetin, Yujin Tang
Abstract: Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce \implname, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, \implname employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific "expert" vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. \implname demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. \implname represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.
Authors: Daojun Liang, Haixia Zhang, Dongfeng Yuan
Abstract: Long-term and Large-scale Wireless Traffic Forecasting (LL-WTF) is pivotal for strategic network management and comprehensive planning on a macro scale. However, LL-WTF poses greater challenges than short-term ones due to the pronounced non-stationarity of extended wireless traffic and the vast number of nodes distributed at the city scale. To cope with this, we propose a Progressive Supervision method based on Label Decomposition (PSLD). Specifically, we first introduce a Random Subgraph Sampling (RSS) algorithm designed to sample a tractable subset from large-scale traffic data, thereby enabling efficient network training. Then, PSLD employs label decomposition to obtain multiple easy-to-learn components, which are learned progressively at shallow layers and combined at deep layers to effectively cope with the non-stationary problem raised by LL-WTF tasks. Finally, we compare the proposed method with various state-of-the-art (SOTA) methods on three large-scale WT datasets. Extensive experimental results demonstrate that the proposed PSLD significantly outperforms existing methods, with an average 2%, 4%, and 11% performance improvement on three WT datasets, respectively. In addition, we built an open source library for WT forecasting (WTFlib) to facilitate related research, which contains numerous SOTA methods and provides a strong benchmark.Experiments can be reproduced through https://github.com/Anoise/WTFlib.
Authors: Mikhail Terekhov
Abstract: Bandit optimization is a difficult problem, especially if the reward model is high-dimensional. When rewards are modeled by neural networks, sublinear regret has only been shown under strong assumptions, usually when the network is extremely wide. In this thesis, we investigate how pre-training can help us in the regime of smaller models. We consider a stochastic contextual bandit with the rewards modeled by a multi-layer neural network. The last layer is a linear predictor, and the layers before it are a black box neural architecture, which we call a representation network. We model pre-training as an initial guess of the weights of the representation network provided to the learner. To leverage the pre-trained weights, we introduce a novel algorithm we call Explore Twice then Commit (E2TC). During its two stages of exploration, the algorithm first estimates the last layer's weights using Ridge regression, and then runs Stochastic Gradient Decent jointly on all the weights. For a locally convex loss function, we provide conditions on the pre-trained weights under which the algorithm can learn efficiently. Under these conditions, we show sublinear regret of E2TC when the dimension of the last layer and number of actions $K$ are much smaller than the horizon $T$. In the weak training regime, when only the last layer is learned, the problem reduces to a misspecified linear bandit. We introduce a measure of misspecification $\epsilon_0$ for this bandit and use it to provide bounds $O(\epsilon_0\sqrt{d}KT+(KT)^{4 /5})$ or $\tilde{O}(\epsilon_0\sqrt{d}KT+d^{1 /3}(KT)^{2 /3})$ on the regret, depending on regularization strength. The first of these bounds has a dimension-independent sublinear term, made possible by the stochasticity of contexts. We also run experiments to evaluate the regret of E2TC and sample complexity of its exploration in practice.
Authors: Rui Shi, Nedret Billor, Elvan Ceyhan
Abstract: We introduce a new method for clustering based on Cluster Catch Digraphs (CCDs). The new method addresses the limitations of RK-CCDs by employing a new variant of spatial randomness test that employs the nearest neighbor distance (NND) instead of the Ripley's K function used by RK-CCDs. We conduct a comprehensive Monte Carlo analysis to assess the performance of our method, considering factors such as dimensionality, data set size, number of clusters, cluster volumes, and inter-cluster distance. Our method is particularly effective for high-dimensional data sets, comparable to or outperforming KS-CCDs and RK-CCDs that rely on a KS-type statistic or the Ripley's K function. We also evaluate our methods using real and complex data sets, comparing them to well-known clustering methods. Again, our methods exhibit competitive performance, producing high-quality clusters with desirable properties. Keywords: Graph-based clustering, Cluster catch digraphs, High-dimensional data, The nearest neighbor distance, Spatial randomness test
Authors: Alexis Bose, Jonathan Ethier, Ryan G. Dempsey, Yifeng Qiu
Abstract: This research leverages Conformal Prediction (CP) in the form of Conformal Predictive Systems (CPS) to accurately estimate uncertainty in a suite of machine learning (ML)-based radio metric models [1] as well as in a 2-D map-based ML path loss model [2]. Utilizing diverse difficulty estimators, we construct 95% confidence prediction intervals (PIs) that are statistically robust. Our experiments demonstrate that CPS models, trained on Toronto datasets, generalize effectively to other cities such as Vancouver and Montreal, maintaining high coverage and reliability. Furthermore, the employed difficulty estimators identify challenging samples, leading to measurable reductions in RMSE as dataset difficulty decreases. These findings highlight the effectiveness of scalable and reliable uncertainty estimation through CPS in wireless network modeling, offering important potential insights for network planning, operations, and spectrum management.
Authors: Zenon Lamprou, Yashar Moshfeghi
Abstract: Brain decoding has emerged as a rapidly advancing and extensively utilized technique within neuroscience. This paper centers on the application of raw electroencephalogram (EEG) signals for decoding human brain activity, offering a more expedited and efficient methodology for enhancing our understanding of the human brain. The investigation specifically scrutinizes the efficacy of brain-computer interfaces (BCI) in deciphering neural signals associated with speech production, with particular emphasis on the impact of vocabulary size, electrode density, and training data on the framework's performance. The study reveals the competitive word error rates (WERs) achievable on the Librispeech benchmark through pre-training on unlabelled data for speech processing. Furthermore, the study evaluates the efficacy of voice recognition under configurations with limited labeled data, surpassing previous state-of-the-art techniques while utilizing significantly fewer labels. Additionally, the research provides a comprehensive analysis of error patterns in voice recognition and the influence of model size and unlabelled training data. It underscores the significance of factors such as vocabulary size and electrode density in enhancing BCI performance, advocating for an increase in microelectrodes and refinement of language models.
Authors: Evelyn Trautmann, Ian Hales, Martin F. Volk
Abstract: Fine-tuning large language models requires high computational and memory resources, and is therefore associated with significant costs. When training on federated datasets, an increased communication effort is also needed. For this reason, parameter-efficient methods (PEFT) are becoming increasingly important. In this context, very good results have already been achieved by fine-tuning with low-rank adaptation methods (LoRA). The application of LoRA methods in Federated Learning, and especially the aggregation of adaptation matrices, is a current research field. In this article, we propose a novel aggregation method and compare it with different existing aggregation methods of low rank adapters trained in a federated fine-tuning of large machine learning models and evaluate their performance with respect to selected GLUE benchmark datasets.
Authors: Thanh Nguyen-Tang, Raman Arora
Abstract: We study the statistical complexity of offline decision-making with function approximation, establishing (near) minimax-optimal rates for stochastic contextual bandits and Markov decision processes. The performance limits are captured by the pseudo-dimension of the (value) function class and a new characterization of the behavior policy that \emph{strictly} subsumes all the previous notions of data coverage in the offline decision-making literature. In addition, we seek to understand the benefits of using offline data in online decision-making and show nearly minimax-optimal rates in a wide range of regimes.
Authors: Kunpeng Xu, Lifei Chen, Shengrui Wang
Abstract: Kernel-based subspace clustering, which addresses the nonlinear structures in data, is an evolving area of research. Despite noteworthy progressions, prevailing methodologies predominantly grapple with limitations relating to (i) the influence of predefined kernels on model performance; (ii) the difficulty of preserving the original manifold structures in the nonlinear space; (iii) the dependency of spectral-type strategies on the ideal block diagonal structure of the affinity matrix. This paper presents DKLM, a novel paradigm for kernel-induced nonlinear subspace clustering. DKLM provides a data-driven approach that directly learns the kernel from the data's self-representation, ensuring adaptive weighting and satisfying the multiplicative triangle inequality constraint, which enhances the robustness of the learned kernel. By leveraging this learned kernel, DKLM preserves the local manifold structure of data in a nonlinear space while promoting the formation of an optimal block-diagonal affinity matrix. A thorough theoretical examination of DKLM reveals its relationship with existing clustering paradigms. Comprehensive experiments on synthetic and real-world datasets demonstrate the effectiveness of the proposed method.
Authors: Filippo Lazzati, Alberto Maria Metelli
Abstract: In Reward Learning (ReL), we are given feedback on an unknown *target reward*, and the goal is to use this information to find it. When the feedback is not informative enough, the target reward is only *partially identifiable*, i.e., there exists a set of rewards (the feasible set) that are equally-compatible with the feedback. In this paper, we show that there exists a choice of reward, non-necessarily contained in the feasible set that, *depending on the ReL application*, improves the performance w.r.t. selecting the reward arbitrarily among the feasible ones. To this aim, we introduce a new *quantitative framework* to analyze ReL problems in a simple yet expressive way. We exemplify the framework in a *reward transfer* use case, for which we devise three provably-efficient ReL algorithms.
Authors: Malcolm L. Wolff, Shenghao Yang, Kari Torkkola, Michael W. Mahoney
Abstract: Pre-trained Large Language Models (LLMs) encapsulate large amounts of knowledge and take enormous amounts of compute to train. We make use of this resource, together with the observation that LLMs are able to transfer knowledge and performance from one domain or even modality to another seemingly-unrelated area, to help with multivariate demand time series forecasting. Attention in transformer-based methods requires something worth attending to -- more than just samples of a time-series. We explore different methods to map multivariate input time series into the LLM token embedding space. In particular, our novel multivariate patching strategy to embed time series features into decoder-only pre-trained Transformers produces results competitive with state-of-the-art time series forecasting models. We also use recently-developed weight-based diagnostics to validate our findings.
Authors: Maciej Krzywda, Mariusz Wermi\'nski, Szymon {\L}ukasik, Amir H. Gandomi
Abstract: This paper presents the application of Kolmogorov-Arnold Networks (KAN) in classifying metal surface defects. Specifically, steel surfaces are analyzed to detect defects such as cracks, inclusions, patches, pitted surfaces, and scratches. Drawing on the Kolmogorov-Arnold theorem, KAN provides a novel approach compared to conventional multilayer perceptrons (MLPs), facilitating more efficient function approximation by utilizing spline functions. The results show that KAN networks can achieve better accuracy than convolutional neural networks (CNNs) with fewer parameters, resulting in faster convergence and improved performance in image classification.
Authors: Yifei Zong, Alexandre Tartakovsky
Abstract: We define a digital twin (DT) of a physical system governed by partial differential equations (PDEs) as a model for real-time simulations and control of the system behavior under changing conditions. We construct DTs using the Karhunen-Lo\`{e}ve Neural Network (KL-NN) surrogate model and transfer learning (TL). The surrogate model allows fast inference and differentiability with respect to control parameters for control and optimization. TL is used to retrain the model for new conditions with minimal additional data. We employ the moment equations to analyze TL and identify parameters that can be transferred to new conditions. The proposed analysis also guides the control variable selection in DT to facilitate efficient TL. For linear PDE problems, the non-transferable parameters in the KL-NN surrogate model can be exactly estimated from a single solution of the PDE corresponding to the mean values of the control variables under new target conditions. Retraining an ML model with a single solution sample is known as one-shot learning, and our analysis shows that the one-shot TL is exact for linear PDEs. For nonlinear PDE problems, transferring of any parameters introduces errors. For a nonlinear diffusion PDE model, we find that for a relatively small range of control variables, some surrogate model parameters can be transferred without introducing a significant error, some can be approximately estimated from the mean-field equation, and the rest can be found using a linear residual least square problem or an ordinary linear least square problem if a small labeled dataset for new conditions is available. The former approach results in a one-shot TL while the latter approach is an example of a few-shot TL. Both methods are approximate for the nonlinear PDEs.
Authors: Geng Sun, Weilong Ma, Jiahui Li, Zemin Sun, Jiacheng Wang, Dusit Niyato, Shiwen Mao
Abstract: The low-altitude economy (LAE), driven by unmanned aerial vehicles (UAVs) and other aircraft, has revolutionized fields such as transportation, agriculture, and environmental monitoring. In the upcoming six-generation (6G) era, UAV-assisted mobile edge computing (MEC) is particularly crucial in challenging environments such as mountainous or disaster-stricken areas. The computation task offloading problem is one of the key issues in UAV-assisted MEC, primarily addressing the trade-off between minimizing the task delay and the energy consumption of the UAV. In this paper, we consider a UAV-assisted MEC system where the UAV carries the edge servers to facilitate task offloading for ground devices (GDs), and formulate a calculation delay and energy consumption multi-objective optimization problem (CDECMOP) to simultaneously improve the performance and reduce the cost of the system. Then, by modeling the formulated problem as a multi-objective Markov decision process (MOMDP), we propose a multi-objective deep reinforcement learning (DRL) algorithm within an evolutionary framework to dynamically adjust the weights and obtain non-dominated policies. Moreover, to ensure stable convergence and improve performance, we incorporate a target distribution learning (TDL) algorithm. Simulation results demonstrate that the proposed algorithm can better balance multiple optimization objectives and obtain superior non-dominated solutions compared to other methods.
Authors: Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Scott Niekum, Peter Stone
Abstract: Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.
Authors: Jerry Chee, Arturs Backurs, Rainie Heck, Li Zhang, Janardhan Kulkarni, Thomas Rothvoss, Sivakanth Gopi
Abstract: Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\mathrm{poly}(1/\epsilon)$ samples from the data distribution, we can round all but $O(m)$ model weights such that the expected approximation error of the quantized model on the true data distribution is $\le \epsilon$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our proof, which is algorithmic, inspired a simple and practical rounding algorithm called \emph{DiscQuant}. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%). We make our code available at https://github.com/jerry-chee/DiscQuant.
Authors: Yaopei Zeng, Lei Liu, Shaoguo Liu, Hongjian Dou, Baoyuan Wu, Li Liu
Abstract: Vertical Federated Learning (VFL) is a well-known FL variant that enables multiple parties to collaboratively train a model without sharing their raw data. Existing VFL approaches focus on overlapping samples among different parties, while their performance is constrained by the limited number of these samples, leaving numerous non-overlapping samples unexplored. Some previous work has explored techniques for imputing missing values in samples, but often without adequate attention to the quality of the imputed samples. To address this issue, we propose a Reliable Imputed-Sample Assisted (RISA) VFL framework to effectively exploit non-overlapping samples by selecting reliable imputed samples for training VFL models. Specifically, after imputing non-overlapping samples, we introduce evidence theory to estimate the uncertainty of imputed samples, and only samples with low uncertainty are selected. In this way, high-quality non-overlapping samples are utilized to improve VFL model. Experiments on two widely used datasets demonstrate the significant performance gains achieved by the RISA, especially with the limited overlapping samples, e.g., a 48% accuracy gain on CIFAR-10 with only 1% overlapping samples.
Authors: Hojjat Salehinejad, Ricky Rojas, Kingsley Iheasirim, Mohammed Yousufuddin, Bijan Borah
Abstract: Fall risk prediction among hospitalized patients is a critical aspect of patient safety in clinical settings, and accurate models can help prevent adverse events. The Hester Davis Score (HDS) is commonly used to assess fall risk, with current clinical practice relying on a threshold-based approach. In this method, a patient is classified as high-risk when their HDS exceeds a predefined threshold. However, this approach may fail to capture dynamic patterns in fall risk over time. In this study, we model the threshold-based approach and propose two machine learning approaches for enhanced fall prediction: One-step ahead fall prediction and sequence-to-point fall prediction. The one-step ahead model uses the HDS at the current timestamp to predict the risk at the next timestamp, while the sequence-to-point model leverages all preceding HDS values to predict fall risk using deep learning. We compare these approaches to assess their accuracy in fall risk prediction, demonstrating that deep learning can outperform the traditional threshold-based method by capturing temporal patterns and improving prediction reliability. These findings highlight the potential for data-driven approaches to enhance patient safety through more reliable fall prevention strategies.
Authors: Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Wei Wang, Jiahao Zhang
Abstract: Graph Neural Networks (GNNs) have become the standard approach for learning and reasoning over relational data, leveraging the message-passing mechanism that iteratively propagates node embeddings through graph structures. While GNNs have achieved significant empirical success, their theoretical limitations remain an active area of research. Existing studies primarily focus on characterizing GNN expressiveness through Weisfeiler-Lehman (WL) graph isomorphism tests. In this paper, we take a fundamentally different approach by exploring the computational limitations of GNNs through the lens of circuit complexity. Specifically, we analyze the circuit complexity of common GNN architectures and prove that under constraints of constant-depth layers, linear or sublinear embedding sizes, and polynomial precision, GNNs cannot solve key problems such as graph connectivity and graph isomorphism unless $\mathsf{TC}^0 = \mathsf{NC}^1$. These results reveal the intrinsic expressivity limitations of GNNs behind their empirical success and introduce a novel framework for analyzing GNN expressiveness that can be extended to a broader range of GNN models and graph decision problems.
Authors: Surjo Dey, Ankit Sharma, Hritu Raj, Susham Biswas
Abstract: This study introduces an advanced methodology for automatically identifying minor deformations in flat walls caused by vibrations from nearby railway tracks. It leverages high-density Terrestrial Laser Scanner (TLS) LiDAR surveys and AI/ML techniques to collect and analyze data. The scan data is processed into a detailed point cloud, which is segmented to distinguish ground points, trees, buildings, and other objects. The analysis focuses on identifying sections along flat walls and estimating their deformations relative to the ground orientation. Findings from the study, conducted at the RGIPT campus, reveal significant deformations in walls close to the railway corridor, with the highest deformations ranging from 7 to 8 cm and an average of 3 to 4 cm. In contrast, walls further from the corridor show negligible deformations. The developed automated process for feature extraction and deformation monitoring demonstrates potential for structural health monitoring. By integrating LiDAR data with machine learning, the methodology provides an efficient system for identifying and analyzing structural deformations, highlighting the importance of continuous monitoring for ensuring structural integrity and public safety in urban infrastructure. This approach represents a substantial advancement in automated feature extraction and deformation analysis, contributing to more effective management of urban infrastructure.
Authors: Zahra Bami, Ali Behnampour, Hassan Doosti
Abstract: Artificial Intelligent transformed industries, like engineering, medicine, finance. Predictive models use supervised learning, a vital Machine learning subset. Crucial for model evaluation, cross-validation includes re-substitution, hold-out, and K-fold. This study focuses on improving the accuracy of ML algorithms across three different datasets. To evaluate Hold-out, Hold-out with iteration, and K-fold Cross-Validation techniques, we created a flexible Python program. By modifying parameters like test size, Random State, and 'k' values, we were able to improve accuracy assessment. The outcomes demonstrate the Hold-out validation method's persistent superiority, particularly with a test size of 10%. With iterations and Random State settings, hold-out with iteration shows little accuracy variance. It suggests that there are variances according to algorithm, with Decision Tree doing best for Framingham and Naive Bayes and K Nearest Neighbors for COVID-19. Different datasets require different optimal K values in K-Fold Cross Validation, highlighting these considerations. This study challenges the universality of K values in K-Fold Cross Validation and suggests a 10% test size and 90% training size for better outcomes. It also emphasizes the contextual impact of dataset features, sample size, feature count, and selected methodologies. Researchers can adapt these codes for their dataset to obtain highest accuracy with specific evaluation.
Authors: Alexander Korotin, Vladimir V'yugin, Evgeny Burnaev
Abstract: We consider the problem of online aggregation of expert predictions with the quadratic loss function. We propose an algorithm for aggregating expert predictions which does not require a prior knowledge of the upper bound on the losses. The algorithm is based on the exponential reweighing of expert losses.
Authors: Liyuan Hu
Abstract: This paper presents a hierarchical reinforcement learning (RL) approach to address the agent grouping or pairing problem in cooperative multi-agent systems. The goal is to simultaneously learn the optimal grouping and agent policy. By employing a hierarchical RL framework, we distinguish between high-level decisions of grouping and low-level agents' actions. Our approach utilizes the CTDE (Centralized Training with Decentralized Execution) paradigm, ensuring efficient learning and scalable execution. We incorporate permutation-invariant neural networks to handle the homogeneity and cooperation among agents, enabling effective coordination. The option-critic algorithm is adapted to manage the hierarchical decision-making process, allowing for dynamic and optimal policy adjustments.
Authors: Ebenezer R. H. P. Isaac, Joseph H. R. Isaac
Abstract: Multivariate anomaly detection finds its importance in diverse applications. Despite the existence of many detectors to solve this problem, one cannot simply define why an obtained anomaly inferred by the detector is anomalous. This reasoning is required for network operators to understand the root cause of the anomaly and the remedial action that should be taken to counteract its occurrence. Existing solutions in explainable AI may give cues to features that influence an anomaly, but they do not formulate generalizable rules that can be assessed by a domain expert. Furthermore, not all outliers are anomalous in a business sense. There is an unfulfilled need for a system that can interpret anomalies predicted by a multivariate anomaly detector and map these patterns to actionable rules. This paper aims to fulfill this need by proposing a semi-autonomous anomaly rule miner. The proposed method is applicable to both discrete and time series data and is tailored for radio access network (RAN) anomaly detection use cases. The proposed method is demonstrated in this paper with time series RAN data.
Authors: Alessandro Hill, Kalen Goo, Puneet Agarwal
Abstract: Prospective students face the challenging task of selecting a university program that will shape their academic and professional careers. For decision-makers and support services, it is often time-consuming and extremely difficult to match personal interests with suitable programs due to the vast and complex catalogue information available. This paper presents the first information system that provides students with efficient recommendations based on both program content and personal preferences. BERTopic, a powerful topic modeling algorithm, is used that leverages text embedding techniques to generate topic representations. It enables us to mine interest topics from all course descriptions, representing the full body of knowledge taught at the institution. Underpinned by the student's individual choice of topics, a shortlist of the most relevant programs is computed through statistical backtracking in the knowledge map, a novel characterization of the program-course relationship. This approach can be applied to a wide range of educational settings, including professional and vocational training. A case study at a post-secondary school with 80 programs and over 5,000 courses shows that the system provides immediate and effective decision support. The presented interest topics are meaningful, leading to positive effects such as serendipity, personalization, and fairness, as revealed by a qualitative study involving 65 students. Over 98% of users indicated that the recommendations aligned with their interests, and about 94% stated they would use the tool in the future. Quantitative analysis shows the system can be configured to ensure fairness, achieving 98% program coverage while maintaining a personalization score of 0.77. These findings suggest that this real-time, user-centered, data-driven system could improve the program selection process.
Authors: Chunjing Xiao, Xue Jiang, Xianghe Du, Wei Yang, Wei Lu, Xiaomin Wang, Kevin Chetty
Abstract: Data imputation is crucial for addressing challenges posed by missing values in multivariate time series data across various fields, such as healthcare, traffic, and economics, and has garnered significant attention. Among various methods, diffusion model-based approaches show notable performance improvements. However, existing methods often cause disharmonious boundaries between missing and known regions and overlook long-range dependencies in missing data estimation, leading to suboptimal results. To address these issues, we propose a Diffusion-based time Series Data Imputation (DSDI) framework. We develop a weight-reducing injection strategy that incorporates the predicted values of missing points with reducing weights into the reverse diffusion process to mitigate boundary inconsistencies. Further, we introduce a multi-scale S4-based U-Net, which combines hierarchical information from different levels via multi-resolution integration to capture long-term dependencies. Experimental results demonstrate that our model outperforms existing imputation methods.
Authors: Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao
Abstract: Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 30% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.
Authors: Shireesh Reddy Pyreddy, Tarannum Shaila Zaman
Abstract: The widespread adoption of generative AI has generated diverse opinions, with individuals expressing both support and criticism of its applications. This study investigates the emotional dynamics surrounding generative AI by analyzing human tweets referencing terms such as ChatGPT, OpenAI, Copilot, and LLMs. To further understand the emotional intelligence of ChatGPT, we examine its responses to selected tweets, highlighting differences in sentiment between human comments and LLM-generated responses. We introduce EmoXpt, a sentiment analysis framework designed to assess both human perspectives on generative AI and the sentiment embedded in ChatGPT's responses. Unlike prior studies that focus exclusively on human sentiment, EmoXpt uniquely evaluates the emotional expression of ChatGPT. Experimental results demonstrate that LLM-generated responses are notably more efficient, cohesive, and consistently positive than human responses.
Authors: Yilang Zhang, Bingcong Li, Georgios B. Giannakis
Abstract: Targeting solutions over `flat' regions of the loss landscape, sharpness-aware minimization (SAM) has emerged as a powerful tool to improve generalizability of deep neural network based learning. While several SAM variants have been developed to this end, a unifying approach that also guides principled algorithm design has been elusive. This contribution leverages preconditioning (pre) to unify SAM variants and provide not only unifying convergence analysis, but also valuable insights. Building upon preSAM, a novel algorithm termed infoSAM is introduced to address the so-called adversarial model degradation issue in SAM by adjusting gradients depending on noise estimates. Extensive numerical tests demonstrate the superiority of infoSAM across various benchmarks.
Authors: Anyin Zhao, Zuquan Chen, Zhengyu Fang, Xiaoge Zhang, Jing Li
Abstract: Molecular property prediction has attracted substantial attention recently. Accurate prediction of drug properties relies heavily on effective molecular representations. The structures of chemical compounds are commonly represented as graphs or SMILES sequences. Recent advances in learning drug properties commonly employ Graph Neural Networks (GNNs) based on the graph representation. For the SMILES representation, Transformer-based architectures have been adopted by treating each SMILES string as a sequence of tokens. Because each representation has its own advantages and disadvantages, combining both representations in learning drug properties is a promising direction. We propose a method named Dual-Modality Cross-Attention (DMCA) that can effectively combine the strengths of two representations by employing the cross-attention mechanism. DMCA was evaluated across eight datasets including both classification and regression tasks. Results show that our method achieves the best overall performance, highlighting its effectiveness in leveraging the complementary information from both graph and SMILES modalities.
Authors: Meihua Dang, Anikait Singh, Linqi Zhou, Stefano Ermon, Jiaming Song
Abstract: RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users' beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76\% over Stable Cascade, generating images that more accurately reflect specific user preferences.
Authors: Pinak Mandal, Georg A. Gottwald
Abstract: We show how random feature maps can be used to forecast dynamical systems with excellent forecasting skill. We consider the tanh activation function and judiciously choose the internal weights in a data-driven manner such that the resulting features explore the nonlinear, non-saturated regions of the activation function. We introduce skip connections and construct a deep variant of random feature maps by combining several units. To mitigate the curse of dimensionality, we introduce localization where we learn local maps, employing conditional independence. Our modified random feature maps provide excellent forecasting skill for both single trajectory forecasts as well as long-time estimates of statistical properties, for a range of chaotic dynamical systems with dimensions up to 512. In contrast to other methods such as reservoir computers which require extensive hyperparameter tuning, we effectively need to tune only a single hyperparameter, and are able to achieve state-of-the-art forecast skill with much smaller networks.
Authors: Jiayi Tian (Callie), Jinming Lu (Callie), Hai Li (Callie), Xiangwei Wang (Callie), Cong (Callie), Hao, Ian Young, Zheng Zhang
Abstract: Transformer models have achieved state-of-the-art performance across a wide range of machine learning tasks. There is growing interest in training transformers on resource-constrained edge devices due to considerations such as privacy, domain adaptation, and on-device scientific machine learning. However, the significant computational and memory demands required for transformer training often exceed the capabilities of an edge device. Leveraging low-rank tensor compression, this paper presents the first on-FPGA accelerator for end-to-end transformer training. On the algorithm side, we present a bi-directional contraction flow for tensorized transformer training, significantly reducing the computational FLOPS and intra-layer memory costs compared to existing tensor operations. On the hardware side, we store all highly compressed model parameters and gradient information on chip, creating an on-chip-memory-only framework for each stage in training. This reduces off-chip communication and minimizes latency and energy costs. Additionally, we implement custom computing kernels for each training stage and employ intra-layer parallelism and pipe-lining to further enhance run-time and memory efficiency. Through experiments on transformer models within $36.7$ to $93.5$ MB using FP-32 data formats on the ATIS dataset, our tensorized FPGA accelerator could conduct single-batch end-to-end training on the AMD Alevo U50 FPGA, with a memory budget of less than $6$-MB BRAM and $22.5$-MB URAM. Compared to uncompressed training on the NVIDIA RTX 3090 GPU, our on-FPGA training achieves a memory reduction of $30\times$ to $51\times$. Our FPGA accelerator also achieves up to $3.6\times$ less energy cost per epoch compared with tensor Transformer training on an NVIDIA RTX 3090 GPU.
Authors: John Bradshaw, Anji Zhang, Babak Mahjour, David E. Graff, Marwin H. S. Segler, Connor W. Coley
Abstract: Deep learning models for anticipating the products of organic reactions have found many use cases, including validating retrosynthetic pathways and constraining synthesis-based molecular design tools. Despite compelling performance on popular benchmark tasks, strange and erroneous predictions sometimes ensue when using these models in practice. The core issue is that common benchmarks test models in an in-distribution setting, whereas many real-world uses for these models are in out-of-distribution settings and require a greater degree of extrapolation. To better understand how current reaction predictors work in out-of-distribution domains, we report a series of more challenging evaluations of a prototypical SMILES-based deep learning model. First, we illustrate how performance on randomly sampled datasets is overly optimistic compared to performance when generalizing to new patents or new authors. Second, we conduct time splits that evaluate how models perform when tested on reactions published in years after those in their training set, mimicking real-world deployment. Finally, we consider extrapolation across reaction classes to reflect what would be required for the discovery of novel reaction types. This panel of tasks can reveal the capabilities and limitations of today's reaction predictors, acting as a crucial first step in the development of tomorrow's next-generation models capable of reaction discovery.
Authors: Manisha Padala, Lokesh Nagalapatti, Atharv Tyagi, Ramasuri Narayanam, Shiv Kumar Saini
Abstract: We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.
Authors: Tzu-Heng Huang, Manjot Bilkhu, Frederic Sala, Javier Movellan
Abstract: Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.
Authors: Xingshuai Huang, Di Wu, Benoit Boulet
Abstract: Decision Transformer (DT), a trajectory modeling method, has shown competitive performance compared to traditional offline reinforcement learning (RL) approaches on various classic control tasks. However, it struggles to learn optimal policies from suboptimal, reward-labeled trajectories. In this study, we explore the use of conditional generative modeling to facilitate trajectory stitching given its high-quality data generation ability. Additionally, recent advancements in Recurrent Neural Networks (RNNs) have shown their linear complexity and competitive sequence modeling performance over Transformers. We leverage the Test-Time Training (TTT) layer, an RNN that updates hidden states during testing, to model trajectories in the form of DT. We introduce a unified framework, called Diffusion-Refined Decision TTT (DRDT3), to achieve performance beyond DT models. Specifically, we propose the Decision TTT (DT3) module, which harnesses the sequence modeling strengths of both self-attention and the TTT layer to capture recent contextual information and make coarse action predictions. We further integrate DT3 with the diffusion model using a unified optimization objective. With experiments on multiple tasks of Gym and AntMaze in the D4RL benchmark, our DT3 model without diffusion refinement demonstrates improved performance over standard DT, while DRDT3 further achieves superior results compared to state-of-the-art conventional offline RL and DT-based methods.
Authors: Noga Bar, Raja Giryes
Abstract: The increasing computational and memory demands in deep learning present significant challenges, especially in resource-constrained environments. We introduce a zero-order quantized optimization (ZOQO) method designed for training models with quantized parameters and operations. Our approach leverages zero-order approximations of the gradient sign and adapts the learning process to maintain the parameters' quantization without the need for full-precision gradient calculations. We demonstrate the effectiveness of ZOQO through experiments in fine-tuning of large language models and black-box adversarial attacks. Despite the limitations of zero-order and quantized operations training, our method achieves competitive performance compared to full-precision methods, highlighting its potential for low-resource environments.
Authors: Ziming Wang, Changwu Huang, Ke Tang, Xin Yao
Abstract: Fairness in machine learning (ML) has garnered significant attention in recent years. While existing research has predominantly focused on the distributive fairness of ML models, there has been limited exploration of procedural fairness. This paper proposes a novel method to achieve procedural fairness during the model training phase. The effectiveness of the proposed method is validated through experiments conducted on one synthetic and six real-world datasets. Additionally, this work studies the relationship between procedural fairness and distributive fairness in ML models. On one hand, the impact of dataset bias and the procedural fairness of ML model on its distributive fairness is examined. The results highlight a significant influence of both dataset bias and procedural fairness on distributive fairness. On the other hand, the distinctions between optimizing procedural and distributive fairness metrics are analyzed. Experimental results demonstrate that optimizing procedural fairness metrics mitigates biases introduced or amplified by the decision-making process, thereby ensuring fairness in the decision-making process itself, as well as improving distributive fairness. In contrast, optimizing distributive fairness metrics encourages the ML model's decision-making process to favor disadvantaged groups, counterbalancing the inherent preferences for advantaged groups present in the dataset and ultimately achieving distributive fairness.
Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu, Guanjun Li
Abstract: Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information. Significant differences in form and content of multimodal information lead to intensified optimization conflicts, hindering effective model training as well as reducing the effectiveness of existing fusion methods for bimodal. To address this problem, we propose the MTPareto framework to optimize multimodal fusion, using a Targeted Pareto(TPareto) optimization algorithm for fusion-level-specific objective learning with a certain focus. Based on the designed hierarchical fusion network, the algorithm defines three fusion levels with corresponding losses and implements all-modal-oriented Pareto gradient integration for each. This approach accomplishes superior multimodal fusion by utilizing the information obtained from intermediate fusion to provide positive effects to the entire process. Experiment results on FakeSV and FVC datasets show that the proposed framework outperforms baselines and the TPareto optimization algorithm achieves 2.40% and 1.89% accuracy improvement respectively.
Authors: Erlong Liu, Yu-Chang Wu, Xiaobin Huang, Chengrui Gao, Ren-Jian Wang, Ke Xue, Chao Qian
Abstract: Multi-objective decision-making problems have emerged in numerous real-world scenarios, such as video games, navigation and robotics. Considering the clear advantages of Reinforcement Learning (RL) in optimizing decision-making processes, researchers have delved into the development of Multi-Objective RL (MORL) methods for solving multi-objective decision problems. However, previous methods either cannot obtain the entire Pareto front, or employ only a single policy network for all the preferences over multiple objectives, which may not produce personalized solutions for each preference. To address these limitations, we propose a novel decomposition-based framework for MORL, Pareto Set Learning for MORL (PSL-MORL), that harnesses the generation capability of hypernetwork to produce the parameters of the policy network for each decomposition weight, generating relatively distinct policies for various scalarized subproblems with high efficiency. PSL-MORL is a general framework, which is compatible for any RL algorithm. The theoretical result guarantees the superiority of the model capacity of PSL-MORL and the optimality of the obtained policy network. Through extensive experiments on diverse benchmarks, we demonstrate the effectiveness of PSL-MORL in achieving dense coverage of the Pareto front, significantly outperforming state-of-the-art MORL methods in the hypervolume and sparsity indicators.
Authors: Tareque Mohmud Chowdhury, Farzana Tabassum, Sabrina Islam, Abu Raihan Mostofa Kamal
Abstract: Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature selection, we partition the transcriptome dataset vertically based on feature types. Then apply the Boruta feature selection process on each of the partitions, combine the results, and apply Boruta again on the combined result. We repeat the process with different parameters of Boruta and prepare the final feature set. Finally, we constructed two ensemble ML models based on LR, SVM and XGBoost classifiers with max voting and averaging probability approach. We used 10-fold cross-validation to ensure robust and reliable classification performance. With 97.11\% accuracy and 0.9996 AUC value, our approach performs better compared to existing state-of-the-art methods to classify 33 types of cancers. A set of 12 types of cancer is traditionally challenging to differentiate between each other due to their similarity in tissue of origin. Our method accurately identifies over 90\% of samples from these 12 types of cancers, which outperforms all known methods presented in existing literature. The gene set enrichment analysis reveals that our framework's selected features have enriched the pathways highly related to cancers. This study develops a feature selection framework to select features highly related to cancer development and leads to identifying different types of cancer samples with higher accuracy.
Authors: Yiqing Zhang, Xiaozhong Liu, Fabricio Murai
Abstract: Clinical trials are the gold standard for assessing the effectiveness and safety of drugs for treating diseases. Given the vast design space of drug molecules, elevated financial cost, and multi-year timeline of these trials, research on clinical trial outcome prediction has gained immense traction. Accurate predictions must leverage data of diverse modes such as drug molecules, target diseases, and eligibility criteria to infer successes and failures. Previous Deep Learning approaches for this task, such as HINT, often require wet lab data from synthesized molecules and/or rely on prior knowledge to encode interactions as part of the model architecture. To address these limitations, we propose a light-weight attention-based model, MEXA-CTP, to integrate readily-available multi-modal data and generate effective representations via specialized modules dubbed "mode experts", while avoiding human biases in model design. We optimize MEXA-CTP with the Cauchy loss to capture relevant interactions across modes. Our experiments on the Trial Outcome Prediction (TOP) benchmark demonstrate that MEXA-CTP improves upon existing approaches by, respectively, up to 11.3% in F1 score, 12.2% in PR-AUC, and 2.5% in ROC-AUC, compared to HINT. Ablation studies are provided to quantify the effectiveness of each component in our proposed method.
Authors: Ruoyu Sun, Yue Xi, Angelos Stefanidis, Zhengyong Jiang, Jionglong Su
Abstract: Deep Reinforcement Learning (DRL) has been extensively used to address portfolio optimization problems. The DRL agents acquire knowledge and make decisions through unsupervised interactions with their environment without requiring explicit knowledge of the joint dynamics of portfolio assets. Among these DRL algorithms, the combination of actor-critic algorithms and deep function approximators is the most widely used DRL algorithm. Here, we find that training the DRL agent using the actor-critic algorithm and deep function approximators may lead to scenarios where the improvement in the DRL agent's risk-adjusted profitability is not significant. We propose that such situations primarily arise from the following two problems: sparsity in positive reward and the curse of dimensionality. These limitations prevent DRL agents from comprehensively learning asset price change patterns in the training environment. As a result, the DRL agents cannot explore the dynamic portfolio optimization policy to improve the risk-adjusted profitability in the training process. To address these problems, we propose a novel multi-agent Hierarchical Deep Reinforcement Learning (HDRL) algorithmic framework in this research. Under this framework, the agents work together as a learning system for portfolio optimization. Specifically, by designing an auxiliary agent that works together with the executive agent for optimal policy exploration, the learning system can focus on exploring the policy with higher risk-adjusted return in the action space with positive return and low variance. In this way, we can overcome the issue of the curse of dimensionality and improve the training efficiency in the positive reward sparse environment.
Authors: Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git
Authors: Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Abstract: Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we propose Feynman Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models, even with off-the-shelf rewards, can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
URLs: https://github.com/zacharyhorvitz/Fk-Diffusion-Steering
Authors: Shourav B. Rabbani, Ibna Kowsar, Manar D. Samad
Abstract: Despite the artificial intelligence (AI) revolution, deep learning has yet to achieve much success with tabular data due to heterogeneous feature space and limited sample sizes without viable transfer learning. The new era of generative AI, powered by large language models (LLM), brings unprecedented learning opportunities to diverse data and domains. This paper investigates the effectiveness of an LLM application programming interface (API) and transfer learning of LLM in tabular data classification. LLM APIs respond to input text prompts with tokenized data and instructions, whereas transfer learning finetunes an LLM for a target classification task. This paper proposes an end-to-end finetuning of LLM to demonstrate cross-data transfer learning on ten benchmark data sets when large pre-trained tabular data models do not exist to facilitate transfer learning. The proposed LLM finetuning method outperforms state-of-the-art machine and deep learning methods on tabular data with less than ten features - a standard feature size for tabular data sets. The transfer learning approach uses a fraction of the computational cost of other deep learning or API-based solutions while ensuring competitive or superior classification performance.
Authors: Timo Saala, Matthias Schott
Abstract: Deep learning techniques have evolved rapidly in recent years, significantly impacting various scientific fields, including experimental particle physics. To effectively leverage the latest developments in computer science for particle physics, a strengthened collaboration between computer scientists and physicists is essential. As all machine learning techniques depend on the availability and comprehensibility of extensive data, clear data descriptions and commonly used data formats are prerequisites for successful collaboration. In this study, we converted open data from the Large Hadron Collider, recorded in the ROOT data format commonly used in high-energy physics, to pandas DataFrames, a well-known format in computer science. Additionally, we provide a brief introduction to the data's content and interpretation. This paper aims to serve as a starting point for future interdisciplinary collaborations between computer scientists and physicists, fostering closer ties and facilitating efficient knowledge exchange.
Authors: Jimeng Shi, Azam Shirali, Bowen Jin, Sizhe Zhou, Wei Hu, Rahuul Rangaraj, Shaowen Wang, Jiawei Han, Zhaonan Wang, Upmanu Lall, Yanzhao Wu, Leonardo Bobadilla, Giri Narasimhan
Abstract: Physics-based numerical models have been the bedrock of atmospheric sciences for decades, offering robust solutions but often at the cost of significant computational resources. Deep learning (DL) models have emerged as powerful tools in meteorology, capable of analyzing complex weather and climate data by learning intricate dependencies and providing rapid predictions once trained. While these models demonstrate promising performance in weather prediction, often surpassing traditional physics-based methods, they still face critical challenges. This paper presents a comprehensive survey of recent deep learning and foundation models for weather prediction. We propose a taxonomy to classify existing models based on their training paradigms: deterministic predictive learning, probabilistic generative learning, and pre-training and fine-tuning. For each paradigm, we delve into the underlying model architectures, address major challenges, offer key insights, and propose targeted directions for future research. Furthermore, we explore real-world applications of these methods and provide a curated summary of open-source code repositories and widely used datasets, aiming to bridge research advancements with practical implementations while fostering open and trustworthy scientific practices in adopting cutting-edge artificial intelligence for weather prediction. The related sources are available at https://github.com/JimengShi/ DL-Foundation-Models-Weather.
Authors: Makoto Otsuka, Kento Kodama, Keisuke Morita, Masayuki Ohzeki
Abstract: This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.
Authors: Paulo Akira F. Enabe, Rodrigo Provasi
Abstract: A hybrid framework integrating the Virtual Element Method (VEM) with deep learning is presented as an initial step toward developing efficient and flexible numerical models for one-dimensional Euler-Bernoulli beams. The primary aim is to explore a data-driven surrogate model capable of predicting displacement fields across varying material and geometric parameters while maintaining computational efficiency. Building upon VEM's ability to handle higher-order polynomials and non-conforming discretizations, the method offers a robust numerical foundation for structural mechanics. A neural network architecture is introduced to separately process nodal and material-specific data, effectively capturing complex interactions with minimal reliance on large datasets. To address challenges in training, the model incorporates Sobolev training and GradNorm techniques, ensuring balanced loss contributions and enhanced generalization. While this framework is in its early stages, it demonstrates the potential for further refinement and development into a scalable alternative to traditional methods. The proposed approach lays the groundwork for advancing numerical and data-driven techniques in beam modeling, offering a foundation for future research in structural mechanics.
Authors: J. Antonio Lara Benitez, Junyi Guo, Kareem Hegazy, Ivan Dokmani\'c, Michael W. Mahoney, Maarten V. de Hoop
Abstract: We introduce Neural Discrete Equilibrium (NeurDE), a machine learning (ML) approach for long-term forecasting of flow phenomena that relies on a "lifting" of physical conservation laws into the framework of kinetic theory. The kinetic formulation provides an excellent structure for ML algorithms by separating nonlinear, non-local physics into a nonlinear but local relaxation to equilibrium and a linear non-local transport. This separation allows the ML to focus on the local nonlinear components while addressing the simpler linear transport with efficient classical numerical algorithms. To accomplish this, we design an operator network that maps macroscopic observables to equilibrium states in a manner that maximizes entropy, yielding expressive BGK-type collisions. By incorporating our surrogate equilibrium into the lattice Boltzmann (LB) algorithm, we achieve accurate flow forecasts for a wide range of challenging flows. We show that NeurDE enables accurate prediction of compressible flows, including supersonic flows, while tracking shocks over hundreds of time steps, using a small velocity lattice-a heretofore unattainable feat without expensive numerical root finding.
Authors: Vladimir Ja\'cimovi\'c
Abstract: Embedding the data in hyperbolic spaces can preserve complex relationships in very few dimensions, thus enabling compact models and improving efficiency of machine learning (ML) algorithms. The underlying idea is that hyperbolic representations can prevent the loss of important structural information for certain ubiquitous types of data. However, further advances in hyperbolic ML require more principled mathematical approaches and adequate geometric methods. The present study aims at enhancing mathematical foundations of hyperbolic ML by combining group-theoretic and conformal-geometric arguments with optimization and statistical techniques. Precisely, we introduce the notion of the mean (barycenter) and the novel family of probability distributions on hyperbolic balls. We further propose efficient optimization algorithms for computation of the barycenter and for maximum likelihood estimation. One can build upon basic concepts presented here in order to design more demanding algorithms and implement hyperbolic deep learning pipelines.
Authors: Vouk Praun-Petrovic, Aadhvika Koundinya, Lavanya Prahallad
Abstract: Generative AI, powered by large language models (LLMs), has revolutionized applications across text, audio, images, and video. This study focuses on developing and evaluating encoder-decoder architectures for the American Sign Language (ASL) image dataset, consisting of 87,000 images across 29 hand sign classes. Three approaches were compared: Feedforward Autoencoders, Convolutional Autoencoders, and Diffusion Autoencoders. The Diffusion Autoencoder outperformed the others, achieving the lowest mean squared error (MSE) and highest Mean Opinion Score (MOS) due to its probabilistic noise modeling and iterative denoising capabilities. The Convolutional Autoencoder demonstrated effective spatial feature extraction but lacked the robustness of the diffusion process, while the Feedforward Autoencoder served as a baseline with limitations in handling complex image data. Objective and subjective evaluations confirmed the superiority of the Diffusion Autoencoder for high-fidelity image reconstruction, emphasizing its potential in multimodal AI applications such as sign language recognition and generation. This work provides critical insights into designing robust encoder-decoder systems to advance multimodal AI capabilities.
Authors: Shiyun Xu, Zhiqi Bu, Yiliang Zhang, Ian Barnett
Abstract: Differential learning rate (DLR), a technique that applies different learning rates to different model parameters, has been widely used in deep learning and achieved empirical success via its various forms. For example, parameter-efficient fine-tuning (PEFT) applies zero learning rates to most parameters so as to significantly save the computational cost. At the core, DLR leverages the observation that different parameters can have different loss curvature, which is hard to characterize in general. We propose the Hessian-informed differential learning rate (Hi-DLR), an efficient approach that solves the hyperparameter optimization (HPO) of learning rates and captures the loss curvature for any model and optimizer adaptively. Given a proper grouping of parameters, we empirically demonstrate that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training. Furthermore, we can quantify the influence of different parameters and freeze the less-contributing parameters, which leads to a new PEFT that automatically adapts to various tasks and models. Additionally, Hi-DLR also exhibits comparable performance on various full model training tasks.
Authors: Ratneel Deo, Scott Sisson, Jody M. Webster, Rohitash Chandra
Abstract: Bayesian Neural Networks (BNNs) offer robust uncertainty quantification in model predictions, but training them presents a significant computational challenge. This is mainly due to the problem of sampling multimodal posterior distributions using Markov Chain Monte Carlo (MCMC) sampling and variational inference algorithms. Moreover, the number of model parameters scales exponentially with additional hidden layers, neurons, and features in the dataset. Typically, a significant portion of these densely connected parameters are redundant and pruning a neural network not only improves portability but also has the potential for better generalisation capabilities. In this study, we address some of the challenges by leveraging MCMC sampling with network pruning to obtain compact probabilistic models having removed redundant parameters. We sample the posterior distribution of model parameters (weights and biases) and prune weights with low importance, resulting in a compact model. We ensure that the compact BNN retains its ability to estimate uncertainty via the posterior distribution while retaining the model training and generalisation performance accuracy by adapting post-pruning resampling. We evaluate the effectiveness of our MCMC pruning strategy on selected benchmark datasets for regression and classification problems through empirical result analysis. We also consider two coral reef drill-core lithology classification datasets to test the robustness of the pruning model in complex real-world datasets. We further investigate if refining compact BNN can retain any loss of performance. Our results demonstrate the feasibility of training and pruning BNNs using MCMC whilst retaining generalisation performance with over 75% reduction in network size. This paves the way for developing compact BNN models that provide uncertainty estimates for real-world applications.
Authors: Muhammad Umair Danish, Katarina Grolinger
Abstract: Load forecasting plays a crucial role in energy management, directly impacting grid stability, operational efficiency, cost reduction, and environmental sustainability. Traditional Vanilla Recurrent Neural Networks (RNNs) face issues such as vanishing and exploding gradients, whereas sophisticated RNNs such as LSTMs have shown considerable success in this domain. However, these models often struggle to accurately capture complex and sudden variations in energy consumption, and their applicability is typically limited to specific consumer types, such as offices or schools. To address these challenges, this paper proposes the Kolmogorov-Arnold Recurrent Network (KARN), a novel load forecasting approach that combines the flexibility of Kolmogorov-Arnold Networks with RNN's temporal modeling capabilities. KARN utilizes learnable temporal spline functions and edge-based activations to better model non-linear relationships in load data, making it adaptable across a diverse range of consumer types. The proposed KARN model was rigorously evaluated on a variety of real-world datasets, including student residences, detached homes, a home with electric vehicle charging, a townhouse, and industrial buildings. Across all these consumer categories, KARN consistently outperformed traditional Vanilla RNNs, while it surpassed LSTM and Gated Recurrent Units (GRUs) in six buildings. The results demonstrate KARN's superior accuracy and applicability, making it a promising tool for enhancing load forecasting in diverse energy management scenarios.
Authors: Karine Karine, Benjamin M. Marlin
Abstract: Reinforcement learning (RL) is increasingly being used in the healthcare domain, particularly for the development of personalized health adaptive interventions. Inspired by the success of Large Language Models (LLMs), we are interested in using LLMs to update the RL policy in real time, with the goal of accelerating personalization. We use the text-based user preference to influence the action selection on the fly, in order to immediately incorporate the user preference. We use the term "user preference" as a broad term to refer to a user personal preference, constraint, health status, or a statement expressing like or dislike, etc. Our novel approach is a hybrid method that combines the LLM response and the RL action selection to improve the RL policy. Given an LLM prompt that incorporates the user preference, the LLM acts as a filter in the typical RL action selection. We investigate different prompting strategies and action selection strategies. To evaluate our approach, we implement a simulation environment that generates the text-based user preferences and models the constraints that impact behavioral dynamics. We show that our approach is able to take into account the text-based user preferences, while improving the RL policy, thus improving personalization in adaptive intervention.
Authors: Henry Li, Ronen Basri, Yuval Kluger
Abstract: Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at \href{https://github.com/lihenryhfl/pcdm}{this https url}.
Authors: Karishma Thakrar, Jiangqin Ma, Max Diamond, Akash Patel
Abstract: Predicting the impact of single-point amino acid mutations on protein stability is essential for understanding disease mechanisms and advancing drug development. Protein stability, quantified by changes in Gibbs free energy ($\Delta\Delta G$), is influenced by these mutations. However, the scarcity of data and the complexity of model interpretation pose challenges in accurately predicting stability changes. This study proposes the application of deep neural networks, leveraging transfer learning and fusing complementary information from different models, to create a feature-rich representation of the protein stability landscape. We developed four models, with our third model, ThermoMPNN+, demonstrating the best performance in predicting $\Delta\Delta G$ values. This approach, which integrates diverse feature sets and embeddings through latent transfusion techniques, aims to refine $\Delta\Delta G$ predictions and contribute to a deeper understanding of protein dynamics, potentially leading to advancements in disease research and drug discovery.
Authors: Weixin Chen, Simon Yu, Huajie Shao, Lui Sha, Han Zhao
Abstract: End-to-end deep neural networks have achieved remarkable success across various domains but are often criticized for their lack of interpretability. While post hoc explanation methods attempt to address this issue, they often fail to accurately represent these black-box models, resulting in misleading or incomplete explanations. To overcome these challenges, we propose an inherently transparent model architecture called Neural Probabilistic Circuits (NPCs), which enable compositional and interpretable predictions through logical reasoning. In particular, an NPC consists of two modules: an attribute recognition model, which predicts probabilities for various attributes, and a task predictor built on a probabilistic circuit, which enables logical reasoning over recognized attributes to make class predictions. To train NPCs, we introduce a three-stage training algorithm comprising attribute recognition, circuit construction, and joint optimization. Moreover, we theoretically demonstrate that an NPC's error is upper-bounded by a linear combination of the errors from its modules. To further demonstrate the interpretability of NPC, we provide both the most probable explanations and the counterfactual explanations. Empirical results on four benchmark datasets show that NPCs strike a balance between interpretability and performance, achieving results competitive even with those of end-to-end black-box models while providing enhanced interpretability.
Authors: Hoang-Thang Ta, Duy-Quy Thai, Anh Tran, Grigori Sidorov, Alexander Gelbukh
Abstract: Kolmogorov-Arnold Networks (KANs) represent an innovation in neural network architectures, offering a compelling alternative to Multi-Layer Perceptrons (MLPs) in models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. By advancing network design, KANs are driving groundbreaking research and enabling transformative applications across various scientific domains involving neural networks. However, existing KANs often require significantly more parameters in their network layers compared to MLPs. To address this limitation, this paper introduces PRKANs (\textbf{P}arameter-\textbf{R}educed \textbf{K}olmogorov-\textbf{A}rnold \textbf{N}etworks), which employ several methods to reduce the parameter count in KAN layers, making them comparable to MLP layers. Experimental results on the MNIST and Fashion-MNIST datasets demonstrate that PRKANs with attention mechanisms outperform several existing KANs and rival the performance of MLPs, albeit with slightly longer training times. Furthermore, the study highlights the advantages of Gaussian Radial Basis Functions (GRBFs) and layer normalization in KAN designs. The repository for this work is available at: \url{https://github.com/hoangthangta/All-KAN}.
Authors: Zong Ke, Shicheng Zhou, Yining Zhou, Chia Hong Chang, Rong Zhang
Abstract: This study explores the use of Generative Adversarial Networks (GANs) to detect AI deepfakes and fraudulent activities in online payment systems. With the growing prevalence of deepfake technology, which can manipulate facial features in images and videos, the potential for fraud in online transactions has escalated. Traditional security systems struggle to identify these sophisticated forms of fraud. This research proposes a novel GAN-based model that enhances online payment security by identifying subtle manipulations in payment images. The model is trained on a dataset consisting of real-world online payment images and deepfake images generated using advanced GAN architectures, such as StyleGAN and DeepFake. The results demonstrate that the proposed model can accurately distinguish between legitimate transactions and deepfakes, achieving a high detection rate above 95%. This approach significantly improves the robustness of payment systems against AI-driven fraud. The paper contributes to the growing field of digital security, offering insights into the application of GANs for fraud detection in financial services. Keywords- Payment Security, Image Recognition, Generative Adversarial Networks, AI Deepfake, Fraudulent Activities
Authors: Luwei Zeng, Runze Yan
Abstract: Modeling car-following behavior is essential for traffic simulation, analyzing driving patterns, and understanding complex traffic flows with varying levels of autonomous vehicles. Traditional models like the Safe Distance Model and Intelligent Driver Model (IDM) require precise parameter calibration and often lack generality due to simplified assumptions about driver behavior. While machine learning and deep learning methods capture complex patterns, they require large labeled datasets. Foundation models provide a more efficient alternative. Pre-trained on vast, diverse time series datasets, they can be applied directly to various tasks without the need for extensive re-training. These models generalize well across domains, and with minimal fine-tuning, they can be adapted to specific tasks like car-following behavior prediction. In this paper, we apply Chronos, a state-of-the-art public time series foundation model, to analyze car-following behavior using the Open ACC dataset. Without fine-tuning, Chronos outperforms traditional models like IDM and Exponential smoothing with trend and seasonality (ETS), and achieves similar results to deep learning models such as DeepAR and TFT, with an RMSE of 0.60. After fine-tuning, Chronos reduces the error to an RMSE of 0.53, representing a 33.75% improvement over IDM and a 12-37% reduction compared to machine learning models like ETS and deep learning models including DeepAR, WaveNet, and TFT. This demonstrates the potential of foundation models to significantly advance transportation research, offering a scalable, adaptable, and highly accurate approach to predicting and simulating car-following behaviors.
Authors: Botao Zhao, Xiaoyang Qu, Zuheng Kang, Junqing Peng, Jing Xiao, Jianzong Wang
Abstract: In deep regression, capturing the relationship among continuous labels in feature space is a fundamental challenge that has attracted increasing interest. Addressing this issue can prevent models from converging to suboptimal solutions across various regression tasks, leading to improved performance, especially for imbalanced regression and under limited sample sizes. However, existing approaches often rely on order-aware representation learning or distance-based weighting. In this paper, we hypothesize a linear negative correlation between label distances and representation similarities in regression tasks. To implement this, we propose an angle-compensated contrastive regularizer for deep regression, which adjusts the cosine distance between anchor and negative samples within the contrastive learning framework. Our method offers a plug-and-play compatible solution that extends most existing contrastive learning methods for regression tasks. Extensive experiments and theoretical analysis demonstrate that our proposed angle-compensated contrastive regularizer not only achieves competitive regression performance but also excels in data efficiency and effectiveness on imbalanced datasets.
Authors: Zhejun Zhang, Yuanping Chen, Shibing Chu
Abstract: Understanding and predicting the diverse conformational states of molecules is crucial for advancing fields such as chemistry, material science, and drug development. Despite significant progress in generative models, accurately generating complex and biologically or material-relevant molecular structures remains a major challenge. In this work, we introduce a diffusion model for three-dimensional (3D) molecule generation that combines a classifiable diffusion model, Diffusion Transformer, with multihead equivariant self-attention. This method addresses two key challenges: correctly attaching hydrogen atoms in generated molecules through learning representations of molecules after hydrogen atoms are removed; and overcoming the limitations of existing models that cannot generate molecules across multiple classes simultaneously. The experimental results demonstrate that our model not only achieves state-of-the-art performance across several key metrics but also exhibits robustness and versatility, making it highly suitable for early-stage large-scale generation processes in molecular design, followed by validation and further screening to obtain molecules with specific properties.
Authors: Zhengzhong Liu, Bowen Tan, Hongyi Wang, Willie Neiswanger, Tianhua Tao, Haonan Li, Fajri Koto, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Liqun Ma, Liping Tang, Nikhil Ranjan, Yonghao Zhuang, Guowei He, Renxi Wang, Mingkai Deng, Robin Algayres, Yuanzhi Li, Zhiqiang Shen, Preslav Nakov, Eric Xing
Abstract: We detail the training of the LLM360 K2-65B model, scaling up our 360-degree OPEN SOURCE approach to the largest and most powerful models under project LLM360. While open-source LLMs continue to advance, the answer to "How are the largest LLMs trained?" remains unclear within the community. The implementation details for such high-capacity models are often protected due to business considerations associated with their high cost. This lack of transparency prevents LLM researchers from leveraging valuable insights from prior experience, e.g., "What are the best practices for addressing loss spikes?" The LLM360 K2 project addresses this gap by providing full transparency and access to resources accumulated during the training of LLMs at the largest scale. This report highlights key elements of the K2 project, including our first model, K2 DIAMOND, a 65 billion-parameter LLM that surpasses LLaMA-65B and rivals LLaMA2-70B, while requiring fewer FLOPs and tokens. We detail the implementation steps and present a longitudinal analysis of K2 DIAMOND's capabilities throughout its training process. We also outline ongoing projects such as TXT360, setting the stage for future models in the series. By offering previously unavailable resources, the K2 project also resonates with the 360-degree OPEN SOURCE principles of transparency, reproducibility, and accessibility, which we believe are vital in the era of resource-intensive AI research.
Authors: Chenyang Qi, Huiping Li, Panfeng Huang
Abstract: In recent years, meta-reinforcement learning (meta-RL) algorithm has been proposed to improve sample efficiency in the field of decision-making and control, enabling agents to learn new knowledge from a small number of samples. However, most research uses the Gaussian distribution to extract task representation, which is poorly adapted to tasks that change in non-stationary environment. To address this problem, we propose a novel meta-reinforcement learning method by leveraging Gaussian mixture model and the transformer network to construct task inference model. The Gaussian mixture model is utilized to extend the task representation and conduct explicit encoding of tasks. Specifically, the classification of tasks is encoded through transformer network to determine the Gaussian component corresponding to the task. By leveraging task labels, the transformer network is trained using supervised learning. We validate our method on MuJoCo benchmarks with non-stationary and multi-task environments. Experimental results demonstrate that the proposed method dramatically improves sample efficiency and accurately recognizes the classification of the tasks, while performing excellently in the environment.
Authors: Bangchen Yin, Jiaao Wang, Weitao Du, Pengbo Wang, Penghua Ying, Haojun Jia, Zisheng Zhang, Yuanqi Du, Carla P. Gomes, Chenru Duan, Hai Xiao, Graeme Henkelman
Abstract: We present AlphaNet, a local frame-based equivariant model designed to achieve both accurate and efficient simulations for atomistic systems. Recently, machine learning force fields (MLFFs) have gained prominence in molecular dynamics simulations due to their advantageous efficiency-accuracy balance compared to classical force fields and quantum mechanical calculations, alongside their transferability across various systems. Despite the advancements in improving model accuracy, the efficiency and scalability of MLFFs remain significant obstacles in practical applications. AlphaNet enhances computational efficiency and accuracy by leveraging the local geometric structures of atomic environments through the construction of equivariant local frames and learnable frame transitions. We substantiate the efficacy of AlphaNet across diverse datasets, including defected graphene, formate decomposition, zeolites, and surface reactions. AlphaNet consistently surpasses well-established models, such as NequIP and DeepPot, in terms of both energy and force prediction accuracy. Notably, AlphaNet offers one of the best trade-offs between computational efficiency and accuracy among existing models. Moreover, AlphaNet exhibits scalability across a broad spectrum of system and dataset sizes, affirming its versatility.
Authors: Ferdinand Rewicki, Joachim Denzler, Julia Niebling
Abstract: Detecting and classifying abnormal system states is critical for condition monitoring, but supervised methods often fall short due to the rarity of anomalies and the lack of labeled data. Therefore, clustering is often used to group similar abnormal behavior. However, evaluating cluster quality without ground truth is challenging, as existing measures such as the Silhouette Score (SSC) only evaluate the cohesion and separation of clusters and ignore possible prior knowledge about the data. To address this challenge, we introduce the Synchronized Anomaly Agreement Index (SAAI), which exploits the synchronicity of anomalies across multivariate time series to assess cluster quality. We demonstrate the effectiveness of SAAI by showing that maximizing SAAI improves accuracy on the task of finding the true number of anomaly classes K in correlated time series by 0.23 compared to SSC and by 0.32 compared to X-Means. We also show that clusters obtained by maximizing SAAI are easier to interpret compared to SSC.
Authors: Mohammadreza Kavianpour, Parisa Kavianpour, Amin Ramezani, Mohammad TH Beheshti
Abstract: Bearing fault diagnosis under varying working conditions faces challenges, including a lack of labeled data, distribution discrepancies, and resource constraints. To address these issues, we propose a progressive knowledge distillation framework that transfers knowledge from a complex teacher model, utilizing a Graph Convolutional Network (GCN) with Autoregressive moving average (ARMA) filters, to a compact and efficient student model. To mitigate distribution discrepancies and labeling uncertainty, we introduce Enhanced Local Maximum Mean Squared Discrepancy (ELMMSD), which leverages mean and variance statistics in the Reproducing Kernel Hilbert Space (RKHS) and incorporates a priori probability distributions between labels. This approach increases the distance between clustering centers, bridges subdomain gaps, and enhances subdomain alignment reliability. Experimental results on benchmark datasets (CWRU and JNU) demonstrate that the proposed method achieves superior diagnostic accuracy while significantly reducing computational costs. Comprehensive ablation studies validate the effectiveness of each component, highlighting the robustness and adaptability of the approach across diverse working conditions.
Authors: Matthijs de Jong, Jan Viebahn, Yuliya Shapovalova
Abstract: The energy transition necessitates new congestion management methods. One such method is controlling the grid topology with machine learning (ML). This approach has gained popularity following the Learning to Run a Power Network (L2RPN) competitions. Graph neural networks (GNNs) are a class of ML models that reflect graph structure in their computation, which makes them suitable for power grid modeling. Various GNN approaches for topology control have thus been proposed. We propose the first GNN model for grid topology control that uses only GNN layers. Additionally, we identify the busbar information asymmetry problem that the popular homogeneous graph representation suffers from, and propose a heterogeneous graph representation to resolve it. We train both homogeneous and heterogeneous GNNs and fully connected neural networks (FCNN) baselines on an imitation learning task. We evaluate the models according to their classification accuracy and grid operation ability. We find that the heterogeneous GNNs perform best on in-distribution networks, followed by the FCNNs, and lastly, the homogeneous GNNs. We also find that both GNN types generalize better to out-of-distribution networks than FCNNs.
Authors: Haishan Ye, Yinghui Huang, Hao Di, Xiangyu Chang
Abstract: We propose an enhanced zeroth-order stochastic Frank-Wolfe framework to address constrained finite-sum optimization problems, a structure prevalent in large-scale machine-learning applications. Our method introduces a novel double variance reduction framework that effectively reduces the gradient approximation variance induced by zeroth-order oracles and the stochastic sampling variance from finite-sum objectives. By leveraging this framework, our algorithm achieves significant improvements in query efficiency, making it particularly well-suited for high-dimensional optimization tasks. Specifically, for convex objectives, the algorithm achieves a query complexity of O(d \sqrt{n}/\epsilon ) to find an epsilon-suboptimal solution, where d is the dimensionality and n is the number of functions in the finite-sum objective. For non-convex objectives, it achieves a query complexity of O(d^{3/2}\sqrt{n}/\epsilon^2 ) without requiring the computation ofd partial derivatives at each iteration. These complexities are the best known among zeroth-order stochastic Frank-Wolfe algorithms that avoid explicit gradient calculations. Empirical experiments on convex and non-convex machine learning tasks, including sparse logistic regression, robust classification, and adversarial attacks on deep networks, validate the computational efficiency and scalability of our approach. Our algorithm demonstrates superior performance in both convergence rate and query complexity compared to existing methods.
Authors: Marco Barbero Mota, John M. Still, Jorge L. Gamboa, Eric V. Strobl, Charles M. Stein, Vivian K. Kawai, Thomas A. Lasko
Abstract: Systemic lupus erythematosus (SLE) is a complex heterogeneous disease with many manifestational facets. We propose a data-driven approach to discover probabilistic independent sources from multimodal imperfect EHR data. These sources represent exogenous variables in the data generation process causal graph that estimate latent root causes of the presence of SLE in the health record. We objectively evaluated the sources against the original variables from which they were discovered by training supervised models to discriminate SLE from negative health records using a reduced set of labelled instances. We found 19 predictive sources with high clinical validity and whose EHR signatures define independent factors of SLE heterogeneity. Using the sources as input patient data representation enables models to provide with rich explanations that better capture the clinical reasons why a particular record is (not) an SLE case. Providers may be willing to trade patient-level interpretability for discrimination especially in challenging cases.
Authors: Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li
Abstract: Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.
Authors: Ping Guo, Cheng Gong, Xi Lin, Fei Liu, Zhichao Lu, Qingfu Zhang, Zhenkun Wang
Abstract: Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions.
Authors: Monse Guedes-Ayala, Lars Schewe, Zeynep Suvak, Miguel Anjos
Abstract: Machine Learning (ML) models have become a very powerful tool to extract information from large datasets and use it to make accurate predictions and automated decisions. However, ML models can be vulnerable to external attacks, causing them to underperform or deviate from their expected tasks. One way to attack ML models is by injecting malicious data to mislead the algorithm during the training phase, which is referred to as a poisoning attack. We can prepare for such situations by designing anticipated attacks, which are later used for creating and testing defence strategies. In this paper, we propose an algorithm to generate strong poisoning attacks for a ridge regression model containing both numerical and categorical features that explicitly models and poisons categorical features. We model categorical features as SOS-1 sets and formulate the problem of designing poisoning attacks as a bilevel optimization problem that is nonconvex mixed-integer in the upper-level and unconstrained convex quadratic in the lower-level. We present the mathematical formulation of the problem, introduce a single-level reformulation based on the Karush-Kuhn-Tucker (KKT) conditions of the lower level, find bounds for the lower-level variables to accelerate solver performance, and propose a new algorithm to poison categorical features. Numerical experiments show that our method improves the mean squared error of all datasets compared to the previous benchmark in the literature.
Authors: S\'egol\`ene Martin, Jean-Christophe Pesquet, Gabriele Steidl, Ismail Ben Ayed
Abstract: We propose a novel Bregman descent algorithm for minimizing a convex function that is expressed as the sum of a differentiable part (defined over an open set) and a possibly nonsmooth term. The approach, referred to as the Variable Bregman Majorization-Minimization (VBMM) algorithm, extends the Bregman Proximal Gradient method by allowing the Bregman function used in the divergence to adaptively vary at each iteration, provided it satisfies a majorizing condition on the objective function. This adaptive framework enables the algorithm to approximate the objective more precisely at each iteration, thereby allowing for accelerated convergence compared to the traditional Bregman Proximal Gradient descent. We establish the convergence of the VBMM algorithm to a minimizer under mild assumptions on the family of metrics used. Furthermore, we introduce a novel application of both the Bregman Proximal Gradient method and the VBMM algorithm to the estimation of the multidimensional parameters of a Dirichlet distribution through the maximization of its log-likelihood. Numerical experiments confirm that the VBMM algorithm outperforms existing approaches in terms of convergence speed.
Authors: Cornelius Hake (Ing. h.c. F. Porsche AG), Jonas Weigele (Ing. h.c. F. Porsche AG), Frederik Reichert (Hochschule Esslingen), Christian Friedrich (Hochschule Karlsruhe)
Abstract: The present study examines the effectiveness of applying Artificial Intelligence methods in an automotive production environment to predict unknown lead times in a non-cycle-controlled production area. Data structures are analyzed to identify contextual features and then preprocessed using one-hot encoding. Methods selection focuses on supervised machine learning techniques. In supervised learning methods, regression and classification methods are evaluated. Continuous regression based on target size distribution is not feasible. Classification methods analysis shows that Ensemble Learning and Support Vector Machines are the most suitable. Preliminary study results indicate that gradient boosting algorithms LightGBM, XGBoost, and CatBoost yield the best results. After further testing and extensive hyperparameter optimization, the final method choice is the LightGBM algorithm. Depending on feature availability and prediction interval granularity, relative prediction accuracies of up to 90% can be achieved. Further tests highlight the importance of periodic retraining of AI models to accurately represent complex production processes using the database. The research demonstrates that AI methods can be effectively applied to highly variable production data, adding business value by providing an additional metric for various control tasks while outperforming current non AI-based systems.
Authors: Buse Sibel Korkmaz, Rahul Nair, Elizabeth M. Daly, Evangelos Anagnostopoulos, Christos Varytimidis, Antonio del Rio Chanona
Abstract: Foundation models require fine-tuning to ensure their generative outputs align with intended results for specific tasks. Automating this fine-tuning process is challenging, as it typically needs human feedback that can be expensive to acquire. We present AutoRefine, a method that leverages reinforcement learning for targeted fine-tuning, utilizing direct feedback from measurable performance improvements in specific downstream tasks. We demonstrate the method for a problem arising in algorithmic hiring platforms where linguistic biases influence a recommendation system. In this setting, a generative model seeks to rewrite given job specifications to receive more diverse candidate matches from a recommendation engine which matches jobs to candidates. Our model detects and regulates biases in job descriptions to meet diversity and fairness criteria. The experiments on a public hiring dataset and a real-world hiring platform showcase how large language models can assist in identifying and mitigation biases in the real world.
Authors: Haochuan Zhang, Chunhua Yang, Jie Han, Liyang Qin, Xiaoli Wang
Abstract: Multi-modal language model has made advanced progress in vision and audio, but still faces significant challenges in dealing with complex reasoning tasks in the time series domain. The reasons are twofold. First, labels for multi-modal time series data are coarse and devoid of analysis or reasoning processes. Training with these data cannot improve the model's reasoning capabilities. Second, due to the lack of precise tokenization in processing time series, the representation patterns for temporal and textual information are inconsistent, which hampers the effectiveness of multi-modal alignment. To address these challenges, we propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT. Specially, we construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system. Additionally, proposed TempoGPT achieves consistent representation between temporal and textual information by quantizing temporal embeddings, where temporal embeddings are quantized into a series of discrete tokens using a predefined codebook; subsequently, a shared embedding layer processes both temporal and textual tokens. Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks. Moreover, we quantitatively demonstrate the effectiveness of quantizing temporal embeddings in enhancing multi-modal alignment and the reasoning capabilities of TLMs. Code and data are available at https://github.com/zhanghaochuan20/TempoGPT.
Authors: Maximilian Bundscherer, Thomas H. Schmitt, Ilja Baumann, Tobias Bocklet
Abstract: This study presents an ML approach for classifying digital radio operating modes evaluated on real-world transmissions. We generated 98 different parameterized radio signals from 17 digital operating modes, transmitted each of them on the 70 cm (UHF) amateur radio band, and recorded our transmissions with two different architectures of SDR receivers. Three lightweight ML models were trained exclusively on spectrograms of limited non-transmitted signals with random characters as payloads. This training involved an online data augmentation pipeline to simulate various radio channel impairments. Our best model, EfficientNetB0, achieved an accuracy of 93.80% across the 17 operating modes and 85.47% across all 98 parameterized radio signals, evaluated on our real-world transmissions with Wikipedia articles as payloads. Furthermore, we analyzed the impact of varying signal durations & the number of FFT bins on classification, assessed the effectiveness of our simulated channel impairments, and tested our models across multiple simulated SNRs.
Authors: Shilong Deng, Zetao Zheng, Hongcai He, Paul Weng, Jie Shao
Abstract: A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be sub-optimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods.
Authors: Michael Adipoetra, S\'egol\`ene Martin
Abstract: We propose a novel deep clustering method that integrates Variational Autoencoders (VAEs) into the Expectation-Maximization (EM) framework. Our approach models the probability distribution of each cluster with a VAE and alternates between updating model parameters by maximizing the Evidence Lower Bound (ELBO) of the log-likelihood and refining cluster assignments based on the learned distributions. This enables effective clustering and generation of new samples from each cluster. Unlike existing VAE-based methods, our approach eliminates the need for a Gaussian Mixture Model (GMM) prior or additional regularization techniques. Experiments on MNIST and FashionMNIST demonstrate superior clustering performance compared to state-of-the-art methods.
Authors: Vinay Sharma, Olga Fink
Abstract: Accurate, interpretable, and real-time modeling of multi-body dynamical systems is essential for predicting behaviors and inferring physical properties in natural and engineered environments. Traditional physics-based models face scalability challenges and are computationally demanding, while data-driven approaches like Graph Neural Networks (GNNs) often lack physical consistency, interpretability, and generalization. In this paper, we propose Dynami-CAL GraphNet, a Physics-Informed Graph Neural Network that integrates the learning capabilities of GNNs with physics-based inductive biases to address these limitations. Dynami-CAL GraphNet enforces pairwise conservation of linear and angular momentum for interacting nodes using edge-local reference frames that are equivariant to rotational symmetries, invariant to translations, and equivariant to node permutations. This design ensures physically consistent predictions of node dynamics while offering interpretable, edge-wise linear and angular impulses resulting from pairwise interactions. Evaluated on a 3D granular system with inelastic collisions, Dynami-CAL GraphNet demonstrates stable error accumulation over extended rollouts, effective extrapolations to unseen configurations, and robust handling of heterogeneous interactions and external forces. Dynami-CAL GraphNet offers significant advantages in fields requiring accurate, interpretable, and real-time modeling of complex multi-body dynamical systems, such as robotics, aerospace engineering, and materials science. By providing physically consistent and scalable predictions that adhere to fundamental conservation laws, it enables the inference of forces and moments while efficiently handling heterogeneous interactions and external forces.
Authors: RunQing Wu, KaiHui Huang, HanYi Zhang, QiHe Liu, GuoJin Yu, JingSong Deng, Fei Ye
Abstract: Continuously acquiring new knowledge from a dynamic environment is a fundamental capability for animals, facilitating their survival and ability to address various challenges. This capability is referred to as continual learning, which focuses on the ability to learn a sequence of tasks without the detriment of previous knowledge. A prevalent strategy to tackle continual learning involves selecting and storing numerous essential data samples from prior tasks within a fixed-size memory buffer. However, the majority of current memory-based techniques typically utilize a single memory buffer, which poses challenges in concurrently managing newly acquired and previously learned samples. Drawing inspiration from the Complementary Learning Systems (CLS) theory, which defines rapid and gradual learning mechanisms for processing information, we propose an innovative dual memory system called the Information-Theoretic Dual Memory System (ITDMS). This system comprises a fast memory buffer designed to retain temporary and novel samples, alongside a slow memory buffer dedicated to preserving critical and informative samples. The fast memory buffer is optimized employing an efficient reservoir sampling process. Furthermore, we introduce a novel information-theoretic memory optimization strategy that selectively identifies and retains diverse and informative data samples for the slow memory buffer. Additionally, we propose a novel balanced sample selection procedure that automatically identifies and eliminates redundant memorized samples, thus freeing up memory capacity for new data acquisitions, which can deal with a growing array of tasks. Our methodology is rigorously assessed through a series of continual learning experiments, with empirical results underscoring the effectiveness of the proposed system.
Authors: Thomas Chen
Abstract: We derive explicit equations governing the cumulative biases and weights in Deep Learning with ReLU activation function, based on gradient descent for the Euclidean cost in the input layer, and under the assumption that the weights are, in a precise sense, adapted to the coordinate system distinguished by the activations. We show that gradient descent corresponds to a dynamical process in the input layer, whereby clusters of data are progressively reduced in complexity ("truncated") at an exponential rate that increases with the number of data points that have already been truncated. We provide a detailed discussion of several types of solutions to the gradient flow equations. A main motivation for this work is to shed light on the interpretability question in supervised learning.
Authors: Aram Ansary Ogholbake, Qiang Cheng
Abstract: Circadian rhythms regulate the physiology and behavior of humans and animals. Despite advancements in understanding these rhythms and predicting circadian phases at the transcriptional level, predicting circadian phases from proteomic data remains elusive. This challenge is largely due to the scarcity of time labels in proteomic datasets, which are often characterized by small sample sizes, high dimensionality, and significant noise. Furthermore, existing methods for predicting circadian phases from transcriptomic data typically rely on prior knowledge of known rhythmic genes, making them unsuitable for proteomic datasets. To address this gap, we developed a novel computational method using unsupervised deep learning techniques to predict circadian sample phases from proteomic data without requiring time labels or prior knowledge of proteins or genes. Our model involves a two-stage training process optimized for robust circadian phase prediction: an initial greedy one-layer-at-a-time pre-training which generates informative initial parameters followed by fine-tuning. During fine-tuning, a specialized loss function guides the model to align protein expression levels with circadian patterns, enabling it to accurately capture the underlying rhythmic structure within the data. We tested our method on both time-labeled and unlabeled proteomic data. For labeled data, we compared our predictions to the known time labels, achieving high accuracy, while for unlabeled human datasets, including postmortem brain regions and urine samples, we explored circadian disruptions. Notably, our analysis identified disruptions in rhythmic proteins between Alzheimer's disease and control subjects across these samples.
Authors: Muhammad Umair Danish, Mathumitha Sureshkumar, Thanuri Fonseka, Umeshika Uthayakumar, Vinura Galwaduge
Abstract: This research provides an in-depth evaluation of various machine learning models for energy forecasting, focusing on the unique challenges of seasonal variations in student residential settings. The study assesses the performance of baseline models, such as LSTM and GRU, alongside state-of-the-art forecasting methods, including Autoregressive Feedforward Neural Networks, Transformers, and hybrid approaches. Special attention is given to predicting energy consumption amidst challenges like seasonal patterns, vacations, meteorological changes, and irregular human activities that cause sudden fluctuations in usage. The findings reveal that no single model consistently outperforms others across all seasons, emphasizing the need for season-specific model selection or tailored designs. Notably, the proposed Hyper Network based LSTM and MiniAutoEncXGBoost models exhibit strong adaptability to seasonal variations, effectively capturing abrupt changes in energy consumption during summer months. This study advances the energy forecasting field by emphasizing the critical role of seasonal dynamics and model-specific behavior in achieving accurate predictions.
Authors: Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort
Abstract: Machine learning techniques in multi-view settings face significant challenges, particularly when integrating heterogeneous data, aligning feature spaces, and managing view-specific biases. These issues are prominent in neuroscience, where data from multiple subjects exposed to the same stimuli are analyzed to uncover brain activity dynamics. In magnetoencephalography (MEG), where signals are captured at the scalp level, estimating the brain's underlying sources is crucial, especially in group studies where sources are assumed to be similar for all subjects. Common methods, such as Multi-View Independent Component Analysis (MVICA), assume identical sources across subjects, but this assumption is often too restrictive due to individual variability and age-related changes. Multi-View Independent Component Analysis with Delays (MVICAD) addresses this by allowing sources to differ up to a temporal delay. However, temporal dilation effects, particularly in auditory stimuli, are common in brain dynamics, making the estimation of time delays alone insufficient. To address this, we propose Multi-View Independent Component Analysis with Delays and Dilations (MVICAD2), which allows sources to differ across subjects in both temporal delays and dilations. We present a model with identifiable sources, derive an approximation of its likelihood in closed form, and use regularization and optimization techniques to enhance performance. Through simulations, we demonstrate that MVICAD2 outperforms existing multi-view ICA methods. We further validate its effectiveness using the Cam-CAN dataset, and showing how delays and dilations are related to aging.
Authors: Ting-Yu Dai, Hayato Ushijima-Mwesigwa
Abstract: A recent report from the World Meteorological Organization (WMO) highlights that water-related disasters have caused the highest human losses among natural disasters over the past 50 years, with over 91\% of deaths occurring in low-income countries. This disparity is largely due to the lack of adequate ground monitoring stations, such as weather surveillance radars (WSR), which are expensive to install. For example, while the US and Europe combined possess over 600 WSRs, Africa, despite having almost one and half times their landmass, has fewer than 40. To address this issue, satellite-based observations offer a global, near-real-time monitoring solution. However, they face several challenges like accuracy, bias, and low spatial resolution. This study leverages the power of diffusion models and residual learning to address these limitations in a unified framework. We introduce the first diffusion model for correcting the inconsistency between different precipitation products. Our method demonstrates the effectiveness in downscaling satellite precipitation estimates from 10 km to 1 km resolution. Extensive experiments conducted in the Seattle region demonstrate significant improvements in accuracy, bias reduction, and spatial detail. Importantly, our approach achieves these results using only precipitation data, showcasing the potential of a purely computer vision-based approach for enhancing satellite precipitation products and paving the way for further advancements in this domain.
Authors: Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang
Abstract: It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.
Authors: Mingkang Wu, Devin White, Vernon Lawhern, Nicholas R. Waytowich, Yongcan Cao
Abstract: Reinforcement learning (RL), a common tool in decision making, learns policies from various experiences based on the associated cumulative return/rewards without treating them differently. On the contrary, humans often learn to distinguish from different levels of performance and extract the underlying trends towards improving their decision making for best performance. Motivated by this, this paper proposes a novel RL method that mimics humans' decision making process by differentiating among collected experiences for effective policy learning. The main idea is to extract important directional information from experiences with different performance levels, named ratings, so that policies can be updated towards desired deviation from these experiences with different ratings. Specifically, we propose a new policy loss function that penalizes distribution similarities between the current policy and failed experiences with different ratings, and assign different weights to the penalty terms based on the rating classes. Meanwhile, reward learning from these rated samples can be integrated with the new policy loss towards an integrated reward and policy learning from rated samples. Optimizing the integrated reward and policy loss function will lead to the discovery of directions for policy improvement towards maximizing cumulative rewards and penalizing most from the lowest performance level while least from the highest performance level. To evaluate the effectiveness of the proposed method, we present results for experiments on a few typical environments that show improved convergence and overall performance over the existing rating-based reinforcement learning method with only reward learning.
Authors: Ryan G. Dempsey, Jonathan Ethier, Halim Yanikomeroglu
Abstract: Path loss prediction is a beneficial tool for efficient use of the radio frequency spectrum. Building on prior research on high-resolution map-based path loss models, this paper studies convolutional neural network input representations in more detail. We investigate different methods of representing scalar features in convolutional neural networks. Specifically, we compare using frequency and distance as input channels to convolutional layers or as scalar inputs to regression layers. We assess model performance using three different feature configurations and find that representing scalar features as image channels results in the strongest generalization.
Authors: Haoxiang Yu, Javier Berrocal, Christine Julien
Abstract: Artificial intelligence has been integrated into nearly every aspect of daily life, powering applications from object detection with computer vision to large language models for writing emails and compact models in smart homes. These machine learning models cater to individual users but are often detached from them, as they are typically stored and processed in centralized data centers. This centralized approach raises privacy concerns, incurs high infrastructure costs, and struggles with personalization. Federated and fully decentralized learning methods have been proposed to address these issues, but they still depend on centralized servers or face slow convergence due to communication constraints. To overcome these challenges, we propose ML Mule, a approach that utilizes individual mobile devices as 'Mules' to train and transport model snapshots as they move through physical spaces, sharing these models with the physical 'Spaces' they inhabit. This method implicitly forms affinity groups among devices associated with users who share particular spaces, enabling collaborative model evolution, and protecting users' privacy. Our approach addresses several major shortcomings of traditional, federated, and fully decentralized learning systems. The proposed framework represents a new class of machine learning methods that are more robust, distributed, and personalized, bringing the field closer to realizing the original vision of intelligent, adaptive, and genuinely context-aware smart environments. The results show that ML Mule converges faster and achieves higher model accuracy compared to other existing methods.
Authors: Sana Rahmani, Reetam Chatterjee, Ali Etemad, Javad Hashemi
Abstract: Continual Learning (CL) methods aim to learn from a sequence of tasks while avoiding the challenge of forgetting previous knowledge. We present DREAM-CL, a novel CL method for ECG arrhythmia detection that introduces dynamic prototype rehearsal memory. DREAM-CL selects representative prototypes by clustering data based on learning behavior during each training session. Within each cluster, we apply a smooth sorting operation that ranks samples by training difficulty, compressing extreme values and removing outliers. The more challenging samples are then chosen as prototypes for the rehearsal memory, ensuring effective knowledge retention across sessions. We evaluate our method on time-incremental, class-incremental, and lead-incremental scenarios using two widely used ECG arrhythmia datasets, Chapman and PTB-XL. The results demonstrate that DREAM-CL outperforms the state-of-the-art in CL for ECG arrhythmia detection. Detailed ablation and sensitivity studies are performed to validate the different design choices of our method.
Authors: Saurabh Bodhe, Zhanguang Zhang, Atia Hamidizadeh, Shixiong Kai, Yingxue Zhang, Mingxuan Yuan
Abstract: Pre-routing slack prediction remains a critical area of research in Electronic Design Automation (EDA). Despite numerous machine learning-based approaches targeting this task, there is still a lack of a truly end-to-end framework that engineers can use to obtain TNS/WNS metrics from raw circuit data at the placement stage. Existing works have demonstrated effectiveness in Arrival Time (AT) prediction but lack a mechanism for Required Arrival Time (RAT) prediction, which is essential for slack prediction and obtaining TNS/WNS metrics. In this work, we propose E2ESlack, an end-to-end graph-based framework for pre-routing slack prediction. The framework includes a TimingParser that supports DEF, SDF and LIB files for feature extraction and graph construction, an arrival time prediction model and a fast RAT estimation module. To the best of our knowledge, this is the first work capable of predicting path-level slacks at the pre-routing stage. We perform extensive experiments and demonstrate that our proposed RAT estimation method outperforms the SOTA ML-based prediction method and also pre-routing STA tool. Additionally, the proposed E2ESlack framework achieves TNS/WNS values comparable to post-routing STA results while saving up to 23x runtime.
Authors: Aziida Nanyonga, Keith Joiner, Ugur Turhan, Graham Wild
Abstract: This study explores using Natural Language Processing in aviation safety, focusing on machine learning algorithms to enhance safety measures. There are currently May 2024, 34 Scopus results from the keyword search natural language processing and aviation safety. Analyzing these studies allows us to uncover trends in the methodologies, findings and implications of NLP in aviation. Both qualitative and quantitative tools have been used to investigate the current state of literature on NLP for aviation safety. The qualitative analysis summarises the research motivations, objectives, and outcomes, showing how NLP can be utilized to help identify critical safety issues and improve aviation safety. This study also identifies research gaps and suggests areas for future exploration, providing practical recommendations for the aviation industry. We discuss challenges in implementing NLP in aviation safety, such as the need for large, annotated datasets, and the difficulty in interpreting complex models. We propose solutions like active learning for data annotation and explainable AI for model interpretation. Case studies demonstrate the successful application of NLP in improving aviation safety, highlighting its potential to make aviation safer and more efficient.
Authors: Xinger Li, Zhiqiang Zhong, Bo Huang, Yang Yang
Abstract: This paper is the first-place solution for ICASSP MEIJU@2025 Track I, which focuses on low-resource multimodal emotion and intention recognition. How to effectively utilize a large amount of unlabeled data, while ensuring the mutual promotion of different difficulty levels tasks in the interaction stage, these two points become the key to the competition. In this paper, pseudo-label labeling is carried out on the model trained with labeled data, and samples with high confidence and their labels are selected to alleviate the problem of low resources. At the same time, the characteristic of easy represented ability of intention recognition found in the experiment is used to make mutually promote with emotion recognition under different attention heads, and higher performance of intention recognition is achieved through fusion. Finally, under the refined processing data, we achieve the score of 0.5532 in the Test set, and win the championship of the track.
Authors: Phillip Maire, Samson G. King, Jonathan Andrew Cheung, Stefanie Walker, Samuel Andrew Hires
Abstract: The rodent vibrissal system is pivotal in advancing neuroscience research, particularly for studies of cortical plasticity, learning, decision-making, sensory encoding, and sensorimotor integration. Despite the advantages, curating touch events is labor intensive and often requires >3 hours per million video frames, even after leveraging automated tools like the Janelia Whisker Tracker. We address this limitation by introducing Whisker Automatic Contact Classifier (WhACC), a python package designed to identify touch periods from high-speed videos of head-fixed behaving rodents with human-level performance. WhACC leverages ResNet50V2 for feature extraction, combined with LightGBM for Classification. Performance is assessed against three expert human curators on over one million frames. Pairwise touch classification agreement on 99.5% of video frames, equal to between-human agreement. Finally, we offer a custom retraining interface to allow model customization on a small subset of data, which was validated on four million frames across 16 single-unit electrophysiology recordings. Including this retraining step, we reduce human hours required to curate a 100 million frame dataset from ~333 hours to ~6 hours.
Authors: Yiqin Zhang, Meiling Chen, Zhengjie Zhang
Abstract: Whether during the early days of popularization or in the present, the window setting in Computed Tomography (CT) has always been an indispensable part of the CT analysis process. Although research has investigated the capabilities of CT multi-window fusion in enhancing neural networks, there remains a paucity of domain-invariant, intuitively interpretable methodologies for Auto Window Setting. In this work, we propose an plug-and-play module originate from Tanh activation function, which is compatible with mainstream deep learning architectures. Starting from the physical principles of CT, we adhere to the principle of interpretability to ensure the module's reliability for medical implementations. The domain-invariant design facilitates observation of the preference decisions rendered by the adaptive mechanism from a clinically intuitive perspective. This enables the proposed method to be understood not only by experts in neural networks but also garners higher trust from clinicians. We confirm the effectiveness of the proposed method in multiple open-source datasets, yielding 10%~200% Dice improvements on hard segment targets.
Authors: Yangyang Li, Zhengya Qia, Yuelin Lia, Haorui Yanga, Ronghua Shanga, Licheng Jiaoa
Abstract: Medical images are characterized by intricate and complex features, requiring interpretation by physicians with medical knowledge and experience. Classical neural networks can reduce the workload of physicians, but can only handle these complex features to a limited extent. Theoretically, quantum computing can explore a broader parameter space with fewer parameters, but it is currently limited by the constraints of quantum hardware.Considering these factors, we propose a distributed hybrid quantum convolutional neural network based on quantum circuit splitting. This model leverages the advantages of quantum computing to effectively capture the complex features of medical images, enabling efficient classification even in resource-constrained environments. Our model employs a quantum convolutional neural network (QCNN) to extract high-dimensional features from medical images, thereby enhancing the model's expressive capability.By integrating distributed techniques based on quantum circuit splitting, the 8-qubit QCNN can be reconstructed using only 5 qubits.Experimental results demonstrate that our model achieves strong performance across 3 datasets for both binary and multiclass classification tasks. Furthermore, compared to recent technologies, our model achieves superior performance with fewer parameters, and experimental results validate the effectiveness of our model.
Authors: Arash Dehghani, Hossein Saberi
Abstract: This paper reviews the state-of-the-art in deepfake generation and detection, focusing on modern deep learning technologies and tools based on the latest scientific advancements. The rise of deepfakes, leveraging techniques like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion models and other generative models, presents significant threats to privacy, security, and democracy. This fake media can deceive individuals, discredit real people and organizations, facilitate blackmail, and even threaten the integrity of legal, political, and social systems. Therefore, finding appropriate solutions to counter the potential threats posed by this technology is essential. We explore various deepfake methods, including face swapping, voice conversion, reenactment and lip synchronization, highlighting their applications in both benign and malicious contexts. The review critically examines the ongoing "arms race" between deepfake generation and detection, analyzing the challenges in identifying manipulated contents. By examining current methods and highlighting future research directions, this paper contributes to a crucial understanding of this rapidly evolving field and the urgent need for robust detection strategies to counter the misuse of this powerful technology. While focusing primarily on audio, image, and video domains, this study allows the reader to easily grasp the latest advancements in deepfake generation and detection.
Authors: Joaquin Delgado Fernandez, Sergio Potenciano Menci, Alessio Magitteri
Abstract: In the evolving landscape of data privacy, the anonymization of electric load profiles has become a critical issue, especially with the enforcement of the General Data Protection Regulation (GDPR) in Europe. These electric load profiles, which are essential datasets in the energy industry, are classified as personal behavioral data, necessitating stringent protective measures. This article explores the implications of this classification, the importance of data anonymization, and the potential of forecasting using microaggregated data. The findings underscore that effective anonymization techniques, such as microaggregation, do not compromise the performance of forecasting models under certain conditions (i.e., forecasting aggregated). In such an aggregated level, microaggregated data maintains high levels of utility, with minimal impact on forecasting accuracy. The implications for the energy sector are profound, suggesting that privacy-preserving data practices can be integrated into smart metering technology applications without hindering their effectiveness.
Authors: Zhiyong Yu, Yuning Jiang, Xin Liu, Yuanming Shi, Chunxiao Jiang, Linling Kuang
Abstract: With the growing demand for Earth observation, it is important to provide reliable real-time remote sensing inference services to meet the low-latency requirements. The Space Computing Power Network (Space-CPN) offers a promising solution by providing onboard computing and extensive coverage capabilities for real-time inference. This paper presents a remote sensing artificial intelligence applications deployment framework designed for Low Earth Orbit satellite constellations to achieve real-time inference performance. The framework employs the microservice architecture, decomposing monolithic inference tasks into reusable, independent modules to address high latency and resource heterogeneity. This distributed approach enables optimized microservice deployment, minimizing resource utilization while meeting quality of service and functional requirements. We introduce Robust Optimization to the deployment problem to address data uncertainty. Additionally, we model the Robust Optimization problem as a Partially Observable Markov Decision Process and propose a robust reinforcement learning algorithm to handle the semi-infinite Quality of Service constraints. Our approach yields sub-optimal solutions that minimize accuracy loss while maintaining acceptable computational costs. Simulation results demonstrate the effectiveness of our framework.
Authors: Sina Moradi
Abstract: Optimal Transport (OT) has established itself as a robust framework for quantifying differences between distributions, with applications that span fields such as machine learning, data science, and computer vision. This paper offers a detailed examination of the OT problem, beginning with its theoretical foundations, including the classical formulations of Monge and Kantorovich and their extensions to modern computational techniques. It explores cutting-edge algorithms, including Sinkhorn iterations, primal-dual strategies, and reduction-based approaches, emphasizing their efficiency and scalability in addressing high-dimensional problems. The paper also highlights emerging trends, such as integrating OT into machine learning frameworks, the development of novel problem variants, and ongoing theoretical advancements. Applications of OT are presented across a range of domains, with particular attention to its innovative application in time series data analysis via Optimal Transport Warping (OTW), a robust alternative to methods like Dynamic Time Warping. Despite the significant progress made, challenges related to scalability, robustness, and ethical considerations remain, necessitating further research. The paper underscores OT's potential to bridge theoretical depth and practical utility, fostering impactful advancements across diverse disciplines.
Authors: Gouki Minegishi, Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo
Abstract: Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of polysemantic neurons into monosemantic features and composing a sparse dictionary of words. However, traditional performance metrics like Mean Squared Error and L0 sparsity ignore the evaluation of the semantic representational power of SAEs -- whether they can acquire interpretable monosemantic features while preserving the semantic relationship of words. For instance, it is not obvious whether a learned sparse feature could distinguish different meanings in one word. In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words. Our findings reveal that SAEs developed to improve the MSE-L0 Pareto frontier may confuse interpretability, which does not necessarily enhance the extraction of monosemantic features. The analysis of SAEs with polysemous words can also figure out the internal mechanism of LLMs; deeper layers and the Attention module contribute to distinguishing polysemy in a word. Our semantics focused evaluation offers new insights into the polysemy and the existing SAE objective and contributes to the development of more practical SAEs.
Authors: Jelena Bratuli\'c, Sudhanshu Mittal, Christian Rupprecht, Thomas Brox
Abstract: Large Language Models (LLMs) have demonstrated impressive performance in various tasks, including In-Context Learning (ICL), where the model performs new tasks by conditioning solely on the examples provided in the context, without updating the model's weights. While prior research has explored the roles of pretraining data and model architecture, the key mechanism behind ICL remains unclear. In this work, we systematically uncover properties present in LLMs that support the emergence of ICL. To disambiguate these factors, we conduct a study with a controlled dataset and data sequences using a deep autoregressive model. We show that conceptual repetitions in the data sequences are crucial for ICL, more so than previously indicated training data properties like burstiness or long-tail distribution. Conceptual repetitions could refer to $n$-gram repetitions in textual data or exact image copies in image sequence data. Such repetitions also offer other previously overlooked benefits such as reduced transiency in ICL performance. Furthermore, we show that the emergence of ICL depends on balancing the in-weight learning objective with the in-context solving ability during training.
Authors: Zenon Lamprou, Frank Polick, Yashar Moshfeghi
Abstract: Contemporary neural networks intended for natural language processing (NLP) are not designed with specific linguistic rules. It suggests that they may acquire a general understanding of language. This attribute has led to extensive research in deciphering their internal representations. A pioneering method involves an experimental setup using human brain data to explore if a translation between brain and neural network representations can be established. Since this technique emerged, more sophisticated NLP models have been developed. In our study, we apply this method to evaluate four new NLP models aiming to identify the one most compatible with brain activity. Additionally, to explore how the brain comprehends text semantically, we alter the text by removing punctuation in four different ways to understand its impact on semantic processing by the human brain. Our findings indicate that the RoBERTa model aligns best with brain activity, outperforming BERT in accuracy according to our metrics. Furthermore, for BERT, higher accuracy was noted when punctuation was excluded, and increased context length did not significantly diminish accuracy compared to the original results with punctuation.
Authors: Jos\'e Ram\'on Pareja Monturiol, Alejandro Pozas-Kerstjens, David P\'erez-Garc\'ia
Abstract: We present a tensorization algorithm for constructing tensor train representations of functions, drawing on sketching and cross interpolation ideas. The method only requires black-box access to the target function and a small set of sample points defining the domain of interest. Thus, it is particularly well-suited for machine learning models, where the domain of interest is naturally defined by the training dataset. We show that this approach can be used to enhance the privacy and interpretability of neural network models. Specifically, we apply our decomposition to (i) obfuscate neural networks whose parameters encode patterns tied to the training data distribution, and (ii) estimate topological phases of matter that are easily accessible from the tensor train representation. Additionally, we show that this tensorization can serve as an efficient initialization method for optimizing tensor trains in general settings, and that, for model compression, our algorithm achieves a superior trade-off between memory and time complexity compared to conventional tensorization methods of neural networks.
Authors: Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, Jan Eric Lenssen
Abstract: We introduce MEt3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MEt3R, we evaluate the consistency of a large set of previous methods for novel view and video generation, including our open, multi-view latent diffusion model.
Authors: Jitao Wang, Chengchun Shi, John D. Piette, Joshua R. Loftus, Donglin Zeng, Zhenke Wu
Abstract: When applied in healthcare, reinforcement learning (RL) seeks to dynamically match the right interventions to subjects to maximize population benefit. However, the learned policy may disproportionately allocate efficacious actions to one subpopulation, creating or exacerbating disparities in other socioeconomically-disadvantaged subgroups. These biases tend to occur in multi-stage decision making and can be self-perpetuating, which if unaccounted for could cause serious unintended consequences that limit access to care or treatment benefit. Counterfactual fairness (CF) offers a promising statistical tool grounded in causal inference to formulate and study fairness. In this paper, we propose a general framework for fair sequential decision making. We theoretically characterize the optimal CF policy and prove its stationarity, which greatly simplifies the search for optimal CF policies by leveraging existing RL algorithms. The theory also motivates a sequential data preprocessing algorithm to achieve CF decision making under an additive noise assumption. We prove and then validate our policy learning approach in controlling unfairness and attaining optimal value through simulations. Analysis of a digital health dataset designed to reduce opioid misuse shows that our proposal greatly enhances fair access to counseling.
Authors: Matyas Bohacek, Hany Farid
Abstract: From a simple text prompt, generative-AI image models can create stunningly realistic and creative images bounded, it seems, by only our imagination. These models have achieved this remarkable feat thanks, in part, to the ingestion of billions of images collected from nearly every corner of the internet. Many creators have understandably expressed concern over how their intellectual property has been ingested without their permission or a mechanism to opt out of training. As a result, questions of fair use and copyright infringement have quickly emerged. We describe a method that allows us to determine if a model was trained on a specific image or set of images. This method is computationally efficient and assumes no explicit knowledge of the model architecture or weights (so-called black-box membership inference). We anticipate that this method will be crucial for auditing existing models and, looking ahead, ensuring the fairer development and deployment of generative AI models.
Authors: Stella C. Dong, James R. Finlay
Abstract: Reinsurance optimization is critical for insurers to manage risk exposure, ensure financial stability, and maintain solvency. Traditional approaches often struggle with dynamic claim distributions, high-dimensional constraints, and evolving market conditions. This paper introduces a novel hybrid framework that integrates {Generative Models}, specifically Variational Autoencoders (VAEs), with {Reinforcement Learning (RL)} using Proximal Policy Optimization (PPO). The framework enables dynamic and scalable optimization of reinsurance strategies by combining the generative modeling of complex claim distributions with the adaptive decision-making capabilities of reinforcement learning. The VAE component generates synthetic claims, including rare and catastrophic events, addressing data scarcity and variability, while the PPO algorithm dynamically adjusts reinsurance parameters to maximize surplus and minimize ruin probability. The framework's performance is validated through extensive experiments, including out-of-sample testing, stress-testing scenarios (e.g., pandemic impacts, catastrophic events), and scalability analysis across portfolio sizes. Results demonstrate its superior adaptability, scalability, and robustness compared to traditional optimization techniques, achieving higher final surpluses and computational efficiency. Key contributions include the development of a hybrid approach for high-dimensional optimization, dynamic reinsurance parameterization, and validation against stochastic claim distributions. The proposed framework offers a transformative solution for modern reinsurance challenges, with potential applications in multi-line insurance operations, catastrophe modeling, and risk-sharing strategy design.
Authors: Shang Wu, Yazhen Wang
Abstract: The seminal paper of Jordan, Kinderlehrer, and Otto introduced what is now widely known as the JKO scheme, an iterative algorithmic framework for computing distributions. This scheme can be interpreted as a Wasserstein gradient flow and has been successfully applied in machine learning contexts, such as deriving policy solutions in reinforcement learning. In this paper, we extend the JKO scheme to accommodate models with unknown parameters. Specifically, we develop statistical methods to estimate these parameters and adapt the JKO scheme to incorporate the estimated values. To analyze the adopted statistical JKO scheme, we establish an asymptotic theory via stochastic partial differential equations that describes its limiting dynamic behavior. Our framework allows both the sample size used in parameter estimation and the number of algorithmic iterations to go to infinity. This study offers a unified framework for joint computational and statistical asymptotic analysis of the statistical JKO scheme. On the computational side, we examine the scheme's dynamic behavior as the number of iterations increases, while on the statistical side, we investigate the large-sample behavior of the resulting distributions computed through the scheme. We conduct numerical simulations to evaluate the finite-sample performance of the proposed methods and validate the developed asymptotic theory.
Authors: Bin Feng, Meng Zheng, Wei Liang, Lei Zhang
Abstract: In this paper, we propose a generalizable deep neural network model for indoor pathloss radio map prediction (termed as IPP-Net). IPP-Net is based on a UNet architecture and learned from both large-scale ray tracing simulation data and a modified 3GPP indoor hotspot model. The performance of IPP-Net is evaluated in the First Indoor Pathloss Radio Map Prediction Challenge in ICASSP 2025. The evaluation results show that IPP-Net achieves a weighted root mean square error of 9.501 dB on three competition tasks and obtains the second overall ranking.
Authors: Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
Abstract: Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.
Authors: Zegeye Mekasha Kidane, Waltenegus Dargie
Abstract: A large number of heterogeneous wireless networks share the unlicensed spectrum designated as the ISM (Industry, Scientific, and Medicine) radio band. These networks do not adhere to a common medium access rule and differ in their specifications considerably. As a result, when concurrently active, they cause cross-technology interference (CTI) on each other. The effect of this interference is not reciprocal, the networks using high transmission power and advanced transmission schemes often causing disproportionate disruptions to those with modest communication and computation resources. CTI corrupts packets, incurs packet retransmission cost, introduces end-to-end latency and jitter, and make networks unpredictable. The purpose of this paper is to closely examine its impact on low-power networks which are based on the IEEE 802.15.4 standard. It discusses latest developments on CTI detection, coexistence and avoidance mechanisms as well on messaging schemes which attempt to enable heterogeneous networks directly communicate with one another to coordinate packet transmission and channel assignment.
Authors: Obed Morrison Atsu, Salmane Naoumi, Roberto Bomfin, Marwa Chafii
Abstract: This paper introduces a novel Multi-Agent Reinforcement Learning (MARL) framework to enhance integrated sensing and communication (ISAC) networks using unmanned aerial vehicle (UAV) swarms as sensing radars. By framing the positioning and trajectory optimization of UAVs as a Partially Observable Markov Decision Process, we develop a MARL approach that leverages centralized training with decentralized execution to maximize the overall sensing performance. Specifically, we implement a decentralized cooperative MARL strategy to enable UAVs to develop effective communication protocols, therefore enhancing their environmental awareness and operational efficiency. Additionally, we augment the MARL solution with a transmission power adaptation technique to mitigate interference between the communicating drones and optimize the communication protocol efficiency. Moreover, a transmission power adaptation technique is incorporated to mitigate interference and optimize the learned communication protocol efficiency. Despite the increased complexity, our solution demonstrates robust performance and adaptability across various scenarios, providing a scalable and cost-effective enhancement for future ISAC networks.
Authors: Yousef Heider, Fadi Aldakheel, Wolfgang Ehlers
Abstract: This work introduces a novel application for predicting the macroscopic intrinsic permeability tensor in deformable porous media, using a limited set of micro-CT images of real microgeometries. The primary goal is to develop an efficient, machine-learning (ML)-based method that overcomes the limitations of traditional permeability estimation techniques, which often rely on time-consuming experiments or computationally expensive fluid dynamics simulations. The novelty of this work lies in leveraging Convolutional Neural Networks (CNN) to predict pore-fluid flow behavior under deformation and anisotropic flow conditions. Particularly, the described approach employs binarized CT images of porous micro-structure as inputs to predict the symmetric second-order permeability tensor, a critical parameter in continuum porous media flow modeling. The methodology comprises four key steps: (1) constructing a dataset of CT images from Bentheim sandstone at different volumetric strain levels; (2) performing pore-scale simulations of single-phase flow using the lattice Boltzmann method (LBM) to generate permeability data; (3) training the CNN model with the processed CT images as inputs and permeability tensors as outputs; and (4) exploring techniques to improve model generalization, including data augmentation and alternative CNN architectures. Examples are provided to demonstrate the CNN's capability to accurately predict the permeability tensor, a crucial parameter in various disciplines such as geotechnical engineering, hydrology, and material science. An exemplary source code is made available for interested readers.
Authors: Mehrshad Saadatinia, Minoo Ahmadi, Armin Abdollahi
Abstract: Understanding emotions in videos is a challenging task. However, videos contain several modalities which make them a rich source of data for machine learning and deep learning tasks. In this work, we aim to improve video sentiment classification by focusing on two key aspects: the video itself, the accompanying text, and the acoustic features. To address the limitations of relying on large labeled datasets, we are developing a method that utilizes clustering-based semi-supervised pre-training to extract meaningful representations from the data. This pre-training step identifies patterns in the video and text data, allowing the model to learn underlying structures and relationships without requiring extensive labeled information at the outset. Once these patterns are established, we fine-tune the system in a supervised manner to classify the sentiment expressed in videos. We believe that this multi-modal approach, combining clustering with supervised fine-tuning, will lead to more accurate and insightful sentiment classification, especially in cases where labeled data is limited.
Authors: Aziida Nanyonga, Hassan Wasswa, Ugur Turhan, Oleksandra Molloy, Graham Wild
Abstract: Safety is a critical aspect of the air transport system given even slight operational anomalies can result in serious consequences. To reduce the chances of aviation safety occurrences, accidents and incidents are reported to establish the root cause, propose safety recommendations etc. However, analysis narratives of the pre-accident events are presented using human-understandable, raw, unstructured, text that a computer system cannot understand. The ability to classify and categorise safety occurrences from their textual narratives would help aviation industry stakeholders make informed safety-critical decisions. To classify and categorise safety occurrences, we applied natural language processing (NLP) and AI (Artificial Intelligence) models to process text narratives. The study aimed to answer the question. How well can the damage level caused to the aircraft in a safety occurrence be inferred from the text narrative using natural language processing. The classification performance of various deep learning models including LSTM, BLSTM, GRU, sRNN, and combinations of these models including LSTM and GRU, BLSTM+GRU, sRNN and LSTM, sRNN and BLSTM, sRNN and GRU, sRNN and BLSTM and GRU, and sRNN and LSTM and GRU was evaluated on a set of 27,000 safety occurrence reports from the NTSB. The results of this study indicate that all models investigated performed competitively well recording an accuracy of over 87.9% which is well above the random guess of 25% for a four-class classification problem. Also, the models recorded high precision, recall, and F1 scores above 80%, 88%, and 85%, respectively. sRNN slightly outperformed other single models in terms of recall (90%) and accuracy (90%) while LSTM reported slightly better performance in terms of precision (87%).
Authors: Jianian Wang, Rui Song
Abstract: To represent the causal relationships between variables, a directed acyclic graph (DAG) is widely utilized in many areas, such as social sciences, epidemics, and genetics. Many causal structure learning approaches are developed to learn the hidden causal structure utilizing deep-learning approaches. However, these approaches have a hidden assumption that the causal relationship remains unchanged over time, which may not hold in real life. In this paper, we develop a new framework to model the dynamic causal graph where the causal relations are allowed to be time-varying. We incorporate the basis approximation method into the score-based causal discovery approach to capture the dynamic pattern of the causal graphs. Utilizing the autoregressive model structure, we could capture both contemporaneous and time-lagged causal relationships while allowing them to vary with time. We propose an algorithm that could provide both past-time estimates and future-time predictions on the causal graphs, and conduct simulations to demonstrate the usefulness of the proposed method. We also apply the proposed method for the covid-data analysis, and provide causal estimates on how policy restriction's effect changes.
Authors: Tomohiko Nakamura, Kwanghee Choi, Keigo Hojo, Yoshiaki Bando, Satoru Fukayama, Shinji Watanabe
Abstract: Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks. Clustering S3M representations yields discrete speech units (DSUs), which serve as compact representations for speech signals. DSUs are typically obtained by k-means clustering. Using DSUs often leads to strong performance in various tasks, including automatic speech recognition (ASR). However, even with the high dimensionality and redundancy of S3M representations, preprocessing S3M representations for better clustering remains unexplored, even though it can affect the quality of DSUs. In this paper, we investigate the potential of linear preprocessing methods for extracting DSUs. We evaluate standardization, principal component analysis, whitening, and independent component analysis (ICA) on DSU-based ASR benchmarks and demonstrate their effectiveness as preprocessing for k-means. We also conduct extensive analyses of their behavior, such as orthogonality or interpretability of individual components of ICA.
Authors: Aziida Nanyonga, Hassan Wasswa, Oleksandra Molloy, Ugur Turhan, Graham Wild
Abstract: The air transport system recognizes the criticality of safety, as even minor anomalies can have severe consequences. Reporting accidents and incidents play a vital role in identifying their causes and proposing safety recommendations. However, the narratives describing pre-accident events are presented in unstructured text that is not easily understood by computer systems. Classifying and categorizing safety occurrences based on these narratives can support informed decision-making by aviation industry stakeholders. In this study, researchers applied natural language processing (NLP) and artificial intelligence (AI) models to process text narratives to classify the flight phases of safety occurrences. The classification performance of two deep learning models, ResNet and sRNN was evaluated, using an initial dataset of 27,000 safety occurrence reports from the NTSB. The results demonstrated good performance, with both models achieving an accuracy exceeding 68%, well above the random guess rate of 14% for a seven-class classification problem. The models also exhibited high precision, recall, and F1 scores. The sRNN model greatly outperformed the simplified ResNet model architecture used in this study. These findings indicate that NLP and deep learning models can infer the flight phase from raw text narratives, enabling effective analysis of safety occurrences.
Authors: Jian Cheng Wong, Abhishek Gupta, Chin Chun Ooi, Pao-Hsiung Chiu, Jiao Liu, Yew-Soon Ong
Abstract: Deep learning models trained on finite data lack a complete understanding of the physical world. On the other hand, physics-informed neural networks (PINNs) are infused with such knowledge through the incorporation of mathematically expressible laws of nature into their training loss function. By complying with physical laws, PINNs provide advantages over purely data-driven models in limited-data regimes. This feature has propelled them to the forefront of scientific machine learning, a domain characterized by scarce and costly data. However, the vision of accurate physics-informed learning comes with significant challenges. This review examines PINNs for the first time in terms of model optimization and generalization, shedding light on the need for new algorithmic advances to overcome issues pertaining to the training speed, precision, and generalizability of today's PINN models. Of particular interest are the gradient-free methods of neuroevolution for optimizing the uniquely complex loss landscapes arising in PINN training. Methods synergizing gradient descent and neuroevolution for discovering bespoke neural architectures and balancing multiple conflicting terms in physics-informed learning objectives are positioned as important avenues for future research. Yet another exciting track is to cast neuroevolution as a meta-learner of generalizable PINN models.
Authors: Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, Chris Schwiegelshohn
Abstract: We consider coresets for $k$-clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the $k$-median objective $\sum_{p}\min_{c\in C}dist(p,C)$. Given a point set $P$, a coreset $\Omega$ is a small weighted subset that approximates the cost of $P$ for all candidate solutions $C$ up to a $(1\pm\varepsilon )$ multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved $k$-median coreset bounds for the following metrics: Coresets of size $\tilde{O}\left(k\varepsilon^{-2}\right)$ for shortest path metrics in planar graphs, improving over the bounds $\tilde{O}\left(k\varepsilon^{-6}\right)$ by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC'21] and $\tilde{O}\left(k^2\varepsilon^{-4}\right)$ by [Braverman, Jiang, Krauthgamer, Wu, SODA'21]. Coresets of size $\tilde{O}\left(kd\ell\varepsilon^{-2}\log m\right)$ for clustering $d$-dimensional polygonal curves of length at most $m$ with curves of length at most $\ell$ with respect to Frechet metrics, improving over the bounds $\tilde{O}\left(k^3d\ell\varepsilon^{-3}\log m\right)$ by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS'22] and $\tilde{O}\left(k^2d\ell\varepsilon^{-2}\log m \log |P|\right)$ by [Conradi, Kolbe, Psarros, Rohde, SoCG'24].
Authors: Phillip Rieger, Alessandro Pegoraro, Kavita Kumari, Tigist Abera, Jonathan Knauer, Ahmad-Reza Sadeghi
Abstract: Split Learning (SL) is a distributed deep learning approach enabling multiple clients and a server to collaboratively train and infer on a shared deep neural network (DNN) without requiring clients to share their private local data. The DNN is partitioned in SL, with most layers residing on the server and a few initial layers and inputs on the client side. This configuration allows resource-constrained clients to participate in training and inference. However, the distributed architecture exposes SL to backdoor attacks, where malicious clients can manipulate local datasets to alter the DNN's behavior. Existing defenses from other distributed frameworks like Federated Learning are not applicable, and there is a lack of effective backdoor defenses specifically designed for SL. We present SafeSplit, the first defense against client-side backdoor attacks in Split Learning (SL). SafeSplit enables the server to detect and filter out malicious client behavior by employing circular backward analysis after a client's training is completed, iteratively reverting to a trained checkpoint where the model under examination is found to be benign. It uses a two-fold analysis to identify client-induced changes and detect poisoned models. First, a static analysis in the frequency domain measures the differences in the layer's parameters at the server. Second, a dynamic analysis introduces a novel rotational distance metric that assesses the orientation shifts of the server's layer parameters during training. Our comprehensive evaluation across various data distributions, client counts, and attack scenarios demonstrates the high efficacy of this dual analysis in mitigating backdoor attacks while preserving model utility.
Authors: Haoxiang Gao, Yu Zhao
Abstract: Autonomous driving (AD) has experienced significant improvements in recent years and achieved promising 3D detection, classification, and localization results. However, many challenges remain, e.g. semantic understanding of pedestrians' behaviors, and downstream handling for pedestrian interactions. Recent studies in applications of Large Language Models (LLM) and Vision-Language Models (VLM) have achieved promising results in scene understanding and high-level maneuver planning in diverse traffic scenarios. However, deploying the billion-parameter LLMs to vehicles requires significant computation and memory resources. In this paper, we analyzed effective knowledge distillation of semantic labels to smaller Vision networks, which can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
Authors: Sanghyun Hong, Fan Wu, Anthony Gruber, Kookjin Lee
Abstract: Neural ordinary differential equations (NODEs) are an emerging paradigm in scientific computing for modeling dynamical systems. By accurately learning underlying dynamics in data in the form of differential equations, NODEs have been widely adopted in various domains, such as healthcare, finance, computer vision, and language modeling. However, there remains a limited understanding of the privacy implications of these fundamentally different models, particularly with regard to their membership inference risks. In this work, we study the membership inference risks associated with NODEs. We first comprehensively evaluate NODEs against membership inference attacks. We show that NODEs are twice as resistant to these privacy attacks compared to conventional feedforward models such as ResNets. By analyzing the variance in membership risks across different NODE models, we identify the factors that contribute to their lower risks. We then demonstrate, both theoretically and empirically, that membership inference risks can be further mitigated by utilizing a stochastic variant of NODEs: Neural stochastic differential equations (NSDEs). We show that NSDEs are differentially-private (DP) learners that provide the same provable privacy guarantees as DP-SGD, the de-facto mechanism for training private models. NSDEs are also effective in mitigating existing membership inference attacks, demonstrating risks comparable to private models trained with DP-SGD while offering an improved privacy-utility trade-off. Moreover, we propose a drop-in-replacement strategy that efficiently integrates NSDEs into conventional feedforward models to enhance their privacy.
Authors: Kun Yang, Jing Yang, Cong Shen
Abstract: In this paper, we address a crucial but often overlooked issue in applying reinforcement learning (RL) to radio resource management (RRM) in wireless communications: the mismatch between the discounted reward RL formulation and the undiscounted goal of wireless network optimization. To the best of our knowledge, we are the first to systematically investigate this discrepancy, starting with a discussion of the problem formulation followed by simulations that quantify the extent of the gap. To bridge this gap, we introduce the use of average reward RL, a method that aligns more closely with the long-term objectives of RRM. We propose a new method called the Average Reward Off policy Soft Actor Critic (ARO SAC) is an adaptation of the well known Soft Actor Critic algorithm in the average reward framework. This new method achieves significant performance improvement our simulation results demonstrate a 15% gain in the system performance over the traditional discounted reward RL approach, underscoring the potential of average reward RL in enhancing the efficiency and effectiveness of wireless network optimization.
Authors: Duy Khanh Lam
Abstract: This paper investigates the investment problem of constructing an optimal no-short sequential portfolio strategy in a market with a latent dependence structure between asset prices and partly unobservable side information, which is often high-dimensional. The results demonstrate that a dynamic strategy, which forms a portfolio based on perfect knowledge of the dependence structure and full market information over time, may not grow at a higher rate infinitely often than a constant strategy, which remains invariant over time. Specifically, if the market is stationary, implying that the dependence structure is statistically stable, the growth rate of an optimal dynamic strategy, utilizing the maximum capacity of the entire market information, almost surely decays over time into an equilibrium state, asymptotically converging to the growth rate of a constant strategy. Technically, this work reassesses the common belief that a constant strategy only attains the optimal limiting growth rate of dynamic strategies when the market process is identically and independently distributed. By analyzing the dynamic log-optimal portfolio strategy as the optimal benchmark in a stationary market with side information, we show that a random optimal constant strategy almost surely exists, even when a limiting growth rate for the dynamic strategy does not. Consequently, two approaches to learning algorithms for portfolio construction are discussed, demonstrating the safety of removing side information from the learning process while still guaranteeing an asymptotic growth rate comparable to that of the optimal dynamic strategy.
Authors: Edouardo Honig, Andrew Lizarraga, Zijun Frank Zhang, Ying Nian Wu
Abstract: Prompt compression is a promising approach to speeding up language model inference without altering the generative model. Prior works compress prompts into smaller sequences of learned tokens using an encoder that is trained as a LowRank Adaptation (LoRA) of the inference language model. However, we show that the encoder does not need to keep the original language model's architecture to achieve useful compression. We introduce the Attention-Only Compressor (AOC), which learns a prompt compression encoder after removing the multilayer perceptron (MLP) layers in the Transformer blocks of a language model, resulting in an encoder with roughly 67% less parameters compared to the original model. Intriguingly we find that, across a range of compression ratios up to 480x, AOC can better regenerate prompts and outperform a baseline compression encoder that is a LoRA of the inference language model without removing MLP layers. These results demonstrate that the architecture of prompt compression encoders does not need to be identical to that of the original decoder language model, paving the way for further research into architectures and approaches for prompt compression.
Authors: Jie Mei, Alejandro Rodriguez-Garcia, Daigo Takeuchi, Gabriel Wainstein, Nina Hubig, Yalda Mohsenzadeh, Srikanth Ramaswamy
Abstract: Continuous, adaptive learning-the ability to adapt to the environment and improve performance-is a hallmark of both natural and artificial intelligence. Biological organisms excel in acquiring, transferring, and retaining knowledge while adapting to dynamic environments, making them a rich source of inspiration for artificial neural networks (ANNs). This study explores how neuromodulation, a fundamental feature of biological learning systems, can help address challenges such as catastrophic forgetting and enhance the robustness of ANNs in continuous learning scenarios. Driven by neuromodulators including dopamine (DA), acetylcholine (ACh), serotonin (5-HT) and noradrenaline (NA), neuromodulatory processes in the brain operate at multiple scales, facilitating dynamic responses to environmental changes through mechanisms ranging from local synaptic plasticity to global network-wide adaptability. Importantly, the relationship between neuromodulators, and their interplay in the modulation of sensory and cognitive processes are more complex than expected, demonstrating a "many-to-one" neuromodulator-to-task mapping. To inspire the design of novel neuromodulation-aware learning rules, we highlight (i) how multi-neuromodulatory interactions enrich single-neuromodulator-driven learning, (ii) the impact of neuromodulators at multiple spatial and temporal scales, and correspondingly, (iii) strategies to integrate neuromodulated learning into or approximate it in ANNs. To illustrate these principles, we present a case study to demonstrate how neuromodulation-inspired mechanisms, such as DA-driven reward processing and NA-based cognitive flexibility, can enhance ANN performance in a Go/No-Go task. By integrating multi-scale neuromodulation, we aim to bridge the gap between biological learning and artificial systems, paving the way for ANNs with greater flexibility, robustness, and adaptability.
Authors: Mengfan Wu, Shenshen Yan, Jie Ren
Abstract: Data-driven machine learning (ML) has demonstrated tremendous potential in material property predictions. However, the scarcity of materials data with costly property labels in the vast chemical space presents a significant challenge for ML in efficiently predicting properties and uncovering structure-property relationships. Here, we propose a novel hierarchy-boosted funnel learning (HiBoFL) framework, which is successfully applied to identify semiconductors with ultralow lattice thermal conductivity ($\kappa_\mathrm{L}$). By training on only a few hundred materials targeted by unsupervised learning from a pool of hundreds of thousands, we achieve efficient and interpretable supervised predictions of ultralow $\kappa_\mathrm{L}$, thereby circumventing large-scale brute-force calculations without clear objectives. As a result, we provide a list of candidates with ultralow $\kappa_\mathrm{L}$ for potential thermoelectric applications and discover a new factor that significantly influences structural anharmonicity. This study offers a novel practical pathway for accelerating the discovery of functional materials.
Authors: Jihoon Park, Jeongin Choe, Dohyun Kim, Jae-Joon Kim
Abstract: Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods.
Authors: Marcos Matabuena
Abstract: Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.
Authors: Haojun Yu, Youcheng Li, Nan Zhang, Zihan Niu, Xuantong Gong, Yanwen Luo, Haotian Ye, Siyu He, Quanlin Wu, Wangyan Qin, Mengyuan Zhou, Jie Han, Jia Tao, Ziwei Zhao, Di Dai, Di He, Dong Wang, Binghui Tang, Ling Huo, James Zou, Qingli Zhu, Yong Wang, Liwei Wang
Abstract: Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired extensive knowledge of breast structures, pathological features, and clinical variations. With few-shot adaptation, BUSGen can generate repositories of realistic and informative task-specific data, facilitating the development of models for a wide range of downstream tasks. Extensive experiments highlight BUSGen's exceptional adaptability, significantly exceeding real-data-trained foundational models in breast cancer screening, diagnosis, and prognosis. In breast cancer early diagnosis, our approach outperformed all board-certified radiologists (n=9), achieving an average sensitivity improvement of 16.5% (P-value<0.0001). Additionally, we characterized the scaling effect of using generated data which was as effective as the collected real-world data for training diagnostic models. Moreover, extensive experiments demonstrated that our approach improved the generalization ability of downstream models. Importantly, BUSGen protected patient privacy by enabling fully de-identified data sharing, making progress forward in secure medical data utilization. An online demo of BUSGen is available at https://aibus.bio.
URLs: https://aibus.bio.
Authors: Sadia Kamal, Tim Oates
Abstract: As deep learning models gain attraction in medical data, ensuring transparent and trustworthy decision-making is essential. In skin cancer diagnosis, while advancements in lesion detection and classification have improved accuracy, the black-box nature of these methods poses challenges in understanding their decision processes, leading to trust issues among physicians. This study leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on different skin lesion datasets, to capture meaningful relationships between visual features and diagnostic criteria terms. To further enhance transparency, we propose a method called MedGrad E-CLIP, which builds on gradient-based E-CLIP by incorporating a weighted entropy mechanism designed for complex medical imaging like skin lesions. This approach highlights critical image regions linked to specific diagnostic descriptions. The developed integrated pipeline not only classifies skin lesions by matching corresponding descriptions but also adds an essential layer of explainability developed especially for medical data. By visually explaining how different features in an image relates to diagnostic criteria, this approach demonstrates the potential of advanced vision-language models in medical image analysis, ultimately improving transparency, robustness, and trust in AI-driven diagnostic systems.
Authors: Alankrita Bhatt, Or Ordentlich, Oron Sabag
Abstract: In online betting, the bookmaker can update the payoffs it offers on a particular event many times before the event takes place, and the updated payoffs may depend on the bets accumulated thus far. We study the problem of bookmaking with the goal of maximizing the return in the worst-case, with respect to the gamblers' behavior and the event's outcome. We formalize this problem as the \emph{Optimal Online Bookmaking game}, and provide the exact solution for the binary case. To this end, we develop the optimal bookmaking strategy, which relies on a new technique called bi-balancing trees, that assures that the house loss is the same for all \emph{decisive} betting sequences, where the gambler bets all its money on a single outcome in each round.
Authors: Lars van der Laan, David Hubbard, Allen Tran, Nathan Kallus, Aur\'elien Bibaut
Abstract: Double reinforcement learning (DRL) enables statistically efficient inference on the value of a policy in a nonparametric Markov Decision Process (MDP) given trajectories generated by another policy. However, this approach necessarily requires stringent overlap between the state distributions, which is often violated in practice. To relax this requirement and extend DRL, we study efficient inference on linear functionals of the $Q$-function (of which policy value is a special case) in infinite-horizon, time-invariant MDPs under semiparametric restrictions on the $Q$-function. These restrictions can reduce the overlap requirement and lower the efficiency bound, yielding more precise estimates. As an important example, we study the evaluation of long-term value under domain adaptation, given a few short trajectories from the new domain and restrictions on the difference between the domains. This can be used for long-term causal inference. Our method combines flexible estimates of the $Q$-function and the Riesz representer of the functional of interest (e.g., the stationary state density ratio for policy value) and is automatic in that we do not need to know the form of the latter - only the functional we care about. To address potential model misspecification bias, we extend the adaptive debiased machine learning (ADML) framework of \citet{van2023adaptive} to construct nonparametrically valid and superefficient estimators that adapt to the functional form of the $Q$-function. As a special case, we propose a novel adaptive debiased plug-in estimator that uses isotonic-calibrated fitted $Q$-iteration - a new calibration algorithm for MDPs - to circumvent the computational challenges of estimating debiasing nuisances from min-max objectives.
Authors: Zhenyu Lei, Yushun Dong, Weiyu Li, Rong Ding, Qi Wang, Jundong Li
Abstract: Large language models (LLMs) have revolutionized scientific research with their exceptional capabilities and transformed various fields. Among their practical applications, LLMs have been playing a crucial role in mitigating threats to human life, infrastructure, and the environment. Despite growing research in disaster LLMs, there remains a lack of systematic review and in-depth analysis of LLMs for natural disaster management. To address the gap, this paper presents a comprehensive survey of existing LLMs in natural disaster management, along with a taxonomy that categorizes existing works based on disaster phases and application scenarios. By collecting public datasets and identifying key challenges and opportunities, this study aims to guide the professional community in developing advanced LLMs for disaster management to enhance the resilience against natural disasters.
Authors: Evgeny Ugolkov, Xupeng He, Hyung Kwak, Hussein Hoteit
Abstract: We develop a procedure for substantially improving the quality of segmented 3D micro-Computed Tomography (micro-CT) images of rocks with a Machine Learning (ML) Generative Model. The proposed model enhances the resolution eightfold (8x) and addresses segmentation inaccuracies due to the overlapping X-ray attenuation in micro-CT measurement for different rock minerals and phases. The proposed generative model is a 3D Deep Convolutional Wasserstein Generative Adversarial Network with Gradient Penalty (3D DC WGAN-GP). The algorithm is trained on segmented 3D low-resolution micro-CT images and segmented unpaired complementary 2D high-resolution Laser Scanning Microscope (LSM) images. The algorithm was demonstrated on multiple samples of Berea sandstones. We achieved high-quality super-resolved 3D images with a resolution of 0.4375 micro-m/voxel and accurate segmentation for constituting minerals and pore space. The described procedure can significantly expand the modern capabilities of digital rock physics.
Authors: Venkatakrishnan Vaidyanathapuram Krishnan, Noel Alben, Anish Nair, Nathaniel Condit-Schultz
Abstract: Music source separation demixes a piece of music into its individual sound sources (vocals, percussion, melodic instruments, etc.), a task with no simple mathematical solution. It requires deep learning methods involving training on large datasets of isolated music stems. The most commonly available datasets are made from commercial Western music, limiting the models' applications to non-Western genres like Carnatic music. Carnatic music is a live tradition, with the available multi-track recordings containing overlapping sounds and bleeds between the sources. This poses a challenge to commercially available source separation models like Spleeter and Hybrid Demucs. In this work, we introduce 'Sanidha', the first open-source novel dataset for Carnatic music, offering studio-quality, multi-track recordings with minimal to no overlap or bleed. Along with the audio files, we provide high-definition videos of the artists' performances. Additionally, we fine-tuned Spleeter, one of the most commonly used source separation models, on our dataset and observed improved SDR performance compared to fine-tuning on a pre-existing Carnatic multi-track dataset. The outputs of the fine-tuned model with 'Sanidha' are evaluated through a listening study.
Authors: Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, Jeannette Bohg
Abstract: Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website https://portal-cornell.github.io/motion_track_policy/.
URLs: https://portal-cornell.github.io/motion_track_policy/.
Authors: Jannik Graebner, Ryne Beeson
Abstract: Long time-duration low-thrust nonlinear optimal spacecraft trajectory global search is a computationally and time expensive problem characterized by clustering patterns in locally optimal solutions. During preliminary mission design, mission parameters are subject to frequent changes, necessitating that trajectory designers efficiently generate high-quality control solutions for these new scenarios. Generative machine learning models can be trained to learn how the solution structure varies with respect to a conditional parameter, thereby accelerating the global search for missions with updated parameters. In this work, state-of-the-art diffusion models are integrated with the indirect approach for trajectory optimization within a global search framework. This framework is tested on two low-thrust transfers of different complexity in the circular restricted three-body problem. By generating and analyzing a training data set, we develop mathematical relations and techniques to understand the complex structures in the costate domain of locally optimal solutions for these problems. A diffusion model is trained on this data and successfully accelerates the global search for both problems. The model predicts how the costate solution structure changes, based on the maximum spacecraft thrust magnitude. Warm-starting a numerical solver with diffusion model samples for the costates at the initial time increases the number of solutions generated per minute for problems with unseen thrust magnitudes by one to two orders of magnitude in comparison to samples from a uniform distribution and from an adjoint control transformation.
Authors: Benjamin Schiffer, Shirley Zhang
Abstract: We study online fair division when there are a finite number of item types and the player values for the items are drawn randomly from distributions with unknown means. In this setting, a sequence of indivisible items arrives according to a random online process, and each item must be allocated to a single player. The goal is to maximize expected social welfare while maintaining that the allocation satisfies proportionality in expectation. When player values are normalized, we show that it is possible to with high probability guarantee proportionality constraint satisfaction and achieve $\tilde{O}(\sqrt{T})$ regret. To achieve this result, we present an upper confidence bound (UCB) algorithm that uses two rounds of linear optimization. This algorithm highlights fundamental aspects of proportionality constraints that allow for a UCB algorithm despite the presence of many (potentially tight) constraints. This result improves upon the previous best regret rate of $\tilde{O}(T^{2/3})$.
Authors: Xiucheng Wang, Peilin Zheng, Nan Cheng
Abstract: In this paper, a signal detection method based on the denoise diffusion model (DM) is proposed, which outperforms the maximum likelihood (ML) estimation method that has long been regarded as the optimal signal detection technique. Theoretically, a novel mathematical theory for intelligent signal detection based on stochastic differential equations (SDEs) is established in this paper, demonstrating the effectiveness of DM in reducing the additive white Gaussian noise in received signals. Moreover, a mathematical relationship between the signal-to-noise ratio (SNR) and the timestep in DM is established, revealing that for any given SNR, a corresponding optimal timestep can be identified. Furthermore, to address potential issues with out-of-distribution inputs in the DM, we employ a mathematical scaling technique that allows the trained DM to handle signal detection across a wide range of SNRs without any fine-tuning. Building on the above theoretical foundation, we propose a DM-based signal detection method, with the diffusion transformer (DiT) serving as the backbone neural network, whose computational complexity of this method is $\mathcal{O}(n^2)$. Simulation results demonstrate that, for BPSK and QAM modulation schemes, the DM-based method achieves a significantly lower symbol error rate (SER) compared to ML estimation, while maintaining a much lower computational complexity.
Authors: Jialin Wu, Kaikai Pan, Yanjiao Chen, Jiangyi Deng, Shengyuan Pang, Wenyuan Xu
Abstract: Transformer models have excelled in natural language tasks, prompting the vision community to explore their implementation in computer vision problems. However, these models are still influenced by adversarial examples. In this paper, we investigate the attack capabilities of six common adversarial attacks on three pretrained ViT models to reveal the vulnerability of ViT models. To understand and analyse the bias in neural network decisions when the input is adversarial, we use two visualisation techniques that are attention rollout and grad attention rollout. To prevent ViT models from adversarial attack, we propose Protego, a detection framework that leverages the transformer intrinsic capabilities to detection adversarial examples of ViT models. Nonetheless, this is challenging due to a diversity of attack strategies that may be adopted by adversaries. Inspired by the attention mechanism, we know that the token of prediction contains all the information from the input sample. Additionally, the attention region for adversarial examples differs from that of normal examples. Given these points, we can train a detector that achieves superior performance than existing detection methods to identify adversarial examples. Our experiments have demonstrated the high effectiveness of our detection method. For these six adversarial attack methods, our detector's AUC scores all exceed 0.95. Protego may advance investigations in metaverse security.
Authors: Nikola Pavlovic, Sudeep Salgia, Qing Zhao
Abstract: We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space (RKHS). We study this problem under the additional constraint of joint differential privacy, where the agents needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that improves upon the state of the art and achieves an error rate of $\mathcal{O}\left(\sqrt{\frac{\gamma_T}{T}} + \frac{\gamma_T}{T \varepsilon}\right)$ after $T$ queries for a large class of kernel families, where $\gamma_T$ represents the effective dimensionality of the kernel and $\varepsilon > 0$ is the privacy parameter. Our results are based on a novel estimator for the reward function that simultaneously enjoys high utility along with a low-sensitivity to observed rewards and contexts, which is crucial to obtain an order optimal learning performance with improved dependence on the privacy parameter.
Authors: Yee-Fan Tan, Jun Lin Liow, Pei-Sze Tan, Fuad Noman, Raphael C. -W. Phan, Hernando Ombao, Chee-Ming Ting
Abstract: Modern brain imaging technologies have enabled the detailed reconstruction of human brain connectomes, capturing structural connectivity (SC) from diffusion MRI and functional connectivity (FC) from functional MRI. Understanding the intricate relationships between SC and FC is vital for gaining deeper insights into the brain's functional and organizational mechanisms. However, obtaining both SC and FC modalities simultaneously remains challenging, hindering comprehensive analyses. Existing deep generative models typically focus on synthesizing a single modality or unidirectional translation between FC and SC, thereby missing the potential benefits of bi-directional translation, especially in scenarios where only one connectome is available. Therefore, we propose Structural-Functional Connectivity GAN (SFC-GAN), a novel framework for bidirectional translation between SC and FC. This approach leverages the CycleGAN architecture, incorporating convolutional layers to effectively capture the spatial structures of brain connectomes. To preserve the topological integrity of these connectomes, we employ a structure-preserving loss that guides the model in capturing both global and local connectome patterns while maintaining symmetry. Our framework demonstrates superior performance in translating between SC and FC, outperforming baseline models in similarity and graph property evaluations compared to ground truth data, each translated modality can be effectively utilized for downstream classification.
Authors: Nour Makke, Sanjay Chawla
Abstract: Machine learning is rapidly making its path into natural sciences, including high-energy physics. We present the first study that infers, directly from experimental data, a functional form of fragmentation functions. The latter represent a key ingredient to describe physical observables measured in high-energy physics processes that involve hadron production, and predict their values at different energy. Fragmentation functions can not be calculated in theory and have to be determined instead from data. Traditional approaches rely on global fits of experimental data using a pre-assumed functional form inspired from phenomenological models to learn its parameters. This novel approach uses a ML technique, namely symbolic regression, to learn an analytical model from measured charged hadron multiplicities. The function learned by symbolic regression resembles the Lund string function and describes the data well, thus representing a potential candidate for use in global FFs fits. This study represents an approach to follow in such QCD-related phenomenology studies and more generally in sciences.
Authors: Csaba T\'oth, Danilo Jr Dela Cruz, Harald Oberhauser
Abstract: The signature kernel is a positive definite kernel for sequential and temporal data that has become increasingly popular in machine learning applications due to powerful theoretical guarantees, strong empirical performance, and recently introduced various scalable variations. In this chapter, we give a short introduction to $\texttt{KSig}$, a $\texttt{Scikit-Learn}$ compatible Python package that implements various GPU-accelerated algorithms for computing signature kernels, and performing downstream learning tasks. We also introduce a new algorithm based on tensor sketches which gives strong performance compared to existing algorithms. The package is available at $\href{https://github.com/tgcsaba/ksig}{\texttt{https://github.com/tgcsaba/ksig}}$.
URLs: https://github.com/tgcsaba/ksig, https://github.com/tgcsaba/ksig
Authors: Paul Melki (IMS), Lionel Bombrun (IMS), Boubacar Diallo (IMS), J\'er\^ome Dias (IMS), Jean-Pierre da Costa (IMS)
Abstract: Precision agriculture in general, and precision weeding in particular, have greatly benefited from the major advancements in deep learning and computer vision. A large variety of commercial robotic solutions are already available and deployed. However, the adoption by farmers of such solutions is still low for many reasons, an important one being the lack of trust in these systems. This is in great part due to the opaqueness and complexity of deep neural networks and the manufacturers' inability to provide valid guarantees on their performance. Conformal prediction, a well-established methodology in the machine learning community, is an efficient and reliable strategy for providing trustworthy guarantees on the predictions of any black-box model under very minimal constraints. Bridging the gap between the safe machine learning and precision agriculture communities, this article showcases conformal prediction in action on the task of precision weeding through deep learning-based image classification. After a detailed presentation of the conformal prediction methodology and the development of a precision spraying pipeline based on a ''conformalized'' neural network and well-defined spraying decision rules, the article evaluates this pipeline on two real-world scenarios: one under in-distribution conditions, the other reflecting a near out-of-distribution setting. The results show that we are able to provide formal, i.e. certifiable, guarantees on spraying at least 90% of the weeds.
Authors: Laifa Tao, Zhengduo Zhao, Xuesong Wang, Bin Li, Wenchao Zhan, Xuanyuan Su, Shangyu Li, Qixuan Huang, Haifei Liu, Chen Lu, Zhixuan Lian
Abstract: Accurately predicting the remaining useful life (RUL) of rotating machinery, such as bearings, is essential for ensuring equipment reliability and minimizing unexpected industrial failures. Traditional data-driven deep learning methods face challenges in practical settings due to inconsistent training and testing data distributions and limited generalization for long-term predictions.
Authors: Aryan Chaudhari, Ankush Singh, Sanchi Gajbhiye, Pratham Agrawal
Abstract: In this paper we discuss lung cancer detection using hybrid model of Convolutional-Neural-Networks (CNNs) and Support-Vector-Machines-(SVMs) in order to gain early detection of tumors, benign or malignant. The work uses this hybrid model by training upon the Computed Tomography scans (CT scans) as dataset. Using deep learning for detecting lung cancer early is a cutting-edge method.
Authors: Amir Pouya Masoumi, Leo Creedon, Ramen Ghosh, Nimra Munir, Ross McMorrow, Marion McAfee
Abstract: This article discusses the integration of the Artificial Bee Colony (ABC) algorithm with two supervised learning methods, namely Artificial Neural Networks (ANNs) and Adaptive Network-based Fuzzy Inference System (ANFIS), for feature selection from Near-Infrared (NIR) spectra for predicting the molecular weight of medical-grade Polylactic Acid (PLA). During extrusion processing of PLA, in-line NIR spectra were captured along with extrusion process and machine setting data. With a dataset comprising 63 observations and 512 input features, appropriate machine learning tools are essential for interpreting data and selecting features to improve prediction accuracy. Initially, the ABC optimization algorithm is coupled with ANN/ANFIS to forecast PLA molecular weight. The objective functions of the ABC algorithm are to minimize the root mean square error (RMSE) between experimental and predicted PLA molecular weights while also minimizing the number of input features. Results indicate that employing ABC-ANFIS yields the lowest RMSE of 282 Da and identifies four significant parameters (NIR wavenumbers 6158 cm-1, 6310 cm-1, 6349 cm-1, and melt temperature) for prediction. These findings demonstrate the effectiveness of using the ABC algorithm with ANFIS for selecting a minimal set of features to predict PLA molecular weight with high accuracy during processing
Authors: Amir Sartipi, Joaquin Delgado Fernandez, Sergio Potenciano Menci, Alessio Magitteri
Abstract: The integrity of time series data in smart grids is often compromised by missing values due to sensor failures, transmission errors, or disruptions. Gaps in smart meter data can bias consumption analyses and hinder reliable predictions, causing technical and economic inefficiencies. As smart meter data grows in volume and complexity, conventional techniques struggle with its nonlinear and nonstationary patterns. In this context, Generative Artificial Intelligence offers promising solutions that may outperform traditional statistical methods. In this paper, we evaluate two general-purpose Large Language Models and five Time Series Foundation Models for smart meter data imputation, comparing them with conventional Machine Learning and statistical models. We introduce artificial gaps (30 minutes to one day) into an anonymized public dataset to test inference capabilities. Results show that Time Series Foundation Models, with their contextual understanding and pattern recognition, could significantly enhance imputation accuracy in certain cases. However, the trade-off between computational cost and performance gains remains a critical consideration.
Authors: Yuchen Lu, Kun Fang
Abstract: Quantum relative entropy, a quantum generalization of the well-known Kullback-Leibler divergence, serves as a fundamental measure of the distinguishability between quantum states and plays a pivotal role in quantum information science. Despite its importance, efficiently estimating quantum relative entropy between two quantum states on quantum computers remains a significant challenge. In this work, we propose the first quantum algorithm for estimating quantum relative entropy and Petz R\'{e}nyi divergence from two unknown quantum states on quantum computers, addressing open problems highlighted in [Phys. Rev. A 109, 032431 (2024)] and [IEEE Trans. Inf. Theory 70, 5653-5680 (2024)]. This is achieved by combining quadrature approximations of relative entropies, the variational representation of quantum f-divergences, and a new technique for parameterizing Hermitian polynomial operators to estimate their traces with quantum states. Notably, the circuit size of our algorithm is at most 2n+1 with n being the number of qubits in the quantum states and it is directly applicable to distributed scenarios, where quantum states to be compared are hosted on cross-platform quantum computers. We validate our algorithm through numerical simulations, laying the groundwork for its future deployment on quantum hardware devices.
Authors: Tri Kurniawan Wijaya, Edoardo D'Amico, Xinyang Shao
Abstract: [This is a position paper and does not contain any empirical or theoretical results] Recommender systems have become a cornerstone of personalized user experiences, yet their development typically involves significant manual intervention, including dataset-specific feature engineering, hyperparameter tuning, and configuration. To this end, we introduce a novel paradigm: Dataset-Agnostic Recommender Systems (DAReS) that aims to enable a single codebase to autonomously adapt to various datasets without the need for fine-tuning, for a given recommender system task. Central to this approach is the Dataset Description Language (DsDL), a structured format that provides metadata about the dataset's features and labels, and allow the system to understand dataset's characteristics, allowing it to autonomously manage processes like feature selection, missing values imputation, noise removal, and hyperparameter optimization. By reducing the need for domain-specific expertise and manual adjustments, DAReS offers a more efficient and scalable solution for building recommender systems across diverse application domains. It addresses critical challenges in the field, such as reusability, reproducibility, and accessibility for non-expert users or entry-level researchers.
Authors: Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
Abstract: Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.
Authors: Kankana Roy, Lars Kr\"amer, Sebastian Domaschke, Malik Haris, Roland Aydin, Fabian Isensee, Martin Held
Abstract: Learning from tabular data is of paramount importance, as it complements the conventional analysis of image and video data by providing a rich source of structured information that is often critical for comprehensive understanding and decision-making processes. We present Multi-task Contrastive Masked Tabular Modeling (MT-CMTM), a novel method aiming to enhance tabular models by leveraging the correlation between tabular data and corresponding images. MT-CMTM employs a dual strategy combining contrastive learning with masked tabular modeling, optimizing the synergy between these data modalities. Central to our approach is a 1D Convolutional Neural Network with residual connections and an attention mechanism (1D-ResNet-CBAM), designed to efficiently process tabular data without relying on images. This enables MT-CMTM to handle purely tabular data for downstream tasks, eliminating the need for potentially costly image acquisition and processing. We evaluated MT-CMTM on the DVM car dataset, which is uniquely suited for this particular scenario, and the newly developed HIPMP dataset, which connects membrane fabrication parameters with image data. Our MT-CMTM model outperforms the proposed tabular 1D-ResNet-CBAM, which is trained from scratch, achieving a relative 1.48% improvement in relative MSE on HIPMP and a 2.38% increase in absolute accuracy on DVM. These results demonstrate MT-CMTM's robustness and its potential to advance the field of multi-modal learning.
Authors: Daniel Steininger, Julia Simon, Andreas Trondl, Markus Murschitz
Abstract: Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
Authors: Dong Liu, Esther Lopez Ramos
Abstract: Semantic retrieval (also known as dense retrieval) based on textual data has been extensively studied for both web search and product search application fields, where the relevance of a query and a potential target document is computed by their dense vector representation comparison. Product image is crucial for e-commence search interactions and is a key factor for customers at product explorations. But its impact for semantic retrieval has not been well studied yet. In this research, we build a multimodal representation for product items in e-commerece search in contrast to pure-text representation of products, and investigate the impact of such representations. The models are developed and evaluated on e-commerce datasets. We demonstrate that a multimodal representation scheme for a product can show improvement either on purchase recall or relevance accuracy in semantic retrieval. Additionally, we provide numerical analysis for exclusive matches retrieved by a multimodal semantic retrieval model versus a text-only semantic retrieval model, to demonstrate the validation of multimodal solutions.
Authors: Dominic Schuh, Janik Kreit, Evan Berkowitz, Lena Funcke, Thomas Luu, Kim A. Nicoli, Marcel Rodekamp
Abstract: Generative models, particularly normalizing flows, have shown exceptional performance in learning probability distributions across various domains of physics, including statistical mechanics, collider physics, and lattice field theory. In the context of lattice field theory, normalizing flows have been successfully applied to accurately learn the Boltzmann distribution, enabling a range of tasks such as direct estimation of thermodynamic observables and sampling independent and identically distributed (i.i.d.) configurations. In this work, we present a proof-of-concept demonstration that normalizing flows can be used to learn the Boltzmann distribution for the Hubbard model. This model is widely employed to study the electronic structure of graphene and other carbon nanomaterials. State-of-the-art numerical simulations of the Hubbard model, such as those based on Hybrid Monte Carlo (HMC) methods, often suffer from ergodicity issues, potentially leading to biased estimates of physical observables. Our numerical experiments demonstrate that leveraging i.i.d.\ sampling from the normalizing flow effectively addresses these issues.
Authors: Amit Vishwakarma, KS Subrahamanian Moosath
Abstract: In this paper, a distance between the Gaussian Mixture Models(GMMs) is obtained based on an embedding of the K-component Gaussian Mixture Model into the manifold of the symmetric positive definite matrices. Proof of embedding of K-component GMMs into the manifold of symmetric positive definite matrices is given and shown that it is a submanifold. Then, proved that the manifold of GMMs with the pullback of induced metric is isometric to the submanifold with the induced metric. Through this embedding we obtain a general lower bound for the Fisher-Rao metric. This lower bound is a distance measure on the manifold of GMMs and we employ it for the similarity measure of GMMs. The effectiveness of this framework is demonstrated through an experiment on standard machine learning benchmarks, achieving accuracy of 98%, 92%, and 93.33% on the UIUC, KTH-TIPS, and UMD texture recognition datasets respectively.
Authors: Sze Ming Lee, Yunxiao Chen
Abstract: Most statistical models for pairwise comparisons, including the Bradley-Terry (BT) and Thurstone models and many extensions, make a relatively strong assumption of stochastic transitivity. This assumption imposes the existence of an unobserved global ranking among all the players/teams/items and monotone constraints on the comparison probabilities implied by the global ranking. However, the stochastic transitivity assumption does not hold in many real-world scenarios of pairwise comparisons, especially games involving multiple skills or strategies. As a result, models relying on this assumption can have suboptimal predictive performance. In this paper, we propose a general family of statistical models for pairwise comparison data without a stochastic transitivity assumption, substantially extending the BT and Thurstone models. In this model, the pairwise probabilities are determined by a (approximately) low-dimensional skew-symmetric matrix. Likelihood-based estimation methods and computational algorithms are developed, which allow for sparse data with only a small proportion of observed pairs. Theoretical analysis shows that the proposed estimator achieves minimax-rate optimality, which adapts effectively to the sparsity level of the data. The spectral theory for skew-symmetric matrices plays a crucial role in the implementation and theoretical analysis. The proposed method's superiority against the BT model, along with its broad applicability across diverse scenarios, is further supported by simulations and real data analysis.
Authors: Brendan Mallery, James M. Murphy, Shuchin Aeron
Abstract: We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of $m$ reference measures given a set of coefficients belonging to the $m$-dimensional simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure $\mu$. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of regularized barycenters as solutions to a fixed-point equation for the average of the entropic maps from the barycenter to the reference measures. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when $\mu$ is a barycenter. It is shown that these coordinates, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, a hallmark of entropy-regularized optimal transport, and we verify these rates experimentally. We also establish that barycentric coordinates are stable with respect to perturbations in the Wasserstein-2 metric, suggesting a robustness of these coefficients to corruptions. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.
Authors: Haonan Xu, Alessio Brini
Abstract: This paper applies deep reinforcement learning (DRL) to optimize liquidity provisioning in Uniswap v3, a decentralized finance (DeFi) protocol implementing an automated market maker (AMM) model with concentrated liquidity. We model the liquidity provision task as a Markov Decision Process (MDP) and train an active liquidity provider (LP) agent using the Proximal Policy Optimization (PPO) algorithm. The agent dynamically adjusts liquidity positions by using information about price dynamics to balance fee maximization and impermanent loss mitigation. We use a rolling window approach for training and testing, reflecting realistic market conditions and regime shifts. This study compares the data-driven performance of the DRL-based strategy against common heuristics adopted by small retail LP actors that do not systematically modify their liquidity positions. By promoting more efficient liquidity management, this work aims to make DeFi markets more accessible and inclusive for a broader range of participants. Through a data-driven approach to liquidity management, this work seeks to contribute to the ongoing development of more efficient and user-friendly DeFi markets.
Authors: Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas
Abstract: Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist's workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods' 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at https://github.com/difeigu/RadAlign.
Authors: Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli\'c, Furu Wei
Abstract: Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Authors: Serguei Barannikov, Alexander Korotin, Dmitry Oganesyan, Daniil Emtsev, Evgeny Burnaev
Abstract: We propose to study neural networks' loss surfaces by methods of topological data analysis. We suggest to apply barcodes of Morse complexes to explore topology of loss surfaces. An algorithm for calculations of the loss function's barcodes of local minima is described. We have conducted experiments for calculating barcodes of local minima for benchmark functions and for loss surfaces of small neural networks. Our experiments confirm our two principal observations for neural networks' loss surfaces. First, the barcodes of local minima are located in a small lower part of the range of values of neural networks' loss function. Secondly, increase of the neural network's depth and width lowers the barcodes of local minima. This has some natural implications for the neural network's learning and for its generalization properties.
Authors: Yuheng Lei, Yao Lyu, Guojian Zhan, Tao Zhang, Jiangtao Li, Jianyu Chen, Shengbo Eben Li, Sifa Zheng
Abstract: Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. While these methods are highly versatile, they often suffer from high sample complexity due to their ignorance of the underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov Decision Process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework Zeroth-Order Actor-Critic (ZOAC). We propose to use step-wise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC employs samplers to collect trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). To evaluate the effectiveness of ZOAC, we apply it to a challenging multi-lane driving task, optimizing the parameters in a rule-based, non-differentiable driving policy that consists of three sub-modules: behavior selection, path planning, and trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across all tasks.
Authors: Andrew Bolt, Carolyn Huston, Petra Kuhnert, Joel Janek Dabrowski, James Hilton, Conrad Sanderson
Abstract: Computational simulations of wildfire spread typically employ empirical rate-of-spread calculations under various conditions (such as terrain, fuel type, weather). Small perturbations in conditions can often lead to significant changes in fire spread (such as speed and direction), necessitating a computationally expensive large set of simulations to quantify uncertainty. Model emulation seeks alternative representations of physical models using machine learning, aiming to provide more efficient and/or simplified surrogate models. We propose a dedicated spatio-temporal neural network based framework for model emulation, able to capture the complex behaviour of fire spread models. The proposed approach can approximate forecasts at fine spatial and temporal resolutions that are often challenging for neural network based approaches. Furthermore, the proposed approach is robust even with small training sets, due to novel data augmentation methods. Empirical experiments show good agreement between simulated and emulated firefronts, with an average Jaccard score of 0.76.
Authors: Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, Lingpeng Kong
Abstract: Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.
Authors: Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao, Jian Tang, Hongyu Guo, Anima Anandkumar
Abstract: Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
Authors: Paiheng Xu, Yuhang Zhou, Bang An, Wei Ai, Furong Huang
Abstract: Given the growing concerns about fairness in machine learning and the impressive performance of Graph Neural Networks (GNNs) on graph data learning, algorithmic fairness in GNNs has attracted significant attention. While many existing studies improve fairness at the group level, only a few works promote individual fairness, which renders similar outcomes for similar individuals. A desirable framework that promotes individual fairness should (1) balance between fairness and performance, (2) accommodate two commonly-used individual similarity measures (externally annotated and computed from input features), (3) generalize across various GNN models, and (4) be computationally efficient. Unfortunately, none of the prior work achieves all the desirables. In this work, we propose a novel method, GFairHint, which promotes individual fairness in GNNs and achieves all aforementioned desirables. GFairHint learns fairness representations through an auxiliary link prediction task, and then concatenates the representations with the learned node embeddings in original GNNs as a "fairness hint". Through extensive experimental investigations on five real-world graph datasets under three prevalent GNN models covering both individual similarity measures above, GFairHint achieves the best fairness results in almost all combinations of datasets with various backbone models, while generating comparable utility results, with much less computational cost compared to the previous state-of-the-art (SoTA) method.
Authors: Yue Wan, Jialu Wu, Tingjun Hou, Chang-Yu Hsieh, Xiaowei Jia
Abstract: Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks. Yet, existing molecular SSL methods largely overlook chemical knowledge, including molecular structure similarity, scaffold composition, and the context-dependent aspects of molecular properties when operating over the chemical space. They also struggle to learn the subtle variations in structure-activity relationship. This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge. It leverages the structural hierarchy within the molecule, embeds them through distinct pre-training tasks across channels, and aggregates channel information in a task-specific manner during fine-tuning. Our approach demonstrates competitive performance across various molecular property benchmarks and offers strong advantages in particularly challenging yet ubiquitous scenarios like activity cliffs.
Authors: Sahil Verma, Gantavya Bhatt, Avi Schwarzschild, Soumye Singhal, Arnav Mohanty Das, Chirag Shah, John P Dickerson, Pin-Yu Chen, Jeff Bilmes
Abstract: Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in multimodal models, such as CleanCLIP, which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives that lead to higher zero-shot classification performance correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP, even with extensive hyperparameter tuning, is ineffective in poison removal when stronger pre-training objectives are used. Our findings underscore critical considerations for ML practitioners who train models using large-scale web-curated data and are concerned about potential backdoor threats.
Authors: Peter J. Ehlers, Hendra I. Nurdin, Daniel Soh
Abstract: Reservoir computing, using nonlinear dynamical systems, offers a cost-effective alternative to neural networks for complex tasks involving processing of sequential data, time series modeling, and system identification. Echo state networks (ESNs), a type of reservoir computer, mirror neural networks but simplify training. They apply fixed, random linear transformations to the internal state, followed by nonlinear changes. This process, guided by input signals and linear regression, adapts the system to match target characteristics, reducing computational demands. A potential drawback of ESNs is that the fixed reservoir may not offer the complexity needed for specific problems. While directly altering (training) the internal ESN would reintroduce the computational burden, an indirect modification can be achieved by redirecting some output as input. This feedback can influence the internal reservoir state, yielding ESNs with enhanced complexity suitable for broader challenges. In this paper, we demonstrate that by feeding some component of the reservoir state back into the network through the input, we can drastically improve upon the performance of a given ESN. We rigorously prove that, for any given ESN, feedback will almost always improve the accuracy of the output. For a set of three tasks, each representing different problem classes, we find that with feedback the average error measures are reduced by $30\%-60\%$. Remarkably, feedback provides at least an equivalent performance boost to doubling the initial number of computational nodes, a computationally expensive and technologically challenging alternative. These results demonstrate the broad applicability and substantial usefulness of this feedback scheme.
Authors: Md Abrar Jahin, Saleh Akram Naife, Anik Kumar Saha, M. F. Mridha
Abstract: Supply chain risk assessment (SCRA) has witnessed a profound evolution through the integration of artificial intelligence (AI) and machine learning (ML) techniques, revolutionizing predictive capabilities and risk mitigation strategies. The significance of this evolution stems from the critical role of robust risk management strategies in ensuring operational resilience and continuity within modern supply chains. Previous reviews have outlined established methodologies but have overlooked emerging AI/ML techniques, leaving a notable research gap in understanding their practical implications within SCRA. This paper conducts a systematic literature review combined with a comprehensive bibliometric analysis. We meticulously examined 1,439 papers and derived key insights from a select group of 51 articles published between 2015 and 2024. The review fills this research gap by addressing pivotal research questions and exploring existing AI/ML techniques, methodologies, findings, and future trajectories, thereby providing a more encompassing view of the evolving landscape of SCRA. Our study unveils the transformative impact of AI/ML models, such as Random Forest, XGBoost, and hybrids, in substantially enhancing precision within SCRA. It underscores adaptable post-COVID strategies, advocating for resilient contingency plans and aligning with evolving risk landscapes. Significantly, this review surpasses previous examinations by accentuating emerging AI/ML techniques and their practical implications within SCRA. Furthermore, it highlights the contributions through a comprehensive bibliometric analysis, revealing publication trends, influential authors, and highly cited articles.
Authors: Lanqing Li, Hai Zhang, Xinyu Zhang, Shatong Zhu, Yang Yu, Junqiao Zhao, Pheng-Ann Heng
Abstract: As a marriage between offline RL and meta-RL, the advent of offline meta-reinforcement learning (OMRL) has shown great promise in enabling RL agents to multi-task and quickly adapt while acquiring knowledge safely. Among which, context-based OMRL (COMRL) as a popular paradigm, aims to learn a universal policy conditioned on effective task representations. In this work, by examining several key milestones in the field of COMRL, we propose to integrate these seemingly independent methodologies into a unified framework. Most importantly, we show that the pre-existing COMRL algorithms are essentially optimizing the same mutual information objective between the task variable $M$ and its latent representation $Z$ by implementing various approximate bounds. Such theoretical insight offers ample design freedom for novel algorithms. As demonstrations, we propose a supervised and a self-supervised implementation of $I(Z; M)$, and empirically show that the corresponding optimization algorithms exhibit remarkable generalization across a broad spectrum of RL benchmarks, context shift scenarios, data qualities and deep learning architectures. This work lays the information theoretic foundation for COMRL methods, leading to a better understanding of task representation learning in the context of reinforcement learning. Given its generality, we envision our framework as a promising offline pre-training paradigm of foundation models for decision making.
Authors: Vijaya Krishna Yalavarthi, Randolf Scholz, Stefan Born, Lars Schmidt-Thieme
Abstract: Probabilistic forecasting of irregularly sampled multivariate time series with missing values is an important problem in many fields, including health care, astronomy, and climate. State-of-the-art methods for the task estimate only marginal distributions of observations in single channels and at single timepoints, assuming a fixed-shape parametric distribution. In this work, we propose a novel model, ProFITi, for probabilistic forecasting of irregularly sampled time series with missing values using conditional normalizing flows. The model learns joint distributions over the future values of the time series conditioned on past observations and queried channels and times, without assuming any fixed shape of the underlying distribution. As model components, we introduce a novel invertible triangular attention layer and an invertible non-linear activation function on and onto the whole real line. We conduct extensive experiments on four datasets and demonstrate that the proposed model provides $4$ times higher likelihood over the previously best model.
Authors: Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Se Jung Kwon, Dongsuk Jeon, Dongsoo Lee
Abstract: Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and activation memory. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules generated by undropped layers and residual connections. Additionally, DropBP calculates the sensitivity of each layer to assign an appropriate drop rate, thereby stabilizing the training process. DropBP is not only applicable to full fine-tuning but can also be orthogonally integrated with all types of PEFT by dropping layers during backward propagation. Specifically, DropBP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU. Furthermore, our DropBP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. The code is available at https://github.com/WooSunghyeon/dropbp.
Authors: Ali Beikmohammadi, Sarit Khirirat, Sindri Magn\'usson
Abstract: Parallel stochastic gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased stochastic gradients, which have been the focus of most theoretical research, is challenging in many distributed machine learning applications. The gradient estimations easily become biased, for example, when gradients are compressed or clipped, when data is shuffled, and in meta-learning and reinforcement learning. In this work, we establish worst-case bounds on parallel momentum methods under biased gradient estimation on both general non-convex and $\mu$-PL problems. Our analysis covers general distributed optimization problems, and we work out the implications for special cases where gradient estimates are biased, i.e. in meta-learning and when the gradients are compressed or clipped. Our numerical experiments verify our theoretical findings and show faster convergence performance of momentum methods than traditional biased gradient descent.
Authors: Arthur Thuy, Dries F. Benoit
Abstract: Image classification with neural networks (NNs) is widely used in industrial processes, situations where the model likely encounters unknown objects during deployment, i.e., out-of-distribution (OOD) data. Worryingly, NNs tend to make confident yet incorrect predictions when confronted with OOD data. To increase the models' reliability, they should quantify the uncertainty in their own predictions, communicating when the output should (not) be trusted. Deep ensembles, composed of multiple independent NNs, have been shown to perform strongly but are computationally expensive. Recent research has proposed more efficient NN ensembles, namely the snapshot, batch, and multi-input multi-output ensemble. This study investigates the predictive and uncertainty performance of efficient NN ensembles in the context of image classification for industrial processes. It is the first to provide a comprehensive comparison and it proposes a novel Diversity Quality metric to quantify the ensembles' performance on the in-distribution and OOD sets in one single metric. The results highlight the batch ensemble as a cost-effective and competitive alternative to the deep ensemble. It matches the deep ensemble in both uncertainty and accuracy while exhibiting considerable savings in training time, test time, and memory storage.
Authors: Jaeill Kim, Wonseok Lee, Moonjung Eo, Wonjong Rhee
Abstract: Class Incremental Learning (CIL) constitutes a pivotal subfield within continual learning, aimed at enabling models to progressively learn new classification tasks while retaining knowledge obtained from prior tasks. Although previous studies have predominantly focused on backward compatible approaches to mitigate catastrophic forgetting, recent investigations have introduced forward compatible methods to enhance performance on novel tasks and complement existing backward compatible methods. In this study, we introduce an effective-Rank based Feature Richness enhancement (RFR) method, designed for improving forward compatibility. Specifically, this method increases the effective rank of representations during the base session, thereby facilitating the incorporation of more informative features pertinent to unseen novel tasks. Consequently, RFR achieves dual objectives in backward and forward compatibility: minimizing feature extractor modifications and enhancing novel task performance, respectively. To validate the efficacy of our approach, we establish a theoretical connection between effective rank and the Shannon entropy of representations. Subsequently, we conduct comprehensive experiments by integrating RFR into eleven well-known CIL methods. Our results demonstrate the effectiveness of our approach in enhancing novel-task performance while mitigating catastrophic forgetting. Furthermore, our method notably improves the average incremental accuracy across all eleven cases examined.
Authors: Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
Abstract: Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8$\times$ longer than existing SP methods. The code is available at https://github.com/OpenNLPLab/LASP.
Authors: Rongguang Ye, Wei-Bin Kou, Ming Tang
Abstract: Fairness in federated learning has emerged as a critical concern, aiming to develop an unbiased model among groups (e.g., male or female) of diverse sensitive features. However, there is a trade-off between model performance and fairness, i.e., improving model fairness will decrease model performance. Existing approaches have characterized such a trade-off by introducing hyperparameters to quantify client's preferences for model fairness and model performance. Nevertheless, these approaches are limited to scenarios where each client has only a single pre-defined preference, and fail to work in practical systems where each client generally has multiple preferences. To this end, we propose a Preference-aware scheme in Fair Federated Learning (called PraFFL) to generate preference-specific models in real time. PraFFL can adaptively adjust the model based on each client's preferences to meet their needs. We theoretically prove that PraFFL can offer the optimal model tailored to an arbitrary preference of each client, and show its linear convergence. Experimental results show that our proposed PraFFL outperforms six fair federated learning algorithms in terms of the model's capability of adapting to clients' different preferences. Our implementation is available at https://github.com/rG223/PraFFL.
Authors: Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno
Abstract: The LLaMA family, a collection of foundation language models ranging from 7B to 65B parameters, has become one of the most powerful open-source large language models (LLMs) and the popular LLM backbone of multi-modal large language models (MLLMs), widely used in computer vision and natural language understanding tasks. In particular, LLaMA3 models have recently been released and have achieved impressive performance in various domains with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-constrained scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially provide new insights and challenges for the low-bit quantization of LLaMA3 and other future LLMs, especially in addressing performance degradation issues that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA fine-tuning (LoRA-FT) methods of LLaMA3 on 1-8 bits and various datasets to reveal the low-bit quantization performance of LLaMA3. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers from non-negligible degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap at low bit-width that needs to be addressed in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization , and quantized models are released at https://huggingface.co/Efficient-ML .
URLs: https://github.com/Macaronlin/LLaMA3-Quantization, https://huggingface.co/Efficient-ML
Authors: Jiangce Chen, Wenzhuo Xu, Zeda Xu, Noelia Grande Guti\'errez, Sneha Prabha Narra, Christopher McComb
Abstract: Time-dependent partial differential equations (PDEs) for classic physical systems are established based on the conservation of mass, momentum, and energy, which are ubiquitous in scientific and engineering applications. These PDEs are strictly local-dependent according to the principle of locality in physics, which means that the evolution at a point is only influenced by the neighborhood around it whose size is determined by the length of timestep multiplied with the speed of characteristic information traveling in the system. However, deep learning architecture cannot strictly enforce the local-dependency as it inevitably increases the scope of information to make local predictions as the number of layers increases. Under limited training data, the extra irrelevant information results in sluggish convergence and compromised generalizability. This paper aims to solve this problem by proposing a data decomposition method to strictly limit the scope of information for neural operators making local predictions, which is called data decomposition enforcing local-dependency (DDELD). The numerical experiments over multiple physical phenomena show that DDELD significantly accelerates training convergence and reduces test errors of benchmark models on large-scale engineering simulations.
Authors: Zhongwang Zhang, Pengxiao Lin, Zhiwei Wang, Yaoyu Zhang, Zhi-Qin John Xu
Abstract: Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential (reasoning-based) solutions, which capture the underlying compositional primitives, or symmetric (memory-based) solutions, which simply memorize mappings without understanding the compositional structure. By analyzing the information flow and vector representations within the model, we reveal the distinct mechanisms underlying these solution types. We further find that inferential (reasoning-based) solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors. We validate our conclusions on various real-world datasets. Our findings provide valuable insights into the role of initialization scale in tuning the reasoning and memorizing ability and we propose the initialization rate $\gamma$ to be a convenient tunable hyper-parameter in common deep learning frameworks, where $1/d_{\mathrm{in}}^\gamma$ is the standard deviation of parameters of the layer with $d_{\mathrm{in}}$ input neurons.
Authors: Zijiang Yan, Hina Tabassum
Abstract: We develop a novel multi-objective reinforcement learning (MORL) framework to jointly optimize wireless network selection and autonomous driving policies in a multi-band vehicular network operating on conventional sub-6GHz spectrum and Terahertz frequencies. The proposed framework is designed to 1. maximize the traffic flow and 2. minimize collisions by controlling the vehicle's motion dynamics (i.e., speed and acceleration), and enhance the ultra-reliable low-latency communication (URLLC) while minimizing handoffs (HOs). We cast this problem as a multi-objective Markov Decision Process (MOMDP) and develop solutions for both predefined and unknown preferences of the conflicting objectives. Specifically, deep-Q-network and double deep-Q-network-based solutions are developed first that consider scalarizing the transportation and telecommunication rewards using predefined preferences. We then develop a novel envelope MORL solution which develop policies that address multiple objectives with unknown preferences to the agent. While this approach reduces reliance on scalar rewards, policy effectiveness varying with different preferences is a challenge. To address this, we apply a generalized version of the Bellman equation and optimize the convex envelope of multi-objective Q values to learn a unified parametric representation capable of generating optimal policies across all possible preference configurations. Following an initial learning phase, our agent can execute optimal policies under any specified preference or infer preferences from minimal data samples.Numerical results validate the efficacy of the envelope-based MORL solution and demonstrate interesting insights related to the inter-dependency of vehicle motion dynamics, HOs, and the communication data rate. The proposed policies enable autonomous vehicles to adopt safe driving behaviors with improved connectivity.
Authors: Wanghan Xu, Fenghua Ling, Wenlong Zhang, Tao Han, Hao Chen, Wanli Ouyang, Lei Bai
Abstract: Data-driven artificial intelligence (AI) models have made significant advancements in weather forecasting, particularly in medium-range and nowcasting. However, most data-driven weather forecasting models are black-box systems that focus on learning data mapping rather than fine-grained physical evolution in the time dimension. Consequently, the limitations in the temporal scale of datasets prevent these models from forecasting at finer time scales. This paper proposes a physics-AI hybrid model (i.e., WeatherGFT) which generalizes weather forecasts to finer-grained temporal scales beyond training dataset. Specifically, we employ a carefully designed PDE kernel to simulate physical evolution on a small time scale (e.g., 300 seconds) and use a parallel neural networks with a learnable router for bias correction. Furthermore, we introduce a lead time-aware training framework to promote the generalization of the model at different lead times. The weight analysis of physics-AI modules indicates that physics conducts major evolution while AI performs corrections adaptively. Extensive experiments show that WeatherGFT trained on an hourly dataset, effectively generalizes forecasts across multiple time scales, including 30-minute, which is even smaller than the dataset's temporal resolution.
Authors: Shukai Duan, Heng Ping, Nikos Kanakaris, Xiongye Xiao, Panagiotis Kyriakis, Nesreen K. Ahmed, Peiyu Zhang, Guixiang Ma, Mihai Capota, Shahin Nazarian, Theodore L. Willke, Paul Bogdan
Abstract: Computation graphs are Directed Acyclic Graphs (DAGs) where the nodes correspond to mathematical operations and are used widely as abstractions in optimizations of neural networks. The device placement problem aims to identify optimal allocations of those nodes to a set of (potentially heterogeneous) devices. Existing approaches rely on two types of architectures known as grouper-placer and encoder-placer, respectively. In this work, we bridge the gap between encoder-placer and grouper-placer techniques and propose a novel framework for the task of device placement, relying on smaller computation graphs extracted from the OpenVINO toolkit. The framework consists of five steps, including graph coarsening, node representation learning and policy optimization. It facilitates end-to-end training and takes into account the DAG nature of the computation graphs. We also propose a model variant, inspired by graph parsing networks and complex network analysis, enabling graph representation learning and jointed, personalized graph partitioning, using an unspecified number of groups. To train the entire framework, we use reinforcement learning using the execution time of the placement as a reward. We demonstrate the flexibility and effectiveness of our approach through multiple experiments with three benchmark models, namely Inception-V3, ResNet, and BERT. The robustness of the proposed framework is also highlighted through an ablation study. The suggested placements improve the inference speed for the benchmark models by up to 58.2% over CPU execution and by up to 60.24% compared to other commonly used baselines.
Authors: Sujai Hiremath, Jacqueline R. M. A. Maasch, Mengxiao Gao, Promit Ghosal, Kyra Gan
Abstract: Learning the unique directed acyclic graph corresponding to an unknown causal model is a challenging task. Methods based on functional causal models can identify a unique graph, but either suffer from the curse of dimensionality or impose strong parametric assumptions. To address these challenges, we propose a novel hybrid approach for global causal discovery in observational data that leverages local causal substructures. We first present a topological sorting algorithm that leverages ancestral relationships in linear structural causal models to establish a compact top-down hierarchical ordering, encoding more causal information than linear orderings produced by existing methods. We demonstrate that this approach generalizes to nonlinear settings with arbitrary noise. We then introduce a nonparametric constraint-based algorithm that prunes spurious edges by searching for local conditioning sets, achieving greater accuracy than current methods. We provide theoretical guarantees for correctness and worst-case polynomial time complexities, with empirical validation on synthetic data.
Authors: Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, R\'emi Gribonval
Abstract: Lipschitz bounds on neural network parameterizations are important to establish generalization, quantization or pruning guarantees, as they control the robustness of the network with respect to parameter changes. Yet, there are few Lipschitz bounds with respect to parameters in the literature, and existing ones only apply to simple feedforward architectures, while also failing to capture the intrinsic rescaling-symmetries of ReLU networks. This paper proves a new Lipschitz bound in terms of the so-called path-metrics of the parameters. Since this bound is intrinsically invariant with respect to the rescaling symmetries of the networks, it sharpens previously known Lipschitz bounds. It is also, to the best of our knowledge, the first bound of its kind that is broadly applicable to modern networks such as ResNets, VGGs, U-nets, and many more.
Authors: Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, Javier Antor\'an, Jos\'e Miguel Hern\'andez-Lobato
Abstract: Scaling hyperparameter optimisation to very large datasets remains an open problem in the Gaussian process community. This paper focuses on iterative methods, which use linear system solvers, like conjugate gradients, alternating projections or stochastic gradient descent, to construct an estimate of the marginal likelihood gradient. We discuss three key improvements which are applicable across solvers: (i) a pathwise gradient estimator, which reduces the required number of solver iterations and amortises the computational cost of making predictions, (ii) warm starting linear system solvers with the solution from the previous step, which leads to faster solver convergence at the cost of negligible bias, (iii) early stopping linear system solvers after a limited computational budget, which synergises with warm starting, allowing solver progress to accumulate over multiple marginal likelihood steps. These techniques provide speed-ups of up to $72\times$ when solving to tolerance, and decrease the average residual norm by up to $7\times$ when stopping early.
Authors: Ziqi Zhang, Zifeng Zhuang, Jingzehua Xu, Yiyuan Yang, Yubo Huang, Donglin Wang, Shuai Zhang
Abstract: We propose a novel one-step supervised imitation learning (IL) framework called Adversarial Density Regression (ADR). This IL framework aims to correct the policy learned on unknown-quality to match the expert distribution by utilizing demonstrations, without relying on the Bellman operator. Specifically, ADR addresses several limitations in previous IL algorithms: First, most IL algorithms are based on the Bellman operator, which inevitably suffer from cumulative offsets from sub-optimal rewards during multi-step update processes. Additionally, off-policy training frameworks suffer from Out-of-Distribution (OOD) state-actions. Second, while conservative terms help solve the OOD issue, balancing the conservative term is difficult. To address these limitations, we fully integrate a one-step density-weighted Behavioral Cloning (BC) objective for IL with auxiliary imperfect demonstration. Theoretically, we demonstrate that this adaptation can effectively correct the distribution of policies trained on unknown-quality datasets to align with the expert policy's distribution. Moreover, the difference between the empirical and the optimal value function is proportional to the upper bound of ADR's objective, indicating that minimizing ADR's objective is akin to approaching the optimal value. Experimentally, we validated the performance of ADR by conducting extensive evaluations. Specifically, ADR outperforms all of the selected IL algorithms on tasks from the Gym-Mujoco domain. Meanwhile, it achieves an 89.5% improvement over IQL when utilizing ground truth rewards on tasks from the Adroit and Kitchen domains. Our codebase will be released at: https://github.com/stevezhangzA/Adverserial_Density_Regression.
URLs: https://github.com/stevezhangzA/Adverserial_Density_Regression.
Authors: Siddarth Venkatraman, Moksh Jain, Luca Scimeca, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid Rector-Brooks, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Abstract: Diffusion models have emerged as effective distribution estimators in vision, language, and reinforcement learning, but their use as priors in downstream tasks poses an intractable posterior inference problem. This paper studies amortized sampling of the posterior over data, $\mathbf{x}\sim p^{\rm post}(\mathbf{x})\propto p(\mathbf{x})r(\mathbf{x})$, in a model that consists of a diffusion generative model prior $p(\mathbf{x})$ and a black-box constraint or likelihood function $r(\mathbf{x})$. We state and prove the asymptotic correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from this posterior, a problem that existing methods solve only approximately or in restricted cases. Relative trajectory balance arises from the generative flow network perspective on diffusion models, which allows the use of deep reinforcement learning techniques to improve mode coverage. Experiments illustrate the broad potential of unbiased inference of arbitrary posteriors under diffusion priors: in vision (classifier guidance), language (infilling under a discrete diffusion LLM), and multimodal data (text-to-image generation). Beyond generative modeling, we apply relative trajectory balance to the problem of continuous control with a score-based behavior prior, achieving state-of-the-art results on benchmarks in offline reinforcement learning.
Authors: Dong Liu, Roger Waleffe, Meng Jiang, Shivaram Venkataraman
Abstract: In our recent research, we have developed a framework called GraphSnapShot, which has been proven an useful tool for graph learning acceleration. GraphSnapShot is a framework for fast cache, storage, retrieval and computation for graph learning. It can quickly store and update the local topology of graph structure and allows us to track patterns in the structure of graph networks, just like take snapshots of the graphs. In experiments, GraphSnapShot shows efficiency, it can achieve up to 30% training acceleration and 73% memory reduction for lossless graph ML training compared to current baselines such as dgl.This technique is particular useful for large dynamic graph learning tasks such as social media analysis and recommendation systems to process complex relationships between entities. The code for GraphSnapShot is publicly available at https://github.com/NoakLiu/GraphSnapShot.
Authors: Sao Mai Nguyen, Maxime Devanne, Olivier Remy-Neris, Mathieu Lempereur, Andr\'e Thepaut
Abstract: While automatic monitoring and coaching of exercises are showing encouraging results in non-medical applications, they still have limitations such as errors and limited use contexts. To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we identify in this article four challenges to address and propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises. The dataset includes 3D Kinect skeleton positions and orientations, RGB videos, 2D skeleton data, and medical annotations to assess the correctness, and error classification and localisation of body part and timespan. Along this dataset, we perform a complete research path, from data collection to processing, and finally a small benchmark. We evaluated on the dataset two baseline movement recognition algorithms, pertaining to two different approaches: the probabilistic approach with a Gaussian Mixture Model (GMM), and the deep learning approach with a Long-Short Term Memory (LSTM). This dataset is valuable because it includes rehabilitation relevant motions in a clinical setting with patients in their rehabilitation program, using a cost-effective, portable, and convenient sensor, and because it shows the potential for improvement on these challenges.
Authors: Enqiang Zhu, Xiang Li, Chanjuan Liu, Nikhil R. Pal
Abstract: The extraction of biomedical data has significant academic and practical value in contemporary biomedical sciences. In recent years, drug repositioning, a cost-effective strategy for drug development by discovering new indications for approved drugs, has gained increasing attention. However, many existing drug repositioning methods focus on mining information from adjacent nodes in biomedical networks without considering the potential inter-relationships between the feature spaces of drugs and diseases. This can lead to inaccurate encoding, resulting in biased mined drug-disease association information. To address this limitation, we propose a new model called Dual-Feature Drug Repurposing Neural Network (DFDRNN). DFDRNN allows the mining of two features (similarity and association) from the drug-disease biomedical networks to encode drugs and diseases. A self-attention mechanism is utilized to extract neighbor feature information. It incorporates two dual-feature extraction modules: the single-domain dual-feature extraction (SDDFE) module for extracting features within a single domain (drugs or diseases) and the cross-domain dual-feature extraction (CDDFE) module for extracting features across domains. By utilizing these modules, we ensure more appropriate encoding of drugs and diseases. A cross-dual-domain decoder is also designed to predict drug-disease associations in both domains. Our proposed DFDRNN model outperforms six state-of-the-art methods on four benchmark datasets, achieving an average AUROC of 0.946 and an average AUPR of 0.597. Case studies on two diseases show that the proposed DFDRNN model can be applied in real-world scenarios, demonstrating its significant potential in drug repositioning.
Authors: Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui
Abstract: Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming activations to CPU memory after each layer's forward pass and fetch them during the backward pass. To maximize the swapping of activations without hindering computation, and to avoid exhausting limited CPU memory, we implement a token-wise activation recomputation and swapping mechanism. Furthermore, we tackle the memory fragmentation issue by employing a bi-level Mixed Integer Programming (MIP) approach, optimizing memory reuse across transformer layers. Empirical results demonstrate that MEMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO's ability to minimize memory fragmentation, reduce recomputation and intensive communication, and circumvent the delays associated with the memory reorganization process due to fragmentation. By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52.30%.
Authors: Johan Peralez, Aur\`elien Delage, Jacopo Castellini, Rafael F. Cunha, Jilles S. Dibangoye
Abstract: The centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to $\epsilon$-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman's principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that $\epsilon$-optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against $\epsilon$-optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.
Authors: Manuel Lecha, Andrea Cavallo, Francesca Dominici, Elvin Isufi, Claudio Battiloro
Abstract: Topological Deep Learning (TDL) has emerged as a paradigm to process and learn from signals defined on higher-order combinatorial topological spaces, such as simplicial or cell complexes. Although many complex systems have an asymmetric relational structure, most TDL models forcibly symmetrize these relationships. In this paper, we first introduce a novel notion of higher-order directionality and we then design Directed Simplicial Neural Networks (Dir-SNNs) based on it. Dir-SNNs are message-passing networks operating on directed simplicial complexes able to leverage directed and possibly asymmetric interactions among the simplices. To our knowledge, this is the first TDL model using a notion of higher-order directionality. We theoretically and empirically prove that Dir-SNNs are more expressive than their directed graph counterpart in distinguishing isomorphic directed graphs. Experiments on a synthetic source localization task demonstrate that Dir-SNNs outperform undirected SNNs when the underlying complex is directed, and perform comparably when the underlying complex is undirected.
Authors: Haolin Chen, Philip N. Garner
Abstract: Motivated by the sensitivity-based importance score of the adaptive low-rank adaptation (AdaLoRA), we utilize more theoretically supported metrics, including the signal-to-noise ratio (SNR), along with the Improved Variational Online Newton (IVON) optimizer, for adaptive parameter budget allocation. The resulting Bayesian counterpart not only has matched or surpassed the performance of using the sensitivity-based importance metric but is also a faster alternative to AdaLoRA with Adam. Our theoretical analysis reveals a significant connection between the two metrics, providing a Bayesian perspective on the efficacy of sensitivity as an importance score. Furthermore, our findings suggest that the magnitude, rather than the variance, is the primary indicator of the importance of parameters.
Authors: Anwar Said, Waseem Abbas, Xenofon Koutsoukos
Abstract: Graph machine learning, particularly using graph neural networks, fundamentally relies on node features. Nevertheless, numerous real-world systems, such as social and biological networks, often lack node features due to various reasons, including privacy concerns, incomplete or missing data, and limitations in data collection. In such scenarios, researchers typically resort to methods like structural and positional encoding to construct node features. However, the length of such features is contingent on the maximum value within the property being encoded, for example, the highest node degree, which can be exceedingly large in applications like scale-free networks. Furthermore, these encoding schemes are limited to categorical data and might not be able to encode metrics returning other type of values. In this paper, we introduce a novel, universally applicable encoder, termed \emph{PropEnc}, which constructs expressive node embedding from any given graph metric. \emph{PropEnc} leverages histogram construction combined with reversed index encoding, offering a flexible method for node features initialization. It supports flexible encoding in terms of both dimensionality and type of input, demonstrating its effectiveness across diverse applications. \emph{PropEnc} allows encoding metrics in low-dimensional space which effectively address the sparsity challenge and enhances the efficiency of the models. We show that \emph{PropEnc} can construct node features that either exactly replicate one-hot encoding or closely approximate indices under various settings. Our extensive evaluations in graph classification setting across multiple social networks that lack node features support our hypothesis. The empirical results conclusively demonstrate that \emph{PropEnc} is both an efficient and effective mechanism for constructing node features from diverse set of graph metrics.
Authors: Omar Le\'on, V\'ictor Rivera, Angel V\'azquez-Pati\~no, Jacinto Ulloa, Esteban Samaniego
Abstract: We explore the possibilities of using energy minimization for the numerical modeling of strain localization in solids as a sharp discontinuity in the displacement field. For this purpose, we consider (regularized) strong discontinuity kinematics in elastoplastic solids. The corresponding mathematical model is discretized using Artificial Neural Networks (ANNs), aiming to predict both the magnitude and location of the displacement jump from energy minimization, $\textit{i.e.}$, within a variational setting. The architecture takes care of the kinematics, while the loss function takes care of the variational statement of the boundary value problem. The main idea behind this approach is to solve both the equilibrium problem and the location of the localization band by means of trainable parameters in the ANN. As a proof of concept, we show through both 1D and 2D numerical examples that the computational modeling of strain localization for elastoplastic solids using energy minimization is feasible.
Authors: Michael S. Albergo, Eric Vanden-Eijnden
Abstract: We propose an algorithm, termed the Non-Equilibrium Transport Sampler (NETS), to sample from unnormalized probability distributions. NETS can be viewed as a variant of annealed importance sampling (AIS) based on Jarzynski's equality, in which the stochastic differential equation used to perform the non-equilibrium sampling is augmented with an additional learned drift term that lowers the impact of the unbiasing weights used in AIS. We show that this drift is the minimizer of a variety of objective functions, which can all be estimated in an unbiased fashion without backpropagating through solutions of the stochastic differential equations governing the sampling. We also prove that some these objectives control the Kullback-Leibler divergence of the estimated distribution from its target. NETS is shown to be unbiased and, in addition, has a tunable diffusion coefficient which can be adjusted post-training to maximize the effective sample size. We demonstrate the efficacy of the method on standard benchmarks, high-dimensional Gaussian mixture distributions, and a model from statistical lattice field theory, for which it surpasses the performances of related work and existing baselines.
Authors: Barak Gahtan, Shany Funk, Einat Kodesh, Itay Ketko, Tsvi Kuflik, Alex M. Bronstein
Abstract: Musculoskeletal injuries during military training significantly impact readiness, making prevention through activity monitoring crucial. While Human Activity Recognition (HAR) using wearable devices offers promising solutions, it faces challenges in processing continuous data streams and recognizing diverse activities without predefined sessions. This paper introduces an end-to-end framework for preprocessing, analyzing, and recognizing activities from wearable data in military training contexts. Using data from 135 soldiers wearing \textit{Garmin--55} smartwatches over six months with over 15 million minutes. We develop a hierarchical deep learning approach that achieves 93.8% accuracy in temporal splits and 83.8% in cross-user evaluation. Our framework addresses missing data through physiologically-informed methods, reducing unknown sleep states from 40.38% to 3.66%. We demonstrate that while longer time windows (45-60 minutes) improve basic state classification, they present trade-offs in detecting fine-grained activities. Additionally, we introduce an intuitive visualization system that enables real-time comparison of individual performance against group metrics across multiple physiological indicators. This approach to activity recognition and performance monitoring provides military trainers with actionable insights for optimizing training programs and preventing injuries.
Authors: Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Jian Zhang, Xiaoguang Niu
Abstract: Nowadays transformer-based Large Language Models (LLM) for code generation tasks usually apply sampling and filtering pipelines. Due to the sparse reward problem in code generation tasks caused by one-token incorrectness, transformer-based models will sample redundant programs till they find a correct one, leading to low efficiency. To overcome the challenge, we incorporate Experience Replay (ER) in the fine-tuning phase, where codes and programs produced are stored and will be replayed to give the LLM agent a chance to learn from past experiences. Based on the spirit of ER, we introduce a novel approach called BTP pipeline which consists of three phases: beam search sampling, testing phase, and prioritized experience replay phase. The approach makes use of failed programs collected by code models and replays programs with high Possibility and Pass-rate Prioritized value (P2Value) from the replay buffer to improve efficiency. P2Value comprehensively considers the possibility of transformers' output and pass rate and can make use of the redundant resources caused by the problem that most programs collected by LLMs fail to pass any tests. We empirically apply our approach in several LLMs, demonstrating that it enhances their performance in code generation tasks and surpasses existing baselines.
Authors: Adrienne M. Propp, Daniel M. Tartakovsky
Abstract: The development of efficient surrogates for partial differential equations (PDEs) is a critical step towards scalable modeling of complex, multiscale systems-of-systems. Convolutional neural networks (CNNs) have gained popularity as the basis for such surrogate models due to their success in capturing high-dimensional input-output mappings and the negligible cost of a forward pass. However, the high cost of generating training data -- typically via classical numerical solvers -- raises the question of whether these models are worth pursuing over more straightforward alternatives with well-established theoretical foundations, such as Monte Carlo methods. To reduce the cost of data generation, we propose training a CNN surrogate model on a mixture of numerical solutions to both the $d$-dimensional problem and its ($d-1$)-dimensional approximation, taking advantage of the efficiency savings guaranteed by the curse of dimensionality. We demonstrate our approach on a multiphase flow test problem, using transfer learning to train a dense fully-convolutional encoder-decoder CNN on the two classes of data. Numerical results from a sample uncertainty quantification task demonstrate that our surrogate model outperforms Monte Carlo with several times the data generation budget.
Authors: Jing Zhang, Linjiajie Fang, Kexin Shi, Wenjia Wang, Bing-Yi Jing
Abstract: ``Distribution shift'' is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative estimates, we introduce an uncertainty-aware optimization objective for updating the Q-value function. The proposed QDQ demonstrates solid theoretical guarantees for the accuracy of Q-value distribution learning and uncertainty measurement, as well as the performance of the learning policy. QDQ consistently shows strong performance on the D4RL benchmark and achieves significant improvements across many tasks.
Authors: Valentyn Melnychuk, Stefan Feuerriegel, Mihaela van der Schaar
Abstract: Estimating causal quantities from observational data is crucial for understanding the safety and effectiveness of medical treatments. However, to make reliable inferences, medical practitioners require not only estimating averaged causal quantities, such as the conditional average treatment effect, but also understanding the randomness of the treatment effect as a random variable. This randomness is referred to as aleatoric uncertainty and is necessary for understanding the probability of benefit from treatment or quantiles of the treatment effect. Yet, the aleatoric uncertainty of the treatment effect has received surprisingly little attention in the causal machine learning community. To fill this gap, we aim to quantify the aleatoric uncertainty of the treatment effect at the covariate-conditional level, namely, the conditional distribution of the treatment effect (CDTE). Unlike average causal quantities, the CDTE is not point identifiable without strong additional assumptions. As a remedy, we employ partial identification to obtain sharp bounds on the CDTE and thereby quantify the aleatoric uncertainty of the treatment effect. We then develop a novel, orthogonal learner for the bounds on the CDTE, which we call AU-learner. We further show that our AU-learner has several strengths in that it satisfies Neyman-orthogonality and, thus, quasi-oracle efficiency. Finally, we propose a fully-parametric deep learning instantiation of our AU-learner.
Authors: Hala Hawashin, Mehrnoosh Sadrzadeh
Abstract: While large language models (LLMs) have advanced the field of natural language processing (NLP), their "black box" nature obscures their decision-making processes. To address this, researchers developed structured approaches using higher order tensors. These are able to model linguistic relations, but stall when training on classical computers due to their excessive size. Tensors are natural inhabitants of quantum systems and training on quantum computers provides a solution by translating text to variational quantum circuits. In this paper, we develop MultiQ-NLP: a framework for structure-aware data processing with multimodal text+image data. Here, "structure" refers to syntactic and grammatical relationships in language, as well as the hierarchical organization of visual elements in images. We enrich the translation with new types and type homomorphisms and develop novel architectures to represent structure. When tested on a main stream image classification task (SVO Probes), our best model showed a par performance with the state of the art classical models; moreover the best model was fully structured.
Authors: Zhibai Huang, Yihan Shen, Yongchen Xie, Zhixiang Wei, Yun wang, Fangxin Liu, Tao Song, Zhengwei Qi
Abstract: Peripheral Component Interconnect Express (PCIe) is the de facto interconnect standard for high-speed peripherals and CPUs. Prototyping and optimizing PCIe devices for emerging scenarios is an ongoing challenge. Since Transaction Layer Packets (TLPs) capture device-CPU interactions, it is crucial to analyze and generate realistic TLP traces for effective device design and optimization. Generative AI offers a promising approach for creating intricate, custom TLP traces necessary for PCIe hardware and software development. However, existing models often generate impractical traces due to the absence of PCIe-specific constraints, such as TLP ordering and causality. This paper presents Phantom, the first framework that treats TLP trace generation as a generative AI problem while incorporating PCIe-specific constraints. We validate Phantom's effectiveness by generating TLP traces for an actual PCIe network interface card. Experimental results show that Phantom produces practical, large-scale TLP traces, significantly outperforming existing models, with improvements of up to 1000$\times$ in task-specific metrics and up to 2.19$\times$ in Frechet Inception Distance (FID) compared to backbone-only methods.
Authors: Satish Kumar Keshri, Nazreen Shah, Ranjitha Prasad
Abstract: The holy grail of machine learning is to enable Continual Federated Learning (CFL) to enhance the efficiency, privacy, and scalability of AI systems while learning from streaming data. The primary challenge of a CFL system is to overcome global catastrophic forgetting, wherein the accuracy of the global model trained on new tasks declines on the old tasks. In this work, we propose Continual Federated Learning with Aggregated Gradients (C-FLAG), a novel replay-memory based federated strategy consisting of edge-based gradient updates on memory and aggregated gradients on the current data. We provide convergence analysis of the C-FLAG approach which addresses forgetting and bias while converging at a rate of $O(1/\sqrt{T})$ over $T$ communication rounds. We formulate an optimization sub-problem that minimizes catastrophic forgetting, translating CFL into an iterative algorithm with adaptive learning rates that ensure seamless learning across tasks. We empirically show that C-FLAG outperforms several state-of-the-art baselines on both task and class-incremental settings with respect to metrics such as accuracy and forgetting.
Authors: Hongbo Liu
Abstract: Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its superior zero-shot performance and excellent transferability to downstream tasks. However, training such large-scale models usually requires substantial computation and storage, which poses barriers for general users with consumer-level computers. Motivated by this observation, in this paper we investigate how to achieve competitive performance on only one Nvidia RTX3090 GPU and with one terabyte for storing dataset. On one hand, we simplify the transformer block structure and combine Weight Inheritance with multi-stage Knowledge Distillation (WIKD), thereby reducing the parameters and improving the inference speed during training along with deployment. On the other hand, confronted with the convergence challenge posed by small dataset, we generate synthetic captions for each sample as data augmentation, and devise a novel Pair Matching (PM) loss to fully exploit the distinguishment among positive and negative image-text pairs. Extensive experiments demonstrate that our model can achieve a new state-of-the-art datascale-parameter-accuracy tradeoff, which could further popularize the CLIP model in the related research community.
Authors: Zhihao Li, Haoze Song, Di Xiao, Zhilu Lai, Wei Wang
Abstract: Partial Differential Equations (PDEs) underpin many scientific phenomena, yet traditional computational approaches often struggle with complex, nonlinear systems and irregular geometries. This paper introduces the \textbf{AMG} method, a \textbf{M}ulti-\textbf{G}raph neural operator approach designed for efficiently solving PDEs on \textbf{A}rbitrary geometries. AMG leverages advanced graph-based techniques and dynamic attention mechanisms within a novel GraphFormer architecture, enabling precise management of diverse spatial domains and complex data interdependencies. By constructing multi-scale graphs to handle variable feature frequencies and a physics graph to encapsulate inherent physical properties, AMG significantly outperforms previous methods, which are typically limited to uniform grids. We present a comprehensive evaluation of AMG across six benchmarks, demonstrating its consistent superiority over existing state-of-the-art models. Our findings highlight the transformative potential of tailored graph neural operators in surmounting the challenges faced by conventional PDE solvers. Our code and datasets are available on \url{https://github.com/lizhihao2022/AMG}.
Authors: Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
Abstract: The use of machine learning and AI on electronic health records (EHRs) holds substantial potential for clinical insight. However, this approach faces challenges due to data heterogeneity, sparsity, temporal misalignment, and limited labeled outcomes. In this context, we leverage a linked EHR dataset of approximately one million de-identified individuals from Bristol, North Somerset, and South Gloucestershire, UK, to characterize urinary tract infections (UTIs). We implemented a data pre-processing and curation pipeline that transforms the raw EHR data into a structured format suitable for developing predictive models focused on data fairness, accountability and transparency. Given the limited availability and biases of ground truth UTI outcomes, we introduce a UTI risk estimation framework informed by clinical expertise to estimate UTI risk across individual patient timelines. Pairwise XGBoost models are trained using this framework to differentiate UTI risk categories with explainable AI techniques applied to identify key predictors and support interpretability. Our findings reveal differences in clinical and demographic predictors across risk groups. While this study highlights the potential of AI-driven insights to support UTI clinical decision-making, further investigation of patient sub-strata and extensive validation are needed to ensure robustness and applicability in clinical practice.
Authors: Zekun Shi, Zheyuan Hu, Min Lin, Kenji Kawaguchi
Abstract: Optimizing neural networks with loss that contain high-dimensional and high-order differential operators is expensive to evaluate with back-propagation due to $\mathcal{O}(d^{k})$ scaling of the derivative tensor size and the $\mathcal{O}(2^{k-1}L)$ scaling in the computation graph, where $d$ is the dimension of the domain, $L$ is the number of ops in the forward computation graph, and $k$ is the derivative order. In previous works, the polynomial scaling in $d$ was addressed by amortizing the computation over the optimization process via randomization. Separately, the exponential scaling in $k$ for univariate functions ($d=1$) was addressed with high-order auto-differentiation (AD). In this work, we show how to efficiently perform arbitrary contraction of the derivative tensor of arbitrary order for multivariate functions, by properly constructing the input tangents to univariate high-order AD, which can be used to efficiently randomize any differential operator. When applied to Physics-Informed Neural Networks (PINNs), our method provides >1000$\times$ speed-up and >30$\times$ memory reduction over randomization with first-order AD, and we can now solve \emph{1-million-dimensional PDEs in 8 minutes on a single NVIDIA A100 GPU}. This work opens the possibility of using high-order differential operators in large-scale problems.
Authors: Oscar Lares, Hao Zhen, Jidong J. Yang
Abstract: Reliable and interpretable traffic crash modeling is essential for understanding causality and improving road safety. This study introduces a novel approach to predicting collision types by utilizing a comprehensive dataset fused from multiple sources, including weather data, crash reports, high-resolution traffic information, pavement geometry, and facility characteristics. Central to our approach is the development of a Feature Group Tabular Transformer (FGTT) model, which organizes disparate data into meaningful feature groups, represented as tokens. These group-based tokens serve as rich semantic components, enabling effective identification of collision patterns and interpretation of causal mechanisms. The FGTT model is benchmarked against widely used tree ensemble models, including Random Forest, XGBoost, and CatBoost, demonstrating superior predictive performance. Furthermore, model interpretation reveals key influential factors, providing fresh insights into the underlying causality of distinct crash types.
Authors: Lekai Song, Pengyu Liu, Yang Liu, Jingfang Pei, Wenyu Cui, Songwei Liu, Yingyi Wen, Teng Ma, Kong-Pang Pun, Leonard W. T. Ng, Guohua Hu
Abstract: Brains perform decision-making by Bayes theorem. The theorem quantifies events as probabilities and, based on probability rules, renders the decisions. Learning from this, Bayes theorem can be applied to enable efficient user-scene interactions. However, given the probabilistic nature, implementing Bayes theorem in hardware using conventional deterministic computing can incur excessive computational cost and decision latency. Though challenging, here we present a probabilistic computing approach based on memristors to implement the Bayes theorem. We integrate memristors with Boolean logics and, by exploiting the volatile stochastic switching of the memristors, realise probabilistic logic operations, key for hardware Bayes theorem implementation. To empirically validate the efficacy of the hardware Bayes theorem in user-scene interactions, we develop lightweight Bayesian inference and fusion hardware operators using the probabilistic logics and apply the operators in road scene parsing for self-driving, including route planning and obstacle detection. The results show our operators can achieve reliable decisions in less than 0.4 ms (or equivalently 2,500 fps), outperforming human decision-making and the existing driving assistance systems.
Authors: Korbinian P\"oppel, Maximilian Beck, Sepp Hochreiter
Abstract: While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs, as well as modern variants like sLSTM do have these capabilities at the cost of strictly sequential processing. While this is often seen as a strong limitation, we show how fast these networks can get with our hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the register level on modern GPUs. We extend traditional RNNs with a parallelization variant that processes multiple RNNs of smaller hidden state in parallel, similar to the head-wise processing in Transformers. To enable flexibility on different GPU variants, we introduce a new optimization framework for hardware-internal cache sizes, memory and compute handling. It models the hardware in a setting using polyhedral-like constraints, including the notion of divisibility. This speeds up the solution process in our ConstrINT library for general integer constraint satisfaction problems (integer CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla PyTorch implementation and allow 40x larger hidden sizes compared to our Triton implementation. Our open-source kernels and the optimization library are released here to boost research in the direction of state-tracking enabled RNNs and sequence modeling: \url{https://github.com/NX-AI/flashrnn}
Authors: Yaorui Shi, Sihang Li, Taiyan Zhang, Xi Fang, Jiankun Wang, Zhiyuan Liu, Guojiang Zhao, Zhengdan Zhu, Zhifeng Gao, Renxin Zhong, Linfeng Zhang, Guolin Ke, Weinan E, Hengxing Cai, Xiang Wang
Abstract: Automated drug discovery offers significant potential for accelerating the development of novel therapeutics by substituting labor-intensive human workflows with machine-driven processes. However, molecules generated by artificial intelligence may unintentionally infringe on existing patents, posing legal and financial risks that impede the full automation of drug discovery pipelines. This paper introduces PatentFinder, a novel multi-agent and tool-enhanced intelligence system that can accurately and comprehensively evaluate small molecules for patent infringement. PatentFinder features five specialized agents that collaboratively analyze patent claims and molecular structures with heuristic and model-based tools, generating interpretable infringement reports. To support systematic evaluation, we curate MolPatent-240, a benchmark dataset tailored for patent infringement assessment algorithms. On this benchmark, PatentFinder outperforms baseline methods that rely solely on large language models or specialized chemical tools, achieving a 13.8% improvement in F1-score and a 12% increase in accuracy. Additionally, PatentFinder autonomously generates detailed and interpretable patent infringement reports, showcasing enhanced accuracy and improved interpretability. The high accuracy and interpretability of PatentFinder make it a valuable and reliable tool for automating patent infringement assessments, offering a practical solution for integrating patent protection analysis into the drug discovery pipeline.
Authors: Manman Yuan, Weiming Jia, Xiong Luo, Jiazhen Ye, Peican Zhu, Junlin Li
Abstract: The precise detection of mild cognitive impairment (MCI) is of significant importance in preventing the deterioration of patients in a timely manner. Although hypergraphs have enhanced performance by learning and analyzing brain networks, they often only depend on vector distances between features at a single scale to infer interactions. In this paper, we deal with a more arduous challenge, hypergraph modelling with synchronization between brain regions, and design a novel framework, i.e., A Multi-scale Hypergraph Network for MCI Detection via Synchronous and Attentive Fusion (MHSA), to tackle this challenge. Specifically, our approach employs the Phase-Locking Value (PLV) to calculate the phase synchronization relationship in the spectrum domain of regions of interest (ROIs) and designs a multi-scale feature fusion mechanism to integrate dynamic connectivity features of functional magnetic resonance imaging (fMRI) from both the temporal and spectrum domains. To evaluate and optimize the direct contribution of each ROI to phase synchronization in the temporal domain, we structure the PLV coefficients dynamically adjust strategy, and the dynamic hypergraph is modelled based on a comprehensive temporal-spectrum fusion matrix. Experiments on the real-world dataset indicate the effectiveness of our strategy. The code is available at https://github.com/Jia-Weiming/MHSA.
Authors: Yihong Jin, Ze Yang
Abstract: Due to the increasing abuse of fraudulent activities that result in significant financial and reputational harm, Ethereum smart contracts face a significant problem in detecting fraud. Existing monitoring methods typically rely on lease code analysis or physically extracted features, which suffer from scalability and adaptability limitations. In this study, we use graph representation learning to observe purchase trends and find fraudulent deals. We can achieve powerful categorisation performance by using innovative machine learning versions and transforming Ethereum invoice data into graph structures. Our method addresses label imbalance through SMOTE-ENN techniques and evaluates models like Multi-Layer Perceptron ( MLP ) and Graph Convolutional Networks ( GCN). Experimental results show that the MLP type surpasses the GCN in this environment, with domain-specific assessments closely aligned with real-world assessments. This study provides a scalable and efficient way to improve Ethereum's ecosystem's confidence and security.
Authors: Neil He, Menglin Yang, Rex Ying
Abstract: Hyperbolic neural networks have emerged as a powerful tool for modeling hierarchical data structures prevalent in real-world datasets. Notably, residual connections, which facilitate the direct flow of information across layers, have been instrumental in the success of deep neural networks. However, current methods for constructing hyperbolic residual networks suffer from limitations such as increased model complexity, numerical instability, and errors due to multiple mappings to and from the tangent space. To address these limitations, we introduce LResNet, a novel Lorentzian residual neural network based on the weighted Lorentzian centroid in the Lorentz model of hyperbolic geometry. Our method enables the efficient integration of residual connections in Lorentz hyperbolic neural networks while preserving their hierarchical representation capabilities. We demonstrate that our method can theoretically derive previous methods while offering improved stability, efficiency, and effectiveness. Extensive experiments on both graph and vision tasks showcase the superior performance and robustness of our method compared to state-of-the-art Euclidean and hyperbolic alternatives. Our findings highlight the potential of LResNet for building more expressive neural networks in hyperbolic embedding space as a generally applicable method to multiple architectures, including CNNs, GNNs, and graph Transformers.
Authors: Yuchen Yan, Yuzhong Chen, Huiyuan Chen, Xiaoting Li, Zhe Xu, Zhichen Zeng, Lihui Liu, Zhining Liu, Hanghang Tong
Abstract: Graph Neural Networks (GNNs) have exhibited remarkable efficacy in diverse graph learning tasks, particularly on static homophilic graphs. Recent attention has pivoted towards more intricate structures, encompassing (1) static heterophilic graphs encountering the edge heterophily issue in the spatial domain and (2) event-based continuous graphs in the temporal domain. State-of-the-art (SOTA) has been concurrently addressing these two lines of work but tends to overlook the presence of heterophily in the temporal domain, constituting the temporal heterophily issue. Furthermore, we highlight that the edge heterophily issue and the temporal heterophily issue often co-exist in event-based continuous graphs, giving rise to the temporal edge heterophily challenge. To tackle this challenge, this paper first introduces the temporal edge heterophily measurement. Subsequently, we propose the Temporal Heterophilic Graph Convolutional Network (THeGCN), an innovative model that incorporates the low/high-pass graph signal filtering technique to accurately capture both edge (spatial) heterophily and temporal heterophily. Specifically, the THeGCN model consists of two key components: a sampler and an aggregator. The sampler selects events relevant to a node at a given moment. Then, the aggregator executes message-passing, encoding temporal information, node attributes, and edge attributes into node embeddings. Extensive experiments conducted on 5 real-world datasets validate the efficacy of THeGCN.
Authors: Artyom Stitsyuk, Jaesik Choi
Abstract: In recent years, the application of transformer-based models in time-series forecasting has received significant attention. While often demonstrating promising results, the transformer architecture encounters challenges in fully exploiting the temporal relations within time series data due to its attention mechanism. In this work, we design eXponential Patch (xPatch for short), a novel dual-stream architecture that utilizes exponential decomposition. Inspired by the classical exponential smoothing approaches, xPatch introduces the innovative seasonal-trend exponential decomposition module. Additionally, we propose a dual-flow architecture that consists of an MLP-based linear stream and a CNN-based non-linear stream. This model investigates the benefits of employing patching and channel-independence techniques within a non-transformer model. Finally, we develop a robust arctangent loss function and a sigmoid learning rate adjustment scheme, which prevent overfitting and boost forecasting performance. The code is available at the following repository: https://github.com/stitsyuk/xPatch.
Authors: Ali Forootani, Raffaele Iervolino
Abstract: Federated Learning (FL) has emerged as a powerful paradigm for decentralized machine learning, enabling collaborative model training across diverse clients without sharing raw data. However, traditional FL approaches often face limitations in scalability and efficiency due to their reliance on synchronous client updates, which can result in significant delays and increased communication overhead, particularly in heterogeneous and dynamic environments. To address these challenges in this paper, we propose an Asynchronous Federated Learning (AFL) algorithm, which allows clients to update the global model independently and asynchronously. Our key contributions include a comprehensive convergence analysis of AFL in the presence of client delays and model staleness. By leveraging martingale difference sequence theory and variance bounds, we ensure robust convergence despite asynchronous updates. Assuming strongly convex local objective functions, we establish bounds on gradient variance under random client sampling and derive a recursion formula quantifying the impact of client delays on convergence. The proposed AFL algorithm addresses key limitations of traditional FL methods, such as inefficiency due to global synchronization and susceptibility to client drift. It enhances scalability, robustness, and efficiency in real-world settings with heterogeneous client populations and dynamic network conditions. Our results underscore the potential of AFL to drive advancements in distributed learning systems, particularly for large-scale, privacy-preserving applications in resource-constrained environments.
Authors: Zhengchao Wan, Qingsong Wang, Gal Mishne, Yusu Wang
Abstract: Diffusion-based generative models have become the standard for image generation. ODE-based samplers and flow matching models improve efficiency, in comparison to diffusion models, by reducing sampling steps through learned vector fields. However, the theoretical foundations of flow matching models remain limited, particularly regarding the convergence of individual sample trajectories at terminal time - a critical property that impacts sample quality and being critical assumption for models like the consistency model. In this paper, we advance the theory of flow matching models through a comprehensive analysis of sample trajectories, centered on the denoiser that drives ODE dynamics. We establish the existence, uniqueness and convergence of ODE trajectories at terminal time, ensuring stable sampling outcomes under minimal assumptions. Our analysis reveals how trajectories evolve from capturing global data features to local structures, providing the geometric characterization of per-sample behavior in flow matching models. We also explain the memorization phenomenon in diffusion-based training through our terminal time analysis. These findings bridge critical gaps in understanding flow matching models, with practical implications for sampling stability and model design.
Authors: Xiaoyi Wu, Bo Ji, Bin Li
Abstract: Combinatorial Multi-Armed Bandit with fairness constraints is a framework where multiple arms form a super arm and can be pulled in each round under uncertainty to maximize cumulative rewards while ensuring the minimum average reward required by each arm. The existing pessimistic-optimistic algorithm linearly combines virtual queue-lengths (tracking the fairness violations) and Upper Confidence Bound estimates as a weight for each arm and selects a super arm with the maximum total weight. The number of super arms could be exponential to the number of arms in many scenarios. In wireless networks, interference constraints can cause the number of super arms to grow exponentially with the number of arms. Evaluating all the feasible super arms to find the one with the maximum total weight can incur extremely high computational complexity in the pessimistic-optimistic algorithm. To avoid this, we develop a low-complexity fair learning algorithm based on the so-called pick-and-compare approach that involves randomly picking $M$ feasible super arms to evaluate. By setting $M$ to a constant, the number of comparison steps in the pessimistic-optimistic algorithm can be reduced to a constant, thereby significantly reducing the computational complexity. Our theoretical proof shows this low-complexity design incurs only a slight sacrifice in fairness and regret performance. Finally, we validate the theoretical result by extensive simulations.
Authors: Xiaopeng Ke
Abstract: Deep Metric Learning (DML) has shown remarkable successes in many domains by taking advantage of powerful deep neural networks. Deep neural networks are prone to adversarial attacks and could be easily fooled by adversarial examples. The current progress on this robustness issue is mainly about deep classification models but pays little attention to DML models. Existing works fail to thoroughly inspect the robustness of DML and neglect an important DML scenario, the clustering-based inference. In this work, we first point out the robustness issue of DML models in clustering-based inference scenarios. We find that, for the clustering-based inference, existing defenses designed DML are unable to be reused and the adaptions of defenses designed for deep classification models cannot achieve satisfactory robustness performance. To alleviate the hazard of adversarial examples, we propose a new defense, the Ensemble Adversarial Training (EAT), which exploits ensemble learning and adversarial training. EAT promotes the diversity of the ensemble, encouraging each model in the ensemble to have different robustness features, and employs a self-transferring mechanism to make full use of the robustness statistics of the whole ensemble in the update of every single model. We evaluate the EAT method on three widely-used datasets with two popular model architectures. The results show that the proposed EAT method greatly outperforms the adaptions of defenses designed for deep classification models.
Authors: Yongxian Wei, Anke Tang, Li Shen, Chun Yuan, Xiaochun Cao
Abstract: Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental requirement of model merging: ensuring the merged model performs comparably to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($\textit{i.e.}$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $\textit{data-free}$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $\textit{shared subspace}$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
Authors: Zhongyi Zhou, Yaxin Peng, Pin Yi, Minjie Zhu, Chaomin Shen
Abstract: Continual Learning enables models to learn and adapt to new tasks while retaining prior knowledge. Introducing new tasks, however, can naturally lead to feature entanglement across tasks, limiting the model's capability to distinguish between new domain data. In this work, we propose a method called Feature Realignment through Experts on hyperSpHere in Continual Learning (Fresh-CL). By leveraging predefined and fixed simplex equiangular tight frame (ETF) classifiers on a hypersphere, our model improves feature separation both intra and inter tasks. However, the projection to a simplex ETF shifts with new tasks, disrupting structured feature representation of previous tasks and degrading performance. Therefore, we propose a dynamic extension of ETF through mixture of experts, enabling adaptive projections onto diverse subspaces to enhance feature representation. Experiments on 11 datasets demonstrate a 2% improvement in accuracy compared to the strongest baseline, particularly in fine-grained datasets, confirming the efficacy of combining ETF and MoE to improve feature distinction in continual learning scenarios.
Authors: Zheng Chun
Abstract: In this work, we build upon the offline reinforcement learning algorithm TD7, which incorporates State-Action Learned Embeddings (SALE) and a prioritized experience replay buffer (LAP). We propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC. The ensemble Q-networks introduce penalties to guide the actor network toward in-distribution actions, effectively addressing the challenge of out-of-distribution actions. Meanwhile, the gradient diversity penalty encourages diverse Q-value gradients, further suppressing overestimation for out-of-distribution actions. Additionally, our method retains an adjustable behavior cloning (BC) term that directs the actor network toward dataset actions during early training stages, while gradually reducing its influence as the precision of the Q-ensemble improves. These enhancements work synergistically to improve the stability and precision of the training. Experimental results on the D4RL MuJoCo benchmarks demonstrate that our algorithm achieves higher convergence speed, stability, and performance compared to existing methods.
Authors: Alexander Scheinker
Abstract: Adaptive physics-informed super-resolution diffusion is developed for non-invasive virtual diagnostics of the 6D phase space density of charged particle beams. An adaptive variational autoencoder (VAE) embeds initial beam condition images and scalar measurements to a low-dimensional latent space from which a 326 pixel 6D tensor representation of the beam's 6D phase space density is generated. Projecting from a 6D tensor generates physically consistent 2D projections. Physics-guided super-resolution diffusion transforms low-resolution images of the 6D density to high resolution 256x256 pixel images. Un-supervised adaptive latent space tuning enables tracking of time-varying beams without knowledge of time-varying initial conditions. The method is demonstrated with experimental data and multi-particle simulations at the HiRES UED. The general approach is applicable to a wide range of complex dynamic systems evolving in high-dimensional phase space. The method is shown to be robust to distribution shift without re-training.
Authors: Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Xuan Shen, Pu Zhao, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang
Abstract: Fine-tuning helps large language models (LLM) recover degraded information and enhance task performance. Although Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning, we have observed that its scaling factor can limit or even reduce performance as the rank size increases. To address this issue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet effective method for optimizing LoRA's scaling factor. By replacing $\alpha/r$ with $\alpha/\sqrt{r}$, RoRA ensures improved performance as rank size increases. Moreover, RoRA enhances low-rank adaptation in fine-tuning uncompressed models and excels in the more challenging task of accuracy recovery when fine-tuning pruned models. Extensive experiments demonstrate the effectiveness of RoRA in fine-tuning both uncompressed and pruned models. RoRA surpasses the state-of-the-art (SOTA) in average accuracy and robustness on LLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B, specifically outperforming LoRA and DoRA by 6.5% and 2.9% on LLaMA-7B, respectively. In pruned model fine-tuning, RoRA shows significant advantages; for SHEARED-LLAMA-1.3, a LLaMA-7B with 81.4% pruning, RoRA achieves 5.7% higher average accuracy than LoRA and 3.9% higher than DoRA.
Authors: Fuhang Liang, Rucong Xu, Deng Lin
Abstract: Current deep regression models usually learn in point-wise way that treat each sample as an independent input, neglecting the relative ordering among different data. Consequently, the regression model could neglect the data 's interrelationships, potentially resulting in suboptimal performance. Moreover, the existence of aleatoric uncertainty in the training data may drive the model to capture non-generalizable patterns, contributing to increased overfitting. To address these issues, we propose a novel adaptive pairwise learning framework (AdaPRL) for regression tasks which leverages the relative differences between data points and integrates with deep probabilistic models to quantify the uncertainty associated with the predictions. Additionally, we adapt AdaPRL for applications in multi-task learning and multivariate time series forecasting. Extensive experiments with several real-world regression datasets including recommendation systems, age estimation, time series forecasting, natural language understanding, finance, and industry datasets show that AdaPRL is compatible with different backbone networks in various tasks and achieves state-of-the-art performance on the vast majority of tasks, highlighting its notable potential including enhancing prediction accuracy and ranking ability, increasing generalization capability, improving robustness to noisy data, improving resilience to reduced data, and enhancing interpretability, etc.
Authors: Sanchit Bedi (Civil Engineering Department, Indian Institute of Technology Delhi, New Delhi, India), Karn Tiwari (Electrical Communications Department, Indian Institute of Science Bengaluru, Bengaluru), Prathosh A. P. (Electrical Communications Department, Indian Institute of Science Bengaluru, Bengaluru), Sri Harsha Kota (Civil Engineering Department, Indian Institute of Technology Delhi, New Delhi, India), N. M. Anoop Krishnan (Civil Engineering Department, Indian Institute of Technology Delhi, New Delhi, India, Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi, New Delhi, India)
Abstract: Carbon Monoxide (CO) is a dominant pollutant in urban areas due to the energy generation from fossil fuels for industry, automobile, and domestic requirements. Forecasting the evolution of CO in real-time can enable the deployment of effective early warning systems and intervention strategies. However, the computational cost associated with the physics and chemistry-based simulation makes it prohibitive to implement such a model at the city and country scale. To address this challenge, here, we present a machine learning model based on neural operator, namely, Complex Neural Operator for Air Quality (CoNOAir), that can effectively forecast CO concentrations. We demonstrate this by developing a country-level model for short-term (hourly) and long-term (72-hour) forecasts of CO concentrations. Our model outperforms state-of-the-art models such as Fourier neural operators (FNO) and provides reliable predictions for both short and long-term forecasts. We further analyse the capability of the model to capture extreme events and generate forecasts in urban cities in India. Interestingly, we observe that the model predicts the next hour CO concentrations with R2 values greater than 0.95 for all the cities considered. The deployment of such a model can greatly assist the governing bodies to provide early warning, plan intervention strategies, and develop effective strategies by considering several what-if scenarios. Altogether, the present approach could provide a fillip to real-time predictions of CO pollution in urban cities.
Authors: M\'elissande Machefer, Michele Ronco, Anne-Claire Thomas, Michael Assouline, Melanie Rabier, Christina Corbane, Felix Rembold
Abstract: Food security is a complex, multidimensional concept challenging to measure comprehensively. Effective anticipation, monitoring, and mitigation of food crises require timely and comprehensive global data. This paper introduces the Harmonized Food Insecurity Dataset (HFID), an open-source resource consolidating four key data sources: the Integrated Food Security Phase Classification (IPC)/Cadre Harmonis\'e (CH) phases, the Famine Early Warning Systems Network (FEWS NET) IPC-compatible phases, and the World Food Program's (WFP) Food Consumption Score (FCS) and reduced Coping Strategy Index (rCSI). Updated monthly and using a common reference system for administrative units, the HFID offers extensive spatial and temporal coverage. It serves as a vital tool for food security experts and humanitarian agencies, providing a unified resource for analyzing food security conditions and highlighting global data disparities. The scientific community can also leverage the HFID to develop data-driven predictive models, enhancing the capacity to forecast and prevent future food crises.
Authors: Chanwoo Lee, Miaoyan Wang
Abstract: We consider the problem of structured tensor denoising in the presence of unknown permutations. Such data problems arise commonly in recommendation system, neuroimaging, community detection, and multiway comparison applications. Here, we develop a general family of smooth tensor models up to arbitrary index permutations; the model incorporates the popular tensor block models and Lipschitz hypergraphon models as special cases. We show that a constrained least-squares estimator in the block-wise polynomial family achieves the minimax error bound. A phase transition phenomenon is revealed with respect to the smoothness threshold needed for optimal recovery. In particular, we find that a polynomial of degree up to $(m-2)(m+1)/2$ is sufficient for accurate recovery of order-$m$ tensors, whereas higher degree exhibits no further benefits. This phenomenon reveals the intrinsic distinction for smooth tensor estimation problems with and without unknown permutations. Furthermore, we provide an efficient polynomial-time Borda count algorithm that provably achieves optimal rate under monotonicity assumptions. The efficacy of our procedure is demonstrated through both simulations and Chicago crime data analysis.
Authors: Joyce Chew, Matthew Hirn, Smita Krishnaswamy, Deanna Needell, Michael Perlmutter, Holly Steach, Siddharth Viswanath, Hau-Tieng Wu
Abstract: The scattering transform is a multilayered, wavelet-based transform initially introduced as a model of convolutional neural networks (CNNs) that has played a foundational role in our understanding of these networks' stability and invariance properties. Subsequently, there has been widespread interest in extending the success of CNNs to data sets with non-Euclidean structure, such as graphs and manifolds, leading to the emerging field of geometric deep learning. In order to improve our understanding of the architectures used in this new field, several papers have proposed generalizations of the scattering transform for non-Euclidean data structures such as undirected graphs and compact Riemannian manifolds without boundary. In this paper, we introduce a general, unified model for geometric scattering on measure spaces. Our proposed framework includes previous work on geometric scattering as special cases but also applies to more general settings such as directed graphs, signed graphs, and manifolds with boundary. We propose a new criterion that identifies to which groups a useful representation should be invariant and show that this criterion is sufficient to guarantee that the scattering transform has desirable stability and invariance properties. Additionally, we consider finite measure spaces that are obtained from randomly sampling an unknown manifold. We propose two methods for constructing a data-driven graph on which the associated graph scattering transform approximates the scattering transform on the underlying manifold. Moreover, we use a diffusion-maps based approach to prove quantitative estimates on the rate of convergence of one of these approximations as the number of sample points tends to infinity. Lastly, we showcase the utility of our method on spherical images, directed graphs, and on high-dimensional single-cell data.
Authors: Yinghao Zhang, Peng Li, Yue Hu
Abstract: The tensor low-rank prior has attracted considerable attention in dynamic MR reconstruction. Tensor low-rank methods preserve the inherent high-dimensional structure of data, allowing for improved extraction and utilization of intrinsic low-rank characteristics. However, most current methods are still confined to utilizing low-rank structures either in the image domain or predefined transformed domains. Designing an optimal transformation adaptable to dynamic MRI reconstruction through manual efforts is inherently challenging. In this paper, we propose a deep unrolling network that utilizes the convolutional neural network (CNN) to adaptively learn the transformed domain for leveraging tensor low-rank priors. Under the supervised mechanism, the learning of the tensor low-rank domain is directly guided by the reconstruction accuracy. Specifically, we generalize the traditional t-SVD to a transformed version based on arbitrary high-dimensional unitary transformations and introduce a novel unitary transformed tensor nuclear norm (UTNN). Subsequently, we present a dynamic MRI reconstruction model based on UTNN and devise an efficient iterative optimization algorithm using ADMM, which is finally unfolded into the proposed T2LR-Net. Experiments on two dynamic cardiac MRI datasets demonstrate that T2LR-Net outperforms the state-of-the-art optimization-based and unrolling network-based methods.
Authors: Luo Luo, Yunyan Bai, Lesi Chen, Yuxing Liu, Haishan Ye
Abstract: We study the decentralized optimization problem $\min_{{\bf x}\in{\mathbb R}^d} f({\bf x})\triangleq \frac{1}{m}\sum_{i=1}^m f_i({\bf x})$, where the local function on the $i$-th agent has the form of $f_i({\bf x})\triangleq \frac{1}{n}\sum_{j=1}^n f_{i,j}({\bf x})$ and every individual $f_{i,j}$ is smooth but possibly nonconvex. We propose a stochastic algorithm called DEcentralized probAbilistic Recursive gradiEnt deScenT (DEAREST) method, which achieves an $\epsilon$-stationary point at each agent with the communication rounds of $\tilde{\mathcal O}(L\epsilon^{-2}/\sqrt{\gamma}\,)$, the computation rounds of $\tilde{\mathcal O}(n+(L+\min\{nL, \sqrt{n/m}\bar L\})\epsilon^{-2})$, and the local incremental first-oracle calls of ${\mathcal O}(mn + {\min\{mnL, \sqrt{mn}\bar L\}}{\epsilon^{-2}})$, where $L$ is the smoothness parameter of the objective function, $\bar L$ is the mean-squared smoothness parameter of all individual functions, and $\gamma$ is the spectral gap of the mixing matrix associated with the network. We then establish the lower bounds to show that the proposed method is near-optimal. Notice that the smoothness parameters $L$ and $\bar L$ used in our algorithm design and analysis are global, leading to sharper complexity bounds than existing results that depend on the local smoothness. We further extend DEAREST to solve the decentralized finite-sum optimization problem under the Polyak-{\L}ojasiewicz condition, also achieving the near-optimal complexity bounds.
Authors: John Chiang
Abstract: In this paper, we present a practical solution to implement privacy-preserving CNN training based on mere Homomorphic Encryption (HE) technique. To our best knowledge, this is the first attempt successfully to crack this nut and no work ever before has achieved this goal. Several techniques combine to accomplish the task:: (1) with transfer learning, privacy-preserving CNN training can be reduced to homomorphic neural network training, or even multiclass logistic regression (MLR) training; (2) via a faster gradient variant called $\texttt{Quadratic Gradient}$, an enhanced gradient method for MLR with a state-of-the-art performance in convergence speed is applied in this work to achieve high performance; (3) we employ the thought of transformation in mathematics to transform approximating Softmax function in the encryption domain to the approximation of the Sigmoid function. A new type of loss function termed $\texttt{Squared Likelihood Error}$ has been developed alongside to align with this change.; and (4) we use a simple but flexible matrix-encoding method named $\texttt{Volley Revolver}$ to manage the data flow in the ciphertexts, which is the key factor to complete the whole homomorphic CNN training. The complete, runnable C++ code to implement our work can be found at: \href{https://github.com/petitioner/HE.CNNtraining}{$\texttt{https://github.com/petitioner/HE.CNNtraining}$}. We select $\texttt{REGNET\_X\_400MF}$ as our pre-trained model for transfer learning. We use the first 128 MNIST training images as training data and the whole MNIST testing dataset as the testing data. The client only needs to upload 6 ciphertexts to the cloud and it takes $\sim 21$ mins to perform 2 iterations on a cloud with 64 vCPUs, resulting in a precision of $21.49\%$.
URLs: https://github.com/petitioner/HE.CNNtraining, https://github.com/petitioner/HE.CNNtraining
Authors: Sunder Ali Khowaja, Parus Khuwaja, Kapal Dev, Weizheng Wang, Lewis Nkenyereye
Abstract: ChatGPT is another large language model (LLM) vastly available for the consumers on their devices but due to its performance and ability to converse effectively, it has gained a huge popularity amongst research as well as industrial community. Recently, many studies have been published to show the effectiveness, efficiency, integration, and sentiments of chatGPT and other LLMs. In contrast, this study focuses on the important aspects that are mostly overlooked, i.e. sustainability, privacy, digital divide, and ethics and suggests that not only chatGPT but every subsequent entry in the category of conversational bots should undergo Sustainability, PrivAcy, Digital divide, and Ethics (SPADE) evaluation. This paper discusses in detail the issues and concerns raised over chatGPT in line with aforementioned characteristics. We also discuss the recent EU AI Act briefly in accordance with the SPADE evaluation. We support our hypothesis by some preliminary data collection and visualizations along with hypothesized facts. We also suggest mitigations and recommendations for each of the concerns. Furthermore, we also suggest some policies and recommendations for EU AI policy act concerning ethics, digital divide, and sustainability
Authors: Yossi Arjevani, Gal Vinograd
Abstract: We consider the nonconvex optimization problem associated with the decomposition of a real symmetric tensor into a sum of rank one terms. Use is made of the rich symmetry structure to construct infinite families of critical points represented by Puiseux series in the problem dimension, and so obtain precise analytic estimates on the value of the objective function and the Hessian spectrum. The results allow an analytic characterization of various obstructions to using local optimization methods, revealing in particular a complex array of saddles and minima differing by their symmetry, structure and analytic properties. A~desirable phenomenon, occurring for all critical points considered, concerns the number of negative Hessian eigenvalues increasing with the value of the objective function. Our approach makes use of Newton polyhedra as well as results from real algebraic geometry, notably the Curve Selection Lemma, to determine the extremal character of degenerate critical points, establishing in particular the existence of infinite families of third-order saddles which can significantly slow down the optimization process.
Authors: Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, Linda Shapiro
Abstract: Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has slowed comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering $1,087$ hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate QUILT: a large-scale vision-language dataset consisting of $802, 144$ image and text pairs. QUILT was automatically curated using a mixture of models, including large language models, handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around $200$K samples. We combine QUILT with datasets from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: QUILT-1M, with $1$M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of QUILT-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images across $13$ diverse patch-level datasets of $8$ different sub-pathologies and cross-modal retrieval tasks.
Authors: Seunghoon Paik, Michael Celentano, Alden Green, Ryan J. Tibshirani
Abstract: Integral probability metrics (IPMs) constitute a general class of nonparametric two-sample tests that are based on maximizing the mean difference between samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the IPM defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness degree $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. We can thus leverage the power of modern neural network optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test.
Authors: Yusuf Brima, Ulf Krumnack, Simone Pika, Gunther Heidemann
Abstract: Disentangled representation learning in speech processing has lagged behind other domains, largely due to the lack of datasets with annotated generative factors for robust evaluation. To address this, we propose SynSpeech, a novel large-scale synthetic speech dataset specifically designed to enable research on disentangled speech representations. SynSpeech includes controlled variations in speaker identity, spoken text, and speaking style, with three dataset versions to support experimentation at different levels of complexity. In this study, we present a comprehensive framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics to assess the modularity, compactness, and informativeness of the representations learned by a state-of-the-art model. Using the RAVE model as a test case, we find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity. This benchmark dataset and evaluation framework fills a critical gap, supporting the development of more robust and interpretable speech representation learning methods.
Authors: Nixon Jerez-Lillo, Francisco A. Rodrigues, Paulo H. Ferreira, Pedro L. Ramos
Abstract: Scale-free networks play a fundamental role in the study of complex networks and various applied fields due to their ability to model a wide range of real-world systems. A key characteristic of these networks is their degree distribution, which often follows a power-law distribution, where the probability mass function is proportional to $x^{-\alpha}$, with $\alpha$ typically ranging between $2 < \alpha < 3$. In this paper, we introduce Bayesian inference methods to obtain more accurate estimates than those obtained using traditional methods, which often yield biased estimates, and precise credible intervals. Through a simulation study, we demonstrate that our approach provides nearly unbiased estimates for the scaling parameter, enhancing the reliability of inferences. We also evaluate new goodness-of-fit tests to improve the effectiveness of the Kolmogorov-Smirnov test, commonly used for this purpose. Our findings show that the Watson test offers superior power while maintaining a controlled type I error rate, enabling us to better determine whether data adheres to a power-law distribution. Finally, we propose a piecewise extension of this model to provide greater flexibility, evaluating the estimation and its goodness-of-fit features as well. In the complex networks field, this extension allows us to model the full degree distribution, instead of just focusing on the tail, as is commonly done. We demonstrate the utility of these novel methods through applications to two real-world datasets, showcasing their practical relevance and potential to advance the analysis of power-law behavior.
Authors: Jae Hyung Ju, Jaiyoung Park, Jongmin Kim, Minsik Kang, Donghwan Kim, Jung Hee Cheon, Jung Ho Ahn
Abstract: Fully homomorphic encryption (FHE) is a promising cryptographic primitive for realizing private neural network inference (PI) services by allowing a client to fully offload the inference task to a cloud server while keeping the client data oblivious to the server. This work proposes NeuJeans, an FHE-based solution for the PI of deep convolutional neural networks (CNNs). NeuJeans tackles the critical problem of the enormous computational cost for the FHE evaluation of CNNs. We introduce a novel encoding method called Coefficients-in-Slot (CinS) encoding, which enables multiple convolutions in one HE multiplication without costly slot permutations. We further observe that CinS encoding is obtained by conducting the first several steps of the Discrete Fourier Transform (DFT) on a ciphertext in conventional Slot encoding. This property enables us to save the conversion between CinS and Slot encodings as bootstrapping a ciphertext starts with DFT. Exploiting this, we devise optimized execution flows for various two-dimensional convolution (conv2d) operations and apply them to end-to-end CNN implementations. NeuJeans accelerates the performance of conv2d-activation sequences by up to 5.68 times compared to state-of-the-art FHE-based PI work and performs the PI of a CNN at the scale of ImageNet within a mere few seconds.
Authors: Jun Liu, Yiming Meng, Maxwell Fitzsimmons, Ruikun Zhou
Abstract: We provide a systematic investigation of using physics-informed neural networks to compute Lyapunov functions. We encode Lyapunov conditions as a partial differential equation (PDE) and use this for training neural network Lyapunov functions. We analyze the analytical properties of the solutions to the Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov equation in training neural Lyapunov functions can lead to approximate regions of attraction close to the true domain of attraction. We also examine approximation errors and the convergence of neural approximations to the unique solution of Zubov's equation. We then provide sufficient conditions for the learned neural Lyapunov functions that can be readily verified by satisfiability modulo theories (SMT) solvers, enabling formal verification of both local stability analysis and region-of-attraction estimates in the large. Through a number of nonlinear examples, ranging from low to high dimensions, we demonstrate that the proposed framework can outperform traditional sums-of-squares (SOS) Lyapunov functions obtained using semidefinite programming (SDP).
Authors: Amir H. Ashouri, Muhammad Asif Manzoor, Duc Minh Vu, Raymond Zhang, Colin Toft, Ziwen Wang, Angel Zhang, Bryan Chan, Tomasz S. Czajkowski, Yaoqing Gao
Abstract: The key to performance optimization of a program is to decide correctly when a certain transformation should be applied by a compiler. This is an ideal opportunity to apply machine-learning models to speed up the tuning process; while this realization has been around since the late 90s, only recent advancements in ML enabled a practical application of ML to compilers as an end-to-end framework. This paper presents ACPO: An AI-Enabled Compiler Framework, a novel framework that provides LLVM with simple and comprehensive tools to benefit from employing ML models for different optimization passes. We first showcase the high-level view, class hierarchy, and functionalities of ACPO and subsequently, demonstrate \taco{a couple of use cases of ACPO by ML-enabling the Loop Unroll and Function Inlining passes used in LLVM's O3. and finally, describe how ACPO can be leveraged to optimize other passes. Experimental results reveal that the ACPO model for Loop Unroll can gain on average 4\%, 3\%, 5.4\%, and 0.2\% compared to LLVM's vanilla O3 optimization when deployed on Polybench, Coral-2, CoreMark, and Graph-500, respectively. Furthermore, by including both Function Inlining and Loop Unroll models, ACPO can provide a combined speedup of 4.5\% on Polybench and 2.4\% on Cbench when compared with LLVM's O3, respectively.
Authors: \v{S}t\v{e}p\'an \v{S}m\'id, Roberto Bondesan
Abstract: In this work, we consider a fundamental task in quantum many-body physics - finding and learning ground states of quantum Hamiltonians and their properties. Recent works have studied the task of predicting the ground state expectation value of sums of geometrically local observables by learning from data. For short-range gapped Hamiltonians, a sample complexity that is logarithmic in the number of qubits and quasipolynomial in the error was obtained. Here we extend these results beyond the local requirements on both Hamiltonians and observables, motivated by the relevance of long-range interactions in molecular and atomic systems. For interactions decaying as a power law with exponent greater than twice the dimension of the system, we recover the same efficient logarithmic scaling with respect to the number of qubits, but the dependence on the error worsens to exponential. Further, we show that learning algorithms equivariant under the automorphism group of the interaction hypergraph achieve a sample complexity reduction, leading in particular to a constant number of samples for learning sums of local observables in systems with periodic boundary conditions. We demonstrate the efficient scaling in practice by learning from DMRG simulations of $1$D long-range and disordered systems with up to $128$ qubits. Finally, we provide an analysis of the concentration of expectation values of global observables stemming from the central limit theorem, resulting in increased prediction accuracy.
Authors: Ci-Siang Lin, Chien-Yi Wang, Yu-Chiang Frank Wang, Min-Hung Chen
Abstract: Weakly-Supervised Semantic Segmentation (WSSS) aims to train segmentation models using image data with only image-level supervision. Since precise pixel-level annotations are not accessible, existing methods typically focus on producing pseudo masks for training segmentation models by refining CAM-like heatmaps. However, the produced heatmaps may capture only the discriminative image regions of object categories or the associated co-occurring backgrounds. To address the issues, we propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space to enhance the semantic alignment between the segmented regions and the target object categories. More specifically, we propose Contrastive Prompt Learning and Prompt-guided Semantic Refinement to learn the prompts that adequately describe and suppress the co-occurring backgrounds associated with each object category. In this way, SemPLeS can perform better semantic alignment between object regions and class labels, resulting in desired pseudo masks for training segmentation models. The proposed SemPLeS framework achieves competitive performance on standard WSSS benchmarks, PASCAL VOC 2012 and MS COCO 2014, and shows compatibility with other WSSS methods. Code: https://github.com/NVlabs/SemPLeS.
Authors: Junpei Komiyama, Shinji Ito, Yuichi Yoshida, Souta Koshino
Abstract: We consider a replicable stochastic multi-armed bandit algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset. Replicability allows third parties to reproduce published findings and assists the original researcher in applying standard statistical tests. We observe that existing algorithms require $O(K^2/\rho^2)$ times more regret than nonreplicable algorithms, where $K$ is the number of arms and $\rho$ is the level of nonreplication. However, we demonstrate that this additional cost is unnecessary when the time horizon $T$ is sufficiently large for a given $K, \rho$, provided that the magnitude of the confidence bounds is chosen carefully. Therefore, for a large $T$, our algorithm only suffers $K^2/\rho^2$ times smaller amount of exploration than existing algorithms. To ensure the replicability of the proposed algorithms, we incorporate randomness into their decision-making processes. We propose a principled approach to limiting the probability of nonreplication. This approach elucidates the steps that existing research has implicitly followed. Furthermore, we derive the first lower bound for the two-armed replicable bandit problem, which implies the optimality of the proposed algorithms up to a $\log\log T$ factor for the two-armed case.
Authors: Qichen Xu, Zhuanglin Shen, Alexander Edstr\"om, I. P. Miranda, Zhiwei Lu, Anders Bergman, Danny Thonig, Wanjian Yin, Olle Eriksson, Anna Delin
Abstract: Despite extensive research on magnetic skyrmions and antiskyrmions, a significant challenge remains in crafting nontrivial high-order skyrmionic textures with varying, or even tailor-made, topologies. We address this challenge, by focusing on a construction pathway of skyrmionic metamaterials within a monolayer thin film and suggest several skyrmionic metamaterials that are surprisingly stable, i.e., long-lived, due to a self-stabilization mechanism. This makes these new textures promising for applications. Central to our approach is the concept of 'simulated controlled assembly', in short, a protocol inspired by 'click chemistry' that allows for positioning topological magnetic structures where one likes, and then allowing for energy minimization to elucidate the stability. Utilizing high-throughput atomistic-spin-dynamic simulations alongside state-of-the-art AI-driven tools, we have isolated skyrmions (topological charge Q=1), antiskyrmions (Q=-1), and skyrmionium (Q=0). These entities serve as foundational 'skyrmionic building blocks' to form the here reported intricate textures. In this work, two key contributions are introduced to the field of skyrmionic systems. First, we present a a novel combination of atomistic spin dynamics simulations and controlled assembly protocols for the stabilization and investigation of new topological magnets. Second, using the aforementioned methods we report on the discovery of skyrmionic metamaterials.
Authors: Daniel Schug, Tyler J. Kovach, M. A. Wolfe, Jared Benson, Sanghyeok Park, J. P. Dodson, J. Corrigan, M. A. Eriksson, Justyna P. Zwolak
Abstract: The rapid development of quantum dot (QD) devices for quantum computing has necessitated more efficient and automated methods for device characterization and tuning. This work demonstrates the feasibility and advantages of applying explainable machine learning techniques to the analysis of quantum dot measurements, paving the way for further advances in automated and transparent QD device tuning. Many of the measurements acquired during the tuning process come in the form of images that need to be properly analyzed to guide the subsequent tuning steps. By design, features present in such images capture certain behaviors or states of the measured QD devices. When considered carefully, such features can aid the control and calibration of QD devices. An important example of such images are so-called $\textit{triangle plots}$, which visually represent current flow and reveal characteristics important for QD device calibration. While image-based classification tools, such as convolutional neural networks (CNNs), can be used to verify whether a given measurement is $\textit{good}$ and thus warrants the initiation of the next phase of tuning, they do not provide any insights into how the device should be adjusted in the case of $\textit{bad}$ images. This is because CNNs sacrifice prediction and model intelligibility for high accuracy. To ameliorate this trade-off, a recent study introduced an image vectorization approach that relies on the Gabor wavelet transform (Schug $\textit{et al.}$ 2024 $\textit{Proc. XAI4Sci: Explainable Machine Learning for Sciences Workshop (AAAI 2024) (Vancouver, Canada)}$ pp 1-6). Here we propose an alternative vectorization method that involves mathematical modeling of synthetic triangles to mimic the experimental data. Using explainable boosting machines, we show that this new method offers superior explainability of model prediction without sacrificing accuracy.
Authors: Anirudh Prabhakaran, YeKun Xiao, Ching-Yu Cheng, Dianbo Liu
Abstract: Ocular diseases, including diabetic retinopathy and glaucoma, present a significant public health challenge due to their high prevalence and potential for causing vision impairment. Early and accurate diagnosis is crucial for effective treatment and management. In recent years, deep learning models have emerged as powerful tools for analysing medical images, such as retina imaging. However, challenges persist in model relibability and uncertainty estimation, which are critical for clinical decision-making. This study leverages the probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over latent discrete dropout masks for the classification and analysis of ocular diseases using fundus images. We develop a robust and generalizable method that utilizes GFlowOut integrated with ResNet18 and ViT models as the backbone in identifying various ocular conditions. This study employs a unique set of dropout masks - none, random, bottomup, and topdown - to enhance model performance in analyzing these fundus images. Our results demonstrate that our learnable probablistic latents significantly improves accuracy, outperforming the traditional dropout approach. We utilize a gradient map calculation method, Grad-CAM, to assess model explainability, observing that the model accurately focuses on critical image regions for predictions. The integration of GFlowOut in neural networks presents a promising advancement in the automated diagnosis of ocular diseases, with implications for improving clinical workflows and patient outcomes.
Authors: Jun Liu, Zhenglun Kong, Pu Zhao, Changdi Yang, Hao Tang, Xuan Shen, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang
Abstract: Structured pruning for large language models (LLMs) has garnered significant academic interest due to its ability to efficiently compress and accelerate LLMs by eliminating redundant weight groups at a coarse-grained granularity. Current structured pruning methods for LLMs typically depend on a singular granularity for assessing weight importance, resulting in notable performance degradation in downstream tasks. Intriguingly, our empirical investigations reveal that utilizing unstructured pruning, which achieves better performance retention by pruning weights at a finer granularity, \emph{i.e.}, individual weights, yields significantly varied sparse LLM structures when juxtaposed to structured pruning. This suggests that evaluating both holistic and individual assessment for weight importance is essential for LLM pruning. Building on this insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA), a novel method that merges fine-grained and coarse-grained evaluations of weight importance for the pruning of LLMs. Leveraging an attention mechanism, HyWIA adaptively determines the optimal blend of granularity in weight importance assessments in an end-to-end pruning manner. Extensive experiments on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks demonstrate the effectiveness of HyWIA in pruning LLMs. For example, HyWIA surpasses the cutting-edge LLM-Pruner by an average margin of 2.82% in accuracy across seven downstream tasks when pruning LLaMA-7B by 50%.
Authors: Abdelrahman Sadallah, Daria Kotova, Ekaterina Kochmar
Abstract: Cryptic crosswords are puzzles that rely not only on general knowledge but also on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models. However, the abilities of large language models (LLMs) have not yet been tested on this task. In this paper, we establish the benchmark results for three popular LLMs -- LLaMA2, Mistral, and ChatGPT -- showing that their performance on this task is still far from that of humans.
Authors: Thomas Melistas, Nikos Spyrou, Nefeli Gkouti, Pedro Sanchez, Athanasios Vlontzos, Yannis Panagakis, Giorgos Papanastasiou, Sotirios A. Tsaftaris
Abstract: Generative AI has revolutionised visual content editing, empowering users to effortlessly modify images and videos. However, not all edits are equal. To perform realistic edits in domains such as natural image or medical imaging, modifications must respect causal relationships inherent to the data generation process. Such image editing falls into the counterfactual image generation regime. Evaluating counterfactual image generation is substantially complex: not only it lacks observable ground truths, but also requires adherence to causal constraints. Although several counterfactual image generation methods and evaluation metrics exist, a comprehensive comparison within a unified setting is lacking. We present a comparison framework to thoroughly benchmark counterfactual image generation methods. We integrate all models that have been used for the task at hand and expand them to novel datasets and causal graphs, demonstrating the superiority of Hierarchical VAEs across most datasets and metrics. Our framework is implemented in a user-friendly Python package that can be extended to incorporate additional SCMs, causal methods, generative models, and datasets for the community to build on. Code: https://github.com/gulnazaki/counterfactual-benchmark.
URLs: https://github.com/gulnazaki/counterfactual-benchmark.
Authors: Ainesh Bakshi, Allen Liu, Ankur Moitra, Ewin Tang
Abstract: We study the problem of Hamiltonian structure learning from real-time evolution: given the ability to apply $e^{-\mathrm{i} Ht}$ for an unknown local Hamiltonian $H = \sum_{a = 1}^m \lambda_a E_a$ on $n$ qubits, the goal is to recover $H$. This problem is already well-understood under the assumption that the interaction terms, $E_a$, are given, and only the interaction strengths, $\lambda_a$, are unknown. But how efficiently can we learn a local Hamiltonian without prior knowledge of its interaction structure? We present a new, general approach to Hamiltonian learning that not only solves the challenging structure learning variant, but also resolves other open questions in the area, all while achieving the gold standard of Heisenberg-limited scaling. In particular, our algorithm recovers the Hamiltonian to $\varepsilon$ error with total evolution time $O(\log (n)/\varepsilon)$, and has the following appealing properties: (1) it does not need to know the Hamiltonian terms; (2) it works beyond the short-range setting, extending to any Hamiltonian $H$ where the sum of terms interacting with a qubit has bounded norm; (3) it evolves according to $H$ in constant time $t$ increments, thus achieving constant time resolution. As an application, we can also learn Hamiltonians exhibiting power-law decay up to accuracy $\varepsilon$ with total evolution time beating the standard limit of $1/\varepsilon^2$.
Authors: Kilian Tscharke, Sebastian Issel, Pascal Debus
Abstract: Quantum computing (QC) seems to show potential for application in machine learning (ML). In particular quantum kernel methods (QKM) exhibit promising properties for use in supervised ML tasks. However, a major disadvantage of kernel methods is their unfavorable quadratic scaling with the number of training samples. Together with the limits imposed by currently available quantum hardware (NISQ devices) with their low qubit coherence times, small number of qubits, and high error rates, the use of QC in ML at an industrially relevant scale is currently impossible. As a small step in improving the potential applications of QKMs, we introduce QUACK, a quantum kernel algorithm whose time complexity scales linear with the number of samples during training, and independent of the number of training samples in the inference stage. In the training process, only the kernel entries for the samples and the centers of the classes are calculated, i.e. the maximum shape of the kernel for n samples and c classes is (n, c). During training, the parameters of the quantum kernel and the positions of the centroids are optimized iteratively. In the inference stage, for every new sample the circuit is only evaluated for every centroid, i.e. c times. We show that the QUACK algorithm nevertheless provides satisfactory results and can perform at a similar level as classical kernel methods with quadratic scaling during training. In addition, our (simulated) algorithm is able to handle high-dimensional datasets such as MNIST with 784 features without any dimensionality reduction.
Authors: Simmaco Di Lillo, Domenico Marinucci, Michele Salvi, Stefano Vigogna
Abstract: It is well-known that randomly initialized, push-forward, fully-connected neural networks weakly converge to isotropic Gaussian processes, in the limit where the width of all layers goes to infinity. In this paper, we propose to use the angular power spectrum of the limiting field to characterize the complexity of the network architecture. In particular, we define sequences of random variables associated with the angular power spectrum, and provide a full characterization of the network complexity in terms of the asymptotic distribution of these sequences as the depth diverges. On this basis, we classify neural networks as low-disorder, sparse, or high-disorder; we show how this classification highlights a number of distinct features for standard activation functions, and in particular, sparsity properties of ReLU networks. Our theoretical results are also validated by numerical simulations.
Authors: Fernando Moreno-Pino, \'Alvaro Arroyo, Harrison Waldon, Xiaowen Dong, \'Alvaro Cartea
Abstract: Time-series data in real-world settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In these settings, traditional sequence-based recurrent models struggle. To overcome this, researchers often replace recurrent architectures with Neural ODE-based models to account for irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of even moderate length. To address this challenge, we introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences and incurs significantly lower computational costs. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global (multi-scale) dependencies in the input data, while remaining robust to changes in the sequence length and sampling frequency and yielding improved spatial processing. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the representational benefits of Neural ODE-based models, all at a fraction of the computational time and memory resources.
Authors: Venkata Sai Pranav Bachina, Ankit Gangwal, Aaryan Ajay Sharma, Charu Sharma
Abstract: Graph Neural Networks (GNNs) have become invaluable intellectual property in graph-based machine learning. However, their vulnerability to model stealing attacks when deployed within Machine Learning as a Service (MLaaS) necessitates robust Ownership Demonstration (OD) techniques. Watermarking is a promising OD framework for Deep Neural Networks, but existing methods fail to generalize to GNNs due to the non-Euclidean nature of graph data. Previous works on GNN watermarking have primarily focused on node and graph classification, overlooking Link Prediction (LP). In this paper, we propose GENIE (watermarking Graph nEural Networks for lInk prEdiction), the first-ever scheme to watermark GNNs for LP. GENIE creates a novel backdoor for both node-representation and subgraph-based LP methods, utilizing a unique trigger set and a secret watermark vector. Our OD scheme is equipped with Dynamic Watermark Thresholding (DWT), ensuring high verification probability (>99.99%) while addressing practical issues in existing watermarking schemes. We extensively evaluate GENIE across 4 model architectures (i.e., SEAL, GCN, GraphSAGE and NeoGNN) and 7 real-world datasets. Furthermore, we validate the robustness of GENIE against 11 state-of-the-art watermark removal techniques and 3 model extraction attacks. We also show GENIE's resilience against ownership piracy attacks. Finally, we discuss a defense strategy to counter adaptive attacks against GENIE.
Authors: Dan Saattrup Nielsen, Kenneth Enevoldsen, Peter Schneider-Kamp
Abstract: This paper explores the performance of encoder and decoder language models on multilingual Natural Language Understanding (NLU) tasks, with a broad focus on Germanic languages. Building upon the ScandEval benchmark, initially restricted to evaluating encoder models, we extend the evaluation framework to include decoder models. We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English. Through a series of experiments and analyses, we also address research questions regarding the comparative performance of encoder and decoder models, the impact of NLU task types, and the variation across language resources. Our findings reveal that encoder models can achieve significantly better NLU performance than decoder models despite having orders of magnitude fewer parameters. Additionally, we investigate the correlation between decoders and task performance via a UMAP analysis, shedding light on the unique capabilities of decoder and encoder models. This study contributes to a deeper understanding of language model paradigms in NLU tasks and provides valuable insights for model selection and evaluation in multilingual settings.
Authors: Siddharth Chandak, Ilai Bistritz, Nicholas Bambos
Abstract: Consider a game where the players' utility functions include a reward function and a linear term for each dimension, with coefficients that are controlled by the manager. We assume that the game is strongly monotone, so gradient play converges to a unique Nash equilibrium (NE). The NE is typically globally inefficient. The global performance at NE can be improved by imposing linear constraints on the NE. We therefore want the manager to pick the controlled coefficients that impose the desired constraint on the NE. However, this requires knowing the players' reward functions and action sets. Obtaining this game information is infeasible in a large-scale network and violates user privacy. To overcome this, we propose a simple algorithm that learns to shift the NE to meet the linear constraints by adjusting the controlled coefficients online. Our algorithm only requires the linear constraints violation as feedback and does not need to know the reward functions or the action sets. We prove that our algorithm converges with probability 1 to the set of NE that satisfy target linear constraints. We then prove an L2 convergence rate of near-$O(t^{-1/4})$.
Authors: Lorenzo Bonasera, Emilio Carrizosa
Abstract: Tree ensembles are very popular machine learning models, known for their effectiveness in supervised classification and regression tasks. Their performance derives from aggregating predictions of multiple decision trees, which are renowned for their interpretability properties. However, tree ensemble models do not reliably exhibit interpretable output. Our work aims to extract an optimized list of rules from a trained tree ensemble, providing the user with a condensed, interpretable model that retains most of the predictive power of the full model. Our approach consists of solving a set partitioning problem formulated through Integer Programming. The proposed method works with either tabular or time series data, for both classification and regression tasks, and its flexible formulation can include any arbitrary loss or regularization functions. Our extensive computational experiments offer statistically significant evidence that our method is competitive with other rule extraction methods in terms of predictive performance and fidelity towards the tree ensemble. Moreover, we empirically show that the proposed method effectively extracts interpretable rules from tree ensemble that are designed for time series data.
Authors: Ahmed Frikha, Nassim Walha, Ricardo Mendes, Krishna Kanth Nakka, Xue Jiang, Xuebing Zhou
Abstract: This work addresses the timely yet underexplored problem of performing inference and finetuning of a proprietary LLM owned by a model provider entity on the confidential/private data of another data owner entity, in a way that ensures the confidentiality of both the model and the data. Hereby, the finetuning is conducted offsite, i.e., on the computation infrastructure of a third-party cloud provider. We tackle this problem by proposing ObfuscaTune, a novel, efficient and fully utility-preserving approach that combines a simple yet effective obfuscation technique with an efficient usage of confidential computing (only 5% of the model parameters are placed on TEE). We empirically demonstrate the effectiveness of ObfuscaTune by validating it on GPT-2 models with different sizes on four NLP benchmark datasets. Finally, we compare to a na\"ive version of our approach to highlight the necessity of using random matrices with low condition numbers in our approach to reduce errors induced by the obfuscation.
Authors: Aidan Furlong, Farah Alsafadi, Scott Palmtag, Andrew Godfrey, Xu Wu
Abstract: The development of Crud-Induced Power Shift (CIPS) is an operational challenge in Pressurized Water Reactors that is due to the development of crud on the fuel rod cladding. The available predictive tools developed previously, usually based on fundamental physics, are computationally expensive and have shown differing degrees of accuracy. This work proposes a completely top-down approach to predict CIPS instances on an assembly level with reactor-specific calibration built-in. Built using artificial neural networks, this work uses a three-dimensional convolutional approach to leverage the image-like layout of the input data. As a classifier, the convolutional neural network model predicts whether a given assembly will experience CIPS as well as the time of occurrence during a given cycle. This surrogate model is both trained and tested using a combination of calculated core model parameters and measured plant data from Unit 1 of the Catawba Nuclear Station. After the evaluation of its performance using various metrics, Monte Carlo dropout is employed for extensive uncertainty quantification of the model predictions. The results indicate that this methodology could be a viable approach in predicting CIPS with an assembly-level resolution across both clean and afflicted cycles, while using limited computational resources.
Authors: Yongqi Leng, Deyi Xiong
Abstract: While large language models (LLMs) have demonstrated superior multi-task capabilities, understanding the learning mechanisms behind this is still a challenging problem. In this paper, we attempt to understand such mechanisms from the perspective of neurons. Specifically, we detect task-sensitive neurons in LLMs via gradient attribution on task-specific data. Through extensive deactivation and fine-tuning experiments, we demonstrate that the detected neurons are highly correlated with the given task, which we term as task-specific neurons. With these identified task-specific neurons, we delve into two common problems in multi-task learning and continuous learning: Generalization and Catastrophic Forgetting. We find that the overlap of task-specific neurons is strongly associated with generalization and specialization across tasks. Interestingly, at certain layers of LLMs, there is a high similarity in the parameters of different task-specific neurons, and such similarity is highly correlated with the generalization performance. Inspired by these findings, we propose a neuron-level continuous fine-tuning method that only fine-tunes the current task-specific neurons during continuous learning, and extensive experiments demonstrate the effectiveness of the proposed method. Our study provides insights into the interpretability of LLMs in multi-task learning.
Authors: Delyan Boychev, Radostin Cholakov
Abstract: Recent generative models produce images with a level of authenticity that makes them nearly indistinguishable from real photos and artwork. Potential harmful use cases of these models, necessitate the creation of robust synthetic image detectors. However, current datasets in the field contain generated images with questionable quality or have examples from one predominant content type which leads to poor generalizability of the underlying detectors. We find that the curation of a balanced amount of high-resolution generated images across various content types is crucial for the generalizability of detectors, and introduce ImagiNet, a dataset of 200K examples, spanning four categories: photos, paintings, faces, and miscellaneous. Synthetic images in ImagiNet are produced with both open-source and proprietary generators, whereas real counterparts for each content type are collected from public datasets. The structure of ImagiNet allows for a two-track evaluation system: i) classification as real or synthetic and ii) identification of the generative model. To establish a strong baseline, we train a ResNet-50 model using a self-supervised contrastive objective (SelfCon) for each track which achieves evaluation AUC of up to 0.99 and balanced accuracy ranging from 86% to 95%, even under conditions that involve compression and resizing. The provided model is generalizable enough to achieve zero-shot state-of-the-art performance on previous synthetic detection benchmarks. We provide ablations to demonstrate the importance of content types and publish code and data.
Authors: Amber Hu, David Zoltowski, Aditya Nair, David Anderson, Lea Duncker, Scott Linderman
Abstract: Understanding how the collective activity of neural populations relates to computation and ultimately behavior is a key goal in neuroscience. To this end, statistical methods which describe high-dimensional neural time series in terms of low-dimensional latent dynamics have played a fundamental role in characterizing neural systems. Yet, what constitutes a successful method involves two opposing criteria: (1) methods should be expressive enough to capture complex nonlinear dynamics, and (2) they should maintain a notion of interpretability often only warranted by simpler linear models. In this paper, we develop an approach that balances these two objectives: the Gaussian Process Switching Linear Dynamical System (gpSLDS). Our method builds on previous work modeling the latent state evolution via a stochastic differential equation whose nonlinear dynamics are described by a Gaussian process (GP-SDEs). We propose a novel kernel function which enforces smoothly interpolated locally linear dynamics, and therefore expresses flexible -- yet interpretable -- dynamics akin to those of recurrent switching linear dynamical systems (rSLDS). Our approach resolves key limitations of the rSLDS such as artifactual oscillations in dynamics near discrete state boundaries, while also providing posterior uncertainty estimates of the dynamics. To fit our models, we leverage a modified learning objective which improves the estimation accuracy of kernel hyperparameters compared to previous GP-SDE fitting approaches. We apply our method to synthetic data and data recorded in two neuroscience experiments and demonstrate favorable performance in comparison to the rSLDS.
Authors: Mohammad Dehghani Tezerjani, Mohammad Khoshnazar, Mohammadhamed Tangestanizadeh, Arman Kiani, Qing Yang
Abstract: The emergence of mobile robotics, particularly in the automotive industry, introduces a promising era of enriched user experiences and adept handling of complex navigation challenges. The realization of these advancements necessitates a focused technological effort and the successful execution of numerous intricate tasks, particularly in the critical domain of Simultaneous Localization and Mapping (SLAM). Various artificial intelligence (AI) methodologies, such as deep learning and reinforcement learning, present viable solutions to address the challenges in SLAM. This study specifically explores the application of reinforcement learning in the context of SLAM. By enabling the agent (the robot) to iteratively interact with and receive feedback from its environment, reinforcement learning facilitates the acquisition of navigation and mapping skills, thereby enhancing the robot's decision-making capabilities. This approach offers several advantages, including improved navigation proficiency, increased resilience, reduced dependence on sensor precision, and refinement of the decision-making process. The findings of this study, which provide an overview of reinforcement learning's utilization in SLAM, reveal significant advancements in the field. The investigation also highlights the evolution and innovative integration of these techniques.
Authors: Dong Liu, Yanxuan Yu, Zhixin Lai, Yite Wang, Jing Wu, Zhongwei Wan, Sina Alinejad, Benjamin Lengerich, Ying Nian Wu
Abstract: This paper focuses on modern efficient training and inference technologies on foundation models and illustrates them from two perspectives: model and system design. Model and System Design optimize LLM training and inference from different aspects to save computational resources, making LLMs more efficient, affordable, and more accessible. The paper list repository is available at \url{https://github.com/NoakLiu/Efficient-Foundation-Models-Survey}
URLs: https://github.com/NoakLiu/Efficient-Foundation-Models-Survey
Authors: Thomas Thebaud, Anna Favaro, Casey Chen, Gabrielle Chavez, Laureano Moro-Velazquez, Ankur Butala, Najim Dehak
Abstract: Motor dysfunction is a common sign of neurodegenerative diseases (NDs) such as Parkinson's disease (PD) and Alzheimer's disease (AD), but may be difficult to detect, especially in the early stages. In this work, we examine the behavior of a wide array of explainable metrics extracted from the handwriting signals of 113 subjects performing multiple tasks on a digital tablet, as part of the Neurological Signals dataset. The aim is to measure their effectiveness in characterizing NDs, including AD and PD. To this end, task-agnostic and task-specific metrics are extracted from 14 distinct tasks. Subsequently, through statistical analysis and a series of classification experiments, we investigate which metrics provide greater discriminative power between NDs and healthy controls and amongst different NDs. Preliminary results indicate that the tasks at hand can all be effectively leveraged to distinguish between the considered set of NDs, specifically by measuring the stability, the speed of writing, the time spent not writing, and the pressure variations between groups from our handcrafted explainable metrics, which shows p-values lower than 0.0001 for multiple tasks. Using various binary classification algorithms on the computed metrics, we obtain up to 87 % accuracy for the discrimination between AD and healthy controls (CTL), and up to 69 % for the discrimination between PD and CTL.
Authors: Ruiqi Zhong, Heng Wang, Dan Klein, Jacob Steinhardt
Abstract: To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate "discusses COVID". To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e.g. subareas), and explains sophisticated concepts that classical methods (e.g. n-gram analysis) struggle to produce.
Authors: Theodoros Sotirou, Vassilis Lyberatos, Orfeas Menis Mastromichalakis, Giorgos Stamou
Abstract: Multimodal models are critical for music understanding tasks, as they capture the complex interplay between audio and lyrics. However, as these models become more prevalent, the need for explainability grows-understanding how these systems make decisions is vital for ensuring fairness, reducing bias, and fostering trust. In this paper, we introduce MusicLIME, a model-agnostic feature importance explanation method designed for multimodal music models. Unlike traditional unimodal methods, which analyze each modality separately without considering the interaction between them, often leading to incomplete or misleading explanations, MusicLIME reveals how audio and lyrical features interact and contribute to predictions, providing a holistic view of the model's decision-making. Additionally, we enhance local explanations by aggregating them into global explanations, giving users a broader perspective of model behavior. Through this work, we contribute to improving the interpretability of multimodal music models, empowering users to make informed choices, and fostering more equitable, fair, and transparent music understanding systems.
Authors: Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang
Abstract: In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
Authors: Ali Mehrabian, Ehsan Hoseinzade, Mahdi Mazloum, Xiaohong Chen
Abstract: Stock markets play an important role in the global economy, where accurate stock price predictions can lead to significant financial returns. While existing transformer-based models have outperformed long short-term memory networks and convolutional neural networks in financial time series prediction, their high computational complexity and memory requirements limit their practicality for real-time trading and long-sequence data processing. To address these challenges, we propose SAMBA, an innovative framework for stock return prediction that builds on the Mamba architecture and integrates graph neural networks. SAMBA achieves near-linear computational complexity by utilizing a bidirectional Mamba block to capture long-term dependencies in historical price data and employing adaptive graph convolution to model dependencies between daily stock features. Our experimental results demonstrate that SAMBA significantly outperforms state-of-the-art baseline models in prediction accuracy, maintaining low computational complexity. The code and datasets are available at github.com/Ali-Meh619/SAMBA.
Authors: Gustavo A. Bas\'ilio, Thiago B. Pereira, Alessandro L. Koerich, Hermano Tavares, Ludmila Dias, Maria das Gra\c{c}as da S. Teixeira, Rafael T. Sousa, Wilian H. Hisatugu, Amanda S. Mota, Anilton S. Garcia, Marco Aur\'elio K. Galletta, Thiago M. Paix\~ao
Abstract: Major Depressive Disorder and anxiety disorders affect millions globally, contributing significantly to the burden of mental health issues. Early screening is crucial for effective intervention, as timely identification of mental health issues can significantly improve treatment outcomes. Artificial intelligence (AI) can be valuable for improving the screening of mental disorders, enabling early intervention and better treatment outcomes. AI-driven screening can leverage the analysis of multiple data sources, including facial features in digital images. However, existing methods often rely on controlled environments or specialized equipment, limiting their broad applicability. This study explores the potential of AI models for ubiquitous depression-anxiety screening given face-centric selfies. The investigation focuses on high-risk pregnant patients, a population that is particularly vulnerable to mental health issues. To cope with limited training data resulting from our clinical setup, pre-trained models were utilized in two different approaches: fine-tuning convolutional neural networks (CNNs) originally designed for facial expression recognition and employing vision-language models (VLMs) for zero-shot analysis of facial expressions. Experimental results indicate that the proposed VLM-based method significantly outperforms CNNs, achieving an accuracy of 77.6%. Although there is significant room for improvement, the results suggest that VLMs can be a promising approach for mental health screening.
Authors: Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo
Abstract: Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to around 0%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, our defended models are still practical with similar utility to the one before our defensive training. Our code is at https://github.com/facebookresearch/SecAlign
Authors: Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao
Abstract: Recently, quantization has been widely used for the compression and acceleration of large language models~(LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with the equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still remain steep and outspread. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead, we apply Kronecker decomposition to the transformation matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments show that FlatQuant sets up a new state-of-the-art quantization benchmark. For instance, it achieves less than $\textbf{1}\%$ accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by $\textbf{7.5}\%$. For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely $\textbf{0.07x}$, bringing up to $\textbf{2.3x}$ speedup for prefill and $\textbf{1.7x}$ speedup for decoding, respectively. Code is available at: \url{https://github.com/ruikangliu/FlatQuant}.
Authors: Alaa Awad, Salah A. Aly
Abstract: Leukemia, a severe form of blood cancer, claims thousands of lives each year. This study focuses on the detection of Acute Lymphoblastic Leukemia (ALL) using advanced image processing and deep learning techniques. By leveraging recent advancements in artificial intelligence, the research evaluates the reliability of these methods in practical, real-world scenarios. Specifically, it examines the performance of state-of-the-art YOLO models, including YOLOv8 and YOLOv11, to distinguish between malignant and benign white blood cells and accurately identify different stages of ALL, including early stages. Moreover, the models demonstrate the ability to detect hematogones, which are frequently misclassified as ALL. With accuracy rates reaching 98.8%, this study highlights the potential of these algorithms to provide robust and precise leukemia detection across diverse datasets and conditions.
Authors: Hossein Nasiri, Seda Dogan-Tusha, Muhammad Iqbal Rochman, Monisha Ghosh
Abstract: Robust classification of the operational environment of wireless devices is becoming increasingly important for wireless network optimization, particularly in a shared spectrum environment. Distinguishing between indoor and outdoor devices can enhance reliability and improve coexistence with existing, outdoor, incumbents. For instance, the unlicensed but shared 6 GHz band (5.925 - 7.125 GHz) enables sharing by imposing lower transmit power for indoor unlicensed devices and a spectrum coordination requirement for outdoor devices. Further, indoor devices are prohibited from using battery power, external antennas, and weatherization to prevent outdoor operations. As these rules may be circumvented, we propose a robust indoor/outdoor classification method by leveraging the fact that the radio-frequency environment faced by a device are quite different indoors and outdoors. We first collect signal strength data from all cellular and Wi-Fi bands that can be received by a smartphone in various environments (indoor interior, indoor near windows, and outdoors), along with GPS accuracy, and then evaluate three machine learning (ML) methods: deep neural network (DNN), decision tree, and random forest to perform classification into these three categories. Our results indicate that the DNN model performs the best, particularly in minimizing the most important classification error, that of classifying outdoor devices as indoor interior devices.
Authors: Joris Postmus, Steven Abreu
Abstract: Large language models have transformed AI, yet reliably controlling their outputs remains a challenge. This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time. Unlike traditional methods using a single steering vector, we introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions. Conceptors act as soft projection matrices and offer more precise control over complex activation patterns. Our experiments demonstrate that conceptors outperform traditional methods across multiple steering tasks. We further use Boolean operations on conceptors for combined steering goals that empirically outperform additively combining steering vectors on a set of tasks. These results highlight conceptors as a promising tool for more effective steering of LLMs. Our code is available on github.com/jorispos/conceptorsteering.
Authors: Qishuai Wen, Chun-Guang Li
Abstract: State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.
Authors: Hanyu Meng, Jeroen Breebaart, Jeremy Stoddard, Vidhyasaharan Sethu, Eliathamby Ambikairajah
Abstract: Estimating frequency-varying acoustic parameters is essential for enhancing immersive perception in realistic spatial audio creation. In this paper, we propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR), and clarity (C50) across 10 frequency bands using first-order Ambisonics (FOA) speech recordings as inputs. The proposed framework utilizes a novel feature named Spectro-Spatial Covariance Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal. Our models significantly outperform existing single-channel methods with only spectral information, reducing estimation errors by more than half for all three acoustic parameters. Additionally, we introduce FOA-Conv3D, a novel back-end network for effectively utilising the SSCV feature with a 3D convolutional encoder. FOA-Conv3D outperforms the convolutional neural network (CNN) and recurrent convolutional neural network (CRNN) backends, achieving lower estimation errors and accounting for a higher proportion of variance (PoV) for all 3 acoustic parameters.
Authors: Aviv Netanyahu, Yilun Du, Antonia Bronars, Jyothish Pari, Joshua Tenenbaum, Tianmin Shu, Pulkit Agrawal
Abstract: Learning the intents of an agent, defined by its goals or motion style, is often extremely challenging from just a few examples. We refer to this problem as task concept learning and present our approach, Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM), which learns new task concepts by leveraging invertible neural generative models. The core idea is to pretrain a generative model on a set of basic concepts and their demonstrations. Then, given a few demonstrations of a new concept (such as a new goal or a new action), our method learns the underlying concepts through backpropagation without updating the model weights, thanks to the invertibility of the generative model. We evaluate our method in five domains -- object rearrangement, goal-oriented navigation, motion caption of human actions, autonomous driving, and real-world table-top manipulation. Our experimental results demonstrate that via the pretrained generative model, we successfully learn novel concepts and generate agent plans or motion corresponding to these concepts in (1) unseen environments and (2) in composition with training concepts.
Authors: Jilong Wang, Javokhirbek Rajabov, Chaoyi Xu, Yiming Zheng, He Wang
Abstract: Legged robots with advanced manipulation capabilities have the potential to significantly improve household duties and urban maintenance. Despite considerable progress in developing robust locomotion and precise manipulation methods, seamlessly integrating these into cohesive whole-body control for real-world applications remains challenging. In this paper, we present a modular framework for robust and generalizable whole-body loco-manipulation controller based on a single arm-mounted camera. By using reinforcement learning (RL), we enable a robust low-level policy for command execution over 5 dimensions (5D) and a grasp-aware high-level policy guided by a novel metric, Generalized Oriented Reachability Map (GORM). The proposed system achieves state-of-the-art one-time grasping accuracy of 89% in the real world, including challenging tasks such as grasping transparent objects. Through extensive simulations and real-world experiments, we demonstrate that our system can effectively manage a large workspace, from floor level to above body height, and perform diverse whole-body loco-manipulation tasks.
Authors: Zhangchi Zhu, Wei Zhang
Abstract: In this paper, we analyze the feature-based knowledge distillation for recommendation from the frequency perspective. By defining knowledge as different frequency components of the features, we theoretically demonstrate that regular feature-based knowledge distillation is equivalent to equally minimizing losses on all knowledge and further analyze how this equal loss weight allocation method leads to important knowledge being overlooked. In light of this, we propose to emphasize important knowledge by redistributing knowledge weights. Furthermore, we propose FreqD, a lightweight knowledge reweighting method, to avoid the computational cost of calculating losses on each knowledge. Extensive experiments demonstrate that FreqD consistently and significantly outperforms state-of-the-art knowledge distillation methods for recommender systems. Our code is available at https://github.com/woriazzc/KDs.
Authors: Ryan G. Dempsey, Jonathan Ethier, Halim Yanikomeroglu
Abstract: Radio deployments and spectrum planning benefit from path loss predictions. Obstructions along a communications link are often considered implicitly or through derived metrics such as representative clutter height or total obstruction depth. In this paper, we propose a path-specific path loss prediction method that uses convolutional neural networks to automatically perform feature extraction from high-resolution obstruction height maps. Our methods result in low prediction error in a variety of environments without requiring derived metrics.
Authors: Navid Salami Pargoo, Mahshid Ghasemi, Shuren Xia, Mehmet Kerem Turkcan, Taqiya Ehsan, Chengbo Zang, Yuan Sun, Javad Ghaderi, Gil Zussman, Zoran Kostic, Jorge Ortiz
Abstract: As urban populations grow, cities are becoming more complex, driving the deployment of interconnected sensing systems to realize the vision of smart cities. These systems aim to improve safety, mobility, and quality of life through applications that integrate diverse sensors with real-time decision-making. Streetscape applications-focusing on challenges like pedestrian safety and adaptive traffic management-depend on managing distributed, heterogeneous sensor data, aligning information across time and space, and enabling real-time processing. These tasks are inherently complex and often difficult to scale. The Streetscape Application Services Stack (SASS) addresses these challenges with three core services: multimodal data synchronization, spatiotemporal data fusion, and distributed edge computing. By structuring these capabilities as clear, composable abstractions with clear semantics, SASS allows developers to scale streetscape applications efficiently while minimizing the complexity of multimodal integration. We evaluated SASS in two real-world testbed environments: a controlled parking lot and an urban intersection in a major U.S. city. These testbeds allowed us to test SASS under diverse conditions, demonstrating its practical applicability. The Multimodal Data Synchronization service reduced temporal misalignment errors by 88%, achieving synchronization accuracy within 50 milliseconds. Spatiotemporal Data Fusion service improved detection accuracy for pedestrians and vehicles by over 10%, leveraging multicamera integration. The Distributed Edge Computing service increased system throughput by more than an order of magnitude. Together, these results show how SASS provides the abstractions and performance needed to support real-time, scalable urban applications, bridging the gap between sensing infrastructure and actionable streetscape intelligence.
Authors: Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu
Abstract: Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept of critical tokens -- elements within reasoning trajectories that significantly influence incorrect outcomes. We present a novel framework for identifying these tokens through rollout sampling and demonstrate their substantial divergence from traditional error tokens. Through extensive experiments on datasets such as GSM8K and MATH500, we show that identifying and replacing critical tokens significantly improves model accuracy. We propose an efficient methodology for pinpointing these tokens in large-scale datasets using contrastive estimation and extend this framework to enhance model training processes with direct preference optimization (DPO). Experimental results on GSM8K and MATH500 benchmarks with the widely used models Llama-3 (8B and 70B) and Deepseek-math (7B) demonstrate the effectiveness of the proposed approach, cDPO. Our results underscore the potential of leveraging critical tokens to reduce errors in reasoning tasks, advancing the development of AI systems capable of robust logical deduction. Our code, annotated datasets, and trained models are available at https://github.com/chenzhiling9954/Critical-Tokens-Matter to support and encourage future research in this promising field.
URLs: https://github.com/chenzhiling9954/Critical-Tokens-Matter
Authors: Kazuki Watanabe, Noboru Isobe
Abstract: Sinkhorn algorithm is the de-facto standard approximation algorithm for optimal transport, which has been applied to a variety of applications, including image processing and natural language processing. In theory, the proof of its convergence follows from the convergence of the Sinkhorn--Knopp algorithm for the matrix scaling problem, and Altschuler et al. show that its worst-case time complexity is in near-linear time. Very recently, sequentially composed optimal transports were proposed by Watanabe and Isobe as a hierarchical extension of optimal transports. In this paper, we present an efficient approximation algorithm, namely Sinkhorn algorithm for sequentially composed optimal transports, for its entropic regularization. Furthermore, we present a theoretical analysis of the Sinkhorn algorithm, namely (i) its exponential convergence to the optimal solution with respect to the Hilbert pseudometric, and (ii) a worst-case complexity analysis for the case of one sequential composition.
Authors: Wan He, Tina Eliassi-Rad, Samuel V. Scarpino
Abstract: Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local sequence alignments or consistent ordering among sequences. However, such methods are computationally expensive when dealing with large ensembles of even moderately sized genomes. In contrast, alignment-free (AF) approaches measure genome similarity based on summary statistics in an unsupervised setting and are efficient enough to analyze large datasets. However, both alignment-based and AF methods typically assume fixed scoring rubrics that lack the flexibility to assign varying importance to different parts of the sequences based on prior knowledge. In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework that addresses these limitations. Our approach, termed the Genome Misclassification Network Analysis (GMNA), simultaneously leverages misclassified instances, a learned scoring rubric, and label information to classify genomes based on associated metadata and better understand potential drivers of misclassification. We evaluate the utility of the GMNA using Naive Bayes and convolutional neural network models, supplemented by additional experiments with transformer-based models, to construct SARS-CoV-2 sampling location classifiers using over 500,000 viral genome sequences and study the resulting network of misclassifications. We demonstrate the global health potential of the GMNA by leveraging the SARS-CoV-2 genome misclassification networks to investigate the role human mobility played in structuring geographic clustering of SARS-CoV-2.
Authors: Xin Zhao, Xiaojun Chen, Yuexin Xuan, Zhendong Zhao, Xiaojun Jia, Xinfeng Li, Xiaofeng Wang
Abstract: The rise of deep learning models in the digital era has raised substantial concerns regarding the generation of Not-Safe-for-Work (NSFW) content. Existing defense methods primarily involve model fine-tuning and post-hoc content moderation. Nevertheless, these approaches largely lack scalability in eliminating harmful content, degrade the quality of benign image generation, or incur high inference costs. To address these challenges, we propose an innovative framework named \textit{Buster}, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Additionally, Buster employs energy-based training data generation through Langevin dynamics for adversarial knowledge augmentation, thereby ensuring robustness in harmful concept definition. This approach demonstrates exceptional resilience and scalability in mitigating NSFW content. Particularly, Buster fine-tunes the text encoder of Text-to-Image models within merely five minutes, showcasing its efficiency. Our extensive experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2\% while preserving the quality of harmless images.
Authors: Patrick Styll, Dowon Kim, Jiook Cha
Abstract: Brain development in the first few months of human life is a critical phase characterized by rapid structural growth and functional organization. Accurately predicting developmental outcomes during this time is crucial for identifying delays and enabling timely interventions. This study introduces the SwiFT (Swin 4D fMRI Transformer) model, designed to predict Bayley-III composite scores using neonatal fMRI data from the Developing Human Connectome Project (dHCP). To enhance predictive accuracy, we apply dimensionality reduction via group independent component analysis (ICA) and pretrain SwiFT on large adult fMRI datasets to address the challenges of limited neonatal data. Our analysis shows that SwiFT significantly outperforms baseline models in predicting cognitive, motor, and language outcomes, leveraging both single-label and multi-label prediction strategies. The model's attention-based architecture processes spatiotemporal data end-to-end, delivering superior predictive performance. Additionally, we use Integrated Gradients with Smoothgrad sQuare (IG-SQ) to interpret predictions, identifying neural spatial representations linked to early cognitive and behavioral development. These findings underscore the potential of Transformer models to advance neurodevelopmental research and clinical practice.
Authors: Pablo Morales-\'Alvarez, Stergios Christodoulidis, Maria Vakalopoulou, Pablo Piantanida, Jose Dolz
Abstract: The emergence of large pre-trained vision-language models (VLMs) represents a paradigm shift in machine learning, with unprecedented results in a broad span of visual recognition tasks. CLIP, one of the most popular VLMs, has exhibited remarkable zero-shot and transfer learning capabilities in classification. To transfer CLIP to downstream tasks, adapters constitute a parameter-efficient approach that avoids backpropagation through the large model (unlike related prompt learning methods). However, CLIP adapters have been developed to target discriminative performance, and the quality of their uncertainty estimates has been overlooked. In this work we show that the discriminative performance of state-of-the-art CLIP adapters does not always correlate with their uncertainty estimation capabilities, which are essential for a safe deployment in real-world scenarios. We also demonstrate that one of such adapters is obtained through MAP inference from a more general probabilistic framework. Based on this observation we introduce BayesAdapter, which leverages Bayesian inference to estimate a full probability distribution instead of a single point, better capturing the variability inherent in the parameter space. In a comprehensive empirical evaluation we show that our approach obtains high quality uncertainty estimates in the predictions, standing out in calibration and selective classification. Our code will be publicly available upon acceptance of the paper.
Authors: Purba Mukherjee, Anjan A. Sen
Abstract: Here we explore certain subtle features imprinted in data from the completed Sloan Digital Sky Survey IV (SDSS-IV) extended Baryon Oscillation Spectroscopic Survey (eBOSS) as a combined probe for the background and perturbed Universe. We reconstruct the baryon Acoustic Oscillation (BAO) and Redshift Space Distortion (RSD) observables as functions of redshift, using measurements from SDSS alone. We apply the Multi-Task Gaussian Process (MTGP) framework to model the interdependencies of cosmological observables $D_M(z)/r_d$, $D_H(z)/r_d$, and $f\sigma_8(z)$, and track their evolution across different redshifts. Subsequently, we obtain constrained three-dimensional phase space containing $D_M(z)/r_d$, $D_H(z)/r_d$, and $f\sigma_8(z)$ at different redshifts probed by the SDSS-IV eBOSS survey. Furthermore, assuming the $\Lambda$CDM model, we obtain constraints on model parameters $\Omega_{m}$, $H_{0}r_{d}$, $\sigma_{8}$ and $S_{8}$ at each redshift probed by SDSS-IV eBOSS. This indicates redshift-dependent trends in $H_0$, $\Omega_m$, $\sigma_8$ and $S_8$ in the $\Lambda$CDM model, suggesting a possible inconsistency in the $\Lambda$CDM model. Ours is a template for model-independent extraction of information for both background and perturbed Universe using a single galaxy survey taking into account all the existing correlations between background and perturbed observables and this can be easily extended to future DESI-3YR as well as Euclid results.
Authors: Xinchen Zhang, Running Zhao, Zhihan Jiang, Handi Chen, Yulong Ding, Edith C. H. Ngai, Shuang-Hua Yang
Abstract: Intrusion Detection Systems (IDS) are crucial for safeguarding digital infrastructure. In dynamic network environments, both threat landscapes and normal operational behaviors are constantly changing, resulting in concept drift. While continuous learning mitigates the adverse effects of concept drift, insufficient attention to drift patterns and excessive preservation of outdated knowledge can still hinder the IDS's adaptability. In this paper, we propose SSF (Strategic Selection and Forgetting), a novel continual learning method for IDS, providing continuous model updates with a constantly refreshed memory buffer. Our approach features a strategic sample selection algorithm to select representative new samples and a strategic forgetting mechanism to drop outdated samples. The proposed strategic sample selection algorithm prioritizes new samples that cause the `drifted' pattern, enabling the model to better understand the evolving landscape. Additionally, we introduce strategic forgetting upon detecting significant drift by discarding outdated samples to free up memory, allowing the incorporation of more recent data. SSF captures evolving patterns effectively and ensures the model is aligned with the change of data patterns, significantly enhancing the IDS's adaptability to concept drift. The state-of-the-art performance of SSF on NSL-KDD and UNSW-NB15 datasets demonstrates its superior adaptability to concept drift for network intrusion detection.
Authors: Krai Cheamsawat, Thiparat Chotibut
Abstract: Quantum reservoir computing (QRC) has emerged as a promising paradigm for harnessing near-term quantum devices to tackle temporal machine learning tasks. Yet identifying the mechanisms that underlie enhanced performance remains challenging, particularly in many-body open systems where nonlinear interactions and dissipation intertwine in complex ways. Here, we investigate a minimal model of a driven-dissipative quantum reservoir described by two coupled Kerr-nonlinear oscillators, an experimentally realizable platform that features controllable coupling, intrinsic nonlinearity, and tunable photon loss. Using Partial Information Decomposition (PID), we examine how different dynamical regimes encode input drive signals in terms of redundancy (information shared by each oscillator) and synergy (information accessible only through their joint observation). Our key results show that, near a critical point marking a dynamical bifurcation, the system transitions from predominantly redundant to synergistic encoding. We further demonstrate that synergy amplifies short-term responsiveness, thereby enhancing immediate memory retention, whereas strong dissipation leads to more redundant encoding that supports long-term memory retention. These findings elucidate how the interplay of instability and dissipation shapes information processing in small quantum systems, providing a fine-grained, information-theoretic perspective for analyzing and designing QRC platforms.
Authors: Oudom Hean, Utsha Saha, Binita Saha
Abstract: In recent years, Large Language Models (LLMs) have emerged as a transformative development in artificial intelligence (AI), drawing significant attention from industry and academia. Trained on vast datasets, these sophisticated AI systems exhibit impressive natural language processing and content generation capabilities. This paper explores the potential of LLMs to address key challenges in personal finance, focusing on the United States. We evaluate several leading LLMs, including OpenAI's ChatGPT, Google's Gemini, Anthropic's Claude, and Meta's Llama, to assess their effectiveness in providing accurate financial advice on topics such as mortgages, taxes, loans, and investments. Our findings show that while these models achieve an average accuracy rate of approximately 70%, they also display notable limitations in certain areas. Specifically, LLMs struggle to provide accurate responses for complex financial queries, with performance varying significantly across different topics. Despite these limitations, the analysis reveals notable improvements in newer versions of these models, highlighting their growing utility for individuals and financial advisors. As these AI systems continue to evolve, their potential for advancing AI-driven applications in personal finance becomes increasingly promising.
Authors: Wenkun He, Yun Liu, Ruitao Liu, Li Yi
Abstract: Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces significant challenges in synchronizing motions due to the high correlations and mutual influences among bodies. To address these challenges, we introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy. SyncDiff employs a single diffusion model to capture the joint distribution of multi-body motions. To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme. Additionally, we introduce a new set of alignment scores to emphasize the synchronization of different body motions. SyncDiff jointly optimizes both data sample likelihood and alignment likelihood through an explicit synchronization strategy. Extensive experiments across four datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.
Authors: Yijia Xiao, Edward Sun, Di Luo, Wei Wang
Abstract: Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. More details on TradingAgents are available at https://TradingAgents-AI.github.io.
Authors: Yihan He, Yuan Cao, Hong-Yu Chen, Dennis Wu, Jianqing Fan, Han Liu
Abstract: Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently large set of pre-training instances, are able to learn the algorithms themselves and perform statistical estimation tasks given new instances. This learning paradigm is distinct from the in-context learning setup and is similar to the learning procedure of human brains where skills are learned through past experience. Theoretically, we prove that pre-trained Transformers can learn the spectral methods and use the classification of bi-class Gaussian mixture model as an example. Our proof is constructive using algorithmic design techniques. Our results are built upon the similarities of multi-layered Transformer architecture with the iterative recovery algorithms used in practice. Empirically, we verify the strong capacity of the multi-layered (pre-trained) Transformer on unsupervised learning through the lens of both the PCA and the Clustering tasks performed on the synthetic and real-world datasets.
Authors: Ying Chen, Rami Al-Maskari, Izabela Horvath, Mayar Ali, Luciano Hoher, Kaiyuan Yang, Zengming Lin, Zhiwei Zhai, Mengzhe Shen, Dejin Xun, Yi Wang, Tony Xu, Maged Goubran, Yunheng Wu, Kensaku Mori, Johannes C. Paetzold, Ali Erturk
Abstract: Recent innovations in light sheet microscopy, paired with developments in tissue clearing techniques, enable the 3D imaging of large mammalian tissues with cellular resolution. Combined with the progress in large-scale data analysis, driven by deep learning, these innovations empower researchers to rapidly investigate the morphological and functional properties of diverse biological samples. Segmentation, a crucial preliminary step in the analysis process, can be automated using domain-specific deep learning models with expert-level performance. However, these models exhibit high sensitivity to domain shifts, leading to a significant drop in accuracy when applied to data outside their training distribution. To address this limitation, and inspired by the recent success of self-supervised learning in training generalizable models, we organized the SELMA3D Challenge during the MICCAI 2024 conference. SELMA3D provides a vast collection of light-sheet images from cleared mice and human brains, comprising 35 large 3D images-each with over 1000^3 voxels-and 315 annotated small patches for finetuning, preliminary testing and final testing. The dataset encompasses diverse biological structures, including vessel-like and spot-like structures. Five teams participated in all phases of the challenge, and their proposed methods are reviewed in this paper. Quantitative and qualitative results from most participating teams demonstrate that self-supervised learning on large datasets improves segmentation model performance and generalization. We will continue to support and extend SELMA3D as an inaugural MICCAI challenge focused on self-supervised learning for 3D microscopy image segmentation.
Authors: Hounsu Kim, Taegyun Kwon, Juhan Nam
Abstract: Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in https://github.com/hanshounsu/d3rm.
Authors: Yuxuan Li, Cheng Zhang, Wen Wang, Yongming Huang
Abstract: Radio map, or pathloss map prediction, is a crucial method for wireless network modeling and management. By leveraging deep learning to construct pathloss patterns from geographical maps, an accurate digital replica of the transmission environment could be established with less computational overhead and lower prediction error compared to traditional model-driven techniques. While existing state-of-the-art (SOTA) methods predominantly rely on convolutional architectures, this paper introduces a hybrid transformer-convolution model, termed RMTransformer, to enhance the accuracy of radio map prediction. The proposed model features a multi-scale transformer-based encoder for efficient feature extraction and a convolution-based decoder for precise pixel-level image reconstruction. Simulation results demonstrate that the proposed scheme significantly improves prediction accuracy, and over a 30% reduction in root mean square error (RMSE) is achieved compared to typical SOTA approaches.
Authors: Tianle Tao, Shizhao Peng, Tianyu Mei, Shoumo Li, Haogang Zhu
Abstract: Accurate nonlinear computation is a key challenge in privacy-preserving machine learning (PPML). Most existing frameworks approximate it through linear operations, resulting in significant precision loss. This paper proposes an efficient, verifiable and accurate security 2-party logistic regression framework (EVA-S2PLoR), which achieves accurate nonlinear function computation through a novel secure element-wise multiplication protocol and its derived protocols. Our framework primarily includes secure 2-party vector element-wise multiplication, addition to multiplication, reciprocal, and sigmoid function based on data disguising technology, where high efficiency and accuracy are guaranteed by the simple computation flow based on the real number domain and the few number of fixed communication rounds. We provide secure and robust anomaly detection through dimension transformation and Monte Carlo methods. EVA-S2PLoR outperforms many advanced frameworks in terms of precision (improving the performance of the sigmoid function by about 10 orders of magnitude compared to most frameworks) and delivers the best overall performance in secure logistic regression experiments.
Authors: Ludwic Leonard, Nils Thuerey, Ruediger Westermann
Abstract: We introduce a single-view reconstruction technique of volumetric fields in which multiple light scattering effects are omnipresent, such as in clouds. We model the unknown distribution of volumetric fields using an unconditional diffusion model trained on a novel benchmark dataset comprising 1,000 synthetically simulated volumetric density fields. The neural diffusion model is trained on the latent codes of a novel, diffusion-friendly, monoplanar representation. The generative model is used to incorporate a tailored parametric diffusion posterior sampling technique into different reconstruction tasks. A physically-based differentiable volume renderer is employed to provide gradients with respect to light transport in the latent space. This stands in contrast to classic NeRF approaches and makes the reconstructions better aligned with observed data. Through various experiments, we demonstrate single-view reconstruction of volumetric clouds at a previously unattainable quality.