Authors: Yair Bleiberg, Michael Werman
Abstract: Coordinate-based Multi-Layer Perceptrons (MLPs) are known to have difficulty reconstructing high frequencies of the training data. A common solution to this problem is Positional Encoding (PE), which has become quite popular. However, PE has drawbacks. It has high-frequency artifacts and adds another hyper-hyperparameter, just like batch normalization and dropout do. We believe that under certain circumstances PE is not necessary, and a smarter construction of the network architecture together with a smart training method is sufficient to achieve similar results. In this paper, we show that very simple MLPs can quite easily output a frequency when given input of the half-frequency and quarter-frequency. Using this, we design a network architecture in blocks, where the input to each block is the output of the two previous blocks along with the original input. We call this a {\it Fibonacci Network}. By training each block on the corresponding frequencies of the signal, we show that Fibonacci Networks can reconstruct arbitrarily high frequencies.
Authors: Xin Zhong, Agnibh Dasgupta, Abdullah Tanvir
Abstract: This paper presents a novel framework for watermarking language models through prompts generated by language models. The proposed approach utilizes a multi-model setup, incorporating a Prompting language model to generate watermarking instructions, a Marking language model to embed watermarks within generated content, and a Detecting language model to verify the presence of these watermarks. Experiments are conducted using ChatGPT and Mistral as the Prompting and Marking language models, with detection accuracy evaluated using a pretrained classifier model. Results demonstrate that the proposed framework achieves high classification accuracy across various configurations, with 95% accuracy for ChatGPT, 88.79% for Mistral. These findings validate the and adaptability of the proposed watermarking strategy across different language model architectures. Hence the proposed framework holds promise for applications in content attribution, copyright protection, and model authentication.
Authors: Eduardo Sany Laber, Miguel Bastista
Abstract: Average-link is widely recognized as one of the most popular and effective methods for building hierarchical agglomerative clustering. The available theoretical analyses show that this method has a much better approximation than other popular heuristics, as single-linkage and complete-linkage, regarding variants of Dasgupta's cost function [STOC 2016]. However, these analyses do not separate average-link from a random hierarchy and they are not appealing for metric spaces since every hierarchical clustering has a 1/2 approximation with regard to the variant of Dasgupta's function that is employed for dissimilarity measures [Moseley and Yang 2020]. In this paper, we present a comprehensive study of the performance of average-link in metric spaces, regarding several natural criteria that capture separability and cohesion and are more interpretable than Dasgupta's cost function and its variants. We also present experimental results with real datasets that, together with our theoretical analyses, suggest that average-link is a better choice than other related methods when both cohesion and separability are important goals.
Authors: Victor M. Tenorio, Antonio G. Marques
Abstract: Graph neural networks (GNNs) have emerged as a promising solution to deal with unstructured data, outperforming traditional deep learning architectures. However, most of the current GNN models are designed to work with a single graph, which limits their applicability in many real-world scenarios where multiple graphs may be involved. To address this limitation, we propose a novel graph-based deep learning architecture to handle tasks where two sets of signals exist, each defined on a different graph. First we consider the setting where the input is represented as a signal on top of one graph (input graph) and the output is a graph signal defined over a different graph (output graph). For this setup, we propose a three-block architecture where we first process the input data using a GNN that operates over the input graph, then apply a transformation function that operates in a latent space and maps the signals from the input to the output graph, and finally implement a second GNN that operates over the output graph. Our goal is not to propose a single specific definition for each of the three blocks, but rather to provide a flexible approach to solve tasks involving data defined on two graphs. The second part of the paper addresses a self-supervised setup, where the focus is not on the output space but on the underlying latent space and, inspired by Canonical Correlation Analysis, we seek informative representations of the data that can be leveraged to solve a downstream task. By leveraging information from multiple graphs, the proposed architecture can capture more intricate relationships between different entities in the data. We test this in several experimental setups using synthetic and real world datasets, and observe that the proposed architecture works better than traditional deep learning architectures, showcasing the importance of leveraging the information of the two graphs.
Authors: Prakash Chourasia, Heramb Lonkar, Sarwan Ali, Murray Patterson
Abstract: Advancements in genomics technology lead to a rising volume of viral (e.g., SARS-CoV-2) sequence data, resulting in increased usage of machine learning (ML) in bioinformatics. Traditional ML techniques require centralized data collection and processing, posing challenges in realistic healthcare scenarios. Additionally, privacy, ownership, and stringent regulation issues exist when pooling medical data into centralized storage to train a powerful deep learning (DL) model. The Federated learning (FL) approach overcomes such issues by setting up a central aggregator server and a shared global model. It also facilitates data privacy by extracting knowledge while keeping the actual data private. This work proposes a cutting-edge Privacy enhancement through Iterative Collaboration (EPIC) architecture. The network is divided and distributed between local and centralized servers. We demonstrate the EPIC approach to resolve a supervised classification problem to estimate SARS-CoV-2 genomic sequence data lineage without explicitly transferring raw sequence data. We aim to create a universal decentralized optimization framework that allows various data holders to work together and converge to a single predictive model. The findings demonstrate that privacy-preserving strategies can be successfully used with aggregation approaches without materially altering the degree of learning convergence. Finally, we highlight a few potential issues and prospects for study in FL-based approaches to healthcare applications.
Authors: Prakash Chourasia, Tamkanat E Ali, Sarwan Ali, Murray Pattersn
Abstract: Federated Learning (FL) is a distributed learning technique that maintains data privacy by providing a decentralized training method for machine learning models using distributed big data. This promising Federated Learning approach has also gained popularity in bioinformatics, where the privacy of biomedical data holds immense importance, especially when patient data is involved. Despite the successful implementation of Federated learning in biological sequence analysis, rigorous consideration is still required to improve accuracy in a way that data privacy should not be compromised. Additionally, the optimal integration of federated learning, especially in protein sequence analysis, has not been fully explored. We propose a deep feed-forward neural network-based enhanced federated learning method for protein sequence classification to overcome these challenges. Our method introduces novel enhancements to improve classification accuracy. We introduce dynamic weighted federated learning (DWFL) which is a federated learning-based approach, where local model weights are adjusted using weighted averaging based on their performance metrics. By assigning higher weights to well-performing models, we aim to create a more potent initial global model for the federated learning process, leading to improved accuracy. We conduct experiments using real-world protein sequence datasets to assess the effectiveness of DWFL. The results obtained using our proposed approach demonstrate significant improvements in model accuracy, making federated learning a preferred, more robust, and privacy-preserving approach for collaborative machine-learning tasks.
Authors: Leo Benac, Abhishek Sharma, Sonali Parbhoo, Finale Doshi-Velez
Abstract: We consider the problem of estimating the transition dynamics $T^*$ from near-optimal expert trajectories in the context of offline model-based reinforcement learning. We develop a novel constraint-based method, Inverse Transition Learning, that treats the limited coverage of the expert trajectories as a \emph{feature}: we use the fact that the expert is near-optimal to inform our estimate of $T^*$. We integrate our constraints into a Bayesian approach. Across both synthetic environments and real healthcare scenarios like Intensive Care Unit (ICU) patient management in hypotension, we demonstrate not only significant improvements in decision-making, but that our posterior can inform when transfer will be successful.
Authors: Usman Anwar, Johannes Von Oswald, Louis Kirsch, David Krueger, Spencer Frei
Abstract: Transformers have demonstrated remarkable in-context learning capabilities across various domains, including statistical learning tasks. While previous work has shown that transformers can implement common learning algorithms, the adversarial robustness of these learned algorithms remains unexplored. This work investigates the vulnerability of in-context learning in transformers to \textit{hijacking attacks} focusing on the setting of linear regression tasks. Hijacking attacks are prompt-manipulation attacks in which the adversary's goal is to manipulate the prompt to force the transformer to generate a specific output. We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions by perturbing a single example in the in-context training set. While our experiments show these attacks succeed on linear transformers, we find they do not transfer to more complex transformers with GPT-2 architectures. Nonetheless, we show that these transformers can be hijacked using gradient-based adversarial attacks. We then demonstrate that adversarial training enhances transformers' robustness against hijacking attacks, even when just applied during finetuning. Additionally, we find that in some settings, adversarial training against a weaker attack model can lead to robustness to a stronger attack model. Lastly, we investigate the transferability of hijacking attacks across transformers of varying scales and initialization seeds, as well as between transformers and ordinary least squares (OLS). We find that while attacks transfer effectively between small-scale transformers, they show poor transferability in other scenarios (small-to-large scale, large-to-large scale, and between transformers and OLS).
Authors: Joey Hong, Anca Dragan, Sergey Levine
Abstract: Value-based reinforcement learning (RL) can in principle learn effective policies for a wide range of multi-turn problems, from games to dialogue to robotic control, including via offline RL from static previously collected datasets. However, despite the widespread use of policy gradient methods to train large language models for single turn tasks (e.g., question answering), value-based methods for multi-turn RL in an off-policy or offline setting have proven particularly challenging to scale to the setting of large language models. This setting requires effectively leveraging pretraining, scaling to large architectures with billions of parameters, and training on large datasets, all of which represent major challenges for current value-based RL methods. In this work, we propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning (SFT) problem where the probabilities of tokens directly translate to Q-values. In this way we obtain an algorithm that smoothly transitions from maximizing the likelihood of the data during pretraining to learning a near-optimal Q-function during finetuning. Our algorithm has strong theoretical foundations, enjoying performance bounds similar to state-of-the-art Q-learning methods, while in practice utilizing an objective that closely resembles SFT. Because of this, our approach can enjoy the full benefits of the pretraining of language models, without the need to reinitialize any weights before RL finetuning, and without the need to initialize new heads for predicting values or advantages. Empirically, we evaluate our method on both pretrained LLMs and VLMs, on a variety of tasks including both natural language dialogue and robotic manipulation and navigation from images.
Authors: Joey Hong, Jessica Lin, Anca Dragan, Sergey Levine
Abstract: Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a single effective response. In reality, many real dialogues are interactive, meaning an agent's utterances will influence their conversational partner, elicit information, or change their opinion. Accounting for how an agent can effectively steer a conversation is a crucial ability in many dialogue tasks, from healthcare to preference elicitation. Existing methods for fine-tuning dialogue agents to accomplish such tasks would rely on curating some amount of expert data. However, doing so often requires understanding the underlying cognitive processes of the conversational partner, which is a skill neither humans nor LLMs trained on human data can reliably do. Our key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. We apply our approach to two domains that require understanding human mental state, intelligent interaction, and persuasion: mental health support, and soliciting charitable donations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.
Authors: Siting Li, Pang Wei Koh, Simon Shaolei Du
Abstract: Recent research suggests that the failures of Vision-Language Models (VLMs) at visual reasoning often stem from erroneous agreements -- when semantically distinct images are ambiguously encoded by the CLIP image encoder into embeddings with high cosine similarity. In this paper, we show that erroneous agreements are not always the main culprit, as Multimodal Large Language Models (MLLMs) can still extract distinct information from them. For instance, when distinguishing objects on the left vs right in the What'sUp benchmark, the CLIP image embeddings of the left/right pairs have an average cosine similarity $>0.99$, and CLIP performs at random chance; but LLaVA-1.5-7B, which uses the same CLIP image encoder, achieves nearly $100\%$ accuracy. We find that the extractable information in CLIP image embeddings is likely obscured by CLIP's inadequate vision-language alignment: Its matching score learned by the contrastive objective might not capture all diverse image-text correspondences. We also study the MMVP benchmark, on which prior work has shown that LLaVA-1.5 cannot distinguish image pairs with high cosine similarity. We observe a performance gain brought by attending more to visual input through an alternative decoding algorithm. Further, the accuracy significantly increases if the model can take both images as input to emphasize their nuanced differences. Both findings indicate that LLaVA-1.5 did not utilize extracted visual information sufficiently. In conclusion, our findings suggest that while improving image encoders could benefit VLMs, there is still room to enhance models with a fixed image encoder by applying better strategies for extracting and utilizing visual information.
Authors: Cheng Zhang, Hanna Foerster, Robert D. Mullins, Yiren Zhao, Ilia Shumailov
Abstract: It is now a common business practice to buy access to large language model (LLM) inference rather than self-host, because of significant upfront hardware infrastructure and energy costs. However, as a buyer, there is no mechanism to verify the authenticity of the advertised service including the serving hardware platform, e.g. that it is actually being served using an NVIDIA H100. Furthermore, there are reports suggesting that model providers may deliver models that differ slightly from the advertised ones, often to make them run on less expensive hardware. That way, a client pays premium for a capable model access on more expensive hardware, yet ends up being served by a (potentially less capable) cheaper model on cheaper hardware. In this paper we introduce \textit{\textbf{hardware and software platform inference (HSPI)}} -- a method for identifying the underlying \GPU{} architecture and software stack of a (black-box) machine learning model solely based on its input-output behavior. Our method leverages the inherent differences of various \GPU{} architectures and compilers to distinguish between different \GPU{} types and software stacks. By analyzing the numerical patterns in the model's outputs, we propose a classification framework capable of accurately identifying the \GPU{} used for model inference as well as the underlying software configuration. Our findings demonstrate the feasibility of inferring \GPU{} type from black-box models. We evaluate HSPI against models served on different real hardware and find that in a white-box setting we can distinguish between different \GPU{}s with between $83.9\%$ and $100\%$ accuracy. Even in a black-box setting we are able to achieve results that are up to three times higher than random guess accuracy.
Authors: Raef Bassily, Crist\'obal Guzm\'an, Michael Menart
Abstract: In this work, we conduct a systematic study of stochastic saddle point problems (SSP) and stochastic variational inequalities (SVI) under the constraint of $(\epsilon,\delta)$-differential privacy (DP) in both Euclidean and non-Euclidean setups. We first consider Lipschitz convex-concave SSPs in the $\ell_p/\ell_q$ setup, $p,q\in[1,2]$. Here, we obtain a bound of $\tilde{O}\big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d}}{n\epsilon}\big)$ on the strong SP-gap, where $n$ is the number of samples and $d$ is the dimension. This rate is nearly optimal for any $p,q\in[1,2]$. Without additional assumptions, such as smoothness or linearity requirements, prior work under DP has only obtained this rate when $p=q=2$ (i.e., only in the Euclidean setup). Further, existing algorithms have each only been shown to work for specific settings of $p$ and $q$ and under certain assumptions on the loss and the feasible set, whereas we provide a general algorithm for DP SSPs whenever $p,q\in[1,2]$. Our result is obtained via a novel analysis of the recursive regularization algorithm. In particular, we develop new tools for analyzing generalization, which may be of independent interest. Next, we turn our attention towards SVIs with a monotone, bounded and Lipschitz operator and consider $\ell_p$-setups, $p\in[1,2]$. Here, we provide the first analysis which obtains a bound on the strong VI-gap of $\tilde{O}\big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d}}{n\epsilon}\big)$. For $p-1=\Omega(1)$, this rate is near optimal due to existing lower bounds. To obtain this result, we develop a modified version of recursive regularization. Our analysis builds on the techniques we develop for SSPs as well as employing additional novel components which handle difficulties arising from adapting the recursive regularization framework to SVIs.
Authors: Ryan D'Orazio, Danilo Vucetic, Zichu Liu, Junhyung Lyle Kim, Ioannis Mitliagkas, Gauthier Gidel
Abstract: Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minimizing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.
Authors: Debmalya Mandal, Goran Radanovic
Abstract: We study the setting of \emph{performative reinforcement learning} where the deployed policy affects both the reward, and the transition of the underlying Markov decision process. Prior work~\parencite{MTR23} has addressed this problem under the tabular setting and established last-iterate convergence of repeated retraining with iteration complexity explicitly depending on the number of states. In this work, we generalize the results to \emph{linear Markov decision processes} which is the primary theoretical model of large-scale MDPs. The main challenge with linear MDP is that the regularized objective is no longer strongly convex and we want a bound that scales with the dimension of the features, rather than states which can be infinite. Our first result shows that repeatedly optimizing a regularized objective converges to a \emph{performatively stable policy}. In the absence of strong convexity, our analysis leverages a new recurrence relation that uses a specific linear combination of optimal dual solutions for proving convergence. We then tackle the finite sample setting where the learner has access to a set of trajectories drawn from the current policy. We consider a reparametrized version of the primal problem, and construct an empirical Lagrangian which is to be optimized from the samples. We show that, under a \emph{bounded coverage} condition, repeatedly solving a saddle point of this empirical Lagrangian converges to a performatively stable solution, and also construct a primal-dual algorithm that solves the empirical Lagrangian efficiently. Finally, we show several applications of the general framework of performative RL including multi-agent systems.
Authors: Inko Bovenzi, Adi Carmel, Michael Hu, Rebecca M. Hurwitz, Fiona McBride, Leo Benac, Jos\'e Roberto Tello Ayala, Finale Doshi-Velez
Abstract: In aims to uncover insights into medical decision-making embedded within observational data from clinical settings, we present a novel application of Inverse Reinforcement Learning (IRL) that identifies suboptimal clinician actions based on the actions of their peers. This approach centers two stages of IRL with an intermediate step to prune trajectories displaying behavior that deviates significantly from the consensus. This enables us to effectively identify clinical priorities and values from ICU data containing both optimal and suboptimal clinician decisions. We observe that the benefits of removing suboptimal actions vary by disease and differentially impact certain demographic groups.
Authors: Simon Wagner, Leif Seute, Vsevolod Viliuga, Nicolas Wolf, Frauke Gr\"ater, Jan St\"uhmer
Abstract: We introduce a generative model for protein backbone design utilizing geometric products and higher order message passing. In particular, we propose Clifford Frame Attention (CFA), an extension of the invariant point attention (IPA) architecture from AlphaFold2, in which the backbone residue frames and geometric features are represented in the projective geometric algebra. This enables to construct geometrically expressive messages between residues, including higher order terms, using the bilinear operations of the algebra. We evaluate our architecture by incorporating it into the framework of FrameFlow, a state-of-the-art flow matching model for protein backbone generation. The proposed model achieves high designability, diversity and novelty, while also sampling protein backbones that follow the statistical distribution of secondary structure elements found in naturally occurring proteins, a property so far only insufficiently achieved by many state-of-the-art generative models.
Authors: Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmonsky, Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swaminathan Sundararaman, Danny Harnik
Abstract: With the growth of model sizes and the scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast model compression literature deleting parts of the model weights for faster inference, we investigate a more traditional type of compression - one that represents the model in a compact form and is coupled with a decompression algorithm that returns it to its original form and size - namely lossless compression. We present ZipNN a lossless compression tailored to neural networks. Somewhat surprisingly, we show that specific lossless compression can gain significant network and storage reduction on popular models, often saving 33% and at times reducing over 50% of the model size. We investigate the source of model compressibility and introduce specialized compression variants tailored for models that further increase the effectiveness of compression. On popular models (e.g. Llama 3) ZipNN shows space savings that are over 17% better than vanilla compression while also improving compression and decompression speeds by 62%. We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like Hugging Face.
Authors: Kai Zhao, Xuhao Li, Qiyu Kang, Feng Ji, Qinxu Ding, Yanan Zhao, Wenfei Liang, Wee Peng Tay
Abstract: We introduce the Distributed-order fRActional Graph Operating Network (DRAGON), a novel continuous Graph Neural Network (GNN) framework that incorporates distributed-order fractional calculus. Unlike traditional continuous GNNs that utilize integer-order or single fractional-order differential equations, DRAGON uses a learnable probability distribution over a range of real numbers for the derivative orders. By allowing a flexible and learnable superposition of multiple derivative orders, our framework captures complex graph feature updating dynamics beyond the reach of conventional models. We provide a comprehensive interpretation of our framework's capability to capture intricate dynamics through the lens of a non-Markovian graph random walk with node feature updating driven by an anomalous diffusion process over the graph. Furthermore, to highlight the versatility of the DRAGON framework, we conduct empirical evaluations across a range of graph learning tasks. The results consistently demonstrate superior performance when compared to traditional continuous GNN models. The implementation code is available at \url{https://github.com/zknus/NeurIPS-2024-DRAGON}.
Authors: Sajal Regmi, Chetan Phakami Pun
Abstract: Large Language Models (LLMs), such as GPT (Radford et al., 2019), have significantly advanced artificial intelligence by enabling sophisticated natural language understanding and generation. However, the high computational and financial costs associated with frequent API calls to these models present a substantial bottleneck, especially for applications like customer service chatbots that handle repetitive queries. In this paper, we introduce GPT Semantic Cache, a method that leverages semantic caching of query embeddings in in-memory storage (Redis). By storing embeddings of user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the LLM. This technique reduces operational costs and improves response times, enhancing the efficiency of LLM-powered applications.
Authors: Shairoz Sohail
Abstract: Kolmogorov-Arnold Networks have recently been introduced as a flexible alternative to multi-layer Perceptron architectures. In this paper, we examine the training dynamics of different KAN architectures and compare them with corresponding MLP formulations. We train with a variety of different initialization schemes, optimizers, and learning rates, as well as utilize back propagation free approaches like the HSIC Bottleneck. We find that (when judged by test accuracy) KANs are an effective alternative to MLP architectures on high-dimensional datasets and have somewhat better parameter efficiency, but suffer from more unstable training dynamics. Finally, we provide recommendations for improving training stability of larger KAN models.
Authors: Dong Shu, Bingbing Duan, Kai Guo, Kaixiong Zhou, Jiliang Tang, Mengnan Du
Abstract: Latent representation alignment has become a foundational technique for constructing multimodal large language models (MLLM) by mapping embeddings from different modalities into a shared space, often aligned with the embedding space of large language models (LLMs) to enable effective cross-modal understanding. While preliminary protein-focused MLLMs have emerged, they have predominantly relied on heuristic approaches, lacking a fundamental understanding of optimal alignment practices across representations. In this study, we explore the alignment of multimodal representations between LLMs and Geometric Deep Models (GDMs) in the protein domain. We comprehensively evaluate three state-of-the-art LLMs (Gemma2-2B, LLaMa3.1-8B, and LLaMa3.1-70B) with four protein-specialized GDMs (GearNet, GVP, ScanNet, GAT). Our work examines alignment factors from both model and protein perspectives, identifying challenges in current alignment methodologies and proposing strategies to improve the alignment process. Our key findings reveal that GDMs incorporating both graph and 3D structural information align better with LLMs, larger LLMs demonstrate improved alignment capabilities, and protein rarity significantly impacts alignment performance. We also find that increasing GDM embedding dimensions, using two-layer projection heads, and fine-tuning LLMs on protein-specific data substantially enhance alignment quality. These strategies offer potential enhancements to the performance of protein-related multimodal models. Our code and data are available at https://github.com/Tizzzzy/LLM-GDM-alignment.
Authors: Yanhui Zhu, Samik Basu, A. Pavan
Abstract: Submodular optimization has become increasingly prominent in machine learning and fairness has drawn much attention. In this paper, we propose to study the fair $k$-submodular maximization problem and develop a $\frac{1}{3}$-approximation greedy algorithm with a running time of $\mathcal{O}(knB)$. To the best of our knowledge, our work is the first to incorporate fairness in the context of $k$-submodular maximization, and our theoretical guarantee matches the best-known $k$-submodular maximization results without fairness constraints. In addition, we have developed a faster threshold-based algorithm that achieves a $(\frac{1}{3} - \epsilon)$ approximation with $\mathcal{O}(\frac{kn}{\epsilon} \log \frac{B}{\epsilon})$ evaluations of the function $f$. Furthermore, for both algorithms, we provide approximation guarantees when the $k$-submodular function is not accessible but only can be approximately accessed. We have extensively validated our theoretical findings through empirical research and examined the practical implications of fairness. Specifically, we have addressed the question: ``What is the price of fairness?" through case studies on influence maximization with $k$ topics and sensor placement with $k$ types. The experimental results show that the fairness constraints do not significantly undermine the quality of solutions.
Authors: Weijie Chen, Alan McMillan
Abstract: This paper introduces an efficient sub-model ensemble framework aimed at enhancing the interpretability of medical deep learning models, thus increasing their clinical applicability. By generating uncertainty maps, this framework enables end-users to evaluate the reliability of model outputs. We developed a strategy to develop diverse models from a single well-trained checkpoint, facilitating the training of a model family. This involves producing multiple outputs from a single input, fusing them into a final output, and estimating uncertainty based on output disagreements. Implemented using U-Net and UNETR models for segmentation and synthesis tasks, this approach was tested on CT body segmentation and MR-CT synthesis datasets. It achieved a mean Dice coefficient of 0.814 in segmentation and a Mean Absolute Error of 88.17 HU in synthesis, improved from 89.43 HU by pruning. Additionally, the framework was evaluated under corruption and undersampling, maintaining correlation between uncertainty and error, which highlights its robustness. These results suggest that the proposed approach not only maintains the performance of well-trained models but also enhances interpretability through effective uncertainty estimation, applicable to both convolutional and transformer models in a range of imaging tasks.
Authors: Jaewon Chu, Jinyoung Park, Seunghun Lee, Hyunwoo J. Kim
Abstract: Latent Bayesian optimization (LBO) approaches have successfully adopted Bayesian optimization over a continuous latent space by employing an encoder-decoder architecture to address the challenge of optimization in a high dimensional or discrete input space. LBO learns a surrogate model to approximate the black-box objective function in the latent space. However, we observed that most LBO methods suffer from the `misalignment problem`, which is induced by the reconstruction error of the encoder-decoder architecture. It hinders learning an accurate surrogate model and generating high-quality solutions. In addition, several trust region-based LBO methods select the anchor, the center of the trust region, based solely on the objective function value without considering the trust region`s potential to enhance the optimization process. To address these issues, we propose Inversion-based Latent Bayesian Optimization (InvBO), a plug-and-play module for LBO. InvBO consists of two components: an inversion method and a potential-aware trust region anchor selection. The inversion method searches the latent code that completely reconstructs the given target data. The potential-aware trust region anchor selection considers the potential capability of the trust region for better local optimization. Experimental results demonstrate the effectiveness of InvBO on nine real-world benchmarks, such as molecule design and arithmetic expression fitting tasks. Code is available at https://github.com/mlvlab/InvBO.
Authors: Kun Wang, Sumanth Varambally, Duncan Watson-Parris, Yi-An Ma, Rose Yu
Abstract: Many important phenomena in scientific fields such as climate, neuroscience, and epidemiology are naturally represented as spatiotemporal gridded data with complex interactions. For example, in climate science, researchers aim to uncover how large-scale events, such as the North Atlantic Oscillation (NAO) and the Antarctic Oscillation (AAO), influence other global processes. Inferring causal relationships from these data is a challenging problem compounded by the high dimensionality of such data and the correlations between spatially proximate points. We present SPACY (SPAtiotemporal Causal discoverY), a novel framework based on variational inference, designed to explicitly model latent time-series and their causal relationships from spatially confined modes in the data. Our method uses an end-to-end training process that maximizes an evidence-lower bound (ELBO) for the data likelihood. Theoretically, we show that, under some conditions, the latent variables are identifiable up to transformation by an invertible matrix. Empirically, we show that SPACY outperforms state-of-the-art baselines on synthetic data, remains scalable for large grids, and identifies key known phenomena from real-world climate data.
Authors: Pochun Li, Yuyang Xiao, Jinghua Yan, Xuan Li, Xiaoye Wang
Abstract: This study presents a novel computer system performance optimization and adaptive workload management scheduling algorithm based on Q-learning. In modern computing environments, characterized by increasing data volumes, task complexity, and dynamic workloads, traditional static scheduling methods such as Round-Robin and Priority Scheduling fail to meet the demands of efficient resource allocation and real-time adaptability. By contrast, Q-learning, a reinforcement learning algorithm, continuously learns from system state changes, enabling dynamic scheduling and resource optimization. Through extensive experiments, the superiority of the proposed approach is demonstrated in both task completion time and resource utilization, outperforming traditional and dynamic resource allocation (DRA) algorithms. These findings are critical as they highlight the potential of intelligent scheduling algorithms based on reinforcement learning to address the growing complexity and unpredictability of computing environments. This research provides a foundation for the integration of AI-driven adaptive scheduling in future large-scale systems, offering a scalable, intelligent solution to enhance system performance, reduce operating costs, and support sustainable energy consumption. The broad applicability of this approach makes it a promising candidate for next-generation computing frameworks, such as edge computing, cloud computing, and the Internet of Things.
Authors: Ahmed Salah, David Yevick
Abstract: This paper demonstrates that grokking behavior in modular arithmetic with a modulus P in a neural network can be controlled by modifying the profile of the activation function as well as the depth and width of the model. Plotting the even PCA projections of the weights of the last NN layer against their odd projections further yields patterns which become significantly more uniform when the nonlinearity is increased by incrementing the number of layers. These patterns can be employed to factor P when P is nonprime. Finally, a metric for the generalization ability of the network is inferred from the entropy of the layer weights while the degree of nonlinearity is related to correlations between the local entropy of the weights of the neurons in the final layer.
Authors: Xingyu Ai, Bin Huang, Fang Chen, Liu Shi, Binxuan Li, Shaoyu Wang, Qiegen Liu
Abstract: Recent advances in diffusion models have demonstrated exceptional performance in generative tasks across vari-ous fields. In positron emission tomography (PET), the reduction in tracer dose leads to information loss in sino-grams. Using diffusion models to reconstruct missing in-formation can improve imaging quality. Traditional diffu-sion models effectively use Gaussian noise for image re-constructions. However, in low-dose PET reconstruction, Gaussian noise can worsen the already sparse data by introducing artifacts and inconsistencies. To address this issue, we propose a diffusion model named residual esti-mation diffusion (RED). From the perspective of diffusion mechanism, RED uses the residual between sinograms to replace Gaussian noise in diffusion process, respectively sets the low-dose and full-dose sinograms as the starting point and endpoint of reconstruction. This mechanism helps preserve the original information in the low-dose sinogram, thereby enhancing reconstruction reliability. From the perspective of data consistency, RED introduces a drift correction strategy to reduce accumulated prediction errors during the reverse process. Calibrating the inter-mediate results of reverse iterations helps maintain the data consistency and enhances the stability of reconstruc-tion process. Experimental results show that RED effec-tively improves the quality of low-dose sinograms as well as the reconstruction results. The code is available at: https://github.com/yqx7150/RED.
Authors: Saheli Saha, Debasmita Banerjee, Rishi Ram, Gowtham Reddy, Debashree Guha, Arnab Sarkar, Bapi Dutta, Moses ArunSingh S, Suman Chakraborty, Indranil Mallick
Abstract: Dose prediction is an area of ongoing research that facilitates radiotherapy planning. Most commercial models utilise imaging data and intense computing resources. This study aimed to predict the dose-volume of rectum and bladder from volumes of target, at-risk structure organs and their overlap regions using machine learning. Dose-volume information of 94 patients with prostate cancer planned for 6000cGy in 20 fractions was exported from the treatment planning system as text files and mined to create a training dataset. Several statistical modelling, machine learning methods, and a new fuzzy rule-based prediction (FRBP) model were explored and validated on an independent dataset of 39 patients. The median absolute error was 2.0%-3.7% for bladder and 1.7-2.4% for rectum in the 4000-6420cGy range. For 5300cGy, 5600cGy and 6000cGy, the median difference was less than 2.5% for rectum and 3.8% for bladder. The FRBP model produced errors of 1.2%, 1.3%, 0.9% and 1.6%, 1.2%, 0.1% for the rectum and bladder respectively at these dose levels. These findings indicate feasibility of obtaining accurate predictions of the clinically important dose-volume parameters for rectum and bladder using just the volumes of these structures.
Authors: Yo-Hwan Choi, Seon-Yu Kang, Minjong Cheon
Abstract: As global warming increases the complexity of weather patterns; the precision of weather forecasting becomes increasingly important. Our study proposes a novel preprocessing method and convolutional autoencoder model developed to improve the interpretation of synoptic weather maps. These are critical for meteorologists seeking a thorough understanding of weather conditions. This model could recognize historical synoptic weather maps that nearly match current atmospheric conditions, marking a significant step forward in modern technology in meteorological forecasting. This comprises unsupervised learning models like VQ-VQE, as well as supervised learning models like VGG16, VGG19, Xception, InceptionV3, and ResNet50 trained on the ImageNet dataset, as well as research into newer models like EfficientNet and ConvNeXt. Our findings proved that, while these models perform well in various settings, their ability to identify comparable synoptic weather maps has certain limits. Our research, motivated by the primary goal of significantly increasing meteorologists' efficiency in labor-intensive tasks, discovered that cosine similarity is the most effective metric, as determined by a combination of quantitative and qualitative assessments to accurately identify relevant historical weather patterns. This study broadens our understanding by shifting the emphasis from numerical precision to practical application, ensuring that our model is effective in theory practical, and accessible in the complex and dynamic field of meteorology.
Authors: Yassine Abbahaddou, Sofiane Ennadir, Johannes F. Lutzeyer, Fragkiskos D. Malliaros, Michalis Vazirgiannis
Abstract: Graph Neural Networks (GNNs), which are nowadays the benchmark approach in graph representation learning, have been shown to be vulnerable to adversarial attacks, raising concerns about their real-world applicability. While existing defense techniques primarily concentrate on the training phase of GNNs, involving adjustments to message passing architectures or pre-processing methods, there is a noticeable gap in methods focusing on increasing robustness during inference. In this context, this study introduces RobustCRF, a post-hoc approach aiming to enhance the robustness of GNNs at the inference stage. Our proposed method, founded on statistical relational learning using a Conditional Random Field, is model-agnostic and does not require prior knowledge about the underlying model architecture. We validate the efficacy of this approach across various models, leveraging benchmark node classification datasets.
Authors: Xiangyu Zhao, Zhiwang Zhou, Wenlong Zhang, Yihao Liu, Xiangyu Chen, Junchao Gong, Hao Chen, Ben Fei, Shiqi Chen, Wanli Ouyang, Xiao-Ming Wu, Lei Bai
Abstract: The Earth's weather system encompasses intricate weather data modalities and diverse weather understanding tasks, which hold significant value to human life. Existing data-driven models focus on single weather understanding tasks (e.g., weather forecasting). Although these models have achieved promising results, they fail to tackle various complex tasks within a single and unified model. Moreover, the paradigm that relies on limited real observations for a single scenario hinders the model's performance upper bound. In response to these limitations, we draw inspiration from the in-context learning paradigm employed in state-of-the-art visual foundation models and large language models. In this paper, we introduce the first generalist weather foundation model (WeatherGFM), designed to address a wide spectrum of weather understanding tasks in a unified manner. More specifically, we initially unify the representation and definition of the diverse weather understanding tasks. Subsequently, we devised weather prompt formats to manage different weather data modalities, namely single, multiple, and temporal modalities. Finally, we adopt a visual prompting question-answering paradigm for the training of unified weather understanding tasks. Extensive experiments indicate that our WeatherGFM can effectively handle up to ten weather understanding tasks, including weather forecasting, super-resolution, weather image translation, and post-processing. Our method also showcases generalization ability on unseen tasks.
Authors: Levi Rauchwerger, Stefanie Jegelka, Ron Levie
Abstract: We analyze the universality and generalization of graph neural networks (GNNs) on attributed graphs, i.e., with node attributes. To this end, we propose pseudometrics over the space of all attributed graphs that describe the fine-grained expressivity of GNNs. Namely, GNNs are both Lipschitz continuous with respect to our pseudometrics and can separate attributed graphs that are distant in the metric. Moreover, we prove that the space of all attributed graphs is relatively compact with respect to our metrics. Based on these properties, we prove a universal approximation theorem for GNNs and generalization bounds for GNNs on any data distribution of attributed graphs. The proposed metrics compute the similarity between the structures of attributed graphs via a hierarchical optimal transport between computation trees. Our work extends and unites previous approaches which either derived theory only for graphs with no attributes, derived compact metrics under which GNNs are continuous but without separation power, or derived metrics under which GNNs are continuous and separate points but the space of graphs is not relatively compact, which prevents universal approximation and generalization analysis.
Authors: Peidong Liu, Wenbo Zhang, Xue Zhe, Jiancheng Lv, Xianggen Liu
Abstract: The efficacy of diffusion models in generating a spectrum of data modalities, including images, text, and videos, has spurred inquiries into their utility in molecular generation, yielding significant advancements in the field. However, the molecular generation process with diffusion models involves multiple autoregressive steps over a finite time horizon, leading to exposure bias issues inherently. To address the exposure bias issue, we propose a training framework named GapDiff. The core idea of GapDiff is to utilize model-predicted conformations as ground truth probabilistically during training, aiming to mitigate the data distributional disparity between training and inference, thereby enhancing the affinity of generated molecules. We conduct experiments using a 3D molecular generation model on the CrossDocked2020 dataset, and the vina energy and diversity demonstrate the potency of our framework with superior affinity. GapDiff is available at \url{https://github.com/HUGHNew/gapdiff}.
Authors: Bo Li, Wei Wang, Peng Ye
Abstract: Differential privacy (DP) is a formal notion that restricts the privacy leakage of an algorithm when running on sensitive data, in which privacy-utility trade-off is one of the central problems in private data analysis. In this work, we investigate the fundamental limits of differential privacy in online learning algorithms and present evidence that separates three types of constraints: no DP, pure DP, and approximate DP. We first describe a hypothesis class that is online learnable under approximate DP but not online learnable under pure DP under the adaptive adversarial setting. This indicates that approximate DP must be adopted when dealing with adaptive adversaries. We then prove that any private online learner must make an infinite number of mistakes for almost all hypothesis classes. This essentially generalizes previous results and shows a strong separation between private and non-private settings since a finite mistake bound is always attainable (as long as the class is online learnable) when there is no privacy requirement.
Authors: Jonah K\"omen, Hannah Marienwald, Jonas Dippel, Julius Hense
Abstract: Deep learning has led to remarkable advancements in computational histopathology, e.g., in diagnostics, biomarker prediction, and outcome prognosis. Yet, the lack of annotated data and the impact of batch effects, e.g., systematic technical data differences across hospitals, hamper model robustness and generalization. Recent histopathological foundation models -- pretrained on millions to billions of images -- have been reported to improve generalization performances on various downstream tasks. However, it has not been systematically assessed whether they fully eliminate batch effects. In this study, we empirically show that the feature embeddings of the foundation models still contain distinct hospital signatures that can lead to biased predictions and misclassifications. We further find that the signatures are not removed by stain normalization methods, dominate distances in feature space, and are evident across various principal components. Our work provides a novel perspective on the evaluation of medical foundation models, paving the way for more robust pretraining strategies and downstream predictors.
Authors: Ricard Montal\`a, Bernat Font, Pol Su\'arez, Jean Rabault, Oriol Lehmkuhl, Ivette Rodriguez
Abstract: This paper presents a deep reinforcement learning (DRL) framework for active flow control (AFC) to reduce drag in aerodynamic bodies. Tested on a 3D cylinder at Re = 100, the DRL approach achieved a 9.32% drag reduction and a 78.4% decrease in lift oscillations by learning advanced actuation strategies. The methodology integrates a CFD solver with a DRL model using an in-memory database for efficient communication between
Authors: Pablo G\'omez, Roland D. Vavrek, Guillermo Buenadicha, John Hoar, Sandor Kruk, Jan Reerink
Abstract: State-of-the-art space science missions increasingly rely on automation due to spacecraft complexity and the costs of human oversight. The high volume of data, including scientific and telemetry data, makes manual inspection challenging. Machine learning offers significant potential to meet these demands. The Euclid space telescope, in its survey phase since February 2024, exemplifies this shift. Euclid's success depends on accurate monitoring and interpretation of housekeeping telemetry and science-derived data. Thousands of telemetry parameters, monitored as time series, may or may not impact the quality of scientific data. These parameters have complex interdependencies, often due to physical relationships (e.g., proximity of temperature sensors). Optimising science operations requires careful anomaly detection and identification of hidden parameter states. Moreover, understanding the interactions between known anomalies and physical quantities is crucial yet complex, as related parameters may display anomalies with varied timing and intensity. We address these challenges by analysing temperature anomalies in Euclid's telemetry from February to August 2024, focusing on eleven temperature parameters and 35 covariates. We use a predictive XGBoost model to forecast temperatures based on historical values, detecting anomalies as deviations from predictions. A second XGBoost model predicts anomalies from covariates, capturing their relationships to temperature anomalies. We identify the top three anomalies per parameter and analyse their interactions with covariates using SHAP (Shapley Additive Explanations), enabling rapid, automated analysis of complex parameter relationships. Our method demonstrates how machine learning can enhance telemetry monitoring, offering scalable solutions for other missions with similar data challenges.
Authors: Zhihong Liu, Xin Xu, Peng Qiao, Dongsheng Li
Abstract: Deep reinforcement learning has led to dramatic breakthroughs in the field of artificial intelligence for the past few years. As the amount of rollout experience data and the size of neural networks for deep reinforcement learning have grown continuously, handling the training process and reducing the time consumption using parallel and distributed computing is becoming an urgent and essential desire. In this paper, we perform a broad and thorough investigation on training acceleration methodologies for deep reinforcement learning based on parallel and distributed computing, providing a comprehensive survey in this field with state-of-the-art methods and pointers to core references. In particular, a taxonomy of literature is provided, along with a discussion of emerging topics and open issues. This incorporates learning system architectures, simulation parallelism, computing parallelism, distributed synchronization mechanisms, and deep evolutionary reinforcement learning. Further, we compare 16 current open-source libraries and platforms with criteria of facilitating rapid development. Finally, we extrapolate future directions that deserve further research.
Authors: Ayobami Adewale, Chris Lee, Amnir Hadachi, Nicolly Lima da Silva
Abstract: As we move towards a mixed-traffic scenario of Autonomous vehicles (AVs) and Human-driven vehicles (HDVs), understanding the car-following behaviour is important to improve traffic efficiency and road safety. Using a real-world trajectory dataset, this study uses descriptive and statistical analysis to investigate the car-following behaviours of three vehicle pairs: HDV-AV, AV-HDV and HDV-HDV in mixed traffic. The ANOVA test showed that car-following behaviours across different vehicle pairs are statistically significant (p-value < 0.05). We also introduce a data-driven Knowledge Distillation Neural Network (KDNN) model for predicting car-following behaviour in terms of speed. The KDNN model demonstrates comparable predictive accuracy to its teacher network, a Long Short-Term Memory (LSTM) network, and outperforms both the standalone student network, a Multilayer Perceptron (MLP), and traditional physics-based models like the Gipps model. Notably, the KDNN model better prevents collisions, measured by minimum Time-to-Collision (TTC), and operates with lower computational power, making it ideal for AVs or driving simulators requiring efficient computing.
Authors: Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Pengyuan Wang, Jingcheng Pang, Kaiyuan Li, Tianshuo Liu, Haoxin Lin, Yang Yu, Zhi-Hua Zhou
Abstract: World models play a crucial role in decision-making within embodied environments, enabling cost-free explorations that would otherwise be expensive in the real world. To facilitate effective decision-making, world models must be equipped with strong generalizability to support faithful imagination in out-of-distribution (OOD) regions and provide reliable uncertainty estimation to assess the credibility of the simulated experiences, both of which present significant challenges for prior scalable approaches. This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior-conditioning and retracing-rollout. Behavior-conditioning addresses the policy distribution shift, one of the primary sources of the world model generalization error, while retracing-rollout enables efficient uncertainty estimation without the necessity of model ensembles. These techniques are universal and can be combined with any neural network architecture for world model learning. Incorporating these two techniques, we present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability. We demonstrate the superiority of Whale-ST in simulation tasks by evaluating both value estimation accuracy and video generation fidelity. Additionally, we examine the effectiveness of our uncertainty estimation technique, which enhances model-based policy optimization in fully offline scenarios. Furthermore, we propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets. We show that Whale-X exhibits promising scalability and strong generalizability in real-world manipulation scenarios using minimal demonstrations.
Authors: Samira Maghool, Paolo Ceravolo
Abstract: In this paper, we propose an innovative approach to thoroughly explore dataset features that introduce bias in downstream machine-learning tasks. Depending on the data format, we use different techniques to map instances into a similarity feature space. Our method's ability to adjust the resolution of pairwise similarity provides clear insights into the relationship between the dataset classification complexity and model fairness. Experimental results confirm the promising applicability of the similarity network in promoting fair models. Moreover, leveraging our methodology not only seems promising in providing a fair downstream task such as classification, it also performs well in imputation and augmentation of the dataset satisfying the fairness criteria such as demographic parity and imbalanced classes.
Authors: Xiaoyang Hou, Tian Zhu, Milong Ren, Dongbo Bu, Xin Gao, Chunming Zhang, Shiwei Sun
Abstract: Generating molecular graphs is crucial in drug design and discovery but remains challenging due to the complex interdependencies between nodes and edges. While diffusion models have demonstrated their potentiality in molecular graph design, they often suffer from unstable training and inefficient sampling. To enhance generation performance and training stability, we propose GGFlow, a discrete flow matching generative model incorporating optimal transport for molecular graphs and it incorporates an edge-augmented graph transformer to enable the direct communications among chemical bounds. Additionally, GGFlow introduces a novel goal-guided generation framework to control the generative trajectory of our model, aiming to design novel molecular structures with the desired properties. GGFlow demonstrates superior performance on both unconditional and conditional molecule generation tasks, outperforming existing baselines and underscoring its effectiveness and potential for wider application.
Authors: Yi Li, Zhichun Guo, Guanpeng Li, Bingzhe Li
Abstract: Graph neural networks (GNNs) have become essential tools for analyzing non-Euclidean data across various domains. During training stage, sampling plays an important role in reducing latency by limiting the number of nodes processed, particularly in large-scale applications. However, as the demand for better prediction performance grows, existing sampling algorithms become increasingly complex, leading to significant overhead. To mitigate this, we propose YOSO (You-Only-Sample-Once), an algorithm designed to achieve efficient training while preserving prediction accuracy. YOSO introduces a compressed sensing (CS)-based sampling and reconstruction framework, where nodes are sampled once at input layer, followed by a lossless reconstruction at the output layer per epoch. By integrating the reconstruction process with the loss function of specific learning tasks, YOSO not only avoids costly computations in traditional compressed sensing (CS) methods, such as orthonormal basis calculations, but also ensures high-probability accuracy retention which equivalent to full node participation. Experimental results on node classification and link prediction demonstrate the effectiveness and efficiency of YOSO, reducing GNN training by an average of 75\% compared to state-of-the-art methods, while maintaining accuracy on par with top-performing baselines.
Authors: Puqian Wang, Nikos Zarifis, Ilias Diakonikolas, Jelena Diakonikolas
Abstract: A single-index model (SIM) is a function of the form $\sigma(\mathbf{w}^{\ast} \cdot \mathbf{x})$, where $\sigma: \mathbb{R} \to \mathbb{R}$ is a known link function and $\mathbf{w}^{\ast}$ is a hidden unit vector. We study the task of learning SIMs in the agnostic (a.k.a. adversarial label noise) model with respect to the $L^2_2$-loss under the Gaussian distribution. Our main result is a sample and computationally efficient agnostic proper learner that attains $L^2_2$-error of $O(\mathrm{OPT})+\epsilon$, where $\mathrm{OPT}$ is the optimal loss. The sample complexity of our algorithm is $\tilde{O}(d^{\lceil k^{\ast}/2\rceil}+d/\epsilon)$, where $k^{\ast}$ is the information-exponent of $\sigma$ corresponding to the degree of its first non-zero Hermite coefficient. This sample bound nearly matches known CSQ lower bounds, even in the realizable setting. Prior algorithmic work in this setting had focused on learning in the realizable case or in the presence of semi-random noise. Prior computationally efficient robust learners required significantly stronger assumptions on the link function.
Authors: Abdulkadir Gokce, Martin Schrimpf
Abstract: When trained on large-scale object classification datasets, certain artificial neural network models begin to approximate core object recognition (COR) behaviors and neural response patterns in the primate visual ventral stream (VVS). While recent machine learning advances suggest that scaling model size, dataset size, and compute resources improve task performance, the impact of scaling on brain alignment remains unclear. In this study, we explore scaling laws for modeling the primate VVS by systematically evaluating over 600 models trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and COR behaviors. We observe that while behavioral alignment continues to scale with larger models, neural alignment saturates. This observation remains true across model architectures and training datasets, even though models with stronger inductive bias and datasets with higher-quality images are more compute-efficient. Increased scaling is especially beneficial for higher-level visual areas, where small models trained on few samples exhibit only poor alignment. Finally, we develop a scaling recipe, indicating that a greater proportion of compute should be allocated to data samples over model size. Our results suggest that while scaling alone might suffice for alignment with human core object recognition behavior, it will not yield improved models of the brain's visual ventral stream with current architectures and datasets, highlighting the need for novel strategies in building brain-like models.
Authors: William Cappelletti, Pascal Frossard
Abstract: Representing and exploiting multivariate signals require capturing complex relations between variables. We define a novel Graph-Dictionary signal model, where a finite set of graphs characterizes relationships in data distribution through a weighted sum of their Laplacians. We propose a framework to infer the graph dictionary representation from observed data, along with a bilinear generalization of the primal-dual splitting algorithm to solve the learning problem. Our new formulation allows to include a priori knowledge on signal properties, as well as on underlying graphs and their coefficients. We show the capability of our method to reconstruct graphs from signals in multiple synthetic settings, where our model outperforms previous baselines. Then, we exploit graph-dictionary representations in a motor imagery decoding task on brain activity data, where we classify imagined motion better than standard methods relying on many more features.
Authors: Lucas Rosenblatt, Yuliia Lut, Eitan Turok, Marco Avella-Medina, Rachel Cummings
Abstract: Imbalanced learning occurs in classification settings where the distribution of class-labels is highly skewed in the training data, such as when predicting rare diseases or in fraud detection. This class imbalance presents a significant algorithmic challenge, which can be further exacerbated when privacy-preserving techniques such as differential privacy are applied to protect sensitive training data. Our work formalizes these challenges and provides a number of algorithmic solutions. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance; these include oversampling, SMOTE, and private synthetic data generation. We also consider DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance; these include model bagging, class-weighted empirical risk minimization and class-weighted deep learning. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy. Finally, we empirically evaluate these privacy-preserving imbalanced learning methods under various data and distributional settings. We find that private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.
Authors: Mayee F. Chen, Michael Y. Hu, Nicholas Lourie, Kyunghyun Cho, Christopher R\'e
Abstract: Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. In this paper, we study the cause of this inconsistency by unifying existing methods into a standard optimization framework. We show that all methods set proportions to minimize total loss, subject to a method-specific mixing law -- an assumption on how loss is a function of mixture proportions. We find that existing parameterizations of mixing laws can express the true loss-proportion relationship empirically, but the methods themselves often set the mixing law parameters inaccurately, resulting in poor and inconsistent performance. Finally, we leverage the insights from our framework to derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.28 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.01 test perplexity points.
Authors: Wangyang Ying, Haoyue Bai, Kunpeng Liu, Yanjie Fu
Abstract: Feature space is an environment where data points are vectorized to represent the original dataset. Reconstructing a good feature space is essential to augment the AI power of data, improve model generalization, and increase the availability of downstream ML models. Existing literature, such as feature transformation and feature selection, is labor-intensive (e.g., heavy reliance on empirical experience) and mostly designed for tabular data. Moreover, these methods regard data samples as independent, which ignores the unique topological structure when applied to graph data, thus resulting in a suboptimal reconstruction feature space. Can we consider the topological information to automatically reconstruct feature space for graph data without heavy experiential knowledge? To fill this gap, we leverage topology-aware reinforcement learning to automate and optimize feature space reconstruction for graph data. Our approach combines the extraction of core subgraphs to capture essential structural information with a graph neural network (GNN) to encode topological features and reduce computing complexity. Then we introduce three reinforcement agents within a hierarchical structure to systematically generate meaningful features through an iterative process, effectively reconstructing the feature space. This framework provides a principled solution for attributed graph feature space reconstruction. The extensive experiments demonstrate the effectiveness and efficiency of including topological awareness.
Authors: Joseph Pollock, Igor Shilov, Euodia Dodd, Yves-Alexandre de Montjoye
Abstract: Membership inference attacks (MIAs) are widely used to empirically assess the privacy risks of samples used to train a target machine learning model. State-of-the-art methods however require training hundreds of shadow models, with the same size and architecture of the target model, solely to evaluate the privacy risk. While one might be able to afford this for small models, the cost often becomes prohibitive for medium and large models. We here instead propose a novel approach to identify the at-risk samples using only artifacts available during training, with little to no additional computational overhead. Our method analyzes individual per-sample loss traces and uses them to identify the vulnerable data samples. We demonstrate the effectiveness of our artifact-based approach through experiments on the CIFAR10 dataset, showing high precision in identifying vulnerable samples as determined by a SOTA shadow model-based MIA (LiRA). Impressively, our method reaches the same precision as another SOTA MIA when measured against LiRA, despite it being orders of magnitude cheaper. We then show LT-IQR to outperform alternative loss aggregation methods, perform ablation studies on hyperparameters, and validate the robustness of our method to the target metric. Finally, we study the evolution of the vulnerability score distribution throughout training as a metric for model-level risk assessment.
Authors: Rhys Gould, Hidenori Tanaka
Abstract: Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental components of modern deep learning. However, their training dynamics lack comprehensive theoretical understanding, with limited insight into why common practices - such as specific hyperparameter choices and normalization layers - contribute to successful generalization. This work presents a continuous-time formulation of Adam and AdamW, facilitating a tractable analysis of training dynamics that can shed light on such practical questions. We theoretically derive a stable region for Adam's hyperparameters $(\beta, \gamma)$ that ensures bounded updates, empirically verifying these predictions by observing unstable exponential growth of parameter updates outside this region. Furthermore, we theoretically justify the success of normalization layers by uncovering an implicit meta-adaptive effect of scale-invariant architectural components. This insight leads to an explicit optimizer, $2$-Adam, which we generalize to $k$-Adam - an optimizer that applies an adaptive normalization procedure $k$ times, encompassing Adam (corresponding to $k=1$) and Adam with a normalization layer (corresponding to $k=2$). Overall, our continuous-time formulation of Adam facilitates a principled analysis, offering deeper understanding of optimal hyperparameter choices and architectural decisions in modern deep learning.
Authors: Shreen Gul, Mohamed Elmahallawy, Sanjay Madria, Ardhendu Tripathy
Abstract: Deep learning (DL) models are popular across various domains due to their remarkable performance and efficiency. However, their effectiveness relies heavily on large amounts of labeled data, which are often time-consuming and labor-intensive to generate manually. To overcome this challenge, it is essential to develop strategies that reduce reliance on extensive labeled data while preserving model performance. In this paper, we propose FisherMask, a Fisher information-based active learning (AL) approach that identifies key network parameters by masking them based on their Fisher information values. FisherMask enhances batch AL by using Fisher information to select the most critical parameters, allowing the identification of the most impactful samples during AL training. Moreover, Fisher information possesses favorable statistical properties, offering valuable insights into model behavior and providing a better understanding of the performance characteristics within the AL pipeline. Our extensive experiments demonstrate that FisherMask significantly outperforms state-of-the-art methods on diverse datasets, including CIFAR-10 and FashionMNIST, especially under imbalanced settings. These improvements lead to substantial gains in labeling efficiency. Hence serving as an effective tool to measure the sensitivity of model parameters to data samples. Our code is available on \url{https://github.com/sgchr273/FisherMask}.
Authors: Ankita Joshi, Ashutosh Sharma, Anoushkrit Goel, Ranjeet Ranjan Jha, Chirag Ahuja, Arnav Bhavsar, Aditya Nigam
Abstract: Fiber tractography is a cornerstone of neuroimaging, enabling the detailed mapping of the brain's white matter pathways through diffusion MRI. This is crucial for understanding brain connectivity and function, making it a valuable tool in neurological applications. Despite its importance, tractography faces challenges due to its complexity and susceptibility to false positives, misrepresenting vital pathways. To address these issues, recent strategies have shifted towards deep learning, utilizing supervised learning, which depends on precise ground truth, or reinforcement learning, which operates without it. In this work, we propose Tract-RLFormer, a network utilizing both supervised and reinforcement learning, in a two-stage policy refinement process that markedly improves the accuracy and generalizability across various data-sets. By employing a tract-specific approach, our network directly delineates the tracts of interest, bypassing the traditional segmentation process. Through rigorous validation on datasets such as TractoInferno, HCP, and ISMRM-2015, our methodology demonstrates a leap forward in tractography, showcasing its ability to accurately map the brain's white matter tracts.
Authors: Ye Zeng, Li Qiao, Zhen Gao, Tong Qin, Zhonghuai Wu, Emad Khalaf, Sheng Chen, Mohsen Guizani
Abstract: In massive multiple-input multiple-output (MIMO) systems, how to reliably acquire downlink channel state information (CSI) with low overhead is challenging. In this work, by integrating the generative pre-trained Transformer (GPT) with federated-tuning, we propose a CSI-GPT approach to realize efficient downlink CSI acquisition. Specifically, we first propose a Swin Transformer-based channel acquisition network (SWTCAN) to acquire downlink CSI, where pilot signals, downlink channel estimation, and uplink CSI feedback are jointly designed. Furthermore, to solve the problem of insufficient training data, we propose a variational auto-encoder-based channel sample generator (VAE-CSG), which can generate sufficient CSI samples based on a limited number of high-quality CSI data obtained from the current cell. The CSI dataset generated from VAE-CSG will be used for pre-training SWTCAN. To fine-tune the pre-trained SWTCAN for improved performance, we propose an online federated-tuning method, where only a small amount of SWTCAN parameters are unfrozen and updated using over-the-air computation, avoiding the high communication overhead caused by aggregating the complete CSI samples from user equipment (UEs) to the BS for centralized fine-tuning. Simulation results verify the advantages of the proposed SWTCAN and the communication efficiency of the proposed federated-tuning method. Our code is publicly available at https://github.com/BIT-ZY/CSI-GPT
Authors: Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Yanchi liu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, Wei Cheng
Abstract: We propose a novel approach to scaling LLM inference for code generation. We frame code generation as a black box optimization problem within the code space, and employ optimization-inspired techniques to enhance exploration. Specifically, we introduce Scattered Forest Search to enhance solution diversity while searching for solutions. Our theoretical analysis illustrates how these methods avoid local optima during optimization. Extensive experiments on HumanEval, MBPP, APPS, CodeContests, and Leetcode reveal significant performance improvements. For instance, our method achieves a pass@1 rate of 67.1% on HumanEval+ and 87.2% on HumanEval with GPT-3.5, marking improvements of 8.6% and 4.3% over the state-of-the-art, while also halving the iterations needed to find the correct solution. Furthermore, our method scales more efficiently than existing search techniques, including tree search, line search, and repeated sampling.
Authors: Stanis{\l}aw {\L}aniewski, Robert \'Slepaczuk
Abstract: This study utilizes machine learning algorithms to analyze and organize knowledge in the field of algorithmic trading. By filtering a dataset of 136 million research papers, we identified 14,342 relevant articles published between 1956 and Q1 2020. We compare traditional practices-such as keyword-based algorithms and embedding techniques-with state-of-the-art topic modeling methods that employ dimensionality reduction and clustering. This comparison allows us to assess the popularity and evolution of different approaches and themes within algorithmic trading. We demonstrate the usefulness of Natural Language Processing (NLP) in the automatic extraction of knowledge, highlighting the new possibilities created by the latest iterations of Large Language Models (LLMs) like ChatGPT. The rationale for focusing on this topic stems from our analysis, which reveals that research articles on algorithmic trading are increasing at a faster rate than the overall number of publications. While stocks and main indices comprise more than half of all assets considered, certain asset classes, such as cryptocurrencies, exhibit a much stronger growth trend. Machine learning models have become the most popular methods in recent years. The study demonstrates the efficacy of LLMs in refining datasets and addressing intricate questions about the analyzed articles, such as comparing the efficiency of different models. Our research shows that by decomposing tasks into smaller components and incorporating reasoning steps, we can effectively tackle complex questions supported by case analyses. This approach contributes to a deeper understanding of algorithmic trading methodologies and underscores the potential of advanced NLP techniques in literature reviews.
Authors: J. Soenen, A. Yurtman, T. Becker, K. Vanthournout, H. Blockeel
Abstract: To enable the transition from fossil fuels towards renewable energy, the low-voltage grid needs to be reinforced at a faster pace and on a larger scale than was historically the case. To efficiently plan reinforcements, one needs to estimate the currents and voltages throughout the grid, which are unknown but can be calculated from the grid layout and the electricity consumption time series of each consumer. However, for many consumers, these time series are unknown and have to be estimated from the available consumer information. We refer to this task as scenario generation. The state-of-the-art approach that generates electricity consumption scenarios is complex, resulting in a computationally expensive procedure with only limited interpretability. To alleviate these drawbacks, we propose a fast and interpretable scenario generation technique based on predictive clustering trees (PCTs) that does not compromise accuracy. In our experiments on three datasets from different locations, we found that our proposed approach generates time series that are at least as accurate as the state-of-the-art while being at least 7 times faster in training and prediction. Moreover, the interpretability of the PCT allows domain experts to gain insight into their data while simultaneously building trust in the predictions of the model.
Authors: Jan P. Williams, J. Nathan Kutz, Krithika Manohar
Abstract: Model predictive control (MPC) is an industry standard control technique that iteratively solves an open-loop optimization problem to guide a system towards a desired state or trajectory. Consequently, an accurate forward model of system dynamics is critical for the efficacy of MPC and much recent work has been aimed at the use of neural networks to act as data-driven surrogate models to enable MPC. Perhaps the most common network architecture applied to this task is the recurrent neural network (RNN) due to its natural interpretation as a dynamical system. In this work, we assess the ability of RNN variants to both learn the dynamics of benchmark control systems and serve as surrogate models for MPC. We find that echo state networks (ESNs) have a variety of benefits over competing architectures, namely reductions in computational complexity, longer valid prediction times, and reductions in cost of the MPC objective function.
Authors: Hala Hawashin
Abstract: Despite significant advances in quantum computing across various domains, research on applying quantum approaches to language compositionality - such as modeling linguistic structures and interactions - remains limited. This gap extends to the integration of quantum language data with real-world data from sources like images, video, and audio. This thesis explores how quantum computational methods can enhance the compositional modeling of language through multimodal data integration. Specifically, it advances Multimodal Quantum Natural Language Processing (MQNLP) by applying the Lambeq toolkit to conduct a comparative analysis of four compositional models and evaluate their influence on image-text classification tasks. Results indicate that syntax-based models, particularly DisCoCat and TreeReader, excel in effectively capturing grammatical structures, while bag-of-words and sequential models struggle due to limited syntactic awareness. These findings underscore the potential of quantum methods to enhance language modeling and drive breakthroughs as quantum technology evolves.
Authors: Tianbo Li, Min Lin, Stephen Dale, Zekun Shi, A. H. Castro Neto, Kostya S. Novoselov, Giovanni Vignale
Abstract: We present a novel approach to address the challenges of variable occupation numbers in direct optimization of density functional theory (DFT). By parameterizing both the eigenfunctions and the occupation matrix, our method minimizes the free energy with respect to these parameters. As the stationary conditions require the occupation matrix and the Kohn-Sham Hamiltonian to be simultaneously diagonalizable, this leads to the concept of ``self-diagonalization,'' where, by assuming a diagonal occupation matrix without loss of generality, the Hamiltonian matrix naturally becomes diagonal at stationary points. Our method incorporates physical constraints on both the eigenfunctions and the occupations into the parameterization, transforming the constrained optimization into an fully differentiable unconstrained problem, which is solvable via gradient descent. Implemented in JAX, our method was tested on aluminum and silicon, confirming that it achieves efficient self-diagonalization, produces the correct Fermi-Dirac distribution of the occupation numbers and yields band structures consistent with those obtained with SCF methods in Quantum Espresso.
Authors: Mansi Sakarvadia
Abstract: Answering multi-hop reasoning questions requires retrieving and synthesizing information from diverse sources. Language models (LMs) struggle to perform such reasoning consistently. We propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LM attention heads. First, we analyze the per-layer activations of GPT-2 models in response to single- and multi-hop prompts. We then propose a mechanism that allows users to inject relevant prompt-specific information, which we refer to as "memories," at critical LM locations during inference. By thus enabling the LM to incorporate additional relevant information during inference, we enhance the quality of multi-hop prompt completions. We empirically show that a simple, efficient, and targeted memory injection into a key attention layer often increases the probability of the desired next token in multi-hop tasks, by up to 424%. We observe that small subsets of attention heads can significantly impact the model prediction during multi-hop reasoning. To more faithfully interpret these heads, we develop Attention Lens: an open source tool that translates the outputs of attention heads into vocabulary tokens via learned transformations called lenses. We demonstrate the use of lenses to reveal how a model arrives at its answer and use them to localize sources of model failures such as in the case of biased and malicious language generation.
Authors: Mohamed Hussein Abo El-Ela, Ali Hamdi Fergany
Abstract: This paper introduces a novel approach to urban pathfinding by transforming traditional heuristic-based algorithms into deep learning models that leverage real-time contextual data, such as traffic and weather conditions. We propose two methods: an enhanced A* algorithm that dynamically adjusts routes based on current environmental conditions, and a neural network model that predicts the next optimal path segment using historical and live data. An extensive benchmark was conducted to compare the performance of different deep learning models, including MLP, GRU, LSTM, Autoencoders, and Transformers. Both methods were evaluated in a simulated urban environment in Berlin, with the neural network model outperforming traditional methods, reducing travel times by up to 40%, while the enhanced A* algorithm achieved a 34% improvement. These results demonstrate the potential of deep learning to optimize urban navigation in real time, providing more adaptable and efficient routing solutions.
Authors: Rongjie Yi, Xiang Li, Weikai Xie, Zhenyan Lu, Chenghua Wang, Ao Zhou, Shangguang Wang, Xiwen Zhang, Mengwei Xu
Abstract: The interest in developing small language models (SLM) for on-device deployment is fast growing. However, the existing SLM design hardly considers the device hardware characteristics. Instead, this work presents a simple yet effective principle for SLM design: architecture searching for (near-)optimal runtime efficiency before pre-training. Guided by this principle, we develop PhoneLM SLM family (currently with 0.5B and 1.5B versions), that acheive the state-of-the-art capability-efficiency tradeoff among those with similar parameter size. We fully open-source the code, weights, and training datasets of PhoneLM for reproducibility and transparency, including both base and instructed versions. We also release a finetuned version of PhoneLM capable of accurate Android Intent invocation, and an end-to-end Android demo. All materials are available at https://github.com/UbiquitousLearning/PhoneLM.
Authors: Jesse Yao, Saurav Acharya, Priyaranjan Parida, Srinivas Attipalli, Ali Dasdan
Abstract: Enterprise searches require users to have complex knowledge of queries, configurations, and metadata, rendering it difficult for them to access information as needed. Most go-to-market (GTM) platforms utilize advanced search, an interface that enables users to filter queries by various fields using categories or keywords, which, historically, however, has proven to be exceedingly cumbersome, as users are faced with seemingly hundreds of options, fields, and buttons. Consequently, querying with natural language has long been ideal, a notion further empowered by Large Language Models (LLMs). In this paper, we implement and evaluate a solution for the Zoominfo product for sellers, which prompts the LLM with natural language, producing search fields through entity extraction that are then converted into a search query. The intermediary search fields offer numerous advantages for each query, including the elimination of syntax errors, simpler ground truths, and an intuitive format for the LLM to interpret. We paired this pipeline with many advanced prompt engineering strategies, featuring an intricate system message, few-shot prompting, chain-of-thought (CoT) reasoning, and execution refinement. Furthermore, we manually created the ground truth for 500+ natural language queries, enabling the supervised fine-tuning of Llama-3-8B-Instruct and the introduction of sophisticated numerical metrics. Comprehensive experiments with closed, open source, and fine-tuned LLM models were conducted through exact, Jaccard, cosine, and semantic similarity on individual search entities to demonstrate the efficacy of our approach. Overall, the most accurate closed model had an average accuracy of 97% per query, with only one field performing under 90%, with comparable results observed from the fine-tuned models.
Authors: Yongqi Jiang, Yansong Gao, Chunyi Zhou, Hongsheng Hu, Anmin Fu, Willy Susilo
Abstract: With the growing applications of Deep Learning (DL), especially recent spectacular achievements of Large Language Models (LLMs) such as ChatGPT and LLaMA, the commercial significance of these remarkable models has soared. However, acquiring well-trained models is costly and resource-intensive. It requires a considerable high-quality dataset, substantial investment in dedicated architecture design, expensive computational resources, and efforts to develop technical expertise. Consequently, safeguarding the Intellectual Property (IP) of well-trained models is attracting increasing attention. In contrast to existing surveys overwhelmingly focusing on model IPP mainly, this survey not only encompasses the protection on model level intelligence but also valuable dataset intelligence. Firstly, according to the requirements for effective IPP design, this work systematically summarizes the general and scheme-specific performance evaluation metrics. Secondly, from proactive IP infringement prevention and reactive IP ownership verification perspectives, it comprehensively investigates and analyzes the existing IPP methods for both dataset and model intelligence. Additionally, from the standpoint of training settings, it delves into the unique challenges that distributed settings pose to IPP compared to centralized settings. Furthermore, this work examines various attacks faced by deep IPP techniques. Finally, we outline prospects for promising future directions that may act as a guide for innovative research.
Authors: Youssef Boulaimen, Gabriele Fossi, Leila Outemzabet, Nathalie Jeanray, Oleksandr Levenets, Stephane Gerart, Sebastien Vachenc, Salvatore Raieli, Joanna Giemza
Abstract: The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity. This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state-of-the-art tools, especially in handling ambiguous and clinically uncertain variants. The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and reliability of genetic variant classification systems. These findings support the deployment of these advanced computational models in clinical environments, where they can significantly enhance the diagnostic processes for genetic disorders, ultimately pushing the boundaries of personalized medicine by offering more detailed and actionable genetic insights.
Authors: David Chapman, Parniyan Farvardin
Abstract: We present a novel empirical approach toward measuring the Probability Density Function (PDF) of the deep features of Convolutional Neural Networks (CNNs). Measurement of the deep feature PDF is a valuable problem for several reasons. Notably, a. Understanding the deep feature PDF yields new insight into deep representations. b. Feature density methods are important for tasks such as anomaly detection which can improve the robustness of deep learning models in the wild. Interpretable measurement of the deep feature PDF is challenging due to the Curse of Dimensionality (CoD), and the Spatial intuition Limitation. Our novel measurement technique combines copula analysis with the Method of Orthogonal Moments (MOM), in order to directly measure the Generalized Characteristic Function (GCF) of the multivariate deep feature PDF. We find that, surprisingly, the one-dimensional marginals of non-negative deep CNN features after major blocks are not well approximated by a Gaussian distribution, and that these features increasingly approximate an exponential distribution with increasing network depth. Furthermore, we observe that deep features become increasingly independent with increasing network depth within their typical ranges. However, we surprisingly also observe that many deep features exhibit strong dependence (either correlation or anti-correlation) with other extremely strong detections, even if these features are independent within typical ranges. We elaborate on these findings in our discussion, where we propose a new hypothesis that exponentially infrequent large valued features correspond to strong computer vision detections of semantic targets, which would imply that these large-valued features are not outliers but rather an important detection signal.
Authors: Rina Bao, Sheng He, Ellen Grant, Yangming Ou
Abstract: Hypoxic-Ischemic Encephalopathy (HIE) affects 1 to 5 out of every 1,000 newborns, with 30% to 50% of cases resulting in adverse neurocognitive outcomes. However, these outcomes can only be reliably assessed as early as age 2. Therefore, early and accurate prediction of HIE-related neurocognitive outcomes using deep learning models is critical for improving clinical decision-making, guiding treatment decisions and assessing novel therapies. However, a major challenge in developing deep learning models for this purpose is the scarcity of large, annotated HIE datasets. We have assembled the first and largest public dataset, however it contains only 156 cases with 2-year neurocognitive outcome labels. In contrast, we have collected 8,859 normal brain black Magnetic Resonance Imagings (MRIs) with 0-97 years of age that are available for brain age estimation using deep learning models. In this paper, we introduce AGE2HIE to transfer knowledge learned by deep learning models from healthy controls brain MRIs to a diseased cohort, from structural to diffusion MRIs, from regression of continuous age estimation to prediction of the binary neurocognitive outcomes, and from lifespan age (0-97 years) to infant (0-2 weeks). Compared to training from scratch, transfer learning from brain age estimation significantly improves not only the prediction accuracy (3% or 2% improvement in same or multi-site), but also the model generalization across different sites (5% improvement in cross-site validation).
Authors: Turker Berk Donmez
Abstract: In democratic societies, electoral systems play a crucial role in translating public preferences into political representation. Among these, the D'Hondt method is widely used to ensure proportional representation, balancing fair representation with governmental stability. Recently, there has been a growing interest in applying similar principles of proportional representation to enhance interpretability in machine learning, specifically in Explainable AI (XAI). This study investigates the integration of D'Hondt-based voting principles in the DhondtXAI method, which leverages resource allocation concepts to interpret feature importance within AI models. Through a comparison of SHAP (Shapley Additive Explanations) and DhondtXAI, we evaluate their effectiveness in feature attribution within CatBoost and XGBoost models for breast cancer and diabetes prediction, respectively. The DhondtXAI approach allows for alliance formation and thresholding to enhance interpretability, representing feature importance as seats in a parliamentary view. Statistical correlation analyses between SHAP values and DhondtXAI allocations support the consistency of interpretations, demonstrating DhondtXAI's potential as a complementary tool for understanding feature importance in AI models. The results highlight that integrating electoral principles, such as proportional representation and alliances, into AI explainability can improve user understanding, especially in high-stakes fields like healthcare.
Authors: Graham M. Jones, Shai Satran, Arvind Satyanarayan
Abstract: This article proposes a new integration of linguistic anthropology and machine learning (ML) around convergent interests in both the underpinnings of language and making language technologies more socially responsible. While linguistic anthropology focuses on interpreting the cultural basis for human language use, the ML field of interpretability is concerned with uncovering the patterns that Large Language Models (LLMs) learn from human verbal behavior. Through the analysis of a conversation between a human user and an LLM-powered chatbot, we demonstrate the theoretical feasibility of a new, conjoint field of inquiry, cultural interpretability (CI). By focusing attention on the communicative competence involved in the way human users and AI chatbots co-produce meaning in the articulatory interface of human-computer interaction, CI emphasizes how the dynamic relationship between language and culture makes contextually sensitive, open-ended conversation possible. We suggest that, by examining how LLMs internally "represent" relationships between language and culture, CI can: (1) provide insight into long-standing linguistic anthropological questions about the patterning of those relationships; and (2) aid model developers and interface designers in improving value alignment between language models and stylistically diverse speakers and culturally diverse speech communities. Our discussion proposes three critical research axes: relativity, variation, and indexicality.
Authors: David Willmes, Nick Krall, James Tanis, Zachary Terner, Fernando Tavares, Joe Haberlin III, Matt Crichton, Alexander Schlichting
Abstract: With billions of people facing moderate or severe food insecurity, the resilience of the global food supply will be of increasing concern due to the effects of climate change and geopolitical events. In this paper we describe a framework to better identify food security hotspots using a combination of remote sensing, deep learning, crop yield modeling, and causal modeling of the food distribution system. While we feel that the methods are adaptable to other regions of the world, we focus our analysis on the wheat breadbasket of northern India, which supplies a large percentage of the world's population. We present a quantitative analysis of deep learning domain adaptation methods for wheat farm identification based on curated remote sensing data from France. We model climate change impacts on crop yields using the existing crop yield modeling tool WOFOST and we identify key drivers of crop simulation error using a longitudinal penalized functional regression. A description of a system dynamics model of the food distribution system in India is also presented, along with results of food insecurity identification based on seeding this model with the predicted crop yields.
Authors: Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M. Kitani, L\'aszl\'o Jeni
Abstract: Transformers are slow to train on videos due to extremely large numbers of input tokens, even though many video tokens are repeated over time. Existing methods to remove such uninformative tokens either have significant overhead, negating any speedup, or require tuning for different datasets and examples. We present Run-Length Tokenization (RLT), a simple approach to speed up video transformers inspired by run-length encoding for data compression. RLT efficiently finds and removes runs of patches that are repeated over time prior to model inference, then replaces them with a single patch and a positional encoding to represent the resulting token's new length. Our method is content-aware, requiring no tuning for different datasets, and fast, incurring negligible overhead. RLT yields a large speedup in training, reducing the wall-clock time to fine-tune a video transformer by 30% while matching baseline model performance. RLT also works without any training, increasing model throughput by 35% with only 0.1% drop in accuracy. RLT speeds up training at 30 FPS by more than 100%, and on longer video datasets, can reduce the token count by up to 80%. Our project page is at https://rccchoudhury.github.io/projects/rlt/.
Authors: Boqi Chen, Yuanzhi Zhu, Yunke Ao, Sebastiano Caprara, Reto Sutter, Gunnar R\"atsch, Ender Konukoglu, Anna Susmelj
Abstract: Single-source domain generalization (SDG) aims to learn a model from a single source domain that can generalize well on unseen target domains. This is an important task in computer vision, particularly relevant to medical imaging where domain shifts are common. In this work, we consider a challenging yet practical setting: SDG for cross-modality medical image segmentation. We combine causality-inspired theoretical insights on learning domain-invariant representations with recent advancements in diffusion-based augmentation to improve generalization across diverse imaging modalities. Guided by the ``intervention-augmentation equivariant'' principle, we use controlled diffusion models (DMs) to simulate diverse imaging styles while preserving the content, leveraging rich generative priors in large-scale pretrained DMs to comprehensively perturb the multidimensional style variable. Extensive experiments on challenging cross-modality segmentation tasks demonstrate that our approach consistently outperforms state-of-the-art SDG methods across three distinct anatomies and imaging modalities. The source code is available at \href{https://github.com/ratschlab/ICMSeg}{https://github.com/ratschlab/ICMSeg}.
URLs: https://github.com/ratschlab/ICMSeg, https://github.com/ratschlab/ICMSeg
Authors: Adriana Caraeni, Alexander Scarlatos, Andrew Lan
Abstract: Recent advances in generative artificial intelligence (AI) have shown promise in accurately grading open-ended student responses. However, few prior works have explored grading handwritten responses due to a lack of data and the challenge of combining visual and textual information. In this work, we leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques. We find that while providing rubrics improves alignment, the model's overall accuracy is still too low for real-world settings, showing there is significant room for growth in this task.
Authors: Yingying Fang, Zihao Jin, Shaojie Guo, Jinda Liu, Yijian Gao, Junzhi Ning, Zhiling Yue, Zhi Li, Simon LF Walsh, Guang Yang
Abstract: Despite significant advancements in report generation methods, a critical limitation remains: the lack of interpretability in the generated text. This paper introduces an innovative approach to enhance the explainability of text generated by report generation models. Our method employs cyclic text manipulation and visual comparison to identify and elucidate the features in the original content that influence the generated text. By manipulating the generated reports and producing corresponding images, we create a comparative framework that highlights key attributes and their impact on the text generation process. This approach not only identifies the image features aligned to the generated text but also improves transparency but also provides deeper insights into the decision-making mechanisms of the report generation models. Our findings demonstrate the potential of this method to significantly enhance the interpretability and transparency of AI-generated reports.
Authors: Chi-en Amy Tai, Oustan Ding, Alexander Wong
Abstract: In the United States, skin cancer ranks as the most commonly diagnosed cancer, presenting a significant public health issue due to its high rates of occurrence and the risk of serious complications if not caught early. Recent advancements in dataset curation and deep learning have shown promise in quick and accurate detection of skin cancer. However, current open-source datasets have significant class imbalances which impedes the effectiveness of these deep learning models. In healthcare, generative artificial intelligence (AI) models have been employed to create synthetic data, addressing data imbalance in datasets by augmenting underrepresented classes and enhancing the overall quality and performance of machine learning models. In this paper, we build on top of previous work by leveraging new advancements in generative AI, notably Stable Diffusion and DreamBooth. We introduce Cancer-Net SCa-Synth, an open access synthetically generated 2D skin lesion dataset for skin cancer classification. Further analysis on the data effectiveness by comparing the ISIC 2020 test set performance for training with and without these synthetic images for a simple model highlights the benefits of leveraging synthetic data to improve performance. Cancer-Net SCa-Synth is publicly available at https://github.com/catai9/Cancer-Net-SCa-Synth as part of a global open-source initiative for accelerating machine learning for cancer care.
Authors: Sreyas Venkataraman, Yufei Wang, Ziyu Wang, Zackory Erickson, David Held
Abstract: Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system's applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets.
Authors: Saksham Rastogi, Danish Pruthi
Abstract: Amidst rising concerns about the internet being proliferated with content generated from language models (LMs), watermarking is seen as a principled way to certify whether text was generated from a model. Many recent watermarking techniques slightly modify the output probabilities of LMs to embed a signal in the generated output that can later be detected. Since early proposals for text watermarking, questions about their robustness to paraphrasing have been prominently discussed. Lately, some techniques are deliberately designed and claimed to be robust to paraphrasing. However, such watermarking schemes do not adequately account for the ease with which they can be reverse-engineered. We show that with access to only a limited number of generations from a black-box watermarked model, we can drastically increase the effectiveness of paraphrasing attacks to evade watermark detection, thereby rendering the watermark ineffective.
Authors: Zijian Hu, Jipeng Zhang, Rui Pan, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, Tong Zhang
Abstract: We present Fox-1, a series of small language models (SLMs) consisting of Fox-1-1.6B and Fox-1-1.6B-Instruct-v0.1. These models are pre-trained on 3 trillion tokens of web-scraped document data and fine-tuned with 5 billion tokens of instruction-following and multi-turn conversation data. Aiming to improve the pre-training efficiency, Fox-1-1.6B model introduces a novel 3-stage data curriculum across all the training data with 2K-8K sequence length. In architecture design, Fox-1 features a deeper layer structure, an expanded vocabulary, and utilizes Grouped Query Attention (GQA), offering a performant and efficient architecture compared to other SLMs. Fox-1 achieves better or on-par performance in various benchmarks compared to StableLM-2-1.6B, Gemma-2B, Qwen1.5-1.8B, and OpenELM1.1B, with competitive inference speed and throughput. The model weights have been released under the Apache 2.0 license, where we aim to promote the democratization of LLMs and make them fully accessible to the whole open-source community.
Authors: Akshat Ramachandran, Souvik Kundu, Tushar Krishna
Abstract: Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude features called outliers. Existing outlier-aware algorithm/architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, in this paper, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of simple multi-precision INT processing elements and a novel network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike existing alternatives, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across various quantization settings show that MicroScopiQ achieves SoTA quantization performance while simultaneously improving inference performance by 3x and reducing energy by 2x over existing alternatives.
Authors: Ziwei Su, Diego Klabjan
Abstract: Stochastic simulation models are generative models that mimic complex systems to help with decision-making. The reliability of these models heavily depends on well-calibrated input model parameters. However, in many practical scenarios, only output-level data are available to learn the input model parameters, which is challenging due to the often intractable likelihood of the stochastic simulation model. Moreover, stochastic simulation models are frequently inexact, with discrepancies between the model and the target system. No existing methods can effectively learn and quantify the uncertainties of input parameters using only output-level data. In this paper, we propose to learn differentiable input parameters of stochastic simulation models using output-level data via kernel score minimization with stochastic gradient descent. We quantify the uncertainties of the learned input parameters using a frequentist confidence set procedure based on a new asymptotic normality result that accounts for model inexactness. The proposed method is evaluated on exact and inexact G/G/1 queueing models.
Authors: Wentang Song, Zhiyuan Yan, Yuzhen Lin, Taiping Yao, Changsheng Chen, Shen Chen, Yandan Zhao, Shouhong Ding, Bin Li
Abstract: This paper addresses the generalization issue in deepfake detection by harnessing forgery quality in training data. Generally, the forgery quality of different deepfakes varies: some have easily recognizable forgery clues, while others are highly realistic. Existing works often train detectors on a mix of deepfakes with varying forgery qualities, potentially leading detectors to short-cut the easy-to-spot artifacts from low-quality forgery samples, thereby hurting generalization performance. To tackle this issue, we propose a novel quality-centric framework for generic deepfake detection, which is composed of a Quality Evaluator, a low-quality data enhancement module, and a learning pacing strategy that explicitly incorporates forgery quality into the training process. The framework is inspired by curriculum learning, which is designed to gradually enable the detector to learn more challenging deepfake samples, starting with easier samples and progressing to more realistic ones. We employ both static and dynamic assessments to assess the forgery quality, combining their scores to produce a final rating for each training sample. The rating score guides the selection of deepfake samples for training, with higher-rated samples having a higher probability of being chosen. Furthermore, we propose a novel frequency data augmentation method specifically designed for low-quality forgery samples, which helps to reduce obvious forgery traces and improve their overall realism. Extensive experiments show that our method can be applied in a plug-and-play manner and significantly enhance the generalization performance.
Authors: Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos
Abstract: Current automated fact-checking (AFC) approaches commonly evaluate evidence either implicitly via the predicted verdicts or by comparing retrieved evidence with a predefined closed knowledge source, such as Wikipedia. However, these methods suffer from limitations, resulting from their reliance on evaluation metrics developed for different purposes and constraints imposed by closed knowledge sources. Recent advances in natural language generation (NLG) evaluation offer new possibilities for evidence assessment. In this work, we introduce Ev2R, an evaluation framework for AFC that comprises three types of approaches for evidence evaluation: reference-based, proxy-reference, and reference-less. We evaluate their effectiveness through agreement with human ratings and adversarial tests, and demonstrate that prompt-based scorers, particularly those leveraging LLMs and reference evidence, outperform traditional evaluation approaches.
Authors: Hendrik Krone, Pierre Haritz, Thomas Liebig
Abstract: The topics of Artificial intelligence (AI) and especially Machine Learning (ML) are increasingly making their way into educational curricula. To facilitate the access for students, a variety of platforms, visual tools, and digital games are already being used to introduce ML concepts and strengthen the understanding of how AI works. We take a look at didactic principles that are employed for teaching computer science, define criteria, and, based on those, evaluate a selection of prominent existing platforms, tools, and games. Additionally, we criticize the approach of portraying ML mostly as a black-box and the resulting missing focus on creating an understanding of data, algorithms, and models that come with it. To tackle this issue, we present a concept that covers intermodal transfer, computational and explanatory thinking, ICE-T, as an extension of known didactic principles. With our multi-faceted concept, we believe that planners of learning units, creators of learning platforms and educators can improve on teaching ML.
Authors: Yuanyuan Li, Philipp Grohs, Philipp Petersen
Abstract: In recent work it has been shown that determining a feedforward ReLU neural network to within high uniform accuracy from point samples suffers from the curse of dimensionality in terms of the number of samples needed. As a consequence, feedforward ReLU neural networks are of limited use for applications where guaranteed high uniform accuracy is required. We consider the question of whether the sampling complexity can be improved by restricting the specific neural network architecture. To this end, we investigate invertible residual neural networks which are foundational architectures in deep learning and are widely employed in models that power modern generative methods. Our main result shows that the residual neural network architecture and invertibility do not help overcome the complexity barriers encountered with simpler feedforward architectures. Specifically, we demonstrate that the computational complexity of approximating invertible residual neural networks from point samples in the uniform norm suffers from the curse of dimensionality. Similar results are established for invertible convolutional Residual neural networks.
Authors: Mohammad Imran Hossain, Muhammad Zain Amin, Daniel Tweneboah Anyimadu, Taofik Ahmed Suleiman
Abstract: Automatic brain tissue segmentation from Magnetic Resonance Imaging (MRI) images is vital for accurate diagnosis and further analysis in medical imaging. Despite advancements in segmentation techniques, a comprehensive comparison between traditional statistical methods and modern deep learning approaches using pre-processing techniques like N4 Bias Field Correction and Anisotropic Diffusion remains underexplored. This study provides a comparative analysis of various segmentation models, including Probabilistic ATLAS, U-Net, nnU-Net, and LinkNet, enhanced with these pre-processing techniques to segment brain tissues (white matter (WM), grey matter (GM) and cerebrospinal fluid (CSF)) on the Internet Brain Segmentation Repository (IBSR18) dataset. Our results demonstrate that the 3D nnU-Net model outperforms others, achieving the highest mean Dice Coefficient score (0.937 +- 0.012), while the 2D nnU-Net model recorded the lowest mean Hausdorff Distance (5.005 +- 0.343 mm) and the lowest mean Absolute Volumetric Difference (3.695 +- 2.931 mm) across five unseen test samples. The findings highlight the superiority of nnU-Net models in brain tissue segmentation, particularly when combined with N4 Bias Field Correction and Anisotropic Diffusion pre-processing techniques. Our implemented code can be accessed via GitHub.
Authors: Simone Brivio, Stefania Fresca, Andrea Manzoni
Abstract: Deep Learning-based Reduced Order Models (DL-ROMs) provide nowadays a well-established class of accurate surrogate models for complex physical systems described by parametrized PDEs, by nonlinearly compressing the solution manifold into a handful of latent coordinates. Until now, design and application of DL-ROMs mainly focused on physically parameterized problems. Within this work, we provide a novel extension of these architectures to problems featuring geometrical variability and parametrized domains, namely, we propose Continuous Geometry-Aware DL-ROMs (CGA-DL-ROMs). In particular, the space-continuous nature of the proposed architecture matches the need to deal with multi-resolution datasets, which are quite common in the case of geometrically parametrized problems. Moreover, CGA-DL-ROMs are endowed with a strong inductive bias that makes them aware of geometrical parametrizations, thus enhancing both the compression capability and the overall performance of the architecture. Within this work, we justify our findings through a thorough theoretical analysis, and we practically validate our claims by means of a series of numerical tests encompassing physically-and-geometrically parametrized PDEs, ranging from the unsteady Navier-Stokes equations for fluid dynamics to advection-diffusion-reaction equations for mathematical biology.
Authors: Lingkai Zhu, Can Deniz Bezek, Orcun Goksel
Abstract: In recent years, the increasing size of deep learning models and their growing demand for computational resources have drawn significant attention to the practice of pruning neural networks, while aiming to preserve their accuracy. In unstructured gradual pruning, which sparsifies a network by gradually removing individual network parameters until a targeted network sparsity is reached, recent works show that both gradient and weight magnitudes should be considered. In this work, we show that such mechanism, e.g., the order of prioritization and selection criteria, is essential. We introduce a gradient-first magnitude-next strategy for choosing the parameters to prune, and show that a fixed-rate subselection criterion between these steps works better, in contrast to the annealing approach in the literature. We validate this on CIFAR-10 dataset, with multiple randomized initializations on both VGG-19 and ResNet-50 network backbones, for pruning targets of 90, 95, and 98% sparsity and for both initially dense and 50% sparse networks. Our proposed fixed-rate gradient-first gradual pruning (FGGP) approach outperforms its state-of-the-art alternatives in most of the above experimental settings, even occasionally surpassing the upperbound of corresponding dense network results, and having the highest ranking across the considered experimental settings.
Authors: Nan Song, Xiaofeng Yang, Ze Yang, Guosheng Lin
Abstract: Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data while preserving old knowledge. Current customization diffusion models excel in few-shot tasks but struggle with catastrophic forgetting problems in lifelong generations. In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting. To address these challenges, we first devise a data-free knowledge distillation strategy to tackle relevant concepts forgetting. Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones, without accessing any previous data. Second, we develop an In-Context Generation (ICGen) paradigm that allows the diffusion model to be conditioned upon the input vision context, which facilitates the few-shot generation and mitigates the issue of previous concepts forgetting. Extensive experiments show that the proposed Lifelong Few-Shot Diffusion (LFS-Diffusion) method can produce high-quality and accurate images while maintaining previously learned knowledge.
Authors: Laure Ciernik, Lorenz Linhardt, Marco Morik, Jonas Dippel, Simon Kornblith, Lukas Muttenthaler
Abstract: The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance, irrespective of the objectives and data modalities used to train these models. Representational similarity is generally measured for individual datasets and is not necessarily consistent across datasets. Thus, one may wonder whether this convergence of model representations is confounded by the datasets commonly used in machine learning. Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations. We find that the objective function is the most crucial factor in determining the consistency of representational similarities across datasets. Specifically, self-supervised vision models learn representations whose relative pairwise similarities generalize better from one dataset to another compared to those of image classification or image-text models. Moreover, the correspondence between representational similarities and the models' task behavior is dataset-dependent, being most strongly pronounced for single-domain datasets. Our work provides a framework for systematically measuring similarities of model representations across datasets and linking those similarities to differences in task behavior.
Authors: Hejer Ammar, Nikita Kiselov, Guillaume Lapouge, Romaric Audigier
Abstract: In real-world applications where confidence is key, like autonomous driving, the accurate detection and appropriate handling of classes differing from those used during training are crucial. Despite the proposal of various unknown object detection approaches, we have observed widespread inconsistencies among them regarding the datasets, metrics, and scenarios used, alongside a notable absence of a clear definition for unknown objects, which hampers meaningful evaluation. To counter these issues, we introduce two benchmarks: a unified VOC-COCO evaluation, and the new OpenImagesRoad benchmark which provides clear hierarchical object definition besides new evaluation metrics. Complementing the benchmark, we exploit recent self-supervised Vision Transformers performance, to improve pseudo-labeling-based OpenSet Object Detection (OSOD), through OW-DETR++. State-of-the-art methods are extensively evaluated on the proposed benchmarks. This study provides a clear problem definition, ensures consistent evaluations, and draws new conclusions about effectiveness of OSOD strategies.
Authors: Ruben Schmeitz, Joris Remmers, Olga Mula, Olaf van der Sluis
Abstract: This study introduces a surrogate modeling framework merging proper orthogonal decomposition, long short-term memory networks, and multi-task learning, to accurately predict elastoplastic deformations in real-time. Superior to single-task neural networks, this approach achieves a mean absolute error below 0.40\% across various state variables, with the multi-task model showing enhanced generalization by mitigating overfitting through shared layers. Moreover, in our use cases, a pre-trained multi-task model can effectively train additional variables with as few as 20 samples, demonstrating its deep understanding of complex scenarios. This is notably efficient compared to single-task models, which typically require around 100 samples. Significantly faster than traditional finite element analysis, our model accelerates computations by approximately a million times, making it a substantial advancement for real-time predictive modeling in engineering applications. While it necessitates further testing on more intricate models, this framework shows substantial promise in elevating both efficiency and accuracy in engineering applications, particularly for real-time scenarios.
Authors: Shuyuan Wu, Bin Du, Xuetong Li, Hansheng Wang
Abstract: We systematically study various network Expectation-Maximization (EM) algorithms for the Gaussian mixture model within the framework of decentralized federated learning. Our theoretical investigation reveals that directly extending the classical decentralized supervised learning method to the EM algorithm exhibits poor estimation accuracy with heterogeneous data across clients and struggles to converge numerically when Gaussian components are poorly-separated. To address these issues, we propose two novel solutions. First, to handle heterogeneous data, we introduce a momentum network EM (MNEM) algorithm, which uses a momentum parameter to combine information from both the current and historical estimators. Second, to tackle the challenge of poorly-separated Gaussian components, we develop a semi-supervised MNEM (semi-MNEM) algorithm, which leverages partially labeled data. Rigorous theoretical analysis demonstrates that MNEM can achieve statistical efficiency comparable to that of the whole sample estimator when the mixture components satisfy certain separation conditions, even in heterogeneous scenarios. Moreover, the semi-MNEM estimator enhances the convergence speed of the MNEM algorithm, effectively addressing the numerical convergence challenges in poorly-separated scenarios. Extensive simulation and real data analyses are conducted to justify our theoretical findings.
Authors: Yuqing Huang, Bastian Wittmann, Olga Demler, Bjoern Menze, Neda Davoudi
Abstract: Early identification of stroke is crucial for intervention, requiring reliable models. We proposed an efficient retinal image representation together with clinical information to capture a comprehensive overview of cardiovascular health, leveraging large multimodal datasets for new medical insights. Our approach is one of the first contrastive frameworks that integrates graph and tabular data, using vessel graphs derived from retinal images for efficient representation. This method, combined with multimodal contrastive learning, significantly enhances stroke prediction accuracy by integrating data from multiple sources and using contrastive learning for transfer learning. The self-supervised learning techniques employed allow the model to learn effectively from unlabeled data, reducing the dependency on large annotated datasets. Our framework showed an AUROC improvement of 3.78% from supervised to self-supervised approaches. Additionally, the graph-level representation approach achieved superior performance to image encoders while significantly reducing pre-training and fine-tuning runtimes. These findings indicate that retinal images are a cost-effective method for improving cardiovascular disease predictions and pave the way for future research into retinal and cerebral vessel connections and the use of graph-based retinal vessel representations.
Authors: Cristiano Patr\'icio, Lu\'is F. Teixeira, Jo\~ao C. Neves
Abstract: The main challenges hindering the adoption of deep learning-based systems in clinical settings are the scarcity of annotated data and the lack of interpretability and trust in these systems. Concept Bottleneck Models (CBMs) offer inherent interpretability by constraining the final disease prediction on a set of human-understandable concepts. However, this inherent interpretability comes at the cost of greater annotation burden. Additionally, adding new concepts requires retraining the entire system. In this work, we introduce a novel two-step methodology that addresses both of these challenges. By simulating the two stages of a CBM, we utilize a pretrained Vision Language Model (VLM) to automatically predict clinical concepts, and a Large Language Model (LLM) to generate disease diagnoses based on the predicted concepts. We validate our approach on three skin lesion datasets, demonstrating that it outperforms traditional CBMs and state-of-the-art explainable methods, all without requiring any training and utilizing only a few annotated examples. The code is available at https://github.com/CristianoPatricio/2-step-concept-based-skin-diagnosis.
URLs: https://github.com/CristianoPatricio/2-step-concept-based-skin-diagnosis.
Authors: Daniela Schkoda, Philipp Faller, Patrick Bl\"obaum, Dominik Janzing
Abstract: We propose a new approach to falsify causal discovery algorithms without ground truth, which is based on testing the causal model on a pair of variables that has been dropped when learning the causal model. To this end, we use the "Leave-One-Variable-Out (LOVO)" prediction where $Y$ is inferred from $X$ without any joint observations of $X$ and $Y$, given only training data from $X,Z_1,\dots,Z_k$ and from $Z_1,\dots,Z_k,Y$. We demonstrate that causal models on the two subsets, in the form of Acyclic Directed Mixed Graphs (ADMGs), often entail conclusions on the dependencies between $X$ and $Y$, enabling this type of prediction. The prediction error can then be estimated since the joint distribution $P(X, Y)$ is assumed to be available, and $X$ and $Y$ have only been omitted for the purpose of falsification. After presenting this graphical method, which is applicable to general causal discovery algorithms, we illustrate how to construct a LOVO predictor tailored towards algorithms relying on specific a priori assumptions, such as linear additive noise models. Simulations indicate that the LOVO prediction error is indeed correlated with the accuracy of the causal outputs, affirming the method's effectiveness.
Authors: Hunor Csala, Arvind Mohan, Daniel Livescu, Amirhossein Arzani
Abstract: Computational cardiovascular flow modeling plays a crucial role in understanding blood flow dynamics. While 3D models provide acute details, they are computationally expensive, especially with fluid-structure interaction (FSI) simulations. 1D models offer a computationally efficient alternative, by simplifying the 3D Navier-Stokes equations through axisymmetric flow assumption and cross-sectional averaging. However, traditional 1D models based on finite element methods (FEM) often lack accuracy compared to 3D averaged solutions. This study introduces a novel physics-constrained machine learning technique that enhances the accuracy of 1D blood flow models while maintaining computational efficiency. Our approach, utilizing a physics-constrained coupled neural differential equation (PCNDE) framework, demonstrates superior performance compared to conventional FEM-based 1D models across a wide range of inlet boundary condition waveforms and stenosis blockage ratios. A key innovation lies in the spatial formulation of the momentum conservation equation, departing from the traditional temporal approach and capitalizing on the inherent temporal periodicity of blood flow. This spatial neural differential equation formulation switches space and time and overcomes issues related to coupling stability and smoothness, while simplifying boundary condition implementation. The model accurately captures flow rate, area, and pressure variations for unseen waveforms and geometries. We evaluate the model's robustness to input noise and explore the loss landscapes associated with the inclusion of different physics terms. This advanced 1D modeling technique offers promising potential for rapid cardiovascular simulations, achieving computational efficiency and accuracy. By combining the strengths of physics-based and data-driven modeling, this approach enables fast and accurate cardiovascular simulations.
Authors: Zhuowen Yin, Chengru Li, Xingbo Dong
Abstract: To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory management.Tube masking strategy reduces redundant information in food and reduces overfitting.These advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.
Authors: Ilia Mahrooghi, Mahshad Moradi, Sina Akbari, Negar Kiyavash
Abstract: While significant progress has been made in designing algorithms that minimize regret in online decision-making, real-world scenarios often introduce additional complexities, perhaps the most challenging of which is missing outcomes. Overlooking this aspect or simply assuming random missingness invariably leads to biased estimates of the rewards and may result in linear regret. Despite the practical relevance of this challenge, no rigorous methodology currently exists for systematically handling missingness, especially when the missingness mechanism is not random. In this paper, we address this gap in the context of multi-armed bandits (MAB) with missing outcomes by analyzing the impact of different missingness mechanisms on achievable regret bounds. We introduce algorithms that account for missingness under both missing at random (MAR) and missing not at random (MNAR) models. Through both analytical and simulation studies, we demonstrate the drastic improvements in decision-making by accounting for missingness in these settings.
Authors: Xiwen Wei, Guihong Li, Radu Marculescu
Abstract: Catastrophic forgetting is a significant challenge in online continual learning (OCL), especially for non-stationary data streams that do not have well-defined task boundaries. This challenge is exacerbated by the memory constraints and privacy concerns inherent in rehearsal buffers. To tackle catastrophic forgetting, in this paper, we introduce Online-LoRA, a novel framework for task-free OCL. Online-LoRA allows to finetune pre-trained Vision Transformer (ViT) models in real-time to address the limitations of rehearsal buffers and leverage pre-trained models' performance benefits. As the main contribution, our approach features a novel online weight regularization strategy to identify and consolidate important model parameters. Moreover, Online-LoRA leverages the training dynamics of loss values to enable the automatic recognition of the data distribution shifts. Extensive experiments across many task-free OCL scenarios and benchmark datasets (including CIFAR-100, ImageNet-R, ImageNet-S, CUB-200 and CORe50) demonstrate that Online-LoRA can be robustly adapted to various ViT architectures, while achieving better performance compared to SOTA methods. Our code will be publicly available at: https://github.com/Christina200/Online-LoRA-official.git.
URLs: https://github.com/Christina200/Online-LoRA-official.git.
Authors: Xiulong Liu, Kun Su, Eli Shlizerman
Abstract: The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.
Authors: Hongyi Pan, Ziliang Hong, Gorkem Durak, Elif Keles, Halil Ertugrul Aktas, Yavuz Taktak, Alpay Medetalibeyoglu, Zheyuan Zhang, Yury Velichko, Concetto Spampinato, Ivo Schoots, Marco J. Bruno, Pallavi Tiwari, Candice Bolan, Tamas Gonda, Frank Miller, Rajesh N. Keswani, Michael B. Wallace, Ziyue Xu, Ulas Bagci
Abstract: Accurate classification of Intraductal Papillary Mucinous Neoplasms (IPMN) is essential for identifying high-risk cases that require timely intervention. In this study, we develop a federated learning framework for multi-center IPMN classification utilizing a comprehensive pancreas MRI dataset. This dataset includes 653 T1-weighted and 656 T2-weighted MRI images, accompanied by corresponding IPMN risk scores from 7 leading medical institutions, making it the largest and most diverse dataset for IPMN classification to date. We assess the performance of DenseNet-121 in both centralized and federated settings for training on distributed data. Our results demonstrate that the federated learning approach achieves high classification accuracy comparable to centralized learning while ensuring data privacy across institutions. This work marks a significant advancement in collaborative IPMN classification, facilitating secure and high-accuracy model training across multiple centers.
Authors: Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla
Abstract: Convolutional Neural Networks (CNNs) have seen significant performance improvements in recent years. However, due to their size and complexity, they function as black-boxes, leading to transparency concerns. State-of-the-art saliency methods generate local explanations that highlight the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction, which is essential for bias mitigation. On the other hand, concept-based methods, such as TCAV (Testing with Concept Activation Vectors), provide insights into how sensitive is the network to a concept, but cannot compute its attribution in a specific prediction nor show its location within the input image. This paper introduces a novel post-hoc explainability framework, Visual-TCAV, which aims to bridge the gap between these methods by providing both local and global explanations for CNN-based image classification. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate saliency maps that show where concepts are recognized by the network. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. This framework is evaluated on popular CNN architectures, with its validity further confirmed via experiments where ground truth for explanations is known, and a comparison with TCAV. Our code will be made available soon.
Authors: Ethan King, Jaime Rodriguez, Diego Llanes, Timothy Doster, Tegan Emerson, James Koch
Abstract: We present a sensor-agnostic spectral transformer as the basis for spectral foundation models. To that end, we introduce a Universal Spectral Representation (USR) that leverages sensor meta-data, such as sensing kernel specifications and sensing wavelengths, to encode spectra obtained from any spectral instrument into a common representation, such that a single model can ingest data from any sensor. Furthermore, we develop a methodology for pre-training such models in a self-supervised manner using a novel random sensor-augmentation and reconstruction pipeline to learn spectral features independent of the sensing paradigm. We demonstrate that our architecture can learn sensor independent spectral features that generalize effectively to sensors not seen during training. This work sets the stage for training foundation models that can both leverage and be effective for the growing diversity of spectral data.
Authors: Puze Liu, Jonas G\"unster, Niklas Funk, Simon Gr\"oger, Dong Chen, Haitham Bou-Ammar, Julius Jankowski, Ante Mari\'c, Sylvain Calinon, Andrej Orsula, Miguel Olivares-Mendez, Hongyi Zhou, Rudolf Lioutikov, Gerhard Neumann, Amarildo Likmeta Amirhossein Zhalehmehrabi, Thomas Bonenfant, Marcello Restelli, Davide Tateo, Ziyuan Liu, Jan Peters
Abstract: Machine learning methods have a groundbreaking impact in many application domains, but their application on real robotic platforms is still limited. Despite the many challenges associated with combining machine learning technology with robotics, robot learning remains one of the most promising directions for enhancing the capabilities of robots. When deploying learning-based approaches on real robots, extra effort is required to address the challenges posed by various real-world factors. To investigate the key factors influencing real-world deployment and to encourage original solutions from different researchers, we organized the Robot Air Hockey Challenge at the NeurIPS 2023 conference. We selected the air hockey task as a benchmark, encompassing low-level robotics problems and high-level tactics. Different from other machine learning-centric benchmarks, participants need to tackle practical challenges in robotics, such as the sim-to-real gap, low-level control issues, safety problems, real-time requirements, and the limited availability of real-world data. Furthermore, we focus on a dynamic environment, removing the typical assumption of quasi-static motions of other real-world benchmarks. The competition's results show that solutions combining learning-based approaches with prior knowledge outperform those relying solely on data when real-world deployment is challenging. Our ablation study reveals which real-world factors may be overlooked when building a learning-based solution. The successful real-world air hockey deployment of best-performing agents sets the foundation for future competitions and follow-up research directions.
Authors: G. J. E. van Otterdijk, S. Moradi, S. Weiland, R. T\'oth, N. O. Jaensson, M. Schoukens
Abstract: Port-Hamiltonian neural networks (pHNNs) are emerging as a powerful modeling tool that integrates physical laws with deep learning techniques. While most research has focused on modeling the entire dynamics of interconnected systems, the potential for identifying and modeling individual subsystems while operating as part of a larger system has been overlooked. This study addresses this gap by introducing a novel method for using pHNNs to identify such subsystems based solely on input-output measurements. By utilizing the inherent compositional property of the port-Hamiltonian systems, we developed an algorithm that learns the dynamics of individual subsystems, without requiring direct access to their internal states. On top of that, by choosing an output error (OE) model structure, we have been able to handle measurement noise effectively. The effectiveness of the proposed approach is demonstrated through tests on interconnected systems, including multi-physics scenarios, demonstrating its potential for identifying subsystem dynamics and facilitating their integration into new interconnected models.
Authors: Jerry Yao-Chieh Hu, Erzhi Liu, Han Liu, Zhao Song, Lichen Zhang
Abstract: Given a database of bit strings $A_1,\ldots,A_m\in \{0,1\}^n$, a fundamental data structure task is to estimate the distances between a given query $B\in \{0,1\}^n$ with all the strings in the database. In addition, one might further want to ensure the integrity of the database by releasing these distance statistics in a secure manner. In this work, we propose differentially private (DP) data structures for this type of tasks, with a focus on Hamming and edit distance. On top of the strong privacy guarantees, our data structures are also time- and space-efficient. In particular, our data structure is $\epsilon$-DP against any sequence of queries of arbitrary length, and for any query $B$ such that the maximum distance to any string in the database is at most $k$, we output $m$ distance estimates. Moreover, - For Hamming distance, our data structure answers any query in $\widetilde O(mk+n)$ time and each estimate deviates from the true distance by at most $\widetilde O(k/e^{\epsilon/\log k})$; - For edit distance, our data structure answers any query in $\widetilde O(mk^2+n)$ time and each estimate deviates from the true distance by at most $\widetilde O(k/e^{\epsilon/(\log k \log n)})$. For moderate $k$, both data structures support sublinear query operations. We obtain these results via a novel adaptation of the randomized response technique as a bit flipping procedure, applied to the sketched strings.
Authors: Yilun Zhao, Yitao Long, Yuru Jiang, Chengye Wang, Weiyuan Chen, Hongjun Liu, Yiming Zhang, Xiangru Tang, Chen Zhao, Arman Cohan
Abstract: We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning, each addressing common scenarios encountered in real-world financial contexts. We assess a broad spectrum of LLMs under long-context and RAG settings. Our results show that even the current best-performing system, GPT-4o, still lags behind human experts. We further provide in-depth analysis on long-context and RAG setting, Chain-of-Thought reasoning, and model reasoning errors, offering insights to drive future advancements. We believe that FinDVer can serve as a valuable benchmark for evaluating LLMs in claim verification over complex, expert-domain documents.
Authors: Guixian Xu, Jinglai Li, Junqi Tang
Abstract: Equivariant Imaging (EI) regularization has become the de-facto technique for unsupervised training of deep imaging networks, without any need of ground-truth data. Observing that the EI-based unsupervised training paradigm currently has significant computational redundancy leading to inefficiency in high-dimensional applications, we propose a sketched EI regularization which leverages the randomized sketching techniques for acceleration. We then extend our sketched EI regularization to develop an accelerated deep internal learning framework -- Sketched Equivariant Deep Image Prior (Sk.EI-DIP), which can be efficiently applied for single-image and task-adapted reconstruction. Our numerical study on X-ray CT image reconstruction tasks demonstrate that our approach can achieve order-of-magnitude computational acceleration over standard EI-based counterpart in single-input setting, and network adaptation at test time.
Authors: Maxime Jacovella, Ali Keshavarzi, Elsa Angelini
Abstract: Despite advances with deep learning (DL), automated airway segmentation from chest CT scans continues to face challenges in segmentation quality and generalization across cohorts. To address these, we propose integrating Curriculum Learning (CL) into airway segmentation networks, distributing the training set into batches according to ad-hoc complexity scores derived from CT scans and corresponding ground-truth tree features. We specifically investigate few-shot domain adaptation, targeting scenarios where manual annotation of a full fine-tuning dataset is prohibitively expensive. Results are reported on two large open-cohorts (ATM22 and AIIB23) with high performance using CL for full training (Source domain) and few-shot fine-tuning (Target domain), but with also some insights on potential detrimental effects if using a classic Bootstrapping scoring function or if not using proper scan sequencing.
Authors: Minjae Lee, Kyungmin Kim, Taesoo Kim, Sangdon Park
Abstract: Trustworthiness of generative language models (GLMs) is crucial in their deployment to critical decision making systems. Hence, certified risk control methods such as selective prediction and conformal prediction have been applied to mitigating the hallucination problem in various supervised downstream tasks. However, the lack of appropriate correctness metric hinders applying such principled methods to language generation tasks. In this paper, we circumvent this problem by leveraging the concept of textual entailment to evaluate the correctness of the generated sequence, and propose two selective generation algorithms which control the false discovery rate with respect to the textual entailment relation (FDR-E) with a theoretical guarantee: $\texttt{SGen}^{\texttt{Sup}}$ and $\texttt{SGen}^{\texttt{Semi}}$. $\texttt{SGen}^{\texttt{Sup}}$, a direct modification of the selective prediction, is a supervised learning algorithm which exploits entailment-labeled data, annotated by humans. Since human annotation is costly, we further propose a semi-supervised version, $\texttt{SGen}^{\texttt{Semi}}$, which fully utilizes the unlabeled data by pseudo-labeling, leveraging an entailment set function learned via conformal prediction. Furthermore, $\texttt{SGen}^{\texttt{Semi}}$ enables to use more general class of selection functions, neuro-selection functions, and provides users with an optimal selection function class given multiple candidates. Finally, we demonstrate the efficacy of the $\texttt{SGen}$ family in achieving a desired FDR-E level with comparable selection efficiency to those from baselines on both open and closed source GLMs. Code and datasets are provided at https://github.com/ml-postech/selective-generation.
Authors: Xiaoyu Wen, Xudong Yu, Rui Yang, Haoyuan Chen, Chenjia Bai, Zhen Wang
Abstract: To obtain a near-optimal policy with fewer interactions in Reinforcement Learning (RL), a promising approach involves the combination of offline RL, which enhances sample efficiency by leveraging offline datasets, and online RL, which explores informative transitions by interacting with the environment. Offline-to-Online (O2O) RL provides a paradigm for improving an offline trained agent within limited online interactions. However, due to the significant distribution shift between online experiences and offline data, most offline RL algorithms suffer from performance drops and fail to achieve stable policy improvement in O2O adaptation. To address this problem, we propose the Robust Offline-to-Online (RO2O) algorithm, designed to enhance offline policies through uncertainty and smoothness, and to mitigate the performance drop in online adaptation. Specifically, RO2O incorporates Q-ensemble for uncertainty penalty and adversarial samples for policy and value smoothness, which enable RO2O to maintain a consistent learning procedure in online adaptation without requiring special changes to the learning objective. Theoretical analyses in linear MDPs demonstrate that the uncertainty and smoothness lead to a tighter optimality bound in O2O against distribution shift. Experimental results illustrate the superiority of RO2O in facilitating stable offline-to-online learning and achieving significant improvement with limited online interactions.
Authors: Eungyeup Kim, Mingjie Sun, Christina Baek, Aditi Raghunathan, J. Zico Kolter
Abstract: Recently, Miller et al. (2021) and Baek et al. (2022) empirically demonstrated strong linear correlations between in-distribution (ID) versus out-of-distribution (OOD) accuracy and agreement. These trends, coined accuracy-on-the-line (ACL) and agreement-on-the-line (AGL), enable OOD model selection and performance estimation without labeled data. However, these phenomena also break for certain shifts, such as CIFAR10-C Gaussian Noise, posing a critical bottleneck. In this paper, we make a key finding that recent test-time adaptation (TTA) methods not only improve OOD performance, but drastically strengthen the ACL and AGL trends in models, even in shifts where models showed very weak correlations before. To analyze this, we revisit the theoretical conditions from Miller et al. (2021) that outline the types of distribution shifts needed for perfect ACL in linear models. Surprisingly, these conditions are satisfied after applying TTA to deep models in the penultimate feature embedding space. In particular, TTA causes the data distribution to collapse complex shifts into those can be expressed by a singular scaling variable in the feature space. Our results show that by combining TTA with AGL-based estimation methods, we can estimate the OOD performance of models with high precision for a broader set of distribution shifts. This lends us a simple system for selecting the best hyperparameters and adaptation strategy without any OOD labeled data.
Authors: Ziliang Wang, Xiaohong Zhang, Kechi Zhang, Ze Shi Li, Meng Yan
Abstract: Individual objects, whether users or services, within a specific region often exhibit similar network states due to their shared origin from the same city or autonomous system (AS). Despite this regional network similarity, many existing techniques overlook its potential, resulting in subpar performance arising from challenges such as data sparsity and label imbalance. In this paper, we introduce the regional-based dual latent state learning network(R2SL), a novel deep learning framework designed to overcome the pitfalls of traditional individual object-based prediction techniques in Quality of Service (QoS) prediction. Unlike its predecessors, R2SL captures the nuances of regional network behavior by deriving two distinct regional network latent states: the city-network latent state and the AS-network latent state. These states are constructed utilizing aggregated data from common regions rather than individual object data. Furthermore, R2SL adopts an enhanced Huber loss function that adjusts its linear loss component, providing a remedy for prevalent label imbalance issues. To cap off the prediction process, a multi-scale perception network is leveraged to interpret the integrated feature map, a fusion of regional network latent features and other pertinent information, ultimately accomplishing the QoS prediction. Through rigorous testing on real-world QoS datasets, R2SL demonstrates superior performance compared to prevailing state-of-the-art methods. Our R2SL approach ushers in an innovative avenue for precise QoS predictions by fully harnessing the regional network similarities inherent in objects.
Authors: Quan Zhou, Jakub Marecek
Abstract: Fairness holds a pivotal role in the realm of machine learning, particularly when it comes to addressing groups categorised by protected attributes, e.g., gender, race. Prevailing algorithms in fair learning predominantly hinge on accessibility or estimations of these protected attributes, at least in the training process. We design a single group-blind projection map that aligns the feature distributions of both groups in the source data, achieving (demographic) group parity, without requiring values of the protected attribute for individual samples in the computation of the map, as well as its use. Instead, our approach utilises the feature distributions of the privileged and unprivileged groups in a boarder population and the essential assumption that the source data are unbiased representation of the population. We present numerical results on synthetic data and real data.
Authors: Franziska Heeg, Ingo Scholtes
Abstract: Node centralities play a pivotal role in network science, social network analysis, and recommender systems. In temporal data, static path-based centralities like closeness or betweenness can give misleading results about the true importance of nodes in a temporal graph. To address this issue, temporal generalizations of betweenness and closeness have been defined that are based on the shortest time-respecting paths between pairs of nodes. However, a major issue of those generalizations is that the calculation of such paths is computationally expensive. Addressing this issue, we study the application of De Bruijn Graph Neural Networks (DBGNN), a time-aware graph neural network architecture, to predict temporal path-based centralities in time series data. We experimentally evaluate our approach in 13 temporal graphs from biological and social systems and show that it considerably improves the prediction of betweenness and closeness centrality compared to (i) a static Graph Convolutional Neural Network, (ii) an efficient sampling-based approximation technique for temporal betweenness, and (iii) two state-of-the-art time-aware graph learning techniques for dynamic graphs.
Authors: Suraj Singireddy, Precious Nwaorgu, Andre Beckus, Aden McKinney, Chinwendu Enyioha, Sumit Kumar Jha, George K. Atia, Alvaro Velasquez
Abstract: Reinforcement learning (RL) is a powerful tool for finding optimal policies in sequential decision processes. However, deep RL methods have two weaknesses: collecting the amount of agent experience required for practical RL problems is prohibitively expensive, and the learned policies exhibit poor generalization on tasks outside the training data distribution. To mitigate these issues, we introduce automaton distillation, a form of neuro-symbolic transfer learning in which Q-value estimates from a teacher are distilled into a low-dimensional representation in the form of an automaton. We then propose methods for generating Q-value estimates where symbolic information is extracted from a teacher's Deep Q-Network (DQN). The resulting Q-value estimates are used to bootstrap learning in the target discrete and continuous environment via a modified DQN and Twin-Delayed Deep Deterministic (TD3) loss function, respectively. We demonstrate that automaton distillation decreases the time required to find optimal policies for various decision tasks in new environments, even in a target environment different in structure from the source environment.
Authors: Jaehyeon Son, Soochan Lee, Gunhee Kim
Abstract: Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research.
Authors: Ga\"etan Frusque, Ismail Nejjar, Majid Nabavi, Olga Fink
Abstract: The Health Index (HI) is crucial for evaluating system health and is important for tasks like anomaly detection and Remaining Useful Life (RUL) prediction of safety-critical systems. Real-time, meticulous monitoring of system conditions is essential, especially in manufacturing high-quality and safety-critical components such as spray coatings. However, acquiring accurate health status information (HI labels) in real scenarios can be difficult or costly because it requires continuous, precise measurements that fully capture the system's health. As a result, using datasets from systems run-to-failure, which provide limited HI labels only at the healthy and end-of-life phases, becomes a practical approach. We employ Deep Semi-supervised Anomaly Detection (DeepSAD) embeddings to tackle the challenge of extracting features associated with the system's health state. Additionally, we introduce a diversity loss to further enrich the DeepSAD embeddings. We also propose applying an alternating projection algorithm with isotonic constraints to transform the embedding into a normalized HI with an increasing trend. Validation on the PHME2010 milling dataset, a recognized benchmark with ground truth HIs, confirms the efficacy of our proposed HI estimations. Our methodology is further applied to monitor the wear states of thermal spray coatings using high-frequency voltage. These contributions facilitate more accessible and reliable HI estimation, particularly in scenarios where obtaining ground truth HI labels is impossible.
Authors: Jungtaek Kim
Abstract: Bayesian optimization is a principled optimization strategy for a black-box objective function. It shows its effectiveness in a wide variety of real-world applications such as scientific discovery and experimental design. In general, the performance of Bayesian optimization is reported through regret-based metrics such as instantaneous, simple, and cumulative regrets. These metrics only rely on function evaluations, so that they do not consider geometric relationships between query points and global solutions, or query points themselves. Notably, they cannot discriminate if multiple global solutions are successfully found. Moreover, they do not evaluate Bayesian optimization's abilities to exploit and explore a search space given. To tackle these issues, we propose four new geometric metrics, i.e., precision, recall, average degree, and average distance. These metrics allow us to compare Bayesian optimization algorithms considering the geometry of both query points and global optima, or query points. However, they are accompanied by an extra parameter, which needs to be carefully determined. We therefore devise the parameter-free forms of the respective metrics by integrating out the additional parameter. Finally, we validate that our proposed metrics can provide more delicate interpretation of Bayesian optimization, on top of assessment via the conventional metrics.
Authors: Luis M\"uller, Daniel Kusuma, Blai Bonet, Christopher Morris
Abstract: Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power. However, such architectures often fail to deliver solid predictive performance on real-world tasks, limiting their practical impact. In contrast, global attention-based models such as graph transformers demonstrate strong performance in practice, but comparing their expressive power with the k-WL hierarchy remains challenging, particularly since these architectures rely on positional or structural encodings for their expressivity and predictive performance. To address this, we show that the recently proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power. Empirically, we demonstrate that the Edge Transformer surpasses other theoretically aligned architectures regarding predictive performance while not relying on positional or structural encodings. Our code is available at https://github.com/luis-mueller/towards-principled-gts
URLs: https://github.com/luis-mueller/towards-principled-gts
Authors: Evan Chen, Frank Po-Chen Lin, Dong-Jun Han, Christopher G. Brinton
Abstract: While federated learning (FL) eliminates the transmission of raw data over a network, it is still vulnerable to privacy breaches from the communicated model parameters. In this work, we propose Multi-Tier Federated Learning with Multi-Tier Differential Privacy (M^2FDP), a DP-enhanced FL methodology for jointly optimizing privacy and performance in hierarchical networks. One of the key concepts of M^2FDP is to extend the concept of HDP towards Multi-Tier Differential Privacy (MDP), while also adapting DP noise injection at different layers of an established FL hierarchy -- edge devices, edge servers, and cloud servers -- according to the trust models within particular subnetworks. We conduct a comprehensive analysis of the convergence behavior of M^2FDP, revealing conditions on parameter tuning under which the training process converges sublinearly to a finite stationarity gap that depends on the network hierarchy, trust model, and target privacy level. Subsequent numerical evaluations demonstrate that M^2FDP obtains substantial improvements in these metrics over baselines for different privacy budgets, and validate the impact of different system configurations.
Authors: Andy Zhou, Bo Li, Haohan Wang
Abstract: Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art. Code can be found at https://github.com/lapisrocks/rpo
Authors: Ruoqi Zhang, Ziwei Luo, Jens Sj\"olund, Thomas B. Sch\"on, Per Mattsson
Abstract: This paper presents advanced techniques of training diffusion policies for offline reinforcement learning (RL). At the core is a mean-reverting stochastic differential equation (SDE) that transfers a complex action distribution into a standard Gaussian and then samples actions conditioned on the environment state with a corresponding reverse-time SDE, like a typical diffusion policy. We show that such an SDE has a solution that we can use to calculate the log probability of the policy, yielding an entropy regularizer that improves the exploration of offline datasets. To mitigate the impact of inaccurate value functions from out-of-distribution data points, we further propose to learn the lower confidence bound of Q-ensembles for more robust policy improvement. By combining the entropy-regularized diffusion policy with Q-ensembles in offline RL, our method achieves state-of-the-art performance on most tasks in D4RL benchmarks. Code is available at https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble.
URLs: https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble.
Authors: Jean Pinsolle, Olivier Goudet, Cyrille Enderli, Sylvain Lamprier, Jin-Kao Hao
Abstract: In this paper, we propose a new deinterleaving method for mixtures of discrete renewal Markov chains. This method relies on the maximization of a penalized likelihood score. It exploits all available information about both the sequence of the different symbols and their arrival times. A theoretical analysis is carried out to prove that minimizing this score allows to recover the true partition of symbols in the large sample limit, under mild conditions on the component processes. This theoretical analysis is then validated by experiments on synthetic data. Finally, the method is applied to deinterleave pulse trains received from different emitters in a RESM (Radar Electronic Support Measurements) context and we show that the proposed method competes favorably with state-of-the-art methods on simulated warfare datasets.
Authors: Jinlong Pang, Jialu Wang, Zhaowei Zhu, Yuanshun Yao, Chen Qian, Yang Liu
Abstract: The pursuit of fairness in machine learning (ML), ensuring that the models do not exhibit biases toward protected demographic groups, typically results in a compromise scenario. This compromise can be explained by a Pareto frontier where given certain resources (e.g., data), reducing the fairness violations often comes at the cost of lowering the model accuracy. In this work, we aim to train models that mitigate group fairness disparity without causing harm to model accuracy. Intuitively, acquiring more data is a natural and promising approach to achieve this goal by reaching a better Pareto frontier of the fairness-accuracy tradeoff. The current data acquisition methods, such as fair active learning approaches, typically require annotating sensitive attributes. However, these sensitive attribute annotations should be protected due to privacy and safety concerns. In this paper, we propose a tractable active data sampling algorithm that does not rely on training group annotations, instead only requiring group annotations on a small validation set. Specifically, the algorithm first scores each new example by its influence on fairness and accuracy evaluated on the validation dataset, and then selects a certain number of examples for training. We theoretically analyze how acquiring more data can improve fairness without causing harm, and validate the possibility of our sampling approach in the context of risk disparity. We also provide the upper bound of generalization error and risk disparity as well as the corresponding connections. Extensive experiments on real-world data demonstrate the effectiveness of our proposed algorithm. Our code is available at https://github.com/UCSC-REAL/FairnessWithoutHarm.
Authors: Hongyu Shen, Yici Yan, Zhizhen Zhao
Abstract: Model-X knockoff has garnered significant attention among various feature selection methods due to its guarantees for controlling the false discovery rate (FDR). Since its introduction in parametric design, knockoff techniques have evolved to handle arbitrary data distributions using deep learning-based generative models. However, we have observed limitations in the current implementations of the deep Model-X knockoff framework. Notably, the "swap property" that knockoffs require often faces challenges at the sample level, resulting in diminished selection power. To address these issues, we develop "Deep Dependency Regularized Knockoff (DeepDRK)," a distribution-free deep learning method that effectively balances FDR and power. In DeepDRK, we introduce a novel formulation of the knockoff model as a learning problem under multi-source adversarial attacks. By employing an innovative perturbation technique, we achieve lower FDR and higher power. Our model outperforms existing benchmarks across synthetic, semi-synthetic, and real-world datasets, particularly when sample sizes are small and data distributions are non-Gaussian.
Authors: Sam Adam-Day, Michael Benedikt, \.Ismail \.Ilkan Ceylan, Ben Finkelshtein
Abstract: We present a new angle on the expressive power of graph neural networks (GNNs) by studying how the predictions of real-valued GNN classifiers, such as those classifying graphs probabilistically, evolve as we apply them on larger graphs drawn from some random graph model. We show that the output converges to a constant function, which upper-bounds what these classifiers can uniformly express. This strong convergence phenomenon applies to a very wide class of GNNs, including state of the art models, with aggregates including mean and the attention-based mechanism of graph transformers. Our results apply to a broad class of random graph models, including sparse and dense variants of the Erd\H{o}s-R\'enyi model, the stochastic block model, and the Barab\'asi-Albert model. We empirically validate these findings, observing that the convergence phenomenon appears not only on random graphs but also on some real-world graphs.
Authors: Naichen Shi, Salar Fattahi, Raed Al Kontar
Abstract: In this work, we study the problem of common and unique feature extraction from noisy data. When we have N observation matrices from N different and associated sources corrupted by sparse and potentially gross noise, can we recover the common and unique components from these noisy observations? This is a challenging task as the number of parameters to estimate is approximately thrice the number of observations. Despite the difficulty, we propose an intuitive alternating minimization algorithm called triple component matrix factorization (TCMF) to recover the three components exactly. TCMF is distinguished from existing works in literature thanks to two salient features. First, TCMF is a principled method to separate the three components given noisy observations provably. Second, the bulk of the computation in TCMF can be distributed. On the technical side, we formulate the problem as a constrained nonconvex nonsmooth optimization problem. Despite the intricate nature of the problem, we provide a Taylor series characterization of its solution by solving the corresponding Karush-Kuhn-Tucker conditions. Using this characterization, we can show that the alternating minimization algorithm makes significant progress at each iteration and converges into the ground truth at a linear rate. Numerical experiments in video segmentation and anomaly detection highlight the superior feature extraction abilities of TCMF.
Authors: Simon Vary, Pierre Ablin, Bin Gao, P. -A. Absil
Abstract: Optimization over the set of matrices $X$ that satisfy $X^\top B X = I_p$, referred to as the generalized Stiefel manifold, appears in many applications involving sampled covariance matrices such as the canonical correlation analysis (CCA), independent component analysis (ICA), and the generalized eigenvalue problem (GEVP). Solving these problems is typically done by iterative methods that require a fully formed $B$. We propose a cheap stochastic iterative method that solves the optimization problem while having access only to random estimates of $B$. Our method does not enforce the constraint in every iteration; instead, it produces iterations that converge to critical points on the generalized Stiefel manifold defined in expectation. The method has lower per-iteration cost, requires only matrix multiplications, and has the same convergence rates as its Riemannian optimization counterparts that require the full matrix $B$. Experiments demonstrate its effectiveness in various machine learning applications involving generalized orthogonality constraints, including CCA, ICA, and the GEVP.
Authors: Mudit Gaur, Amrit Singh Bedi, Di Wang, Vaneet Aggarwal
Abstract: The current state-of-the-art theoretical analysis of Actor-Critic (AC) algorithms significantly lags in addressing the practical aspects of AC implementations. This crucial gap needs bridging to bring the analysis in line with practical implementations of AC. To address this, we advocate for considering the MMCLG criteria: \textbf{M}ulti-layer neural network parametrization for actor/critic, \textbf{M}arkovian sampling, \textbf{C}ontinuous state-action spaces, the performance of the \textbf{L}ast iterate, and \textbf{G}lobal optimality. These aspects are practically significant and have been largely overlooked in existing theoretical analyses of AC algorithms. In this work, we address these gaps by providing the first comprehensive theoretical analysis of AC algorithms that encompasses all five crucial practical aspects (covers MMCLG criteria). We establish global convergence sample complexity bounds of $\tilde{\mathcal{O}}\left({\epsilon^{-3}}\right)$. We achieve this result through our novel use of the weak gradient domination property of MDP's and our unique analysis of the error in critic estimation.
Authors: Xiaodan Xing, Fadong Shi, Jiahao Huang, Yinzhe Wu, Yang Nan, Sheng Zhang, Yingying Fang, Mike Roberts, Carola-Bibiane Sch\"onlieb, Javier Del Ser, Guang Yang
Abstract: Generative Artificial Intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimise training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimise outcomes. Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend, known as the AI autophagy phenomenon, suggests a future where generative AI systems may increasingly consume their own outputs without discernment, raising concerns about model performance, reliability, and ethical implications. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects? To address these research questions, this study examines the existing literature, delving into the consequences of AI autophagy, analyzing the associated risks, and exploring strategies to mitigate its impact. Our aim is to provide a comprehensive perspective on this phenomenon advocating for a balanced approach that promotes the sustainable development of generative AI technologies in the era of large models.
Authors: Emile Pierret, Bruno Galerne
Abstract: Diffusion or score-based models recently showed high performance in image generation. They rely on a forward and a backward stochastic differential equations (SDE). The sampling of a data distribution is achieved by solving numerically the backward SDE or its associated flow ODE. Studying the convergence of these models necessitates to control four different types of error: the initialization error, the truncation error, the discretization and the score approximation. In this paper, we study theoretically the behavior of diffusion models and their numerical implementation when the data distribution is Gaussian. In this restricted framework where the score function is a linear operator, we derive the analytical solutions of the backward SDE and the probability flow ODE. We prove that these solutions and their discretizations are all Gaussian processes, which allows us to compute exact Wasserstein errors induced by each error type for any sampling scheme. Monitoring convergence directly in the data space instead of relying on Inception features, our experiments show that the recommended numerical schemes from the diffusion models literature are also the best sampling schemes for Gaussian distributions.
Authors: Sujai Hiremath, Jacqueline R. M. A. Maasch, Mengxiao Gao, Promit Ghosal, Kyra Gan
Abstract: Learning the unique directed acyclic graph corresponding to an unknown causal model is a challenging task. Methods based on functional causal models can identify a unique graph, but either suffer from the curse of dimensionality or impose strong parametric assumptions. To address these challenges, we propose a novel hybrid approach for global causal discovery in observational data that leverages local causal substructures. We first present a topological sorting algorithm that leverages ancestral relationships in linear structural equation models to establish a compact top-down hierarchical ordering, encoding more causal information than linear orderings produced by existing methods. We demonstrate that this approach generalizes to nonlinear settings with arbitrary noise. We then introduce a nonparametric constraint-based algorithm that prunes spurious edges by searching for local conditioning sets, achieving greater accuracy than current methods. We provide theoretical guarantees for correctness and worst-case polynomial time complexities, with empirical validation on synthetic data.
Authors: Dimitrios Bachtis, Giulio Biroli, Aur\'elien Decelle, Beatriz Seoane
Abstract: In this paper, we investigate the feature encoding process in a prototypical energy-based generative model, the Restricted Boltzmann Machine (RBM). We start with an analytical investigation using simplified architectures and data structures, and end with numerical analysis of real trainings on real datasets. Our study tracks the evolution of the model's weight matrix through its singular value decomposition, revealing a series of phase transitions associated to a progressive learning of the principal modes of the empirical probability distribution. The model first learns the center of mass of the modes and then progressively resolve all modes through a cascade of phase transitions. We first describe this process analytically in a controlled setup that allows us to study analytically the training dynamics. We then validate our theoretical results by training the Bernoulli-Bernoulli RBM on real data sets. By using data sets of increasing dimension, we show that learning indeed leads to sharp phase transitions in the high-dimensional limit. Moreover, we propose and test a mean-field finite-size scaling hypothesis. This shows that the first phase transition is in the same universality class of the one we studied analytically, and which is reminiscent of the mean-field paramagnetic-to-ferromagnetic phase transition.
Authors: Nicolas B\'ereux, Aur\'elien Decelle, Cyril Furtlehner, Lorenzo Rosset, Beatriz Seoane
Abstract: Restricted Boltzmann Machines (RBMs) are effective tools for modeling complex systems and deriving insights from data. However, training these models with highly structured data presents significant challenges due to the slow mixing characteristics of Markov Chain Monte Carlo processes. In this study, we build upon recent theoretical advancements in RBM training, to significantly reduce the computational cost of training (in very clustered datasets), evaluating and sampling in RBMs in general. The learning process is analogous to thermodynamic continuous phase transitions observed in ferromagnetic models, where new modes in the probability measure emerge in a continuous manner. Such continuous transitions are associated with the critical slowdown effect, which adversely affects the accuracy of gradient estimates, particularly during the initial stages of training with clustered data. To mitigate this issue, we propose a pre-training phase that encodes the principal components into a low-rank RBM through a convex optimization process. This approach enables efficient static Monte Carlo sampling and accurate computation of the partition function. We exploit the continuous and smooth nature of the parameter annealing trajectory to achieve reliable and computationally efficient log-likelihood estimations, enabling online assessment during the training, and propose a novel sampling strategy named parallel trajectory tempering (PTT) which outperforms previously optimized MCMC methods. Our results show that this training strategy enables RBMs to effectively address highly structured datasets that conventional methods struggle with. We also provide evidence that our log-likelihood estimation is more accurate than traditional, more computationally intensive approaches in controlled scenarios. The PTT algorithm significantly accelerates MCMC processes compared to existing and conventional methods.
Authors: Jiyuan Tan, Jose Blanchet, Vasilis Syrgkanis
Abstract: Recent progress in Neural Causal Models (NCMs) showcased how identification and partial identification of causal effects can be automatically carried out via training of neural generative models that respect the constraints encoded in a given causal graph [Xia et al. 2022, Balazadeh et al. 2022]. However, formal consistency of these methods has only been proven for the case of discrete variables or only for linear causal models. In this work, we prove the consistency of partial identification via NCMs in a general setting with both continuous and categorical variables. Further, our results highlight the impact of the design of the underlying neural network architecture in terms of depth and connectivity as well as the importance of applying Lipschitz regularization in the training phase. In particular, we provide a counterexample showing that without Lipschitz regularization this method may not be asymptotically consistent. Our results are enabled by new results on the approximability of Structural Causal Models (SCMs) via neural generative models, together with an analysis of the sample complexity of the resulting architectures and how that translates into an error in the constrained optimization problem that defines the partial identification bounds.
Authors: Xiaoyun Xu, Zhuoran Liu, Stefanos Koffas, Shujian Yu, Stjepan Picek
Abstract: Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with an average 9.99\% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available at~\url{https://github.com/xiaoyunxxy/ban}.
Authors: Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele
Abstract: Inference on large language models (LLMs) can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in LLM inference contributes significantly to these costs, which has sparked an interest in approximating the self-attention computation to reduce such costs. In this work, we propose to approximate self-attention by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to speed up the attention computation due to reduced data movement (load/store) and compute costs while maintaining the efficacy of the models better than other popular approximation methods.
Authors: Abdullah Akg\"ul, Manuel Hau{\ss}mann, Melih Kandemir
Abstract: Current approaches to model-based offline reinforcement learning often incorporate uncertainty-based reward penalization to address the distributional shift problem. These approaches, commonly known as pessimistic value iteration, use Monte Carlo sampling to estimate the Bellman target to perform temporal difference based policy evaluation. We find out that the randomness caused by this sampling step significantly delays convergence. We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. The resulting algorithm, which we call Moment Matching Offline Model-Based Policy Optimization (MOMBO), propagates the uncertainty of the next state through a nonlinear Q-network in a deterministic fashion by approximating the distributions of hidden layer activations by a normal distribution. We show that it is possible to provide tighter guarantees for the suboptimality of MOMBO than the existing Monte Carlo sampling approaches. We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks.
Authors: Zhiyang Wang, Juan Cervino, Alejandro Ribeiro
Abstract: Graph Neural Networks (GNNs) extend convolutional neural networks to operate on graphs. Despite their impressive performances in various graph learning tasks, the theoretical understanding of their generalization capability is still lacking. Previous GNN generalization bounds ignore the underlying graph structures, often leading to bounds that increase with the number of nodes -- a behavior contrary to the one experienced in practice. In this paper, we take a manifold perspective to establish the statistical generalization theory of GNNs on graphs sampled from a manifold in the spectral domain. As demonstrated empirically, we prove that the generalization bounds of GNNs decrease linearly with the size of the graphs in the logarithmic scale, and increase linearly with the spectral continuity constants of the filter functions. Notably, our theory explains both node-level and graph-level tasks. Our result has two implications: i) guaranteeing the generalization of GNNs to unseen data over manifolds; ii) providing insights into the practical design of GNNs, i.e., restrictions on the discriminability of GNNs are necessary to obtain a better generalization performance. We demonstrate our generalization bounds of GNNs using synthetic and multiple real-world datasets.
Authors: Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
Abstract: Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized by an unusually high weight norm and influence the final layer normalization (LayerNorm) scale to effectively scale down the logits. Our work shows that entropy neurons operate by writing onto an unembedding null space, allowing them to impact the residual stream norm with minimal direct effect on the logits themselves. We observe the presence of entropy neurons across a range of models, up to 7 billion parameters. On the other hand, token frequency neurons, which we discover and describe here for the first time, boost or suppress each token's logit proportionally to its log frequency, thereby shifting the output distribution towards or away from the unigram distribution. Finally, we present a detailed case study where entropy neurons actively manage confidence in the setting of induction, i.e. detecting and continuing repeated subsequences.
Authors: Matthijs Pals, A Erdem Sa\u{g}tekin, Felix Pei, Manuel Gloeckler, Jakob H Macke
Abstract: A central aim in computational neuroscience is to relate the activity of large populations of neurons to an underlying dynamical system. Models of these neural dynamics should ideally be both interpretable and fit the observed data well. Low-rank recurrent neural networks (RNNs) exhibit such interpretability by having tractable dynamics. However, it is unclear how to best fit low-rank RNNs to data consisting of noisy observations of an underlying stochastic system. Here, we propose to fit stochastic low-rank RNNs with variational sequential Monte Carlo methods. We validate our method on several datasets consisting of both continuous and spiking neural data, where we obtain lower dimensional latent dynamics than current state of the art methods. Additionally, for low-rank models with piecewise linear nonlinearities, we show how to efficiently identify all fixed points in polynomial rather than exponential cost in the number of units, making analysis of the inferred dynamics tractable for large RNNs. Our method both elucidates the dynamical systems underlying experimental recordings and provides a generative model whose trajectories match observed variability.
Authors: Tian Tian, Lin F. Yang, Csaba Szepesv\'ari
Abstract: The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives while maximizing cumulative reward. However, the current understanding of how to learn efficiently in a CMDP environment with a potentially infinite number of states remains under investigation, particularly when function approximation is applied to the value functions. In this paper, we address the learning problem given linear function approximation with $q_{\pi}$-realizability, where the value functions of all policies are linearly representable with a known feature map, a setting known to be more general and challenging than other linear settings. Utilizing a local-access model, we propose a novel primal-dual algorithm that, after $\tilde{O}(\text{poly}(d) \epsilon^{-3})$ queries, outputs with high probability a policy that strictly satisfies the constraints while nearly optimizing the value with respect to a reward function. Here, $d$ is the feature dimension and $\epsilon > 0$ is a given error. The algorithm relies on a carefully crafted off-policy evaluation procedure to evaluate the policy using historical data, which informs policy updates through policy gradients and conserves samples. To our knowledge, this is the first result achieving polynomial sample complexity for CMDP in the $q_{\pi}$-realizable setting.
Authors: Bogdan Kulynych, Juan Felipe Gomez, Georgios Kaissis, Flavio du Pin Calmon, Carmela Troncoso
Abstract: Differential privacy (DP) is a widely used approach for mitigating privacy risks when training machine learning models on sensitive data. DP mechanisms add noise during training to limit the risk of information leakage. The scale of the added noise is critical, as it determines the trade-off between privacy and utility. The standard practice is to select the noise scale to satisfy a given privacy budget $\varepsilon$. This privacy budget is in turn interpreted in terms of operational attack risks, such as accuracy, sensitivity, and specificity of inference attacks aimed to recover information about the training data records. We show that first calibrating the noise scale to a privacy budget $\varepsilon$, and then translating {\epsilon} to attack risk leads to overly conservative risk assessments and unnecessarily low utility. Instead, we propose methods to directly calibrate the noise scale to a desired attack risk level, bypassing the step of choosing $\varepsilon$. For a given notion of attack risk, our approach significantly decreases noise scale, leading to increased utility at the same level of privacy. We empirically demonstrate that calibrating noise to attack sensitivity/specificity, rather than $\varepsilon$, when training privacy-preserving ML models substantially improves model accuracy for the same risk level. Our work provides a principled and practical way to improve the utility of privacy-preserving ML without compromising on privacy. The code is available at https://github.com/Felipe-Gomez/riskcal
Authors: Thomas Kwa, Drake Thomas, Adri\`a Garriga-Alonso
Abstract: When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.
Authors: Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goul\~ao, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierr\'e, Sander Schulhoff, Jun Jet Tai, Hannah Tan, Omar G. Younis
Abstract: Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium
Authors: Chunan Liu, Lilian Denzler, Yihong Chen, Andrew Martin, Brooks Paige
Abstract: Epitope identification is vital for antibody design yet challenging due to the inherent variability in antibodies. While many deep learning methods have been developed for general protein binding site prediction tasks, whether they work for epitope prediction remains an understudied research question. The challenge is also heightened by the lack of a consistent evaluation pipeline with sufficient dataset size and epitope diversity. We introduce a filtered antibody-antigen complex structure dataset, AsEP (Antibody-specific Epitope Prediction). AsEP is the largest of its kind and provides clustered epitope groups, allowing the community to develop and test novel epitope prediction methods and evaluate their generalisability. AsEP comes with an easy-to-use interface in Python and pre-built graph representations of each antibody-antigen complex while also supporting customizable embedding methods. Using this new dataset, we benchmark several representative general protein-binding site prediction methods and find that their performances fall short of expectations for epitope prediction. To address this, we propose a novel method, WALLE, which leverages both unstructured modeling from protein language models and structural modeling from graph neural networks. WALLE demonstrate up to 3-10X performance improvement over the baseline methods. Our empirical findings suggest that epitope prediction benefits from combining sequential features provided by language models with geometrical information from graph representations. This provides a guideline for future epitope prediction method design. In addition, we reformulate the task as bipartite link prediction, allowing convenient model performance attribution and interpretability. We open source our data and code at https://github.com/biochunan/AsEP-dataset.
Authors: Xavier Gonzalez, Andrew Warrington, Jimmy T. H. Smith, Scott W. Linderman
Abstract: Conventional nonlinear RNNs are not naturally parallelizable across the sequence length, unlike transformers and linear RNNs. Lim et. al. (2024) therefore tackle parallelized evaluation of nonlinear RNNs, posing it as a fixed point problem solved with Newton's method. By deriving and applying a parallelized form of Newton's method, they achieve large speedups over sequential evaluation. However, their approach inherits cubic computational complexity and numerical instability. We tackle these weaknesses. To reduce the computational complexity, we apply quasi-Newton approximations and show they converge comparably, use less memory, and are faster, compared to full-Newton. To stabilize Newton's method, we leverage a connection between Newton's method damped with trust regions and Kalman smoothing. This connection allows us to stabilize the iteration, per the trust region, and use efficient parallelized Kalman algorithms to retain performance. We compare these methods empirically and highlight use cases where each algorithm excels.
Authors: Chang-Wei Shi, Yi-Rui Yang, Wu-Jun Li
Abstract: Distributed learning is essential for training large-scale deep models. Asynchronous SGD (ASGD) and its variants are commonly used distributed learning methods, particularly in scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have found that naively incorporating momentum into ASGD can impede the convergence. In this paper, we propose a novel method called ordered momentum (OrMo) for ASGD. In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of OrMo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis of ASGD with momentum without dependence on the maximum delay. Empirical results demonstrate that OrMo can achieve better convergence performance compared with ASGD and other asynchronous methods with momentum.
Authors: Vincenzo Marco De Luca, Antonio Longa, Andrea Passerini, Pietro Li\`o
Abstract: Graph Neural Networks (GNNs) have emerged as the predominant paradigm for learning from graph-structured data, offering a wide range of applications from social network analysis to bioinformatics. Despite their versatility, GNNs face challenges such as lack of generalization and poor interpretability, which hinder their wider adoption and reliability in critical applications. Dropping has emerged as an effective paradigm for improving the generalization capabilities of GNNs. However, existing approaches often rely on random or heuristic-based selection criteria, lacking a principled method to identify and exclude nodes that contribute to noise and over-complexity in the model. In this work, we argue that explainability should be a key indicator of a model's quality throughout its training phase. To this end, we introduce xAI-Drop, a novel topological-level dropping regularizer that leverages explainability to pinpoint noisy network elements to be excluded from the GNN propagation mechanism. An empirical evaluation on diverse real-world datasets demonstrates that our method outperforms current state-of-the-art dropping approaches in accuracy, and improves explanation quality.
Authors: Ethan Che, Daniel R. Jiang, Hongseok Namkoong, Jimmy Wang
Abstract: Real-world experiments involve batched & delayed feedback, non-stationarity, multiple objectives & constraints, and (often some) personalization. Tailoring adaptive methods to address these challenges on a per-problem basis is infeasible, and static designs remain the de facto standard. Focusing on short-horizon ($\le 10$) adaptive experiments, we move away from bespoke algorithms and present a mathematical programming formulation that can flexibly incorporate a wide range of objectives, constraints, and statistical procedures. We formulating a dynamic program based on central limit approximations, which enables the use of scalable optimization methods based on auto-differentiation and GPU parallelization. To evaluate our framework, we implement a simple heuristic planning method ("solver") and benchmark it across hundreds of problem instances involving non-stationarity, personalization, and multiple objectives & constraints. Unlike bespoke methods (e.g., Thompson sampling variants), our mathematical programming framework provides consistent gains over static randomized control trials and exhibits robust performance across problem instances.
Authors: Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy
Abstract: Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance. Code is available at https://github.com/UtkarshSaxena1/EigenAttn.
Authors: Zhuoping Zhou, Davoud Ataee Tarzanagh, Bojian Hou, Qi Long, Li Shen
Abstract: This paper examines the issue of fairness in the estimation of graphical models (GMs), particularly Gaussian, Covariance, and Ising models. These models play a vital role in understanding complex relationships in high-dimensional data. However, standard GMs can result in biased outcomes, especially when the underlying data involves sensitive characteristics or protected groups. To address this, we introduce a comprehensive framework designed to reduce bias in the estimation of GMs related to protected attributes. Our approach involves the integration of the pairwise graph disparity error and a tailored loss function into a nonsmooth multi-objective optimization problem, striving to achieve fairness across different sensitive groups while maintaining the effectiveness of the GMs. Experimental evaluations on synthetic and real-world datasets demonstrate that our framework effectively mitigates bias without undermining GMs' performance.
Authors: Zeyu Zhou, Tianci Liu, Ruqi Bai, Jing Gao, Murat Kocaoglu, David I. Inouye
Abstract: In high-stake domains such as healthcare and hiring, the role of machine learning (ML) in decision-making raises significant fairness concerns. This work focuses on Counterfactual Fairness (CF), which posits that an ML model's outcome on any individual should remain unchanged if they had belonged to a different demographic group. Previous works have proposed methods that guarantee CF. Notwithstanding, their effects on the model's predictive performance remains largely unclear. To fill in this gap, we provide a theoretical study on the inherent trade-off between CF and predictive performance in a model-agnostic manner. We first propose a simple but effective method to cast an optimal but potentially unfair predictor into a fair one without losing the optimality. By analyzing its excess risk in order to achieve CF, we quantify this inherent trade-off. Further analysis on our method's performance with access to only incomplete causal knowledge is also conducted. Built upon it, we propose a performant algorithm that can be applied in such scenarios. Experiments on both synthetic and semi-synthetic datasets demonstrate the validity of our analysis and methods.
Authors: Qinglong Ma, Peizhi Zhao, Sen Wang, Tao Song
Abstract: In recent years, Solving partial differential equations has shifted the focus of traditional neural network studies from finite-dimensional Euclidean spaces to generalized functional spaces in research. A novel methodology is to learn an operator as a means of approximating the mapping between outputs. Currently, researchers have proposed a variety of operator architectures. Nevertheless, the majority of these architectures adopt an iterative update architecture, whereby a single operator is learned from the same function space. In practical physical science problems, the numerical solutions of partial differential equations are complex, and a serial single operator is unable to accurately approximate the intricate mapping between input and output. So, We propose a deep parallel operator model (DPNO) for efficiently and accurately solving partial differential equations. DPNO employs convolutional neural networks to extract local features and map data into distinct latent spaces. Designing a parallel block of double Fourier neural operators to solve the iterative error problem. DPNO approximates complex mappings between inputs and outputs by learning multiple operators in different potential spaces in parallel blocks. DPNO achieved the best performance on five of them, with an average improvement of 10.5\%, and ranked second on one dataset.
Authors: Deheng Yuan, Tao Guo, Zhongyi Huang
Abstract: Consider the communication-constrained estimation of discrete distributions under $\ell^p$ losses, where each distributed terminal holds multiple independent samples and uses limited number of bits to describe the samples. We obtain the minimax optimal rates of the problem in most parameter regimes. An elbow effect of the optimal rates at $p=2$ is clearly identified. To show the optimal rates, we first design estimation protocols to achieve them. The key ingredient of these protocols is to introduce adaptive refinement mechanisms, which first generate rough estimate by partial information and then establish refined estimate in subsequent steps guided by the rough estimate. The protocols leverage successive refinement, sample compression, thresholding and random hashing methods to achieve the optimal rates in different parameter regimes. The optimality of the protocols is shown by deriving compatible minimax lower bounds.
Authors: Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, Thomas L. Griffiths
Abstract: Chain-of-thought (CoT) prompting has become a widely used strategy for working with large language and multimodal models. While CoT has been shown to improve performance across many tasks, determining the settings in which it is effective remains an ongoing effort. In particular, it is still an open question in what settings CoT systematically reduces model performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, looking at cases where (i) verbal thinking or deliberation hurts performance in humans, and (ii) the constraints governing human performance generalize to language models. Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions. In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts. We also identify three tasks that satisfy condition (i) but not (ii), and find that while verbal thinking reduces human performance in these tasks, CoT retains or increases model performance. Overall, our results show that while there is not an exact parallel between the cognitive processes of models and those of humans, considering cases where thinking has negative consequences for human performance can help us identify settings where it negatively impacts models. By connecting the literature on human deliberation with evaluations of CoT, we offer a new tool that can be used in understanding the impact of prompt choices and inference-time reasoning.
Authors: Zeyuan Li, Yangfan He, Lewei He, Jianhui Wang, Tianyu Shi, Bin Lei, Yuchen Li, Qiuwu Chen
Abstract: Recently, large language models (LLMs) have achieved significant progress in automated code generation. Despite their strong instruction-following capabilities, these models frequently struggled to align with user intent in coding scenarios. In particular, they were hampered by datasets that lacked diversity and failed to address specialized tasks or edge cases. Furthermore, challenges in supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) led to failures in generating precise, human-intent-aligned code. To tackle these challenges and improve the code generation performance for automated programming systems, we propose Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization (i.e., FALCON). FALCON is structured into two hierarchical levels. From the global level, long-term memory improves code quality by retaining and applying learned knowledge. At the local level, short-term memory allows for the incorporation of immediate feedback from compilers and AI systems. Additionally, we introduce meta-reinforcement learning with feedback rewards to solve the global-local bi-level optimization problem and enhance the model's adaptability across diverse code generation tasks. Extensive experiments demonstrate that our technique achieves state-of-the-art performance, leading other reinforcement learning methods by more than 4.5 percentage points on the MBPP benchmark and 6.1 percentage points on the Humaneval benchmark. The open-sourced code is publicly available at https://github.com/titurte/FALCON.
Authors: Alexander Christie, Matan Leibovich, Miguel Moscoso, Alexei Novikov, George Papanicolaou, Chrysoula Tsogka
Abstract: We propose a methodology that exploits large and diverse data sets to accurately estimate the ambient medium's Green's functions in strongly scattering media. Given these estimates, obtained with and without the use of neural networks, excellent imaging results are achieved, with a resolution that is better than that of a homogeneous medium. This phenomenon, also known as super-resolution, occurs because the ambient scattering medium effectively enhances the physical imaging aperture. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Authors: Armand Kassa\"i Koupa\"i, Jorge Mifsut Benet, Yuan Yin, Jean-No\"el Vittaut, Patrick Gallinari
Abstract: Solving parametric partial differential equations (PDEs) presents significant challenges for data-driven methods due to the sensitivity of spatio-temporal dynamics to variations in PDE parameters. Machine learning approaches often struggle to capture this variability. To address this, data-driven approaches learn parametric PDEs by sampling a very large variety of trajectories with varying PDE parameters. We first show that incorporating conditioning mechanisms for learning parametric PDEs is essential and that among them, $\textit{adaptive conditioning}$, allows stronger generalization. As existing adaptive conditioning methods do not scale well with respect to the number of parameters to adapt in the neural solver, we propose GEPS, a simple adaptation mechanism to boost GEneralization in Pde Solvers via a first-order optimization and low-rank rapid adaptation of a small set of context parameters. We demonstrate the versatility of our approach for both fully data-driven and for physics-aware neural solvers. Validation performed on a whole range of spatio-temporal forecasting problems demonstrates excellent performance for generalizing to unseen conditions including initial conditions, PDE coefficients, forcing terms and solution domain. $\textit{Project page}$: https://geps-project.github.io
Authors: Hamidreza Eivazi, Andr\'e Hebenbrock, Raphael Ginster, Steffen Bl\"omeke, Stefan Wittek, Christoph Herrmann, Thomas S. Spengler, Thomas Turek, Andreas Rausch
Abstract: Battery degradation remains a critical challenge in the pursuit of green technologies and sustainable energy solutions. Despite significant research efforts, predicting battery capacity loss accurately remains a formidable task due to its complex nature, influenced by both aging and cycling behaviors. To address this challenge, we introduce a novel general-purpose model for battery degradation prediction and synthesis, DiffBatt. Leveraging an innovative combination of conditional and unconditional diffusion models with classifier-free guidance and transformer architecture, DiffBatt achieves high expressivity and scalability. DiffBatt operates as a probabilistic model to capture uncertainty in aging behaviors and a generative model to simulate battery degradation. The performance of the model excels in prediction tasks while also enabling the generation of synthetic degradation curves, facilitating enhanced model training by data augmentation. In the remaining useful life prediction task, DiffBatt provides accurate results with a mean RMSE of 196 cycles across all datasets, outperforming all other models and demonstrating superior generalizability. This work represents an important step towards developing foundational models for battery degradation.
Authors: Kola Ayonrinde
Abstract: Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrained by a total sparsity upper bound. For example, TopK SAEs solve this allocation problem with the additional constraint that each token matches with at most $k$ features. In TopK SAEs, the $k$ active features per token constraint is the same across tokens, despite some tokens being more difficult to reconstruct than others. To address this limitation, we propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token. Feature Choice SAEs solve the sparsity allocation problem under the additional constraint that each feature matches with at most $m$ tokens. Mutual Choice SAEs solve the unrestricted allocation problem where the total sparsity budget can be allocated freely between tokens and features. Additionally, we introduce a new auxiliary loss function, $\mathtt{aux\_zipf\_loss}$, which generalises the $\mathtt{aux\_k\_loss}$ to mitigate dead and underutilised features. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels as a result of the inherent adaptive computation. More accurate and scalable feature extraction methods provide a path towards better understanding and more precise control of foundation models.
Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Taiji Suzuki, Qingfu Zhang, Hau-San Wong
Abstract: Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings.
Authors: Ethan Baron, Victor Hau, Zeke Weng
Abstract: Using data from professional bouldering competitions from 2008 to 2022, we train a logistic regression to predict climber results and measure climber skill. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers' varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a framework commonly used in recommender systems, to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors. In this framework, a climber's performance on a given problem is predicted by taking the dot product of the corresponding climber vector and problem vectors. PMF effectively handles sparse datasets, such as our dataset where only a subset of climbers attempt each particular problem, by extrapolating patterns from similar climbers. We contrast the empirical performance of PMF to the logistic regression approach and investigate the multivariate representations produced by PMF to gain insights into climber characteristics. Our results show that the multivariate PMF representations improve predictive performance of professional bouldering competitions by capturing both the overall strength of climbers and their specialized skill sets. We provide our code open-source at https://github.com/baronet2/boulder2vec.
Authors: Mateusz Cedro, David Martens
Abstract: Graph Neural Networks (GNNs) are a powerful technique for machine learning on graph-structured data, yet they pose interpretability challenges, especially for non-expert users. Existing GNN explanation methods often yield technical outputs such as subgraphs and feature importance scores, which are not easily understood. Building on recent insights from social science and other Explainable AI (XAI) methods, we propose GraphXAIN, a natural language narrative that explains individual predictions made by GNNs. We present a model-agnostic and explainer-agnostic XAI approach that complements graph explainers by generating GraphXAINs, using Large Language Models (LLMs) and integrating graph data, individual predictions from GNNs, explanatory subgraphs, and feature importances. We define XAI Narratives and XAI Descriptions, highlighting their distinctions and emphasizing the importance of narrative principles in effective explanations. By incorporating natural language narratives, our approach supports graph practitioners and non-expert users, aligning with social science research on explainability and enhancing user understanding and trust in complex GNN models. We demonstrate GraphXAIN's capabilities on a real-world graph dataset, illustrating how its generated narratives can aid understanding compared to traditional graph explainer outputs or other descriptive explanation methods.
Authors: Benedict Aaron Tjandra, Federico Barbero, Michael Bronstein
Abstract: Despite the successful application of Temporal Graph Networks (TGNs) for tasks such as dynamic node classification and link prediction, they still perform poorly on the task of dynamic node affinity prediction -- where the goal is to predict 'how much' two nodes will interact in the future. In fact, simple heuristic approaches such as persistent forecasts and moving averages over ground-truth labels significantly and consistently outperform TGNs. Building on this observation, we find that computing heuristics over messages is an equally competitive approach, outperforming TGN and all current temporal graph (TG) models on dynamic node affinity prediction. In this paper, we prove that no formulation of TGN can represent persistent forecasting or moving averages over messages, and propose to enhance the expressivity of TGNs by adding source-target identification to each interaction event message. We show that this modification is required to represent persistent forecasting, moving averages, and the broader class of autoregressive models over messages. Our proposed method, TGNv2, significantly outperforms TGN and all current TG models on all Temporal Graph Benchmark (TGB) dynamic node affinity prediction datasets.
Authors: Themistoklis Haris
Abstract: Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like $k$-Nearest-Neighbor ($k$NN) attention have been introduced [Roy, Saffar, Vaswani, Grangier, 2021] enabling each token to attend to only its $k$ closest tokens. While $k$NN attention has shown empirical success in making Transformers more efficient, its exact approximation guarantees have not been theoretically analyzed. In this work, we establish a theoretical framework for $k$NN attention, reformulating self-attention as expectations over softmax distributions and leveraging lazy Gumbel sampling [Mussmann, Levy, Ermon, 2017] with $k$NN indices for efficient approximation. Building on this framework, we also propose novel sub-quadratic algorithms that approximate self-attention gradients by leveraging efficient sampling techniques, such as Markov Chain-based estimation. Finally, we demonstrate the practical effectiveness of these algorithms through empirical experiments, showcasing their benefits in both training and inference.
Authors: Hala Hawashin, Mehrnoosh Sadrzadeh
Abstract: While large language models (LLMs) have advanced the field of natural language processing (NLP), their "black box" nature obscures their decision-making processes. To address this, researchers developed structured approaches using higher order tensors. These are able to model linguistic relations, but stall when training on classical computers due to their excessive size. Tensors are natural inhabitants of quantum systems and training on quantum computers provides a solution by translating text to variational quantum circuits. In this paper, we develop MultiQ-NLP: a framework for structure-aware data processing with multimodal text+image data. Here, "structure" refers to syntactic and grammatical relationships in language, as well as the hierarchical organization of visual elements in images. We enrich the translation with new types and type homomorphisms and develop novel architectures to represent structure. When tested on a main stream image classification task (SVO Probes), our best model showed a par performance with the state of the art classical models; moreover the best model was fully structured.
Authors: Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty
Abstract: Large Language Models (LLMs) are highly resource-intensive to fine-tune due to their enormous size. While low-rank adaptation is a prominent parameter-efficient fine-tuning approach, it suffers from sensitivity to hyperparameter choices, leading to instability in model performance on fine-tuning downstream tasks. This paper highlights the importance of effective parameterization in low-rank fine-tuning to reduce estimator variance and enhance the stability of final model outputs. We propose MonteCLoRA, an efficient fine-tuning technique, employing Monte Carlo estimation to learn an unbiased posterior estimation of low-rank parameters with low expected variance, which stabilizes fine-tuned LLMs with only O(1) additional parameters. MonteCLoRA shows significant improvements in accuracy and robustness, achieving up to 3.8% higher accuracy and 8.6% greater robustness than existing efficient fine-tuning methods on natural language understanding tasks with pre-trained RoBERTa-base. Furthermore, in generative tasks with pre-trained LLaMA-1-7B, MonteCLoRA demonstrates robust zero-shot performance with 50% lower variance than the contemporary efficient fine-tuning methods. The theoretical and empirical results presented in the paper underscore how parameterization and hyperpriors balance exploration-exploitation in the low-rank parametric space, therefore leading to more optimal and robust parameter estimation during efficient fine-tuning.
Authors: Marlon Tobaben, Marcus Klasson, Rui Li, Arno Solin, Antti Honkela
Abstract: This work explores the intersection of continual learning (CL) and differential privacy (DP). Crucially, continual learning models must retain knowledge across tasks, but this conflicts with the differential privacy requirement of restricting individual samples to be memorised in the model. We propose using pre-trained models to address the trade-offs between privacy and performance in a continual learning setting. More specifically, we present necessary assumptions to enable privacy-preservation and propose combining pre-trained models with parameter-free classifiers and parameter-efficient adapters that are learned under differential privacy. Our experiments demonstrate their effectiveness and provide insights into balancing the competing demands of continual learning and privacy.
Authors: Marco Due\~nas, Federico Nutarelli, V\'ictor Ortiz, Massimo Riccaboni, Francesco Serti
Abstract: Our paper presents a methodology to study the heterogeneous effects of economy-wide shocks and applies it to the case of the impact of the COVID-19 crisis on exports. This methodology is applicable in scenarios where the pervasive nature of the shock hinders the identification of a control group unaffected by the shock, as well as the ex-ante definition of the intensity of the shock's exposure of each unit. In particular, our study investigates the effectiveness of various Machine Learning (ML) techniques in predicting firms' trade and, by building on recent developments in causal ML, uses these predictions to reconstruct the counterfactual distribution of firms' trade under different COVID-19 scenarios and to study treatment effect heterogeneity. Specifically, we focus on the probability of Colombian firms surviving in the export market under two different scenarios: a COVID-19 setting and a non-COVID-19 counterfactual situation. On average, we find that the COVID-19 shock decreased a firm's probability of surviving in the export market by about 20 percentage points in April 2020. We study the treatment effect heterogeneity by employing a classification analysis that compares the characteristics of the firms on the tails of the estimated distribution of the individual treatment effects.
Authors: Amol Damare, Shouvik Roy, Roshan Sharma, Keith DSouza, Scott A. Smolka, Scott D. Stoller
Abstract: We present Barrier-based Simplex (Bb-Simplex), a new, provably correct design for runtime assurance of continuous dynamical systems. Bb-Simplex is centered around the Simplex control architecture, which consists of a high-performance advanced controller that is not guaranteed to maintain safety of the plant, a verified-safe baseline controller, and a decision module that switches control of the plant between the two controllers to ensure safety without sacrificing performance. In Bb-Simplex, Barrier certificates are used to prove that the baseline controller ensures safety. Furthermore, Bb-Simplex features a new automated method for deriving, from the barrier certificate, the conditions for switching between the controllers. Our method is based on the Taylor expansion of the barrier certificate and yields computationally inexpensive switching conditions. We also propose extensions to Bb-Simplex to enable its use in hybrid systems, which have multiple modes each with its own dynamics, and to support its use when only approximate dynamics (not exact dynamics) are available, for both continuous-time and hybrid dynamical systems. We consider significant applications of Bb-Simplex to microgrids featuring advanced controllers in the form of neural networks trained using reinforcement learning. These microgrids are modeled in RTDS, an industry-standard high-fidelity, real-time power systems simulator. Our results demonstrate that Bb-Simplex can automatically derive switching conditions for complex continuous-time and hybrid systems, the switching conditions are not overly conservative, and Bb-Simplex ensures safety even in the presence of adversarial attacks on the neural controller when only approximate dynamics (with an error bound) are available.
Authors: Benedikt Schulz, Lutz K\"ohler, Sebastian Lerch
Abstract: The importance of accurately quantifying forecast uncertainty has motivated much recent research on probabilistic forecasting. In particular, a variety of deep learning approaches has been proposed, with forecast distributions obtained as output of neural networks. These neural network-based methods are often used in the form of an ensemble, e.g., based on multiple model runs from different random initializations or more sophisticated ensembling strategies such as dropout, resulting in a collection of forecast distributions that need to be aggregated into a final probabilistic prediction. With the aim of consolidating findings from the machine learning literature on ensemble methods and the statistical literature on forecast combination, we address the question of how to aggregate distribution forecasts based on such `deep ensembles'. Using theoretical arguments and a comprehensive analysis on twelve benchmark data sets, we systematically compare probability- and quantile-based aggregation methods for three neural network-based approaches with different forecast distribution types as output. Our results show that combining forecast distributions from deep ensembles can substantially improve the predictive performance. We propose a general quantile aggregation framework for deep ensembles that allows for corrections of systematic deficiencies and performs well in a variety of settings, often superior compared to a linear combination of the forecast densities. Finally, we investigate the effects of the ensemble size and derive recommendations of aggregating distribution forecasts from deep ensembles in practice.
Authors: Nika Haghtalab, Thodoris Lykouris, Sloan Nietert, Alexander Wei
Abstract: We study Stackelberg games where a principal repeatedly interacts with a non-myopic long-lived agent, without knowing the agent's payoff function. Although learning in Stackelberg games is well-understood when the agent is myopic, dealing with non-myopic agents poses additional complications. In particular, non-myopic agents may strategize and select actions that are inferior in the present in order to mislead the principal's learning algorithm and obtain better outcomes in the future. We provide a general framework that reduces learning in presence of non-myopic agents to robust bandit optimization in the presence of myopic agents. Through the design and analysis of minimally reactive bandit algorithms, our reduction trades off the statistical efficiency of the principal's learning algorithm against its effectiveness in inducing near-best-responses. We apply this framework to Stackelberg security games (SSGs), pricing with unknown demand curve, general finite Stackelberg games, and strategic classification. In each setting, we characterize the type and impact of misspecifications present in near-best responses and develop a learning algorithm robust to such misspecifications. On the way, we improve the state-of-the-art query complexity of learning in SSGs with $n$ targets from $O(n^3)$ to a near-optimal $\widetilde{O}(n)$ by uncovering a fundamental structural property of these games. The latter result is of independent interest beyond learning with non-myopic agents.
Authors: V. A. Aksyuk
Abstract: Machine learning algorithms have achieved superhuman performance in specific complex domains. However, learning online from few examples and compositional learning for efficient generalization across domains remain elusive. In humans, such learning includes specific declarative memory formation and is closely associated with consciousness. Predictive processing has been advanced as a principled Bayesian framework for understanding the cortex as implementing deep generative models for both sensory perception and action control. However, predictive processing offers little direct insight into fast compositional learning or of the separation between conscious and unconscious contents. Here, propose that access consciousness arises as a consequence of a particular learning mechanism operating within a predictive processing system. We extend predictive processing by adding online, single-example new structure learning via hierarchical binding of unpredicted inferences. This system learns new causes by quickly connecting together novel combinations of perceptions, which manifests as working memories that can become short- and long-term declarative memories retrievable by associative recall. The contents of such bound representations are unified yet differentiated, can be maintained by selective attention and are globally available. The proposed learning process explains contrast and masking manipulations, postdictive perceptual integration, and other paradigm cases of consciousness research. 'Phenomenal conscious experience' is how the learning system transparently models its own functioning, giving rise to perceptual illusions underlying the meta-problem of consciousness. Our proposal naturally unifies the feature binding, recurrent processing, predictive processing, and global workspace theories of consciousness.
Authors: Kaizheng Wang
Abstract: We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets, and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate accordingly. Our non-asymptotic excess risk bounds demonstrate that our estimator adapts effectively to both the structure of the target distribution and the covariate shift. This adaptation is quantified through a notion of effective sample size that reflects the value of labeled source data for the target regression task. Our estimator achieves the minimax optimal error rate up to a polylogarithmic factor, and we find that using pseudo-labels for model selection does not significantly hinder performance.
Authors: Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, In So Kweon, Junmo Kim
Abstract: This survey reviews the progress of diffusion models in generating images from text, ~\textit{i.e.} text-to-image diffusion models. As a self-contained work, this survey starts with a brief introduction of how diffusion models work for image synthesis, followed by the background for text-conditioned image synthesis. Based on that, we present an organized review of pioneering methods and their improvements on text-to-image generation. We further summarize applications beyond image generation, such as text-guided generation for various modalities like videos, and text-guided image editing. Beyond the progress made so far, we discuss existing challenges and promising future directions.
Authors: Lawrence A. Bull, Matthew R. Jones, Elizabeth J. Cross, Andrew Duncan, Mark Girolami
Abstract: In practice, non-destructive testing (NDT) procedures tend to consider experiments (and their respective models) as distinct, conducted in isolation and associated with independent data. In contrast, this work looks to capture the interdependencies between acoustic emission (AE) experiments (as meta-models) and then use the resulting functions to predict the model hyperparameters for previously unobserved systems. We utilise a Bayesian multilevel approach (similar to deep Gaussian Processes) where a higher level meta-model captures the inter-task relationships. Our key contribution is how knowledge of the experimental campaign can be encoded between tasks as well as within tasks. We present an example of AE time-of-arrival mapping for source localisation, to illustrate how multilevel models naturally lend themselves to representing aggregate systems in engineering. We constrain the meta-model based on domain knowledge, then use the inter-task functions for transfer learning, predicting hyperparameters for models of previously unobserved experiments (for a specific design).
Authors: Ziliang Wang, Xiaohong Zhang, Ze Shi Li, Sheng Huang, Meng Yan
Abstract: Accurate Quality of Service (QoS) prediction is essential for enhancing user satisfaction in web recommendation systems, yet existing prediction models often overlook feature noise, focusing predominantly on label noise. In this paper, we present the Probabilistic Deep Supervision Network (PDS-Net), a robust framework designed to effectively identify and mitigate feature noise, thereby improving QoS prediction accuracy. PDS-Net operates with a dual-branch architecture: the main branch utilizes a decoder network to learn a Gaussian-based prior distribution from known features, while the second branch derives a posterior distribution based on true labels. A key innovation of PDS-Net is its condition-based noise recognition loss function, which enables precise identification of noisy features in objects (users or services). Once noisy features are identified, PDS-Net refines the feature's prior distribution, aligning it with the posterior distribution, and propagates this adjusted distribution to intermediate layers, effectively reducing noise interference. Extensive experiments conducted on two real-world QoS datasets demonstrate that PDS-Net consistently outperforms existing models, achieving an average improvement of 8.91% in MAE on Dataset D1 and 8.32% on Dataset D2 compared to the ate-of-the-art. These results highlight PDS-Net's ability to accurately capture complex user-service relationships and handle feature noise, underscoring its robustness and versatility across diverse QoS prediction environments.
Authors: Patrick Feeney, Michael C. Hughes
Abstract: The information noise-contrastive estimation (InfoNCE) loss function provides the basis of many self-supervised deep learning methods due to its strong empirical results and theoretic motivation. Previous work suggests a supervised contrastive (SupCon) loss to extend InfoNCE to learn from available class labels. This SupCon loss has been widely-used due to reports of good empirical performance. However, in this work we find that the prior SupCon loss formulation has questionable justification because it can encourage some images from the same class to repel one another in the learned embedding space. This problematic intra-class repulsion gets worse as the number of images sharing one class label increases. We propose the Supervised InfoNCE REvisited (SINCERE) loss as a theoretically-justified supervised extension of InfoNCE that eliminates intra-class repulsion. Experiments show that SINCERE leads to better separation of embeddings from different classes and improves transfer learning classification accuracy. We additionally utilize probabilistic modeling to derive an information-theoretic bound that relates SINCERE loss to the symmeterized KL divergence between data-generating distributions for a target class and all other classes.
Authors: Runze Cheng, Yao Sun, Dusit Niyato, Lan Zhang, Lei Zhang, Muhammad Ali Imran
Abstract: With the significant advances in AI-generated content (AIGC) and the proliferation of mobile devices, providing high-quality AIGC services via wireless networks is becoming the future direction. However, the primary challenges of AIGC services provisioning in wireless networks lie in unstable channels, limited bandwidth resources, and unevenly distributed computational resources. To this end, this paper proposes a semantic communication (SemCom)-empowered AIGC (SemAIGC) generation and transmission framework, where only semantic information of the content rather than all the binary bits should be generated and transmitted by using SemCom. Specifically, SemAIGC integrates diffusion models within the semantic encoder and decoder to design a workload-adjustable transceiver thereby allowing adjustment of computational resource utilization in edge and local. In addition, a Resource-aware wOrklOad Trade-off (ROOT) scheme is devised to intelligently make workload adaptation decisions for the transceiver, thus efficiently generating, transmitting, and fine-tuning content as per dynamic wireless channel conditions and service requirements. Simulations verify the superiority of our proposed SemAIGC framework in terms of latency and content quality compared to conventional approaches.
Authors: Adam Lechowicz, Nicolas Christianson, Bo Sun, Noman Bashir, Mohammad Hajiesmaili, Adam Wierman, Prashant Shenoy
Abstract: We introduce and study online conversion with switching costs, a family of online problems that capture emerging problems at the intersection of energy and sustainability. In this problem, an online player attempts to purchase (alternatively, sell) fractional shares of an asset during a fixed time horizon with length $T$. At each time step, a cost function (alternatively, price function) is revealed, and the player must irrevocably decide an amount of asset to convert. The player also incurs a switching cost whenever their decision changes in consecutive time steps, i.e., when they increase or decrease their purchasing amount. We introduce competitive (robust) threshold-based algorithms for both the minimization and maximization variants of this problem, and show they are optimal among deterministic online algorithms. We then propose learning-augmented algorithms that take advantage of untrusted black-box advice (such as predictions from a machine learning model) to achieve significantly better average-case performance without sacrificing worst-case competitive guarantees. Finally, we empirically evaluate our proposed algorithms using a carbon-aware EV charging case study, showing that our algorithms substantially improve on baseline methods for this problem.
Authors: Andrea Wynn, Ilia Sucholutsky, Thomas L. Griffiths
Abstract: How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values -- including ethics, honesty, and fairness -- training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple language models, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.
Authors: Nicola Bariletto, Nhat Ho
Abstract: Training machine learning and statistical models often involves optimizing a data-driven risk criterion. The risk is usually computed with respect to the empirical data distribution, but this may result in poor and unstable out-of-sample performance due to distributional uncertainty. In the spirit of distributionally robust optimization, we propose a novel robust criterion by combining insights from Bayesian nonparametric (i.e., Dirichlet process) theory and a recent decision-theoretic model of smooth ambiguity-averse preferences. First, we highlight novel connections with standard regularized empirical risk minimization techniques, among which Ridge and LASSO regressions. Then, we theoretically demonstrate the existence of favorable finite-sample and asymptotic statistical guarantees on the performance of the robust optimization procedure. For practical implementation, we propose and study tractable approximations of the criterion based on well-known Dirichlet process representations. We also show that the smoothness of the criterion naturally leads to standard gradient-based numerical optimization. Finally, we provide insights into the workings of our method by applying it to a variety of tasks based on simulated and real datasets.
Authors: Juan M Gorriz, R. Martin Clemente, F Segovia, J Ramirez, A Ortiz, J. Suckling
Abstract: As a technique that can compactly represent complex patterns, machine learning has significant potential for predictive inference. K-fold cross-validation (CV) is the most common approach to ascertaining the likelihood that a machine learning outcome is generated by chance, and it frequently outperforms conventional hypothesis testing. This improvement uses measures directly obtained from machine learning classifications, such as accuracy, that do not have a parametric description. To approach a frequentist analysis within machine learning pipelines, a permutation test or simple statistics from data partitions (i.e., folds) can be added to estimate confidence intervals. Unfortunately, neither parametric nor non-parametric tests solve the inherent problems of partitioning small sample-size datasets and learning from heterogeneous data sources. The fact that machine learning strongly depends on the learning parameters and the distribution of data across folds recapitulates familiar difficulties around excess false positives and replication. A novel statistical test based on K-fold CV and the Upper Bound of the actual risk (K-fold CUBV) is proposed, where uncertain predictions of machine learning with CV are bounded by the worst case through the evaluation of concentration inequalities. Probably Approximately Correct-Bayesian upper bounds for linear classifiers in combination with K-fold CV are derived and used to estimate the actual risk. The performance with simulated and neuroimaging datasets suggests that K-fold CUBV is a robust criterion for detecting effects and validating accuracy values obtained from machine learning and classical CV schemes, while avoiding excess false positives.
Authors: Vincent Zhihao Zheng, Lijun Sun
Abstract: Accurately modeling the correlation structure of errors is critical for reliable uncertainty quantification in probabilistic time series forecasting. While recent deep learning models for multivariate time series have developed efficient parameterizations for time-varying contemporaneous covariance, but they often assume temporal independence of errors for simplicity. However, real-world data often exhibit significant error autocorrelation and cross-lag correlation due to factors such as missing covariates. In this paper, we introduce a plug-and-play method that learns the covariance structure of errors over multiple steps for autoregressive models with Gaussian-distributed errors. To ensure scalable inference and computational efficiency, we model the contemporaneous covariance using a low-rank-plus-diagonal parameterization and capture cross-covariance through a group of independent latent temporal processes. The learned covariance matrix is then used to calibrate predictions based on observed residuals. We evaluate our method on probabilistic models built on RNNs and Transformer architectures, and the results confirm the effectiveness of our approach in improving predictive accuracy and uncertainty quantification without significantly increasing the parameter size.
Authors: Ludovico Theo Giorgini, Katherine Deck, Tobias Bischoff, Andre Souza
Abstract: We introduce an approach for analyzing the responses of dynamical systems to external perturbations that combines score-based generative modeling with the Generalized Fluctuation-Dissipation Theorem (GFDT). The methodology enables accurate estimation of system responses, including those with non-Gaussian statistics. We numerically validate our approach using time-series data from three different stochastic partial differential equations of increasing complexity: an Ornstein-Uhlenbeck process with spatially correlated noise, a modified stochastic Allen-Cahn equation, and the 2D Navier-Stokes equations. We demonstrate the improved accuracy of the methodology over conventional methods and discuss its potential as a versatile tool for predicting the statistical behavior of complex dynamical systems.
Authors: Dmitry Yarotsky
Abstract: We explore the theoretical possibility of learning $d$-dimensional targets with $W$-parameter models by gradient flow (GF) when $W
Authors: Shabnam Tafreshi, Shubham Vatsal, Mona Diab
Abstract: It is important to be able to analyze the emotional state of people around the globe. There are 7100+ active languages spoken around the world and building emotion classification for each language is labor intensive. Particularly for low-resource and endangered languages, building emotion classification can be quite challenging. We present a cross-lingual emotion classifier, where we train an emotion classifier with resource-rich languages (i.e. \textit{English} in our work) and transfer the learning to low and moderate resource languages. We compare and contrast two approaches of transfer learning from a high-resource language to a low or moderate-resource language. One approach projects the annotation from a high-resource language to low and moderate-resource language in parallel corpora and the other one uses direct transfer from high-resource language to the other languages. We show the efficacy of our approaches on 6 languages: Farsi, Arabic, Spanish, Ilocano, Odia, and Azerbaijani. Our results indicate that our approaches outperform random baselines and transfer emotions across languages successfully. For all languages, the direct cross-lingual transfer of emotion yields better results. We also create annotated emotion-labeled resources for four languages: Farsi, Azerbaijani, Ilocano and Odia.
Authors: Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejaswi Madhusudhan
Abstract: Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique. We release the preference pairs used in alignment at: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences
URLs: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences
Authors: Matthew Finlayson, Xiang Ren, Swabha Swayamdipta
Abstract: Large language model (LLM) providers often hide the architectural details and parameters of their proprietary models by restricting public access to a limited API. In this work we show that, with only a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1000 USD for OpenAI's gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We exploit this fact to unlock several capabilities, including (but not limited to) obtaining cheap full-vocabulary outputs, auditing for specific types of model updates, identifying the source LLM given a single full LLM output, and even efficiently discovering the LLM's hidden size. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.
Authors: Nikita Kornilov, Petr Mokrov, Alexander Gasnikov, Alexander Korotin
Abstract: Over the several recent years, there has been a boom in development of Flow Matching (FM) methods for generative modeling. One intriguing property pursued by the community is the ability to learn flows with straight trajectories which realize the Optimal Transport (OT) displacements. Straightness is crucial for the fast integration (inference) of the learned flow's paths. Unfortunately, most existing flow straightening methods are based on non-trivial iterative FM procedures which accumulate the error during training or exploit heuristics based on minibatch OT. To address these issues, we develop and theoretically justify the novel \textbf{Optimal Flow Matching} (OFM) approach which allows recovering the straight OT displacement for the quadratic transport in just one FM step. The main idea of our approach is the employment of vector field for FM which are parameterized by convex functions.
Authors: Yue Ding, Sen Yan, Maqsood Hussain Shah, Hongyuan Fang, Ji Li, Mingming Liu
Abstract: The escalating challenges of traffic congestion and environmental degradation underscore the critical importance of embracing E-Mobility solutions in urban spaces. In particular, micro E-Mobility tools such as E-scooters and E-bikes, play a pivotal role in this transition, offering sustainable alternatives for urban commuters. However, the energy consumption patterns for these tools are a critical aspect that impacts their effectiveness in real-world scenarios and is essential for trip planning and boosting user confidence in using these. To this effect, recent studies have utilised physical models customised for specific mobility tools and conditions, but these models struggle with generalization and effectiveness in real-world scenarios due to a notable absence of open datasets for thorough model evaluation and verification. To fill this gap, our work presents an open dataset, collected in Dublin, Ireland, specifically designed for energy modelling research related to E-Scooters and E-Bikes. Furthermore, we provide a comprehensive analysis of energy consumption modelling based on the dataset using a set of representative machine learning algorithms and compare their performance against the contemporary mathematical models as a baseline. Our results demonstrate a notable advantage for data-driven models in comparison to the corresponding mathematical models for estimating energy consumption. Specifically, data-driven models outperform physical models in accuracy by up to 83.83% for E-Bikes and 82.16% for E-Scooters based on an in-depth analysis of the dataset under certain assumptions.
Authors: Mingmeng Geng, Roberto Trotta
Abstract: Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts through a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. The words used for estimation are not fixed but adaptive, including those with decreasing frequency. We find that large language models (LLMs), represented by ChatGPT, are having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of LLM-style abstracts is estimated to be approximately 35%, if we take the responses of GPT-3.5 to one simple prompt, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of LLMs into academics' writing style.
Authors: Ozan Baris Mulayim, Edson Severnini, Mario Berg\'es
Abstract: In single-zone multi-node systems (SZMRSs), temperature controls rely on a single probe near the thermostat, resulting in temperature discrepancies that cause thermal discomfort and energy waste. Augmenting smart thermostats (STs) with per-room sensors has gained acceptance by major ST manufacturers. This paper leverages additional sensory information to empirically characterize the services provided by buildings, including thermal comfort, energy efficiency, and demand response (DR). Utilizing room-level time-series data from 1,000 houses, metadata from 110,000 houses across the United States, and data from two real-world testbeds, we examine the limitations of SZMNSs and explore the potential of remote sensors. We discovered that comfortable DR durations (CDRDs) for rooms are typically 70% longer or 40% shorter than for the room with the thermostat. When averaging, rooms at the control temperature's bounds are typically deviated around -3{\deg}F to 2.5{\deg}F from the average. Moreover, in 95% of houses, we identified rooms experiencing notably higher solar gains compared to the rest of the rooms, while 85% and 70% of houses demonstrated lower heat input and poor insulation, respectively. Lastly, it became evident that the consumption of cooling energy escalates with the increase in the number of sensors, whereas heating usage experiences fluctuations ranging from -19% to +25%. This study serves as a benchmark for assessing the thermal comfort and DR services in the existing housing stock, while also highlighting the energy efficiency impacts of sensing technologies. Our approach sets the stage for more granular, precise control strategies of SZMNSs.
Authors: Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang
Abstract: Most of these text-to-video (T2V) generative models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce a simple and effective Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene description (e.g., 'the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline by achieving a relative gain of 29% in the overall score, which averages visual consistency and text adherence using human evaluation.
Authors: Thomas Nagler, Lennart Schneider, Bernd Bischl, Matthias Feurer
Abstract: Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model's generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.
Authors: Zikang Zhou, Haibo Hu, Xinhong Chen, Jianping Wang, Nan Guan, Kui Wu, Yung-Hui Li, Yu-Kai Huang, Chun Jason Xue
Abstract: Simulating realistic interactions among traffic agents is crucial for efficiently validating the safety of autonomous driving systems. Existing leading simulators primarily use an encoder-decoder structure to encode the historical trajectories for future simulation. However, such a paradigm complicates the model architecture, and the manual separation of history and future trajectories leads to low data utilization. To address these challenges, we propose Behavior Generative Pre-trained Transformers (BehaviorGPT), a decoder-only, autoregressive architecture designed to simulate the sequential motion of multiple agents. Crucially, our approach discards the traditional separation between "history" and "future," treating each time step as the "current" one, resulting in a simpler, more parameter- and data-efficient design that scales seamlessly with data and computation. Additionally, we introduce the Next-Patch Prediction Paradigm (NP3), which enables models to reason at the patch level of trajectories and capture long-range spatial-temporal interactions. BehaviorGPT ranks first across several metrics on the Waymo Sim Agents Benchmark, demonstrating its exceptional performance in multi-agent and agent-map interactions. We outperformed state-of-the-art models with a realism score of 0.741 and improved the minADE metric to 1.540, with an approximately 91.6% reduction in model parameters.
Authors: Jessica N. Howard, Marc S. Klinger, Anindita Maiti, Alexander G. Stapleton
Abstract: The Neural Network Field Theory correspondence (NNFT) is a mapping from neural network (NN) architectures into the space of statistical field theories (SFTs). The Bayesian renormalization group (BRG) is an information-theoretic coarse graining scheme that generalizes the principles of the exact renormalization group (ERG) to arbitrarily parameterized probability distributions, including those of NNs. In BRG, coarse graining is performed in parameter space with respect to an information-theoretic distinguishability scale set by the Fisher information metric. In this paper, we unify NNFT and BRG to form a powerful new framework for exploring the space of NNs and SFTs, which we coin BRG-NNFT. With BRG-NNFT, NN training dynamics can be interpreted as inducing a flow in the space of SFTs from the information-theoretic `IR' $\rightarrow$ `UV'. Conversely, applying an information-shell coarse graining to the trained network's parameters induces a flow in the space of SFTs from the information-theoretic `UV' $\rightarrow$ `IR'. When the information-theoretic cutoff scale coincides with a standard momentum scale, BRG is equivalent to ERG. We demonstrate the BRG-NNFT correspondence on two analytically tractable examples. First, we construct BRG flows for trained, infinite-width NNs, of arbitrary depth, with generic activation functions. As a special case, we then restrict to architectures with a single infinitely-wide layer, scalar outputs, and generalized cos-net activations. In this case, we show that BRG coarse-graining corresponds exactly to the momentum-shell ERG flow of a free scalar SFT. Our analytic results are corroborated by a numerical experiment in which an ensemble of asymptotically wide NNs are trained and subsequently renormalized using an information-shell BRG scheme.
Authors: Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai Yuan
Abstract: Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only, despite a significant domain gap. Using in-domain data but in a challenging task generalization setting on Meta-World, we further demonstrate more sample-efficient training than is possible with prior language-conditioned reward models that are either trained with binary classification, use static images, or do not leverage the temporal information present in video data.
Authors: Md Saiful Islam, Tariq Adnan, Jan Freyberg, Sangwu Lee, Abdelrahman Abdelkader, Meghan Pawlik, Cathe Schwartz, Karen Jaffe, Ruth B. Schneider, E Ray Dorsey, Ehsan Hoque
Abstract: Limited accessibility to neurological care leads to underdiagnosed Parkinson's Disease (PD), preventing early intervention. Existing AI-based PD detection methods primarily focus on unimodal analysis of motor or speech tasks, overlooking the multifaceted nature of the disease. To address this, we introduce a large-scale, multi-task video dataset consisting of 1102 sessions (each containing videos of finger tapping, facial expression, and speech tasks captured via webcam) from 845 participants (272 with PD). We propose a novel Uncertainty-calibrated Fusion Network (UFNet) that leverages this multimodal data to enhance diagnostic accuracy. UFNet employs independent task-specific networks, trained with Monte Carlo Dropout for uncertainty quantification, followed by self-attended fusion of features, with attention weights dynamically adjusted based on task-specific uncertainties. To ensure patient-centered evaluation, the participants were randomly split into three sets: 60% for training, 20% for model selection, and 20% for final performance evaluation. UFNet significantly outperformed single-task models in terms of accuracy, area under the ROC curve (AUROC), and sensitivity while maintaining non-inferior specificity. Withholding uncertain predictions further boosted the performance, achieving 88.0+-0.3%$ accuracy, 93.0+-0.2% AUROC, 79.3+-0.9% sensitivity, and 92.6+-0.3% specificity, at the expense of not being able to predict for 2.3+-0.3% data (+- denotes 95% confidence interval). Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. Requiring only a webcam and microphone, our approach facilitates accessible home-based PD screening, especially in regions with limited healthcare resources.
Authors: Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach
Abstract: Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.
Authors: Pandiyaraju V, Shravan Venkatraman, Abeshek A, Aravintakshan S A, Pavan Kumar S, Kannan A
Abstract: Alzheimer's disease (AD) is the most common form of neurodegeneration, which impacts millions of people each year. Diagnosing and classifying AD accurately with neuroimaging data is an ongoing challenge in the field of medicine. Traditional Convolutional Neural Networks (CNNs) are good at capturing low-level information from images, but their capability to extract high-level minuscule particles is suboptimal, which is a significant challenge in detecting AD from MRI scans. To overcome this, we propose a novel Granular Feature Integration method to combine information extraction at different scales combined with an efficient information flow. We also propose a Bi-Focal Perspective mechanism to highlight focus on subtle neurofibrillary tangles and amyloid plaques in MRI scans. Our model yielded an F1-Score of 99.31%, a precision of 99.24%, and a recall of 99.51%, which shows a major improvement in comparison to existing state-of-the-art (SOTA) CNNs.
Authors: Emanuele De Angelis (CNR-IASI, Rome, Italy), Maurizio Proietti (CNR-IASI, Rome, Italy), Francesca Toni (Imperial, London, UK)
Abstract: Assumption-based Argumentation (ABA) is advocated as a unifying formalism for various forms of non-monotonic reasoning, including logic programming. It allows capturing defeasible knowledge, subject to argumentative debate. While, in much existing work, ABA frameworks are given up-front, in this paper we focus on the problem of automating their learning from background knowledge and positive/negative examples. Unlike prior work, we newly frame the problem in terms of brave reasoning under stable extensions for ABA. We present a novel algorithm based on transformation rules (such as Rote Learning, Folding, Assumption Introduction and Fact Subsumption) and an implementation thereof that makes use of Answer Set Programming. Finally, we compare our technique to state-of-the-art ILP systems that learn defeasible knowledge.
Authors: Ruixuan Liu, Alan Chen, Weiye Zhao, Changliu Liu
Abstract: Combinatorial assembly uses standardized unit primitives to build objects that satisfy user specifications. This paper studies assembly sequence planning (ASP) for physical combinatorial assembly. Given the shape of the desired object, the goal is to find a sequence of actions for placing unit primitives to build the target object. In particular, we aim to ensure the planned assembly sequence is physically executable. However, ASP for combinatorial assembly is particularly challenging due to its combinatorial nature. To address the challenge, we employ deep reinforcement learning to learn a construction policy for placing unit primitives sequentially to build the desired object. Specifically, we design an online physics-aware action mask that filters out invalid actions, which effectively guides policy learning and ensures violation-free deployment. In the end, we apply the proposed method to Lego assembly with more than 250 3D structures. The experiment results demonstrate that the proposed method plans physically valid assembly sequences to build all structures, achieving a $100\%$ success rate, whereas the best comparable baseline fails more than $40$ structures. Our implementation is available at \url{https://github.com/intelligent-control-lab/PhysicsAwareCombinatorialASP}.
URLs: https://github.com/intelligent-control-lab/PhysicsAwareCombinatorialASP
Authors: Huaqing Zhang, Lesi Chen, Jing Xu, Jingzhao Zhang
Abstract: This paper studies simple bilevel problems, where a convex upper-level function is minimized over the optimal solutions of a convex lower-level problem. We first show the fundamental difficulty of simple bilevel problems, that the approximate optimal value of such problems is not obtainable by first-order zero-respecting algorithms. Then we follow recent works to pursue the weak approximate solutions. For this goal, we propose novel near-optimal methods for smooth and nonsmooth problems by reformulating them into functionally constrained problems.
Authors: Yan Huang, Li Wang
Abstract: Inspired by the gradient flow viewpoint of the Landau equation and corresponding dynamic formulation of the Landau metric in [arXiv:2007.08591], we develop a novel implicit particle method for the Landau equation in the framework of the JKO scheme. We first reformulate the Landau metric in a computationally friendly form, and then translate it into the Lagrangian viewpoint using the flow map. A key observation is that, while the flow map evolves according to a rather complicated integral equation, the unknown component is merely a score function of the corresponding density plus an additional term in the null space of the collision kernel. This insight guides us in designing and training the neural network for the flow map. Additionally, the objective function is in a double summation form, making it highly suitable for stochastic methods. Consequently, we design a tailored version of stochastic gradient descent that maintains particle interactions and significantly reduces the computational complexity. Compared to other deterministic particle methods, the proposed method enjoys exact entropy dissipation and unconditional stability, therefore making it suitable for large-scale plasma simulations over extended time periods.
Authors: Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, S\'ebastien Brati\`eres, Paolo Merialdo, Simone Scardapane
Abstract: The remarkable performance achieved by Large Language Models (LLM) has driven research efforts to leverage them for a wide range of tasks and input modalities. In speech-to-text (S2T) tasks, the emerging solution consists of projecting the output of the encoder of a Speech Foundational Model (SFM) into the LLM embedding space through an adapter module. However, no work has yet investigated how much the downstream-task performance depends on each component (SFM, adapter, LLM) nor whether the best design of the adapter depends on the chosen SFM and LLM. To fill this gap, we evaluate the combination of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, namely Automatic Speech Recognition and Speech Translation. Our results demonstrate that the SFM plays a pivotal role in downstream performance, while the adapter choice has moderate impact and depends on the SFM and LLM.
Authors: Lalith Bharadwaj Baru, Shilhora Akshay Patel, Rohit Boddeda
Abstract: The evolution of digital image manipulation, particularly with the advancement of deep generative models, significantly challenges existing deepfake detection methods, especially when the origin of the deepfake is obscure. To tackle the increasing complexity of these forgeries, we propose \textbf{Wavelet-CLIP}, a deepfake detection framework that integrates wavelet transforms with features derived from the ViT-L/14 architecture, pre-trained in the CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze both spatial and frequency features from images, thus enhancing the model's capability to detect sophisticated deepfakes. To verify the effectiveness of our approach, we conducted extensive evaluations against existing state-of-the-art methods for cross-dataset generalization and detection of unseen images generated by standard diffusion models. Our method showcases outstanding performance, achieving an average AUC of 0.749 for cross-data generalization and 0.893 for robustness against unseen deepfakes, outperforming all compared methods. The code can be reproduced from the repo: \url{https://github.com/lalithbharadwajbaru/Wavelet-CLIP}
Authors: Zelei Cheng, Xian Wu, Jiahao Yu, Shuo Han, Xin-Qiang Cai, Xinyu Xing
Abstract: Toxicity classification in textual content remains a significant problem. Data with labels from a single annotator fall short of capturing the diversity of human perspectives. Therefore, there is a growing need to incorporate crowdsourced annotations for training an effective toxicity classifier. Additionally, the standard approach to training a classifier using empirical risk minimization (ERM) may fail to address the potential shifts between the training set and testing set due to exploiting spurious correlations. This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution (OOD) risk. We theoretically prove the convergence of our bi-level optimization algorithm. Experimental results demonstrate that our approach outperforms existing baseline methods in terms of both average and worst-group accuracy, confirming its effectiveness in leveraging crowdsourced annotations to achieve more effective and robust toxicity classification.
Authors: Gokul Puthumanaillam, Paulo Padrao, Jose Fuentes, Leonardo Bobadilla, Melkior Ornik
Abstract: Robots performing navigation tasks in complex environments face significant challenges due to uncertainty in state estimation. Effectively managing this uncertainty is crucial, but the optimal approach varies depending on the specific details of the task: different tasks require varying levels of precision in different regions of the environment. For instance, a robot navigating a crowded space might need precise localization near obstacles but can operate effectively with less precise state estimates in open areas. This varying need for certainty in different parts of the environment, depending on the task, calls for policies that can adapt their uncertainty management strategies based on task-specific requirements. In this paper, we present a framework for integrating task-specific uncertainty requirements directly into navigation policies. We introduce Task-Specific Uncertainty Map (TSUM), which represents acceptable levels of state estimation uncertainty across different regions of the operating environment for a given task. Using TSUM, we propose Generalized Uncertainty Integration for Decision-Making and Execution (GUIDE), a policy conditioning framework that incorporates these uncertainty requirements into the robot's decision-making process. We find that conditioning policies on TSUMs provides an effective way to express task-specific uncertainty requirements and enables the robot to reason about the context-dependent value of certainty. We show how integrating GUIDE into reinforcement learning frameworks allows the agent to learn navigation policies without the need for explicit reward engineering to balance task completion and uncertainty management. We evaluate GUIDE on a variety of real-world navigation tasks and find that it demonstrates significant improvements in task completion rates compared to baselines. Evaluation videos can be found at https://guided-agents.github.io.
Authors: Guanhang Lei, Zhen Lei, Lei Shi
Abstract: Neural operators have shown promise in solving many types of Partial Differential Equations (PDEs). They are significantly faster compared to traditional numerical solvers once they have been trained with a certain amount of observed data. However, their numerical performance in solving time-dependent PDEs, particularly in long-time prediction of dynamic systems, still needs improvement. In this paper, we focus on solving the long-time integration of nonlinear wave equations via neural operators by replacing the initial condition with the prediction in a recurrent manner. Given limited observed temporal trajectory data, we utilize some intrinsic features of these nonlinear wave equations, such as conservation laws and well-posedness, to improve the algorithm design and reduce accumulated error. Our numerical experiments examine these improvements in the Korteweg-de Vries (KdV) equation, the sine-Gordon equation, and the Klein-Gordon wave equation on the irregular domain.
Authors: Nayoung Choi, Youngjune Lee, Gyu-Hwung Cho, Haeyu Jeong, Jungmin Kong, Saehun Kim, Keunchan Park, Jaeho Choi, Sarah Cho, Inchang Jeong, Gyohee Nam, Sunghoon Han, Wonil Yang
Abstract: Large Language Models (LLMs) excel at understanding the semantic relationships between queries and documents, even with lengthy and complex long-tail queries. These queries are challenging for feedback-based rankings due to sparse user engagement and limited feedback, making LLMs' ranking ability highly valuable. However, the large size and slow inference of LLMs necessitate the development of smaller, more efficient models (sLLMs). Recently, integrating ranking label generation into distillation techniques has become crucial, but existing methods underutilize LLMs' capabilities and are cumbersome. Our research, RRADistill: Re-Ranking Ability Distillation, propose an efficient label generation pipeline and novel sLLM training methods for both encoder and decoder models. We introduce an encoder-based method using a Term Control Layer to capture term matching signals and a decoder-based model with a ranking layer for enhanced understanding. A/B testing on a Korean-based search platform, validates the effectiveness of our approach in improving re-ranking for long-tail queries.
Authors: Yue Yu, Prayag Tiwari
Abstract: Large Language Models (LLMs), such as ChatGPT, Phi3 and Llama-3, are leading a significant leap in AI, as they can generalize knowledge from their training to new tasks without fine-tuning. However, their application in the financial domain remains relatively limited. The financial field is inherently complex, requiring a deep understanding across various perspectives, from macro, micro economic trend to quantitative analysis. Motivated by this complexity, a mixture of expert LLMs tailored to specific financial domains could offer a more comprehensive understanding for intricate financial tasks. In this paper, we present the FinTeamExperts, a role-specialized LLM framework structured as a Mixture of Experts (MOEs) for financial analysis. The framework simulates a collaborative team setting by training each model to specialize in distinct roles: Macro Analysts, Micro analysts, and Quantitative Analysts. This role-specific specialization enhances the model's ability to integrate their domain-specific expertise. We achieve this by training three 8-billion parameter models on different corpus, each dedicated to excelling in specific finance-related roles. We then instruct-tune FinTeamExperts on downstream tasks to align with practical financial tasks. The experimental results show that FinTeamExperts outperform all models of the same size and larger on three out of four datasets. On the fourth dataset, which presents a more complex task, FinTeamExperts still surpass all models of the same size. This highlights the success of our role-based specialization approach and the continued training approach for FinTeamExperts.
Authors: Piotr Kaniewski, Fariba Yousefi, Yeman Brhane Hagos, Talha Qaiser, Nikolay Burlutskiy
Abstract: In drug discovery, accurate lung tumor segmentation is an important step for assessing tumor size and its progression using \textit{in-vivo} imaging such as MRI. While deep learning models have been developed to automate this process, the focus has predominantly been on human subjects, neglecting the pivotal role of animal models in pre-clinical drug development. In this work, we focus on optimizing lung tumor segmentation in mice. First, we demonstrate that the nnU-Net model outperforms the U-Net, U-Net3+, and DeepMeta models. Most importantly, we achieve better results with nnU-Net 3D models than 2D models, indicating the importance of spatial context for segmentation tasks in MRI mice scans. This study demonstrates the importance of 3D input over 2D input images for lung tumor segmentation in MRI scans. Finally, we outperform the prior state-of-the-art approach that involves the combined segmentation of lungs and tumors within the lungs. Our work achieves comparable results using only lung tumor annotations requiring fewer annotations, saving time and annotation efforts. This work (https://anonymous.4open.science/r/lung-tumour-mice-mri-64BB) is an important step in automating pre-clinical animal studies to quantify the efficacy of experimental drugs, particularly in assessing tumor changes.
URLs: https://anonymous.4open.science/r/lung-tumour-mice-mri-64BB)
Authors: Dongwon Kim, Seoyeon Kim, Suha Kwak
Abstract: Object-centric learning (OCL) aims to learn representations of individual objects within visual scenes without manual supervision, facilitating efficient and effective visual reasoning. Traditional OCL methods primarily employ bottom-up approaches that aggregate homogeneous visual features to represent objects. However, in complex visual environments, these methods often fall short due to the heterogeneous nature of visual features within an object. To address this, we propose a novel OCL framework incorporating a top-down pathway. This pathway first bootstraps the semantics of individual objects and then modulates the model to prioritize features relevant to these semantics. By dynamically modulating the model based on its own output, our top-down pathway enhances the representational quality of objects. Our framework achieves state-of-the-art performance across multiple synthetic and real-world object-discovery benchmarks.
Authors: Xiao Hu, Ziqi Chen, Bo Peng, Daniel Adu-Ampratwum, Xia Ning
Abstract: Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. Our approach implements a unique local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions, ensuring that the impact of varying-sizes molecular fragments on yield is accurately accounted for. Another key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM outperforms existing methods in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. Its advanced modeling of reactant-reagent interactions and sensitivity to small molecular fragments make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through https://github.com/ninglab/Yield_log_RRIM.
Authors: Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu
Abstract: Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
Authors: Timo Klein, Lukas Miklautz, Kevin Sidak, Claudia Plant, Sebastian Tschiatschek
Abstract: Akin to neuroplasticity in human brains, the plasticity of deep neural networks enables their quick adaption to new data. This makes plasticity particularly crucial for deep Reinforcement Learning (RL) agents: Once plasticity is lost, an agent's performance will inevitably plateau because it cannot improve its policy to account for changes in the data distribution, which are a necessary consequence of its learning process. Thus, developing well-performing and sample-efficient agents hinges on their ability to remain plastic during training. Furthermore, the loss of plasticity can be connected to many other issues plaguing deep RL, such as training instabilities, scaling failures, overestimation bias, and insufficient exploration. With this survey, we aim to provide an overview of the emerging research on plasticity loss for academics and practitioners of deep reinforcement learning. First, we propose a unified definition of plasticity loss based on recent works, relate it to definitions from the literature, and discuss metrics for measuring plasticity loss. Then, we categorize and discuss numerous possible causes of plasticity loss before reviewing currently employed mitigation strategies. Our taxonomy is the first systematic overview of the current state of the field. Lastly, we discuss prevalent issues within the literature, such as a necessity for broader evaluation, and provide recommendations for future research, like gaining a better understanding of an agent's neural activity and behavior.
Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
Abstract: Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"{\i}vely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-$\Sigma$, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5$\times$, achieving 3.0$\times$ speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.